ATLAS DAQ Note 77
18 Sept 1997
R.K.Bo ck, A.Bogaerts, R. Hauser, C. Hortnagl,
K. Koski, P.Werner, A.Guglielmi, O.Orel
1
Benchmarking Communication Systems
1 Intro duction
Following the market trend of high-p erformance computing towards parallel systems
available at decreasing cost, we b elieve that there is a nite chance that muchor
mayb e all of the computing load of level-2 triggers in ATLAS, and certainly all of
level 3, can b e executed in a farm constituted by commercial parallel systems.
The computer p erformance evaluation of systems use, among other criteria,
results of b enchmarking, viz. measured execution times obtained by running a
representative job mix, usually without investing substantial e ort in optimising
for the system at hand. In order to assess, how far parallel systems can con-
tribute to our trigger problem solution, we have designed a comparatively naive
set of application-indep endent communication benchmarks; they are do cumented in
ATLAS DAQ notes 48 and 61. The results are, in rst instance, large tables of
measured communication times.
Our goal was to derive from these detailed results several basic communication
parameters. They include obvious ones like bandwidth, but also various overheads
asso ciated to switching technologies or arising from interfacing to op erating sys-
tems, and measures of trac interference. We will eventually use these parameters
for comparing di erent system p ossibilities, in particular as input to detailed and
full-scale ATLAS mo delling. While not replacing more detailed b enchmarking of
applications, they do give more useful information than the combination of CPU
b enchmarks with bandwidth numb ers.
The tested systems include a numb er of di erent architectures from clusters of
workstations to tightly coupled massively parallel systems. We also included the
technologies that were prominent in the ATLAS demonstrator program, SCI and
ATM. The b enchmark package includes versions for many di erent communication
technologies and programming interfaces, such as shared memory, MPI, Digital's
Memory Channel, Cray T3E Shared memory API and Meiko CS-2.
2 Description of the b enchmarks
To assess the p erformance of a numb er of commercially available parallel systems, we
de ned a set of abstract basic communication b enchmarks, which are not sp eci c to
our application. We also added two more application-oriented b enchmarks, which
represent much simpli ed versions of the currently prop osed second-level trigger
solutions [2].
All abstract basic b enchmarks are executed for a varying numb er of packet sizes
(minimum, 64, 256, 1024) and a varying numb er of pro cessors where applicable (2,
4, 8, 16, 32, 64). Packet sizes are restricted to those exp ected in our application,
although some implementations have scanned a wider parameter space.
1
Submitted, in mo di ed form, to Paral lel Computing 1
A more detailed de nition of the b enchmarks can b e found in CERN ATLAS
2 3
do cuments . An example implementation in C is also available from a web site .
Default implementations for MPI and shared memory are available, and the software
has also b een adapted to several low-level libraries from di erentvendors, including
Cray T3E, Meiko CS2 and Digital's Memory Channel.
The following b enchmark programs have b een used (N is the total number of
no des):
Ping-pong
One no de sends data to another no de, and waits for the data to b e sent back.
The b enchmark measures the round-trip time.
Two-way
Both no des are sending and receiving data simultaneously.
Al l-to-al l
All no des are sending to each other simultaneously. Increasing the number of
no des in the system increases the overall throughput.
Pairs
N=2 no des send to N=2 receivers one-directionally.
Outfarming and Broadcast
For outfarming one no de sends packets to N 1 receivers, while broadcast
uses hardware broadcast, if present. Thus in outfarming the data could be
di erent in each send, whereas broadcast sends always the same data.
Funnel and Push-farm
In the funnel b enchmark N 1 senders send data to one receiver.
The push-farm representsatyp e of communication, in which the data is sent
from N 1 no des to one receiver, much the same wayasinthe funnel. The
di erence is that in the push-farm b enchmark additional computing cycles can
be included; the computing represents analysis of these measurements, and
we execute dummy co de lasting 0, 100, 400 or 1600 microseconds. Each time
b efore the computing cycle is started, the request for the next data item has
b een issued, allowing overlap of computing and communication.
Pul l-farm
Pul l-farm represents a typ e of communication, in which rst a control message
(64 bytes) is sent from the receiver to N 1 senders, and subsequently an
amount of data (1024 bytes) is received back from each sender. Computing
cycles can b e included the same wayasin push-farm.
A graphical representation of the b enchmark top ologies is given in Figure 1.
Of particular relevance to our application are the b enchmarks ping-pong, pairs,
push-farm and pul l-farm. The latter twoobviously have b een sp eci cally designed
to corresp ond to communication patterns typical for our application. Ping-pong
tests the request-acknowledge cycle, which is needed in several kinds of transmis-
sions. The pairs b enchmark tests one-way communication p erformance from p oint
to p oint, whichischaracteristic for communication without need for acknowledge-
ment.
2
http://atlasinfo.cern.ch/Atlas/GROUPS/DAQTRIG/NOTES/note61/rudi.ps.Z,
DAQ 48.ps.Z http://atlasinfo.cern.ch/Atlas/GROUPS/DAQTRIG/NOTES/note48/ATLAS
3
http://www.cern.ch/RD11/comb ench/comb ench.tar.Z 2 Ping-pong
Two-way
Outfarming and Broadcast All-to-all
Pairs
∆d ∆d
Funnel Push-farm Pull-farm
Figure 1: A graphical representation of the communication b enchmarks. The dot
indicates, which time is measured.
3 Implementation
The b enchmarks have b een implemented in di erent technologies by designing a
separate intermediate layer for each technology, as illustrated in Figure 2. This layer
contains the message passing routines, such as sending and receiving the message,
initialisation, cleaning up and broadcasting. The routines include non-blo cking send
and receive op erations.
By using this kind of layered approach, the p orting of the b enchmarks has b een
made more straightforward. When implementing a version for a new technology,
only the intermediate layer has to b e changed.
Since two di erentATM libraries were used, also two di erent implementations
of the low level interface had to b e programmed for that particular technology. Since
the application programming interface in di erentATM hardware, for example, is
usually proprietary, the interface has to b e implemented separately for each system.
The push-farm and pul l-farm results for ATM, use sp eci c trac generators as
data sources. Additionally, the programs used in these measurements di er slightly
from the other b enchmarks.
45
The b enchmarks for Mercury RACEway system have b een designed by using
6
the PeakWare to olkit , previously known as CapCASE [14].
4
RACE is a registered trademark of Mercury Computer Systems,Inc.
5
http://www.mc.com/Technical bulletins/mtb4smp-race/smp-v-race.html
6
PeakWare is a trademark of MATRA SYSTEMES & INFORMATION 3 write once ATLAS communication benchmarks per application
write once low level interface per technology
MPI
various SCI ATM shared communication memory
MEMORY CHANNEL hardware
Figure 2: The structure of the b enchmark implementation. The low level interface
has b een programmed separately for each technology.
4 Platforms
The measurements have b een done in a numb er of technologies:
7
Scalable CoherentInterface (SCI) on a RIO2 8061 emb edded pro cessor b oard
using the LynxOS op erating system, on PC under LINUX, and on Digital
Alphas under Digital UNIX
Digital Memory Channel connecting Digital Alphas
Asynchronous Transfer Mo de (ATM), on RIO2 (and RTPC) emb edded pro-
cessor b oards, with the LynxOS op erating system, on PC-s under Windows
NT, and on Digital Alphas under Digital UNIX
Cray T3E Shared memory Application Programming Interface
Raceway Bus, using a Matra Systemes & Information PeakWare to olkit
T9000 using IEEE 1355 DS Links (GPMIMD)
Meiko CS-2 communication library
In addition, shared memory and Message Passing Interface (MPI) have b een
used in multiple systems, as opp osed to lower-level API-s. Shared memory has
b een b enchmarked in Digital 8400, Silicon Graphics Challenge and Origin systems.
The tested MPI platforms include shared memory multipro cessors, such as Digital
8400 system and Silicon Graphics Challenge, clusters, such as Digital's Memory
Channel, and conventional distributed memory systems, such as IBM SP2, Cray
T3E and Meiko CS-2.
We will here describ e more in detail the test pro cedure of three of these tech-
nologies: Digital's Memory Channel, SCI and ATM.
4.1 Memory Channel
8
[5] is a proprietary network technology from Digital Equipment Memory Channel
Corp oration, is commercially targeted as inter-no de transp ort medium in Digital
7
http://www.ces.ch/Pro ducts/Pro ducts.html
8
Memory Channel is a registered trademark of Digital Equipment Corp oration 4 AlphaStation 200 4/166
AlphaStation 200 4/166
AlphaServer 4000 5/300
AlphaStation 200 4/166
MEMORY CHANNEL hub
AlphaStation
200 4/166
Figure 3: Con guration in the Memory Channel tests. Later an AlphaStation 500
system was added into the con guration replacing one of the older stations.
UNIX TruCluster con gurations. It typically interconnects several AlphaServers
with multiple pro cessors each, and thus extends the scalability of Digital's pro duct
line to installations with currently up to 96 (= 8 12) parallel Alpha pro cessors.
Memory Channel presents the abstraction of a unique shared address space to
all pro cesses, regardless of their attachment to remote CPUs. Inter-no de com-
munication can be pursued with low overhead b ecause after an initial phase of
memory-mapping single user-level CPU store and load instructions suce to launch
communication, and required protection mechanisms are enforced in hardware.
Unlike cache-coherent non-uniform memory architectures (CC-NUMAs) in par-
ticular, Memory Channel opts for a slimmer solution which exp oses di erences of
lo cal and remote memories at the application level. It do es not provide a strong
memory coherency mo del; thus senders and receivers have to follow software-based
proto cols for distinguishing between outdated and fresh copies of data. Further-
more its memory mappings are asymmetric, i.e. applications must b e prepared to
use di erent virtual addresses for reading from and writing to physically unique
remote lo cations.
The quoted results were obtained with the following exp erimental setup: One
AlphaServer 4000 5/300 (299 MHz Alpha 21164 E5, 96 KB +2 MB o -chip cache)
and four AlphaStations 200 4/166 (166 MHz Alpha 21064 EV4.5, 512 KB o -chip
cache) were combined in a Digital's UNIX TruCluster. Inter-no de connectivitywas
established over Memory Channel-to-PCI adapter cards (revision 1.5), a Memory
Channel hub with ve line-cards and copp er link cables. All workstations had
128 MB RAM con gurations and ran copies of the Digital UNIX 4.0B op erating
system. The test environment is illustrated in Figure 3.
The tests used atwo-copy implementation of the low-level interface's message
passing library that ran on top the Memory Channel API library [6]. It is under-
stood that results can be improved by avoiding the second copy, at the price of
violating the proto col stack of Figure 2.
4.2 Scalable Coherent Interface (SCI)
SCI [8] is an enabling technology for cache-coherent non-uniform memory architec-
tures (CC-NUMAs). It aims at b etter scalability for shared-memory multipro cessors
than what bus-based schemes can achieve, by using multiple p oint-to-p oint links and 5 RIO 8061
AlphaStation 500/400
RIO 8061 SCI switch
AlphaServer 4000 5/300
RIO 8061
Figure 4: Con guration in the SCI tests.
a directory-based cache coherency scheme. The internal comp osition of the network,
which aggregates links into top ologies of ringlets and switches, is e ectively hidden
from clients b ehind an interface which resembles backplane buses.
In comparison to Memory Channel, which has b een adopted by a single vendor
for market-ready solutions, SCI an IEEE 1596 standard, with increasing strong
9 10 11 12
commitments e.g. from Data General , Sequent , Sun and Siemens-Nixdorf .
For our tests we used Dolphin Interconnect Solutions PCI-SCI cards (rev. B) [7]
which conform to the 32-bit PCI lo cal-bus sp eci cation. This PCI implementation
did not o er SCI's cache coherency; it enabled us to utilise a variety of di erent
no des, in particular also including emb edded pro cessor b oards, which is of great
imp ortance to some of the conceived parallel applications in our domain.
The cards, attached to 18-DE-200 link cables for a 16-bit parallel, electrical
implementation of the physical layer, o er up to 200 MB/s of aggregate bandwidth
on the medium. We observed that obtained bandwidths (. 70 MB/s maximum)
were limited by the p erformance of PCI buses. This restriction is avoided by systems
which integrate SCI on the system-bus level, at the obvious price of giving up
reusable interface cards.
Our results were obtained with the following equipment: p oint-to-p oint measure-
ments refer to our fastest pair of no des, i.e. an AlphaStation 500/400 (400 MHz
Alpha 21164 E5.6, 96 KB +2 MB o -chip cache) and an AlphaServer 4000 5/300.
Tests involving more that two no des ran on the AlphaServer and a pool of RIO2
13
8061 (100 MHz PowerPC 604) VME-emb edded pro cessor b oards.
The low level interface's two-copy message passing library op erated on top of
an implementation of a draft version of SCI PHY-API [9], whose general aim is to
provide a standard for low-level software access to SCI services.
SCI tests have b een done using a con guration illustrated in Figure 4.
4.3 ATM
4.3.1 Abstract b enchmarks
ATM tests were carried out by using p oint-to-p oint connections b etween a number
of di erent systems. Additionally, a testb ed consisting of Digital AlphaStation
9
http://www.dg.com/numaliine/html/sci interconnect chipset and adapter.html
10
http://www.sequent.com/numaq/
11
http://www.sun.com/hp c/tech/interconnect.html
12
http://www.sni.de/public/sni.htm
13
http://www.ces.ch/Pro ducts/Pro ducts.html 6 AlphaStation 200 4/166
RTPC
AlphaStation FORE switch 200 4/166
RIO 8061
RIO 8061
equipment on loan from RD31
Figure 5: ATM testb ed con guration.
200, RIO2 and RTPC systems and FORE ATM switch was used. The testb ed
con guration is illustrated in Figure 5. The maximum numb er of no des available
during the tests was ve. In addition new generation Digital AlphaStation 500 and
AlphaServer 4000 systems were available for ping-p ong tests.
A general problem with the ATM measurements arose for b enchmarking from
the limited availability of the systems; our testb ed was relatively heterogeneous: two
di erenttyp es of system and two di erentATM libraries were used. The Digital
systems used Digital's ATMSOCK library version 1.0, which will b ecome a com-
mercial pro duct [10] (it had not b een released at the time of the tests). The RIO2s
and RTPC used ATMNicLib library [13], which was an ecient implementation
reducing overheads bybypassing the kernel (a library of utility functions was called
instead of a device driver) and avoiding data copies (NicStar network interface has
a direct access to user bu ers). Thus the ATM library implementation running in
RIO2s and RTPC was esp ecially tuned.
Most of the b enchmarks, including all the testb ed runs, have b een run using
AlphaStation 200 systems, since newer systems were not available for testing at that
moment. The p oint-to-p oint results from the newer Digital systems (AlphaServer
4000, AlphaStation 500) were added later.
Minimum overheads corresp ond to the smallest packet size used in the measure-
ments, whichwas either 1 byte or 8 bytes. The largest packet size used here was
1024 bytes, although some of the ping-p ong tests were additionally run with larger
packet size.
In testb ed measurements the full optimisation was not used. The p oint-to-
p oint measurements have b een done with full optimisations b etween two AlphaSta-
tion 200, b etween AlphaStation 500 and AlphaServer 4000, and b etween RIO2 and
RTPC.
The basic implementation of the b enchmarks uses Unde ned Bit Rate (UBR)
connections, and sends in full sp eed; as ATM has no ow control, the receiver can
in some cases lo ose packets. This is esp ecially apparent in b enchmarks pairs, funnel
and push farm. In the funnel b enchmarks, when four senders and large packets were
used, no meaningful measurements were p ossible. Toavoid the problem of lo osing
packets, Constant Bit Rate (CBR) connections with reduced bandwidth for each 7
sender could b e used instead. Tests using CBR have b een carried out in addition
[11], [12].
Minimum round trip time divided bytwo for RIO2 was around 80-100 microsec-
onds in b enchmark ping-pong in which b oth sides receive and send. The same
overhead for Digital AlphaStation 200 is around 200 microseconds. For newer Dig-
ital systems the ping-p ong overhead is close to the one obtained on the RIO2s.
The p erformance di erence between the measurements with Digital pro cessors of
di erent generations is surprisingly large, which should b e attributed not so much
to the faster pro cessor technology (communication b enchmarks are not very CPU
intensive), but to other architectural changes, whichhave taken place during recent
years.
For receiving only (b enchmark pairs), RIO2 can achievea very low receiver over-
head, less than 10 microseconds for one byte. Digital Alpha's receiver overhead in
pairs is around 20-30 microseconds for one byte. The fast RIO2 results demonstrate
the p erformance gain, which can b een achieved byinvesting in low-level design work
with ATM drivers [4] compared with directly available commercial implementations.
The ATM link sp eed was nominally 155 Mbit/s. The maximum user bandwidth,
e.g. link sp eed subtracted by the amount of control data, was around 135 Mbit/s.
For example, sending in full sp eed (UBR) one-directionally b etween RIO2s, already
with 1024 bit packet size the link sp eed is almost entirely used (129.6 Mbit/s).
However, since the nominal sp eed 155 Mbit/s is not very high compared with some
other communication technologies of to day,itwould b e interesting to see the e ects
of ATM connections with higher sp eed (e.g. 622 Mbit/s).
4.3.2 Application b enchmarks: push farm and pull farm
The p erformance of the receiver pro cessor in the push farm tests was measured
on the upgraded demonstrator of the RD31 pro ject [11]. In this system, sender
pro cessors are replaced byATM trac generators whichemulate the senders. The
receiver pro cessor establishes a Constant Bit Rate connection with each sender.
The bandwidth of this channel is 1=N umber of S ender s to avoid congestion in the
switching network and the receiver. We measured the maximum event rate that a
receiver can handle for various packet sizes, numb er of messages to b e group ed and
pro cessing times. For small messages when the total amountofevent data do es not
exceed 2 kBytes the p erformance of the system is determined by the software and
hardware overhead of the receiver T =30s+ 8s N umber of S ender s.
oh push
For large messages maximum event rate is limited by the usable bandwidth of
the 155 Mbit/s ATM links. For example, when each of the four senders sends 1
kbyte of data, the total data transmission time is 250s and maximum event rate
is 4 kHz.
The measurements for the pul l farm b enchmark were made on the demonstrator
for ATLAS describ ed in [12]. For this implementation of the pull proto col, the
overhead to handle small events (less than 2 kBytes) in the receiver is: T =
oh pull
200s+ 13s N umber of S ender s
For events bigger than 8 kBytes the link bandwidth limits the maximum event
rate p er receiver. For intermediate event sizes no simple formula can b e derived.
For example, when four senders send 1 kbyte of data, maximum event rate is ab out
2.66 kHz which corresp onds to T of 376 s.
oh pull
For b oth the push farm and pul l farm, when the sum of overhead and pro cessing
time is larger than the event transfer time the maximum event rate is 1=(ov er head +
pr ocessing time). When the data transfer time is dominant the maximum event
rate is 1=tr ansf er time. 8
5 Results
5.1 An overview of the parameters
We have condensed the large number of di erent b enchmark results into a few
meaningful parameters, as shown in Figure 6. A complete set of results is available
14
from the web site .
The parameters are the following:
In ping-pong we de ne one parameter: the overhead derived from dividing the
round-trip time for the smallest packet size bytwo. This parameter represents the
latency of the communication.
Latency has also b een measured in two-way. However, in this measurement the
time has not b een divided bytwo, since b oth the no des are sending and receiving
data simultaneously. Note that the two-way latency is larger than in ping-pong,
since setup times of b oth sending and receiving are included.
In the pairs b enchmark we de ne two parameters: overhead and e ective band-
width. Overhead is the one-directional communication sp eed. The e ective band-
width has b een calculated from this b enchmark (and not from the previous ones),
since in one-directional communication the sp eed is not limited bywaiting for an
acknowledgement each time the message is sent. We extract the e ective bandwidth
using as a reference the packet of one kilobyte. It should b e noted that this is in
most cases not the upp er limit for bandwidth; some systems achieve substantially
higher bandwidth only with packets much larger than 1 kbyte.
A parameter describing the broadcasting capabilities of the system has b een
extracted from the outfarming and broadcasting b enchmarks by dividing the out-
farming ratio by the broadcasting ratio. The ratios have b een calculated by dividing
the time for 2 N no des by the time for N receiving no des, from 2 to 4, 4 to 8 and
8 to 16 pro cessors when p ossible, from which the average is taken. The parameter
shows howwell the broadcasting has b een implemented in each system; the larger
the numb er is, the more ecient the broadcasting is compared to the outfarming
p erformance of the same system.
The funneling (and push-farm) b enchmark represents a typical data collection
approach, in whichanumb er of no des sends their data to one receiver. Atypical
example has b een chosen: four no des sending to one no de a 1 kbyte packet each. The
cycle time to complete the op eration, the inverse frequency, is given as a parameter.
The results of funnel and push-farm di er from each other only slightly. We present
mainly the results from funnel, since it has b een run on a larger numb er of systems.
Where up-to-date funnel results have not b een available, push-farm results were
used instead.
Pul l-farm represents another typ e of data collection, in which rst a read request
is sent. Also here a typical example of four senders and one receiver has b een
chosen. In pul l-farm the packet sizes have b een xed: 64 bytes for the control
message requesting the data and 1 kbyte for the actual data. The cycle time,
inverse frequency, is presented.
The current implementations of communication b enchmarks push-farm and pul l-
farm allow, in principle, overlap between computation and communication. They
use non-blo cking communication primitives for starting to gather fragments be-
longing to the (n + 1)-th event, b efore starting on pre-emptive calculations on the
n-th event. Neither b enchmark attempts to request fragments b elonging to more
than one future event in advance. This is partly b ecause only few implementations
of the communication layer allowmultiple outstanding send- or receive-op erations
between two pro cesses at each time.
14
http://www.cern.ch/RD11/comb ench/results.html 9 ping- two- broad funnel pull- over- pong way pairs pairs cast cycle farm lap latency latency overhd bandw. ratio time cycle t Architecture µs µs µs MB/s µs µs % ATM, RIO 104.8 108.9 9.1 16.2 2.05 250 ** 376 ** - ATM, DEC 85.4 90.3 33.3 13.6 - - - - Cray T3E 4.8 5.5 5.0 57.4 1.55 35.7 57.2 12 Cray T3E (MPI) 21.7 34.4 19.2 19.1 1.67 243.7 349.6 0 DEC MC 6.4 7.8 3.7 31.2 1.92 83.3 * 90.0 71 DEC MC (MPI) 26.5 51.3 21.3 11.4 1.28 - - - DEC 8400 4.0 7.8 4.5 32.2 1.15 128.3* 138.3 0 DEC 8400 (MPI) 13.3 22.9 11.0 13.1 1.60 - - - GPMIMD 6.6 - 13.1 3.1 1.05 1132.4 - - IBM SP2 (MPI) 74.2 88.5 31.0 10.8 1.56 - - - Matra, Raceway 8.8 12.5 10.0 59.0 - 47.5 - - Meiko (Channel) 20.3 34.4 22.4 11.4 1.88 124.0* 172.0 78 Meiko (MPI) 128.5 137.0 102.0 6.2 1.23 - - - Parsytec 217.0 354.0 188.0 3.7 1.50 1103 - - SCI, DEC 9.9 14.9 8.3 38.5 - - - - SCI, RIO/DEC 12.5 18.1 9.9 21.2 - 76.6 * 98.3 61 SGI Origin 7.1 12.7 8.8 31.3 1.25 178.3 207.2 - SGI Challenge 12.9 20.1 14.0 12.2 1.23 262.6 - -
SGI Chall. (MPI) 66.1 81.4 34.8 5.8 - 546.4 - -
Figure 6: Parameters extracted from b enchmark results. Push-farm results were
used instead of funnel, when either funnel results were not available or push-farm
results were considerably newer. The push-farm results have b een marked with an
asterisk (*). Push-farm and pul l-farm results of ATM, marked with double asterisk
(**), have b een achieved by slightly di erent measurements, and are describ ed more
in detail in section 5.3.2.
We parameterised the observed amount of overlap by 1 a=t (in p ercent),
r;0
where a is the (application-sp eci c) communication overhead, and t is the time
r;0
sp ent by the receiver for gathering fragments (communication) with d = 0 (no
calculation). The overhead a was obtained as the average of t d for di erent
r;d
values of d; that is the overall excess over the time d that is required for calculating
alone. We used results from measurements, in which four senders relayed fragments
of 1024 bytes each to one receiver. The overlap presented in the table has b een
taken from the push-farm.
5.2 Discussion of the results
The results provide a large amount of data. The parameters extracted from these
results attempt to compress this large amount of b enchmark data to a few mean-
ingful numb ers, which can b e used in comparing the communication p erformance
of di erent systems.
From the parameters in Figure 6 a numb er of observations can b e made. The 10
latencies vary considerably, from few to few hundreds of microseconds. The same
observation applies to overhead in pairs b enchmark. The lowest overheads are not
necessarily obtained by the tightly coupled shared memory systems, even though
that might b e exp ected, since for example Digital's Memory Channel reaches less
than 4 s and Cray T3E 5 s.
The bandwidths measured by sending a kilobyte packet vary b etween 3.7 and
59.0 MB/s, although many of the systems can do b etter with a larger packet size.
MPI results were obtained in multiple systems, in which also lower level API-s
were available. The latencies and overheads with MPI, at least with the current MPI
versions, is large, typically 3-6 times larger than with lower level API-s. On the other
hand, since MPI is available in multiple platforms, it provides p ortability; it can
b e debated, whether the di erence in p erformance justi es additional programming
work.
The overlap results show the b est overlap for the Meiko CS-2, as is explicable
from its powerful p er-no de communication co-pro cessors. The AlphaServer 8400
o ccupies the other end of the sp ectrum as a typical SMP multi-pro cessor, with no
overlap at all (and low absolute latencies at the same time). Some other observed
overlaps must b e attributed to artifacts from implementations of intermediate soft-
ware layers, that stemmed from di erent authors, and are not so easily explained.
The scalability of the b enchmarked systems do es not strongly app ear from the
parameters. This is partly due to the fact that in many of the b enchmarked con g-
urations only few pro cessors were available. Few of the systems, Silicon Graphics
Origin and Cray T3E, could b e tested with large numb er of no des. These systems
demonstrated quite good communication network scalability up to the tested 44
(SGI) and 64 (Cray) pro cessors.
The large numberofbenchmark results has b een obtained over a time span of
more than a year. During that time a numb er of hardware and software upgrades
to ok place, so that the results do not necessarily represent the most up-to-date situa-
tion. In addition, the maximum numb er of pro cessors presented in some b enchmark
results dep ended on access to the systems.
It would b e exp ected that in pairs b enchmark the overhead would b e constantly
smaller than the latency in ping-pong, since only receiving setup time is present
in addition to the transfer time, and no ow control from the receiver back to the
sender is used. However, in some cases the pairs overhead is larger. In one case, this
kind of b ehaviour can b e explained by background load (the tested systems could
not always b e dedicated), in another there was a con guration change during some
of the measurements. The di erence is in these cases quite small, and the suspicious
times are also from the lower end (mostly less than 10 s with few exceptions), thus
the statistical error of the measurements might also in uence the results.
The parameters in Figure 6 represent critical asp ects of the systems, but are in
no way sucient to generate all measured b enchmark results. They may b e seen,
however, as parameters that can b e used in a mo del.
The current implementation do es a memory copy in each end of the data transfer,
whichis not an optimal way for some of the technologies, for example for shared
memory.
6 Summary
15
Benchmark suites suchasParkb ench measure the multiprocessor p erformance of
a system by running a set of prede ned applications or kernels of applications. Only
a small set of the Parkb ench suite deals with communication, however. We feel that
15
http://www.netlib.org/parkb ench/ 11
our b enchmark suite can serve as a useful to ol in comparing the raw communication
p erformance of parallel systems, as it is available to application-level programs.
There is a large numb er of results for these b enchmarks available. This makes
it p ossible to widely compare the communication p erformance of di erent systems.
Many of the latest-generation parallel technologies have b een measured.
Several of the systems show communication overheads b elow 10 s for small
packets. Some of the systems additionally have proven a good scalability, which
has b een tested in some cases with up to 64 pro cessors. Due to the go o d scalability
of some of the systems within the tested range it is predictable that scalabilityto
hundreds of pro cessors will either already now, or at least in the near future, also
b e quite ecient for example with some of the tightly coupled parallel systems or
shared memory systems using NUMA (non-uniform memory access) architecture.
The communication parameter most typical for our application, pull-farm with
four senders, is completed in some systems in around 60-90 s. The b est result for
the push-farm, for the same numb er of pro cessors and 1 kbyte data, is around 40
s. We consider these numb ers as promising for our trigger work.
7 Future work
The b enchmarks describ ed in this do cument have b een run on a large number of
di erent systems. This provides an extensive set of results, from which information
ab out the di erent communication networks can b e extracted. However, new and
faster systems constantly arriveat the market; we intend to sub ject them to the
same pro cedures.
Parallel systems are evolving, to o. Many of the main vendors are developing
systems based on clustering shared memory systems, in which each no de thus is
a multipro cessor system itself. This kind of two-level (or more) communication
hierarchy creates new challenges for future releases of communication b enchmarks.
8 Acknowledgements
Wewould like to thank Digital Equipment Corp oration for their close co-op eration
during this b enchmark work.
Wewould like to thank Center for Scienti c Computing (CSC) in Finland for
usage of their sup ercomputer systems.
Wewould also like to thank Irakli Mandjavidze (DSM/DAPNIA), Andreu Pacheco
(CERN), Denis Calvet (DSM/DAPNIA) and the CERN RD31 group for technical
assistance, and for the opp ortunity to use the ATM switch and related hardware in
building the testb ed for measurements.
In addition, we thank the following p ersons, who have contributed in running the
b enchmarks: Igor Zacharov (Silicon Graphics), Raynald Huaulme (Matra Systemes
& Information), Iosif Legrand (DESY), Ruud van Wijk (NIKHEF), Roger Heeley
(CERN) and John Ap ostolakis (CERN).
References
[1] ATLAS Technical Proposal. CERN/LHCC/94-43, 1994.
[2] Atlas Level-2 Trigger Groups, Atlas Second-Level Trigger Options. CHEP'97
16
conference pro ceedings .
16
http://sgi.ifh.de/CHEP97/pap er/pap er/466.ps 12
[3] J.Ap ostolakis et al., Abstract Communication Benchmarks on Paral lel Systems
17
for Real-time Applications. CHEP'97 conference pro ceedings .
[4] Private communications, from Irakli Mandjavidze and Denis Calvet.
18
[5] R. B. Gillett. Memory Channel Network for PCI . IEEE Micro. February 1996.
[6] Digital Equipment Corp oration. TruCluster Production Server Software. MEM-
19
ORY CHANNEL Application Programming Interfaces . Part Number AA-
QTN4B-TE. Septemb er 1996.
[7] Dolphin Interconnect Solutions A.S. Oslo, Norway. PCI-SCI Bridge Functional
Speci cation.Version 3.1 (con dential). Novemb er, 1996.
[8] IEEE Computer So ciety. IEEE Standard for Scalable Coherent Interface (SCI).
IEEE Std 1596-1992 (recognised as an American National Standard (ANSI).
August, 1993.
[9] IEEE Computer So ciety. Physical layer Application Programming Interface for
20
the Scalable Coherent Interface (SCI PHY-API) . IEEE Std P1596.9/Draft
0.41b. March 23, 1997.
[10] Digital Equipment Corp oration.Digital UNIX Native ATM Application Pro-
gramming Interface. Programmer's reference for PVC op erations. Version 2.0
Septemb er 16, 1996.
[11] M. Costa et al. Lessons from ATM-based event builder demonstrators and chal-
lenged for LHC-scale systems. Pro ceedings of Second Workshop on Electronics
for LHC Exp eriments. BalatonFred, Hungary, 23-27 Septemb er, 1996.
[12] D. Calvet et al. Operation and Performance of an ATM based Demonstrator
for the Sequential Option of the ATLAS Trigger. Pro ceedings of Tenth IEEE
Real Time Conference, Beaune, France, 21-26 Sept 1997.
[13] D. Calvet et al. Performance Analysis of ATM Network Interfaces for Data
Acquisition Applications. Pro ceedings of Second International Data Acquisition
Workshop on Networked Data Acquisition Systems. World Scienti c Publishing
1997, pp. 73-80.
[14] Alain CLOUARD et al. CapCASE: A Graphical Development Tool Supporting
Scalable, Heterogeneous Multicomputers. Pro ceedings of the International Con-
ference on Signal Pro cessing Applications and Technology (ICSPAT '96), pp.
873-879, Boston, USA. Octob er 7-10, 1996.
17
http://sgi.ifh.de/CHEP97/pap er/pap er/460.ps
18
http://www.digital.com:80/info/hp c/ref/gillett ieee.p df
19
http://www.unix.digital.com/faqs/publications/cluster do c/PS MC API/TOC.HTM#TOC
20
http://sci.lbl.gov/sciapi/draft/b o ok041b.p df 13