Benchmarking Communication Systems 1

ATLAS DAQ Note 77 18 Sept 1997 R.K.Bo ck, A.Bogaerts, R. Hauser, C. Hortnagl, K. Koski, P.Werner, A.Guglielmi, O.Orel 1 Benchmarking Communication Systems 1 Intro duction Following the market trend of high-p erformance computing towards parallel systems available at decreasing cost, we b elieve that there is a nite chance that muchor mayb e all of the computing load of level-2 triggers in ATLAS, and certainly all of level 3, can b e executed in a farm constituted by commercial parallel systems. The computer p erformance evaluation of systems use, among other criteria, results of b enchmarking, viz. measured execution times obtained by running a representative job mix, usually without investing substantial e ort in optimising for the system at hand. In order to assess, how far parallel systems can con- tribute to our trigger problem solution, we have designed a comparatively naive set of application-indep endent communication benchmarks; they are do cumented in ATLAS DAQ notes 48 and 61. The results are, in rst instance, large tables of measured communication times. Our goal was to derive from these detailed results several basic communication parameters. They include obvious ones like bandwidth, but also various overheads asso ciated to switching technologies or arising from interfacing to op erating systems, and measures of trac interference. We will eventually use these parameters for comparing di erent system p ossibilities, in particular as input to detailed and full-scale ATLAS mo delling. While not replacing more detailed b enchmarking of applications, they do give more useful information than the combination of CPU b enchmarks with bandwidth numb ers. The tested systems include a numb er of di erent architectures from clusters of workstations to tightly coupled massively parallel systems. We also included the technologies that were prominent in the ATLAS demonstrator program, SCI and ATM. The b enchmark package includes versions for many di erent communication technologies and programming interfaces, such as shared memory, MPI, Digital's Memory Channel, Cray T3E Shared memory API and Meiko CS-2. 2 Description of the b enchmarks To assess the p erformance of a numb er of commercially available parallel systems, we de ned a set of abstract basic communication b enchmarks, which are not sp eci c to our application. We also added two more application-oriented b enchmarks, which represent much simpli ed versions of the currently prop osed second-level trigger solutions [2]. All abstract basic b enchmarks are executed for a varying numb er of packet sizes (minimum, 64, 256, 1024) and a varying numb er of pro cessors where applicable (2, 4, 8, 16, 32, 64). Packet sizes are restricted to those exp ected in our application, although some implementations have scanned a wider parameter space. 1 Submitted, in mo di ed form, to Paral lel Computing 1 A more detailed de nition of the b enchmarks can b e found in CERN ATLAS 2 3 do cuments . An example implementation in C is also available from a web site . Default implementations for MPI and shared memory are available, and the software has also b een adapted to several low-level libraries from di erentvendors, including Cray T3E, Meiko CS2 and Digital's Memory Channel. The following b enchmark programs have b een used (N is the total number of no des): Ping-pong One no de sends data to another no de, and waits for the data to b e sent back. The b enchmark measures the round-trip time. Two-way Both no des are sending and receiving data simultaneously. Al l-to-al l All no des are sending to each other simultaneously. Increasing the number of no des in the system increases the overall throughput. Pairs N=2 no des send to N=2 receivers one-directionally. Outfarming and Broadcast For outfarming one no de sends packets to N 1 receivers, while broadcast uses hardware broadcast, if present. Thus in outfarming the data could be di erent in each send, whereas broadcast sends always the same data. Funnel and Push-farm In the funnel b enchmark N 1 senders send data to one receiver. The push-farm representsatyp e of communication, in which the data is sent from N 1 no des to one receiver, much the same wayasinthe funnel. The di erence is that in the push-farm b enchmark additional computing cycles can be included; the computing represents analysis of these measurements, and we execute dummy co de lasting 0, 100, 400 or 1600 microseconds. Each time b efore the computing cycle is started, the request for the next data item has b een issued, allowing overlap of computing and communication. Pul l-farm Pul l-farm represents a typ e of communication, in which rst a control message (64 bytes) is sent from the receiver to N 1 senders, and subsequently an amount of data (1024 bytes) is received back from each sender. Computing cycles can b e included the same wayasin push-farm. A graphical representation of the b enchmark top ologies is given in Figure 1. Of particular relevance to our application are the b enchmarks ping-pong, pairs, push-farm and pul l-farm. The latter twoobviously have b een sp eci cally designed to corresp ond to communication patterns typical for our application. Ping-pong tests the request-acknowledge cycle, which is needed in several kinds of transmis- sions. The pairs b enchmark tests one-way communication p erformance from p oint to p oint, whichischaracteristic for communication without need for acknowledge- ment. 2 http://atlasinfo.cern.ch/Atlas/GROUPS/DAQTRIG/NOTES/note61/rudi.ps.Z, DAQ 48.ps.Z http://atlasinfo.cern.ch/Atlas/GROUPS/DAQTRIG/NOTES/note48/ATLAS 3 http://www.cern.ch/RD11/comb ench/comb ench.tar.Z 2 Ping-pong Two-way Outfarming and Broadcast All-to-all Pairs ∆d ∆d Funnel Push-farm Pull-farm Figure 1: A graphical representation of the communication b enchmarks. The dot indicates, which time is measured. 3 Implementation The b enchmarks have b een implemented in di erent technologies by designing a separate intermediate layer for each technology, as illustrated in Figure 2. This layer contains the message passing routines, such as sending and receiving the message, initialisation, cleaning up and broadcasting. The routines include non-blo cking send and receive op erations. By using this kind of layered approach, the p orting of the b enchmarks has b een made more straightforward. When implementing a version for a new technology, only the intermediate layer has to b e changed. Since two di erentATM libraries were used, also two di erent implementations of the low level interface had to b e programmed for that particular technology. Since the application programming interface in di erentATM hardware, for example, is usually proprietary, the interface has to b e implemented separately for each system. The push-farm and pul l-farm results for ATM, use sp eci c trac generators as data sources. Additionally, the programs used in these measurements di er slightly from the other b enchmarks. 45 The b enchmarks for Mercury RACEway system have b een designed by using 6 the PeakWare to olkit , previously known as CapCASE [14]. 4 RACE is a registered trademark of Mercury Computer Systems,Inc. 5 http://www.mc.com/Technical bulletins/mtb4smp-race/smp-v-race.html 6 PeakWare is a trademark of MATRA SYSTEMES & INFORMATION 3 write once ATLAS communication benchmarks per application write once low level interface per technology MPI various SCI ATM shared communication memory MEMORY CHANNEL hardware Figure 2: The structure of the b enchmark implementation. The low level interface has b een programmed separately for each technology. 4 Platforms The measurements have b een done in a numb er of technologies: 7 Scalable CoherentInterface (SCI) on a RIO2 8061 emb edded pro cessor b oard using the LynxOS op erating system, on PC under LINUX, and on Digital Alphas under Digital UNIX Digital Memory Channel connecting Digital Alphas Asynchronous Transfer Mo de (ATM), on RIO2 (and RTPC) emb edded pro- cessor b oards, with the LynxOS op erating system, on PC-s under Windows NT, and on Digital Alphas under Digital UNIX Cray T3E Shared memory Application Programming Interface Raceway Bus, using a Matra Systemes & Information PeakWare to olkit T9000 using IEEE 1355 DS Links (GPMIMD) Meiko CS-2 communication library In addition, shared memory and Message Passing Interface (MPI) have b een used in multiple systems, as opp osed to lower-level API-s. Shared memory has b een b enchmarked in Digital 8400, Silicon Graphics Challenge and Origin systems. The tested MPI platforms include shared memory multipro cessors, such as Digital 8400 system and Silicon Graphics Challenge, clusters, such as Digital's Memory Channel, and conventional distributed memory systems, such as IBM SP2, Cray T3E and Meiko CS-2. We will here describ e more in detail the test pro cedure of three of these technologies: Digital's Memory Channel, SCI and ATM. 4.1 Memory Channel 8 [5] is a proprietary network technology from Digital Equipment Memory Channel Corp oration, is commercially targeted as inter-no de transp ort medium in Digital 7 http://www.ces.ch/Pro ducts/Pro ducts.html 8 Memory Channel is a registered trademark of Digital Equipment Corp oration 4 AlphaStation 200 4/166 AlphaStation 200 4/166 AlphaServer 4000 5/300 AlphaStation 200 4/166 MEMORY CHANNEL hub AlphaStation 200 4/166 Figure 3: Con guration in the Memory Channel tests. Later an AlphaStation 500 system was added into the con guration replacing one of the older stations.

Load more