Clemson University TigerPrints
Publications School of Computing
4-1999 Evaluation of Real-Time Fiber Communications for Parallel Collective Operations Amy Apon Clemson University, [email protected]
Parvathi Rajagopal Vanderbilt University
Follow this and additional works at: https://tigerprints.clemson.edu/computing_pubs Part of the Computer Sciences Commons
Recommended Citation Apon, Amy and Rajagopal, Parvathi, "Evaluation of Real-Time Fiber Communications for Parallel Collective Operations" (1999). Publications . 23. https://tigerprints.clemson.edu/computing_pubs/23
This is brought to you for free and open access by the School of Computing at TigerPrints. It has been accepted for inclusion in Publications by an authorized administrator of TigerPrints. For more information, please contact [email protected].
Evaluation of Real-Time Fib er Communications
?
for Parallel Collective Op erations
1 2
Parvathi Ra jagopal and AmyW.Apon
1
Vanderbilt University, Nashville, TN 37235 [email protected]
2
University of Arkansas, Fayetteville, AR 72701 [email protected]
Abstract. Real-Time Fib er Communications RTFC is a gigabit sp eed
network that has b een designed for damage tolerant lo cal area networks.
In addition to its damage tolerantcharacteristics, it has several features
that make it attractive as a p ossible interconnection technology for paral-
lel applications in a cluster of workstations. These characteristics include
supp ort for broadcast and multicast messaging, memory cache in the net-
work interface card, and supp ort for very ne grain writes to the network
cache. Broadcast data is captured in network cache of all workstations
in the network providing a distributed shared memory capability. In this
pap er, RTFCis intro duced. The p erformance of standard MPI collec-
tive communications using TCP proto cols over RTFC are evaluated and
compared exp erimentally with that of Fast Ethernet. It is found that
the MPI message passing libraries over traditional TCP proto cols over
RTFC p erform well with resp ect to Fast Ethernet. Also, a new approach
that uses direct network cache movement of bu ers for collective op era-
tions is evaluated. It is found that execution time for parallel collective
communications may b e improved via e ective use of network cache.
1 Intro duction
Real-Time Fib er Communications RTFC is a unique lo cal and campus area
network with damage tolerantcharacteristics necessary for high availabilityin
mission critical applications. The RTFC standard [5] has b een in development
over a p erio d of several years and rst generation pro ducts are available through
two indep endent vendors. RTFC arose from the requirements of military sur-
face ship combat systems. Existing communication systems, while fast enough,
did not have the necessary fault and damage tolerance characteristics. When
constructed in a damage-tolerant con guration, RTFC will automatically recon-
gure itself via a pro cess known as rostering whenever damage to a no de, link,
or hub o ccurs.
In addition to its damage-tolerant features, RTFC is a ring that uses a con-
nectionless data-link layer that has minimum overhead [9]. Unlike switched tech-
nologies such as Myrinet and ATM that may p erform multicast and broadcast
?
This work was supp orted in part by NSF Research Instrumentation Grant 9617384
and by equipment supp ort from Belob ox Systems, Inc., Irvine, California.
communications by sending unicast messages to all destinations, RTFC directly
supp orts the multicast and broadcast communications that are imp ortant in
parallel applications. Therefore, it seems appropriate to explore the advantages
that RTFCmayhave in parallel collective communications i.e., those that are
variants of broadcast and multicast messaging.
The purp ose of this pap er is to intro duce Real-Time Fib er Communications
and to evaluate its applicability for parallel collective op erations. The evaluation
is based on measurements and exp erimentation on a cluster of workstations that
is interconnected via RTFC. The pap er is organized as follows.
{ Section 2 intro duces Real-Time Fib er Communications,
{ Section 3 describ es the exp erimental workstation and network platform,
{ Section 4 describ es the software platform for collective communications,
{ Section 5 gives the measurement results, and
{ Section 6 gives conclusions and a description of related and future work.
2 Overview of Real-Time Fib er Communications
The basic network top ology of RTFCis a set of no des that are connected via
a hub or switch in a simple star con guration. The no des form a logical ring,
and each no de has an assigned numb er in increasing order around the ring. One
physical star forms a single non-redundant fault-tolerant segment. The maximum
numb er of no des in a segment is 254. The distance from a no de to a hub or switch
is normally up to 300 meters for 62.5 micro b er-optic media. For applications
that do not require high levels of fault tolerance, it is also p ossible to con gure
RTFC as a simple physical ring using only the no des and the network interface
cards in the no des. At the physical layer RTFC is fully compliant with FC-0 and
FC-1 of the Fibre Channel FC-PH sp eci cation.
The RTFC ring is a variantof a register insertion ring [1], and guarantees
real-time, in-order, reliable delivery of data. During normal op eration, packets
are removed from the ring by the sending no de i.e., a source removal p olicy is
used. While this limits the maximum throughput on the ring to the bandwidth
of the links only, this p olicy has certain advantages in that sending no des are
always aware of trac on the entire ring. Since each no de can see all of the
ring trac, it is p ossible for no des to monitor their individual contribution to
ring trac to ensure that the ring bu ers do not ll, and that ring tour time
is b ounded. The unique purge no de in the segment removes any orphan packets
that traverse the ring more than one time.
Bridge no des may b e used to connect segments, and redundant bridge no des
may be used to connect segments to meet the damage-tolerance requirements
ofamulti-segmented network. For applications that require high levels of fault
tolerance, it is p ossible to construct redundant copies of the ring. By adding
channels to the network interface cards at each no de and adding additional hubs
or switches, up to four physically identical rings maybe constructed with the
same set of no des. In this manner, RTFC can b e con gured to b e dual, triple,
or quad-redundant for fault tolerance. The communication path from one no de
to the next higher-numb ered no de in the ring go es through a single hub. While
other hubs receive a copy of all packets and forward them, the receiving no de
only listens to one hub, usually the one with the lowest numb er if no damage has
b een sustained. If a no de, hub, or communication link fails in an RTFC segment,
then a pro cess known as rostering determines the new logical ring. This abilityto
recon gure the route b etween working no des i.e., the logical ring, in real-time,
after sustaining wide-scale damage, is a distinguishing characteristic of RTFC.
RTFC supp orts traditional unicast, multicast, and broadcast messaging b e-
tween no des. In addition, RTFC supp orts the concept of network cache. Network
cache is memory in the network interface card that is mapp ed to the user address
space and is accessible to the user either via direct memory access DMA or via
copyby the CPU. One use of network cache is to provide an ecient hardware
implementation of distributed shared memory across the no des in a segment.
Each network interface card can supp ort up to 256MB of network cache. Some
p ortion of network cache is reserved for system use, dep ending on the version
of system software running, but the remaining p ortion of network cache can b e
read or written by user pro cesses running on anyworkstation in the network.
is copied (reflected) to network cache here NODE X A write to network cache here NODE Y
Processor/ RTFC Processor/ Local Memory ring Local Memory RTFC NIC
RTFC NIC System Bus
System Bus
Fig. 1. Op eration of the RTFC network cache
Broadcast data written to network cache of any no de app ears in the network
cache of all no des on that segment within one ring-tour, as illustrated in Figure 1.
The ring-tour time is the time that it takes a micropacket to travel from the
source around the ring one time. The total hardware latency to travel from the
source around the ring one time can b e calculated to b e less than one millisecond
on the largest ring at maximum distance [5].
The data link layer in RTFC is de ned by the RTFC standard and replaces
the FC-2 layer de ned in Fibre Channel. Data at the data link layer is transferred
primarily using two sizes of hardware-generated micropackets. For signals and
small messages, a xed-size, 12-byte 96 bit, micropacket is used [9].
The data rate of RTFC can b e calculated. RTFC sends at the standard Fibre
Channel rate of approximately 1 Gbps. Fibre Channel uses 8B/10B data enco d-
ing, giving a user data rate of 1 Gbps * 80 = 800 Mbps. For each micropacket
the p ercentage of data bits is 32 data bits/96 bits/micropacket + 32 bits/idle
= 25. Therefore, the user data rate is approximately 200 Mbps. For longer mes-
sages, a high-payload micropacket longer than 12 bytes is supp orted in hardware
that obtains up to 80 eciency of the theoretical b er optic bandwidth. All
testing was p erformed with rst generation hardware that only supp orts 96 bit
micropackets.
3 Exp erimental Workstation and Network Platform
The workstations used in this study consists of three dual-pro cessor VME bus-
based workstations, connected by a 3Com Sup erStack I I Hub 100 TX Fast Eth-
ernet hub for 100Mbps tests and a VMEbus RTFC gigabit Fib erHub. Each
workstation is identi ed by an Ethernet network IP address and host name, and
similarly byanRTFC network IP address and host name.
A simple blo ck diagram of the workstation hardware is shown in Figure 2.
Each of the three workstations has a Motorola MVME3600 VME pro cessor mo d-
ule, as outlined by the dashed line in the gure. The MVME3600 consists of a
pro cessor/memory mo dule and a base mo dule, assembled together to form a
VMEbus b oard-level computer. The base mo dule and pro cessor/memory mo d-
ules are b oth 6U VME b oards. The combination o ccupies two slots in a VME
rack. The pro cessor/memory mo dules consists of two 200MHz PowerPC 604
pro cessors and 64MB of memory. The pro cessor mo dules have a 64-bit PCI con-
nection to allow mating with the MVME3600 series of baseb oards. The base I/O
mo dule consists of a digital 2114X 10/100 Ethernet/Fast Ethernet interface and
video, I/O, keyb oard, and mouse connectors. The MVME761 transition mo dule
not shown in Figure 2 provides a connector access to 10/100 BaseT p ort. A P2
adapter not shown in Figure 2 routes the Ethernet signals to the MVME761
transition mo dule. Each workstation is connected to the RTFC network via a
V01PAK1-01A Fib erLAN SBC network interface, each with 1MB of SRAM net-
work cache memory.
VME Backplane
P2 Connector P1 Connector P2 Connector P1 Connector
DEC 2114X NICSCSI VME Bridge Base module
FiberLAN NIC
PCI Bus 64 bit PCI Connector Network Cache Processor Bus CPU 0 Memory CPU 1 MVME 3600
Processor/memory module
Fig. 2. Simple blo ck diagram of the workstation architecture
System software on the workstations is the IBM AIX 4.1.5 op erating sys-
tem. The MPICH implementation of the MPI Message Passing Interface is
chosen as the application platform [4]. MPICH network communication that
is based on TCP/IP can be made to run over Fast Ethernet or over RTFC,
dep ending on the host names that are sp eci ed in the pro c group con gura-
tion le. The MPICH implementation of MPI provides unicast p oint-to-p oint
communications, broadcast, and multicast collective communications. MPICH
collective communications are implemented using p oint-to-p oint, which are im-
plemented through message passing libraries built for the sp eci c hardware. The
p erformance of the collective op erations in MPICH is enhanced by the use of
asynchronous non-blo cking sends and receives where p ossible, and by using a
tree-structured communication mo del where p ossible.
In addition to TCP/IP, messaging is provided over RTFC via the Advanced
MultiPro cessor/Multicomputer Distributed Computing AmpDC environment
[2], a pro duct of Belob ox Systems, Inc. However, an implementation of AmpDC
middleware for RTFCwas not yet available for AIX. At the time of this study,a
partial implementation of some RTFC primitive comp onents for AIX were avail-
able, including a function to write data directly to network cache memory from
lo cal workstation memory using DMA or CPU access, and a function to read
data directly from network cache to lo cal workstation memory. Also available
were two functions that return a p ointer to the rst and last available long word
lo cation of user-managed network cache, resp ectively.
Standard b enchmarking techniques were used throughout this study. All runs
were executed for at least one hundred trials. The timings obtained were cleared
of outliers by eliminating the b ottom and top ten p ercent of readings. The net-
work was isolated in that other trac was eliminated and no other applications
were running on the workstations at the time. Each call was executed a few times
b efore the actual timed call was executed in order to eliminate time taken for
setting up connections, loading of the cache, and other housekeeping functions.
Measurements were made on b oth RTFC and Fast Ethernet.
4 Software Platform for Collective Communications
An exp erimental software platform was develop ed for evaluating the p erformance
of MPI collective communications by using the network cache architecture [8].
Collective calls for broadcast, scatter, gather, allgather, alltoall were designed to
work similarly to their corresp onding MPI calls. These calls were implemented
on the exp erimental platform by using the basic RTFC read and write calls.
In order to allow pro cesses to not over-write each other's data, network cache
was partitioned among the pro cesses [6]. Each pro cess writes to its area of net-
work cache. Each pro cess may read from one or more areas of network cache,
dep ending on the sp eci c collective op eration. An initialize function nds the
rst and last lo cation on network cache that is available to the user applica-
tions, and partitions the available memory equally among the pro cesses. The
assumption is made that at any given time there is only one parallel applica-
tion i.e., MPI program executing, and that this application has access to all of
network cache that is available to user programs. If there is a need to execute
more than one parallel program at a time then an external scheduler should b e
used that will be resp onsible for the additional partitioning of network cache
that would b e required.
Collective calls are considered complete when the area of network cache used
by the pro cess can be reused. Since the exp erimental platform do es not yet
supp ort interrupts at the receiving no de via the RTFC primitives available at
the time of these tests, there is a dicultyinhow to signal a receiving no de that
a message has arrived. In order to demonstrate pro of-of-concept, MPI messaging
over Fast Ethernet is used in the prototyp e co de to notify a set of no des of the
start of each collective op eration and to synchronize no des between collective
op erations.
In MPI, pro cesses involved in a collective call can return as so on as their par-
ticipation in the routine is completed. Completion means that the bu er i.e.,
the area of network cache used by the pro cess is free to b e used by other pro-
cesses. The metho d for signaling completion of each of the calls varies dep ending
on the op erations b eing p erformed. In general, the sending pro cesses write to
network cache and inform all other pro cesses where to nd the data in network
cache, the receiving pro cesses read the data from the network cache, and an MPI
synchronization call ensures that the bu er is safe b efore it is re-used.
For example, in case of the broadcast, the ro ot writes the contents of the
send bu er to its area in network cache and sends the starting address to all
other pro cesses in the group. The other pro cesses receive the address, read the
information from network cache and write it to their lo cal memory. Once the
pro cesses write the message to lo cal memory, the send a completed message to
the ro ot which exits up on receing the complete message from all receivers.
As another example, in case of allgather, each pro cess sends the start address
to all other pro cesses. That is, the address is allgathered to other pro cesses. Once
a pro cess reads the contents of network cache it sends a completed message to
the sending pro cess, which then exits up on receiving the completed message.
Figure 3 illustrates the ow of execution for allgather. Figure 4 illustrates the
partitioning of network cache and the use of lo cal memory for allgather. Similar
algorithms are implemented for scatter, gather, and alltoall messaging.
an_busWrite MPI_Allgather an_busRead MPI_Send MPI_Recv start message to All processes send message from read complete read complete stop network cache and receive the network cache to all proc's from all proc's
address
Fig. 3. Flowchart for the allgather op eration
5 Measurement Results for Collective Communication
Performance measurements were conducted for the exp erimental co de running
over RTFC, and also for MPI over TCP/IP on b oth RTFC and Fast Ethernet.
The time from the start of the collective call until the completion of the last
pro cess is the metric used in this study [7]. Measurements were p erformed for
broadcast, alltoall, allgather, scatter and gather, for two and three pro cesses. Node P0 Network cache memory Local workstation memory P0 P1 P0 sends local memory receive P1 sends VME bus buffer P2 P3 P2 sends data to be sent from P3 sends