Effects of Communication Latency, Overhead, and Bandwidth in a Cluster Architecture
Richard P. Martin, Amin M. Vahdat, David E. Culler and Thomas E. Anderson Computer Science Division University of California Berkeley, CA 94720
Abstract reflective memory operations [7, 21], and providing lean communi- This work provides a systematic study of the impact of commu- cation software layers [35, 43, 44]. Recently, we have seen a shift nication performance on parallel applications in a high performance to designs that accept a reduction in communication performance network of workstations. We develop an experimental system in to obtain greater generality (e.g., Flash vs. Dash), greater oppor- which the communication latency, overhead, and bandwidth can be tunity for specialization (e.g. Tempest [38]), or a cleaner commu- independently varied to observe the effects on a wide range of ap- nication interface (e.g., T3E vs. T3D). At the same time, a num- plications. Our results indicate that current efforts to improve clus- ber of investigations are focusing on bringing the communication ter communication performance to that of tightly integrated paral- performance of clusters closer to that of the more tightly integrated lel machines results in significantly improved application perfor- parallel machines [3, 12, 21, 35, 43]. Moving forward from these mance. We show that applications demonstrate strong sensitivity research alternatives, a crucial question to answer is how much do to overhead, slowing down by a factor of 60 on 32 processors when the improvements in communication performance actually improve
overhead is increased from 3 to 103 s. Applications in this study application performance. are also sensitive to per-message bandwidth, but are surprisingly The goal of this work is to provide a systematic study of the im- tolerant of increased latency and lower per-byte bandwidth. Fi- pact of communication performance on parallel applications. It fo- nally, most applications demonstrate a highly linear dependence to cuses on a high performance cluster architecture, for which a fast both overhead and per-message bandwidth, indicating that further Active Message layer has been developed to a low latency, high improvements in communication performance will continue to im- bandwidth network. We want to quantify the performance impact prove application performance. of our communication enhancements on applications and to under- stand if they have gone far enough. Furthermore, we want to under- stand which aspects of communication performance are most im- 1Introduction portant. The main contributions of this work are (i) a reproducible empirical apparatus for measuring the effects of variations in com- munication performance for clusters, (ii) a methodology for a sys- Many research efforts in parallel computer architecture have fo- tematic investigation of these effects and (iii) an in-depth study of cused on improving various aspects of communication perfor- application sensitivity to latency, overhead, and bandwidth, quanti- mance. These investigations cover a vast spectrum of alternatives, fying application performance in response to changes in communi- ranging from integrating message transactions into the memory cation performance. controller [5, 10, 29, 41] or the cache controller [1, 20, 32], to in- corporating messaging deep into the processor [9, 11, 12, 17, 22, Our approach is to determine application sensitivity to machine 23, 36, 40], integrating the network interface on the memory bus [7, communication characteristics by running a benchmark suite on a 31], providing dedicated message processors [6, 8, 37], providing large cluster in which the communication layer has been modified various kinds of bulk transfer support [5, 26, 29, 37], supporting to allow the latency, overhead,per-messagebandwidth and per-byte bandwidth to be adjusted independently. This four-parameter char-
This work was supported in part by the Defense Advanced Research Projects acterization of communication performance is based on the LogP Agency (N00600-93-C-2481, F30602-95-C-0014), the National Science Foundation model [2, 14], the framework for our systematic investigation of the (CDA 9401156), Sun Microsystems, California MICRO, Hewlett Packard, Intel, Mi- communication design space. By adjusting these parameters, we crosoft, and Mitsubishi. Anderson was also supported by a National Science Founda- can observe changesin the execution time of applications on a spec- tion Presidential Faculty Fellowship. The authors can be contacted at frmartin,
vahdat, culler, [email protected]. trum of systems ranging from the current high-performance cluster to conventional LAN based clusters. We measure a suite of appli- cations with a wide range of program characteristics, e.g., coarse- grained vs. fine-grained and read vs. write based, to enable us to draw conclusions about the effect of communication characteristics on classes of applications. Our results show that, in general, applications are most sensi- tive to communication overhead. This effect can easily be predicted from communication frequency. The sensitivity to messagerate and data transfer bandwidth is less pronounced and more complex. Ap- plications are least sensitive to the actual network transit latency P (processors) and the effects are qualitatively different than what is exhibited for PM PM . . . PM the other parameters. Overall, the trends indicate that the efforts o (overhead) o to improve communication performance pay off. Further improve- g (gap) ments will continue to improve application performance. However, L (latency) these efforts should focus on reducing overhead. Interconnection network We believe that there are several advantages to our approach of limited capacity running real programs with realistic inputs on a flexible hardware (L/g to or from prototype that can vary its performance characteristics. The inter- a proccessor) actions influencing a parallel program’s overall performance can be Figure 1: LogP Abstract Machine. The LogP model describes very complex, so changing the performance of one aspect of the an abstract machine configuration in terms of four performance pa- system may cause subtle changes to the program’s behavior. For
rameters: L, the latency experienced in each communication event, example, changing the communication overhead may change the
o, the overhead experienced by the sending and receiving proces- load balance, the synchronization behavior, the contention, or other
sors, g , the gap between successive sends or successive receives by aspects of a parallel program. By measuring the full program on
a processor, and P , the number of processors/memory modules. a modified machine, we observe the summary effect of the com- plex underlying interactions. Also, we are able to run applications on realistic input sizes, so we escape the difficulties of attempting to size the machine parameters down to levels appropriate for the communicate by point-to-point messages is characterized by four small problems feasible on a simulator and then extrapolating to the parameters (illustrated in Figure 1). real case [45]. These issues have driven a number of efforts to de-
L:thelatency, or delay, incurred in communicating a message velop powerful simulators [38, 39], as well as to develop flexible containing a small number of words from its source proces- hardware prototypes [24]. sor/memory module to its target. The drawback of a real system is that it is most suited to investi-
gate design points that are “slower” than the base hardware. Thus, o:theoverhead,defined as the length of time that a processor to perform the study we must use a prototype communication layer is engaged in the transmission or reception of each message; and network hardware with better performance than what is gen- during this time, the processor cannot perform other opera- erally available. We are then able to scale back the performance tions. to observe the “slowdown” relative to the initial, aggressive design
g :thegap,defined as the minimum time interval between con- point. By observing the slowdown as a function of network per- secutive message transmissions or consecutive message re- formance, we can extrapolate back from the initial design point to ceptions at a module; this is the time it takes for a message more aggressive hypotheticaldesigns. Wehave constructedsuch an to cross through the bandwidth bottleneck in the system. apparatus for clusters using commercially available hardware and
publicly available research software. P : the number of processor/memory modules.
The remainder of the paper is organized as follows. After pro-
o g
viding the necessary background in Section 2, Section 3 describes L, ,and are specified in units of time. It is assumed that the
L g e the experimental setup and our methodology for emulating designs network has a finite capacity, such that at most d messagescan with a range of communication performance. In addition, we out- be in transit from any processor or to any processor at any time. If line a microbenchmarking technique to calibrate the effective com- a processor attempts to transmit a message that would exceed this munication characteristics of our experimental apparatus. Section 4 limit, it stalls until the message can be sent without exceeding the describes the characteristics of the applications in our benchmark capacity limit.
suite and reports their overall communication requirements, such The simplest communication operation, sending a single packet