Performance Comparison of LAM/MPI, MPICH, and MVICH on a Linux Cluster Connected by a Gigabit Ethernet Network

Performance Comparison of LAM/MPI, MPICH, and MVICH on a Linux Cluster connected by a Gigabit Ethernet Network. ∗ Hong Ong‡ and Paul A. Farrell† Department of Mathematics and Computer Science, Kent State University, Kent, OH 44242. ‡ [email protected] † [email protected] Abstract usually accomplished through parallel programming libraries such as MPI, PVM [Geist], and BSP [Jonathan]. These environments provide well- We evaluate and compare the performance of LAM, defined portable mechanisms for which concurrent MPICH, and MVICH on a Linux cluster connected applications can be developed easily. In particular, by a Gigabit Ethernet network. Performance statis- MPI has been widely accepted in the scientific par- tics are collected using NetPIPE which show the allel computing area. The use of MPI has broadened behavior of LAM/MPI and MPICH over a gigabit over time as well. Two of the most extensively used network. Since LAM and MPICH use the TCP/IP MPI implementations are MPICH [Gropp, Gropp2] socket interface for communicating messages, it is from Mississippi State University and Argonne Na- critical to have high TCP/IP performance. De- tional Laboratory and LAM [LSC] originally from spite many efforts to improve TCP/IP performance, Ohio Supercomputing Center. LAM is now being the performance graphs presented here indicate that maintained by the University of Notre Dame. The the overhead incurred in protocol stack process- modular design taken by MPICH and LAM has al- ing is still high. Recent developments such as the lowed research organizations and commercial ven- VIA-based MPI implementation MVICH can im- dors to port the software to a great variety of mul- prove communication throughput and thus give the tiprocessor and multicomputer platforms and dis- promise of enabling distributed applications to im- tributed environments. prove performance. Naturally, there has been great interest in the performance of LAM and MPICH for enabling high- performance computing in clusters. Large scale dis- 1 Introduction tributed applications using MPI ( either LAM or MPICH ) as communication transport on a cluster of computers impose heavy demands on communica- Due to the increase in network hardware speed and tion networks. Gigabit Ethernet technology, among the availability of low cost high performance work- others high-speed networks, can in principle pro- stations, cluster computing has become increasingly vide the required bandwidth to meet these demands. popular. Many research institutes, universities, Moreover it also holds the promise of considerable and industrial sites around the world have started price reductions, possibly even to commodity levels, to purchase/build low cost clusters, such as Linux as Gigabit over copper devices become more avail- Beowulf-class clusters, for their parallel processing able and use increases. However, it has also shifted needs at a fraction of the price of mainframes or the communication bottleneck from network media supercomputers. to protocol processing. Since LAM and MPICH use TCP/UDP socket interfaces to communicate mes- On these cluster systems, parallel processing is sages between nodes, there have been great efforts in reducing the overhead incurred in processing the ∗This work was supported in part by NSF CDA 9617541, TCP/IP stacks. However, the efforts have yielded NSF ASC 9720221, by the OBR Investment Fund Ohio Com- only moderate improvement. Since then, many sys- munication and Computing ATM Research Network (OCAR- net), and through the Ohio Board of Regents Computer Sci- tems such as U-Net [Welsh], BIP [Geoffray], and Ac- ence Enhancement Initiative tive Message [Martin] have been proposed to provide the overhead of end-to-end Internet protocols did low latency and high bandwidth message-passing not significantly contribute to the poor network per- between clusters of workstations and I/O devices formance since the latency equation was primarily that are connected by a network. More recently, dominated by the underlying network links. How- the Virtual Interface Architecture (VIA) [Compaq] ever, recent improvements in network technology has been developed to standardize these ideas. VIA and processor speeds has made the overhead of the defines mechanisms that will bypass layers of pro- Internet protocol stacks the dominant factor in the tocol stacks and avoid intermediate copies of data latency equation. during sending and receiving messages. Elimina- tion of this overhead not only enables significant The VIA specification defines mechanisms to avoid communication performance increases but will also this communication bottleneck by eliminating the result in a significant decrease in processor utiliza- intermediate copies of data. This effectively reduces tion by the communication subsystem. Since the latency and lowers the impact on bandwidth. The introduction of VIA, there have been several soft- mechanisms also enable a process to disable inter- ware and hardware implementations of VIA. Berke- rupts under heavy workloads and enable interrupts ley VIA [Buonadonna], Giganet VIA [Speight], M- only on wait-for-completion. This indirectly avoids VIA [MVIA], and FirmVIA [Banikazemi] are among context switch overhead since the mechanisms do these implementations. This has also led to the re- not need to switch to the protocol stacks or to an- cent development of VIA-based MPI communica- other process. tions libraries, noticeably MVICH [MVICH]. The VIA specification only requires control and The rest of this paper is organized as follows: In setup to go through the OS kernel. Users (also Section 2.1, we briefly overview the VIA architec- known as VI Consumers) can transfer their data ture. In Section 2.2, we give a brief description to/from network interfaces directly without oper- of Gigabit Ethernet technology. The testing en- ating system intervention via a pair of send and re- vironment is given in Section 3. In Section 4, we ceive work queues. And, a process can own multiple present performance results for TCP/IP, LAM and work queues at any given time. MPICH. Preliminary performance results using VIA and MVICH on a Gigabit Ethernet network will also VIA is based on a standard software interface and be presented. Finally, conclusions and future work a hardware interface model. The separation of are presented in Section 5. hardware interface and software interface makes VI highly portable between computing platforms and network interface cards (NICs). The software interface is composed of the VI Provider library (VIPL) 2 VIA and Gigabit Ethernet and the VI kernel agent. The hardware interface is the VI NIC which is media dependent. By providing a standard software interface to the network, 2.1 VIA Overview VIA can achieve the network performance needed by communication intensive applications. The VIA [INTEL, INTEL2] interface model was de- VIA supports send/receive and remote direct veloped based on the observation by researchers that memory access (RDMA) read/write types of the most significant overhead was the time required data movements. These operations describe the to communicate between processor, memory, and gather/scatter memory locations to the connected I/O subsystems that are directly connected to a VI. To initiate these operations, a registered descrip- network. In particular, communication time is not tor should be placed on the VI work queue. The scaling to each individual component of the whole current revision of the VIA specification defines the system and can possibly increase exponentially in a semantics of a DMA Read operation but does not cluster of workstations. require that the network interface support it. This communication overhead is caused by the The VI kernel agent provides synchronization by time accumulated when messages move through providing the scheduling semantics for blocking different layers of the Internet protocol suite of calls. As a privileged entity, it controls hardware TCP/UDP/IP in the operating system. In the past, interrupts from the VI architecture on a global, and per VI basis. The VI kernel agent also supports In the event that the receive buffers approach their buffer registration and de-registration. The regis- maximum capacity, a high water mark interrupts tration of buffers allows the enforcement of protec- the MAC control of the receiving node and sends tion across process boundaries via page ownership. a signal to the sending node instructing it to halt Privileged kernel processing is required to perform packet transmission for a specified period of time virtual-to-physical address translation and to wire until the buffer can catch up. The sending node the associated pages into memory. stops packet transmission until the time interval is past or until it receives a new packet from the receiving node with a time interval of zero. It then resumes packet transmission. The high water mark 2.2 Gigabit Ethernet Technology ensures that enough buffer capacity remains to give the MAC time to inform the other devices to shut down the flow of data before the buffer capacity Gigabit Ethernet [GEA], also known as the IEEE overflows. Similarly, there is a low water mark to Standard 802.3z, is the latest Ethernet technology. notify the MAC control when there is enough open Like Ethernet, Gigabit Ethernet is a media access capacity in the buffer to restart the flow of incoming control (MAC) and physical-layer (PHY) technol- data. ogy. It offers one gigabit per second (1 Gbps) raw bandwidth which is 10 times faster than fast Eth- Full-duplex transmission can

Performance Comparison of LAM/MPI, MPICH, and MVICH on a Linux Cluster Connected by a Gigabit Ethernet Network

Designing an Ultra Low-Overhead Multithreading Runtime for Nim

Fiber Context As a ﬁrst-Class Object

Fiber-Based Architecture for NFV Cloud Databases

A Thread Performance Comparison: Windows NT and Solaris on a Symmetric Multiprocessor

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

The Impact of Process Placement and Oversubscription on Application Performance: a Case Study for Exascale Computing

And Computer Systems I: Introduction Systems

Enabling Technologies for Distributed and Cloud Computing Dr

Eindhoven University of Technology MASTER Acceleration of A

HPX-Agseries

Fibers Are Not (P)Threads the Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations

Use of High Perfo Nce Networks and Supercomputers Ime Flight Simulation