On the Implementation and Evaluation of Berkeley Sockets on Maestro2 Cluster Computing Environment

On the Implementation and Evaluation of Berkeley Sockets on Maestro2 cluster computing environment ½ ½¾ Ricardo Guapo½¾ , Shinichi Yamagiwa and Leonel Sousa ½ INESC-ID, Rua Alves Redol 9 Apartado 13069, 1000-029 Lisboa Portugal ¾ Instituto Superior Tecnico´ - Av. Rovisco Pais, 1049-001 Lisboa Portugal guapo, yama, las @sips.inesc-id.pt Abstract tocol layers. Thus, those communication software are able to achieve almost 100% of the potential network hardware The support on cluster environments of ”legacy proto- performance. cols” is important to avoid rewriting the code of applica- To take advantage of such high performance protocols, tions, but this support should not prevent to achieve the the applications need to use those proprietary communica- maximum communication performance. This paper ad- tion software. Once the necessity to change the platform dresses this issue by implementing the Berkeley sockets in- of the applications, namely to a different cluster, a serious terface over the MMP message passing protocol, which is problem arises to keep the compatibility of the code, namely the lower layer of the Maestro2 cluster communication net- in what respect communication. Therefore, it is necessary work. Experimental results show that MMP-Sockets offers for the programmer to rearrange the communication part of a minimum latency of 25s and a maximum throughput of the applications regarding to the new dedicated communi- 1250Mbps. These values correspond to a relative increase cation software. of 80% for the latency and a decrease of about 30% for the To address this portability/compatibility problem of ap- throughput, regarding to a communication based only on plications on different cluster hardware, the MPI standard MMP. However, MMP-Sockets increases the compatibility was proposed [9]. MPI defines a standard API for a message and portability of developed applications. passing library. It is suitable for parallel applications for which the communication size and timing are defined stati- cally, such as: matrix computation, Fast Fourier Transform 1. Introduction (FFT) and a LU decomposition. However, not all high performance communication demanding programs have pre- Performance extracted from off-the-shelf components cise knowledge about the amount data transfers, e.g.: HTTP is the key to design efficient commodity cluster comput- based video transcoding [6] and large data-bases. ers. For the parallel application in clusters, it is important Even if ported to MPI, those kind of applications still to reduce the overheads in the network hardware such as require the use of a TCP connection to clients outside the the copy operations between user application and drivers cluster. This forces the existence of two communication buffers and also the excessive TCP information headers for methods/libraries, TCP and MPI, to be conjugated together, local communication. which can be avoided if TCP would already be present at To achieve the potential network performance of clus- high performance for inside cluster communication as the ter computers hardware, dedicated software is applied for standard unified protocol. Therefore, it has been recently implementing the communication interface among applica- started an effort to bring cluster high performance com- tion’s parallel processes. An example of such specific net- munication to applications using TCP, by implementing li- works is the Myrinet [10] network hardware and its dedi- braries which will enable the execution of those applications cated low-level protocols GM [1], PM [4] and BIP [8]. GM, in clusters [3]. PM and BIP optimize the communication with Myrinet net- This paper is focused on the design and implementation work hardware and implement the zero-copy network inter- of a fully compatible communication library of TCP over face, by touching the user memory space that is locked by a dedicated communication software that can achieve high the OS (pindowned) directly to send and receive messages. performance communication for cluster computing. For Those protocols touch the hardware directly from the user that, the Berkeley sockets interface was implemented over application to bypass the thick conventional network pro- the MMP message passing protocol, which is the lower Host layer of the Maestro2 cluster communication environment. Host Memory Moreover, we will discuss the variation in the communica- Space tion performance with the change of the parameters in our User 1) Space socket implementation. Fill m Pindowned Host CPU essage The paper is organized as follows: the next section Buffer presents an overview of the high performance network tech- Issue Complete request notification nology Maestro2, with the dedicated communication soft- 2) 5) 3) Transmission DMA ware called MMP. Section 3 presents the design and imple- Buffer transfer mentation of the Berkeley sockets API over the MMP com- Network munication library. Section 4 presents experimental results Interface CPU 4) and evaluate the relative performance of the new developed Physical Switch Box library. Section 4 finally concludes this paper. transmission PCI Interface 2. Background Figure 1. Maestro network diagram and send The Maestro2 network [11][12] is an intelligent network operation subactions that has been developed to address the overheads of inter- cluster communication. The Maestro network is supported generates and writes requests to the switch controller. by a dedicated communication protocol, implemented as a message passing library, called MMP. Maestro2 Message Passing protocol: To achieve high Maestro2 network: The Maestro2 network is composed performance with Maestro2 PC clusters, the low latency by network interfaces (NI) and a Switch Box (SB) as and high bandwidth dedicated communication MMP library shown in the diagram of figure 1. Each network interface was developed. MMP consists of a user level library and the is connected via two Low Voltage Differential Signalling Maestro2 communication firmware that directly controls NI (LVDS) [5] cables for transmission and reception, and is and SB. The two main key observations in developing MMP connected to a commodity computer such as a personal were: i) to avoid unnecessary overhead so that the commu- computer or a WorkStation via a 64bit@66MHz PCI bus. nication does not degrade the overall performance of the The connection between NI and SB is full-duplex and the parallel applications; ii) in accessing the CPU, computation peak bandwidth is of 3.2Gbps. Currently, the SB has ports must have higher priority than communication. to directly connect up to eight NI(s). One or more ports Regarding the first issue, MMP implements the zero- can increase the fan-out of the switch by cascading SB(s). copy mechanism on Linux OS (Operating System), elimi- The NI includes a NI manager, a PCI interface, network nating data copy between user and kernel memory spaces FIFO buffers, a link controller (MLC-X) and LVDS trans- and avoiding system calls. The zero-copy communication mitter/receiver. The NI manager works as a processor ele- technique consists on exchanging messages between the ment in charge of handling communications, and consists of sender’s application memory space to the receiver’s one di- a PowerPC603e@300MHz and 64Mbyte of SDRAM. The rectly, by using DMA operations between network hard- PCI interface maps the address space of the SDRAM and ware buffers and application buffers that are allocated and part of host processor’s memory into the PowerPC’s address locked [7] (generally called, pindowned). space. A 8Kbyte network FIFO buffers to store incom- To deal with the second issue, MMP communication ing and outgoing messages. The MLC-X is a full duplex functions were made to be non-blocking. In addition, com- link layer controller on which the continuous network burst plex communication operations, such as the creation of transfer is implemented. MLC-X supports two communi- data chunks, migrates to the physical network. There- cation channels between the network FIFO buffers. And fi- fore, the communication code will be overlapped with the nally, LVDS transmitter/receiver drive its physical medium computation code on the host processor. MMP provides under control of MLC-X. It transmits and receives data via primitive message-based non-blocking communication in- its 3.2 Gbps full duplex link. The PCI interface, network terface functions (MMP send(), MMP Recv()), and cor- buffers, and MLC-X are implemented into the Virtex-II responding request completion synchronization functions FPGA (Field Programmable Gate Array) chip [13]. The MMP send wait(), MMP recv wait(). SB consists of four SB interfaces, a SB manager and a MMP uses a connection-less data transmission sys- switch controller. Each SB interface manages two ports tem based on messages, similar to UDP. Arguments for and includes a message analyzer. The message analyzer ex- the sending or receiving functions indicate only the re- tracts the message headers of each incoming message, and ceiver and sender end-point identification, besides mes- passes it to the SB manager. The SB manager consists of a sage identification parameters. When an application call PowerPC603e, 32Mbyte SDRAM, and a routing circuit that MMP Initialize() function, MMP will provide an in- dividual ”port” to identify the application end-point in the User ApplicationApplication MMP network. In figure 1 it is presented the sub-actions Applic ation Reception D a ta Bu ffe r s wait queue performed by a MMP send operation: 1) application creates Berkeley Interface the message in the pindowned memory; 2) it notifies the send request; 3) MMP firmware in NI issues DMA transfers Core Data Str uc tur e Connection information: between main memory and network buffers; 4) the message File descriptors MMP Endpoint is transmitted by link-layer to SB and 5) the request status Connection state Core Engine is updated to completed. MMP Transmission It was on top of MMP library that the Berkeley socket Buffer s MMP Automation Communication interface was implemented in this work. Next section will Layer Protocols be focused on the implementation aspects. MMP TCP 3.

Load more