Efficient Inter-Core Communications on Manycore Machines

ZIMP: Efficient Inter-core Communications on Manycore Machines Pierre-Louis Aublin Sonia Ben Mokhtar Gilles Muller Vivien Quema´ Grenoble University CNRS - LIRIS INRIA CNRS - LIG Abstract—Modern computers have an increasing num- nication mechanisms enabling both one-to-one and one- ber of cores and, as exemplified by the recent Barrelfish to-many communications. Current operating systems operating system, the software they execute increasingly provide various inter-core communication mechanisms. resembles distributed, message-passing systems. To support this evolution, there is a need for very efficient However it is not clear yet how these mechanisms be- inter-core communication mechanisms. Current operat- have on manycore processors, especially when messages ing systems provide various inter-core communication are intended to many recipient cores. mechanisms, but it is not clear yet how they behave In this report, we study seven communication mech- on manycore processors. In this report, we study seven anisms, which are considered state-of-the-art. More mechanisms, that are considered state-of-the-art. We show that these mechanisms have two main drawbacks that precisely, we study TCP, UDP and Unix domain sock- limit their efficiency: they perform costly memory copy ets, pipes, IPC and POSIX message queues. These operations and they do not provide efficient support mechanisms are supported by traditional operating sys- for one-to-many communications. We do thus propose tems such as Linux. We also study Barrelfish message ZIMP, a new inter-core communication mechanism that passing, the inter-process communication mechanism of implements zero-copy inter-core message communications and that efficiently handles one-to-many communications. Barrelfish OS [1]. We show that all these mechanisms We evaluate ZIMP on three manycore machines, hav- have two main drawbacks that limit their efficiency. ing respectively 8, 16 and 24 cores, using both micro- Firstly, they perform costly memory copy operations. and macro-benchmarks (a consensus and a checkpointing Indeed, to send a message, multiple copies of the latter protocol). Our evaluation shows that ZIMP consistently have to be created in memory. Secondly, they do not improves the performance of existing mechanisms, by up to one order of magnitude. provide efficient support for one-to-many communications. Indeed, to send a message to N receivers, N calls to the send primitive of the communication mechanism I. INTRODUCTION need to be performed. Modern computers have an increasing number of We propose ZIMP (Zero-copy Inter-core Message cores and as envisioned by the designers of the recent Passing), a new, efficient, inter-core communication Barrelfish operating system [1], the software they run mechanism. More particularly, ZIMP provides a zero- increasingly resemble distributed, message passing sys- copy send primitive by allocating messages directly in a tems. Moreover, it is now admitted that one of the major shared area of memory. It also efficiently handles one- challenges of the next decade for system researchers, to-many communications by allowing a message to be will be to enable modern manycore systems to toler- sent once and read multiple times. ate both software [2], [3] and hardware failures [4]. We evaluate ZIMP on three manycore machines, hav- To tolerate failures in such message-passing manycore ing respectively 8, 16 and 24 cores, using both micro- systems, agreement [5] and broadcast protocols [6], [7] and macro-benchmarks. Our macro-benchmarks consist can be used for maintaining a consistent state across of two real world applications, namely PaxosInside [17] cores. Furthermore, checkpointing algorithms [8], [9], and a checkpointing protocol [8]. The performance [10], [11], [12], [13], [14], [15], [16] can provide a mean evaluation shows that for PaxosInside, ZIMP improves to resist failures of software distributed across cores. In the throughput of the best state-of-the-art mechanism this area, Yabandeh and Guerraoui have pioneered fault by up to 473%. It also shows that for the checkpointing tolerance for manycore architectures with PaxosInside, protocol, ZIMP reduces the latency of the best state-of- an adaptation of the Paxos protocol for manycore ma- the-art mechanism by up to 58%. chines [17]. The remaining of the report is organized as follows: In order to support efficient execution of such proto- In Section II, we describe the existing communication cols, there is a need for very efficient inter-core commu- mechanisms that can be used for inter-core communications. In Section III, we present ZIMP, our effi- require large messages to be fragmented into smaller cient communication mechanism devised for manycore packets (the maximum packet size is 16kB using the machines. We then evaluate the performance of the loopback interface). Moreover, to send a set of packets state-of-the-art mechanisms and of ZIMP in Section IV. TCP and UDP sockets first places them on the sender’s We finally discuss related work in Section V, before socket list. Packets are then pushed by the kernel on concluding the report in Section VI. the loopback interface. Upon an interruption, the kernel receives the packets and places them on the receiver’s II. BACKGROUND socket list. Note that TCP and UDP do not offer the In this section, we start by reviewing seven state- same interface and do not provide the same guarantees. of-the-art inter-core communication mechanisms. We More precisely, TCP is loss-less and stream-oriented then analyze the drawbacks of all presented mechanisms while UDP can lose messages (if the receiver socket and discuss the need for a new efficient inter-core list is full) and is datagram oriented. Finally, UDP communication mechanism. datagrams have a maximum size of 65kB. Hence, to send larger messages using UDP, the application needs A. Inter-core communication mechanisms to fragment them into chunks of 65kB or less, resulting We review seven inter-core communication mecha- in additional system calls. nisms. Six of them are provided by traditional Lin- ux/Unix operating systems. Among these mechanisms, 3) Pipes: A pipe is a data structure stored in kernel Unix domain sockets, Pipes, IPC message queues and space and containing a circular list of small buffer POSIX message queues have been devised for the entries. Each entry has a size of 4kB, which corresponds communication between processes residing on the same to the size of a memory page. To send a message using host. We also review TCP and UDP sockets that have a pipe, a process uses the write() system call. Then, been designed for communication over an IP network the kernel splits the message in 4kB chunks and copies but that are often used for communication on the same the chunks from the user space to the pipe’s circular host. We finally present Barrelfish message passing, a list. To receive messages from a pipe, a receiver uses communication mechanism that has been specifically the read() system call. This system call copies the devised for manycore machines. content of the pipe to a user space buffer. 1) Unix domain sockets: Unix Domain sockets is 4) IPC and POSIX message queues: IPC and POSIX a communication mechanism specifically designed for message queues (noted IPC MQ and POSIX MQ in communications between processes residing on the the following, respectively) are queues of messages same host. In order to use Unix domain sockets, the residing in kernel space. Although the design of both sender and the receiver both create a socket, i.e., a data mechanisms is similar, they exhibit some differences. structure with a pointer to a linked list of datagrams. For instance, an IPC MQ can be used between several To send a message, a sender creates a buffer and uses senders and several receivers, while a POSIX MQ can the sendto() system call. Then, the kernel copies the be used between multiple senders but only one receiver. buffer from the user space to the kernel space and adds To send a message, a sender uses the msgsnd() it to the list. To receive messages, a receiver uses the system call1. Then, the kernel copies the message from recvfrom() system call. Finally, the kernel copies the user space to the kernel space and adds the message the last entry of the list from the kernel space to the to the queue. To receive a message, a receiver uses the user space. msgrcv() system call. When a message is present, it is copied from the kernel space to the user space. 2) TCP and UDP sockets: TCP and UDP sockets are two mechanisms that have been designed to allow 5) Barrelfish message passing: Barrelfish message processes residing on distinct machines to communicate passing (noted Barrelfish MP) is a user-level point-to- over an IP network. Nevertheless, these mechanisms can point communication mechanism that has been devised also be used by processes residing on the same host for the Barrelfish operating system2. For each sender- to communicate by using the loopback interface. From receiver pair, there are two circular buffers of messages a design point of view, UDP and TCP sockets share that act as unidirectional communication channels. The similarities with Unix domain sockets: they use the 1 same system calls and messages have to be copied from We describe the interface used in IPC message queues. The interface used in POSIX message queues is similar. the user to the kernel space, and vice versa. There are 2We use the publicly available Barrelfish source code release of nevertheless some differences: UDP and TCP sockets march 2011, available at http://www.barrelfish.org. 2 size of these channels, S, is defined when they are the N receivers. The second line of the table shows created. Messages have a fixed size, which is a multiple that for receiving a message, the latter is copied N of the size of a cache line, and they are aligned on a times by the N receivers (once for each receiver).

Efficient Inter-Core Communications on Manycore Machines

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support