Transparent Operating Systems Live Migration Support for Containerised RDMA-Applications

MigrOS: Transparent Operating Systems Live Migration Support for Containerised RDMA-applications

Maksym Planeta Jan Bierbaum Leo Sahaya Daphne Antony Torsten Hoefler TU Dresden TU Dresden AMOLF ETH Zürich Hermann Härtig TU Dresden

Abstract User applications communicate directly with the NICs to send and receive messages using specialised RDMA- Major data centre providers are introducing RDMA- APIs, like IB verbs1.This direct access minimises net- based networks for their tenants, as well as for operat- work latency, which made RDMA networks ubiquitous ing their underlying infrastructure. In comparison to in HPC [37, 45, 50, 67] and increasingly more accus- traditional socket-based network stacks, RDMA-based tomed in the data centre context [30, 56, 66]. As a re- networks offer higher throughput, lower latency, and result, major data centre providers already offer RDMA duced CPU overhead. However, RDMA networks make connectivity for the end-users [1, 2]. transparent checkpoint and migration operations much Similarly, containers have also become ubiquitous for more difficult. The difficulties arise because RDMA lightweight virtualisation in data centre settings. Con- network architectures remove the OS from the critical tainerised applications do not depend on the software path of communication. As a result, the OS loses con- stack of the host, thus greatly simplifying distributed trol over active RDMA network connections, required application deployment and administration. However, for live migration of RDMA-applications. This paper RDMA networks and containerisation come at odds, presents MigrOS, an OS-level architecture for transpar- when employed together: The former try to bring appli- ent live migration of RDMA-applications. MigrOS of- cations and underlying hardware “closer” to each other, fers changes at the OS software level and small changes whereas the latter facilitates the opposite. This pa- to the RDMA communication protocol. As a proof of per, in particular, addresses the issue of migratability concept, we integrate the proposed changes into Soft- of containerised RDMA-applications through OS-level RoCE, an open-source kernel-level implementation of techniques. an RDMA communication protocol. We designed these The ability to live-migrate applications has long been changes to introduce no runtime overhead, apart from available for virtual machines (VMs) and is widely ap- the actual migration costs. MigrOS allows seamless live preciated in cloud computing [25, 28, 38, 59, 63]. We migration of applications in data centre settings. It also expect live migration to become even more popular with allows HPC clusters to explore new scheduling strate- the growth of disaggregated [26, 33], serverless [73], and gies, which currently do not consider migration as an fog computing [61]. In contrast to VMs, containerised arXiv:2009.06988v2 [cs.OS] 23 Oct 2020 option to reallocate the resources. applications share the kernel, and thus their state, with the host system. In general, it is still possible to extract the relevant container state from the kernel and 1 Introduction restore it on another host later on. This recoverable state includes open TCP connections, shell sessions, file Cloud computing is undergoing a phase of rapidly locks [9, 42]. However, the state of RDMA communi- increasing network performance. This trend implies cation channels is not recoverable by existing systems, higher requirements on the data and packet processing and hence applications using RDMA cannot be check- rate and results in the adoption of high-performance pointed or migrated. network stacks [10, 12, 13, 23, 64, 72, 75]. RDMA To outline the conceptual difficulties involved in sav- network architectures address this demand by offloading the state of RDMA communication channels, we ing packet processing onto specialised circuitry of the compare a traditional TCP/IP-based network stack and network interface controllers (RDMA NICs). These 1IB verbs is the most common low-level API for RDMA net- RDMA NICs process packets much faster than CPUs. works.

1 App Kern App Kern App Kern App Kern Container/Process 1 2 PD 1 2 3 4 GUID GUID NIC NIC NIC NIC GID GID MR … SQ SR SR LID SQ LID … RQ RR RR … QPN QPN Figure 1: Traditional (left) and RDMA (right) network MR CQ WC WC … SRQ stacks. In traditional networks, the user application triggers the NIC via kernel (1). After receiving a packet, the NIC notifies the application back through the ker- Figure 2: Primitives of the IB verbs library. Each queue nel (2). In RDMA networks the application communi- pair (QP) comprises a send and a receive queue and has cates directly to the NICs (3) and vice-versa (4) with- multiple IDs; node-global IDs (grey) are shared by all out kernel intervention. Traditional networks require QPs on the same node. message copy between application buffers ( ) and NIC accessible kernel buffers ( ). RDMA-NICs (right) can access the message buffer in the application memory di- 2.1 Containers rectly ( ). In Linux, processes and process trees can be logically separated from the rest of the system using namespace isolation. Using namespaces, allows process creation the IB verbs API (see Figure 1). First, with a tradi- with an isolated view on the file system, network de- tional network stack, the kernel fully controls when the vices, users, etc. Container runtimes leverage names- communication happens: applications need to perform paces and other low-level kernel mechanisms [3, 52] to system calls to send or receive a message. In IB verbs, create a complete system view without external depen- because of direct communication between the NIC and dencies. Considering their close relation, we use the the application, the OS has no communication intercep- terms container and process interchangeably in this pa- tion points, except of tearing down the connection. Al- per. A distributed application may comprise multiple though the OS can stop a process from sending further containers across a network: a Spark application, for messages, the NIC may still silently change the applica- example, can run the master and each worker in an iso- tion state. Second, part of the connection state resides lated container and an MPI [29] application can con- at the NIC and is inaccessible for the OS. Creating a tainerise each rank. consistent checkpoint is impossible in this situation. In this paper, we propose MigrOS, an architec- 2.2 Infiniband verbs ture enabling transparent live migration of containerised RDMA-applications on the OS level. We identify The IB verbs API is today’s de-facto standard for high- the missing hardware capabilities of existing RDMA- performance RDMA communication. It enables appli- enabled NICs required for transparent live migration. cations to achieve high throughput and low latency by We augment the underlying RoCEv2 communication accessing the NIC directly (OS-bypass), avoiding unnec- protocol to update the physical addresses of a mi- essary memory movement (zero-copy), and delegating grated container transparently. We modify a software packet processing to the NIC (offloading). RoCEv2-implementation to show that the required pro- Figure 2 shows the following IB verbs objects involved tocol changes are small and do not affect the critical in communication. Memory regions (MRs) represent path of the communication. Finally, we demonstrate an pinned memory shared between the application and the end-to-end live migration flow of containerised RDMA- NIC. Queue pairs (QPs), comprising a send queue (SQ) applications. and a receive (RQ) queue, represent connections. To re- duce memory footprint, multiple QPs can replace multiple individual RQs with use a single shared receive queue (SRQ). Completion queues (CQs) inform the application about completed communication requests. A 2 Background protection domain (PD) groups all these IB verbs objects together and represents the process address space This section gives a short introduction to containerisa- to the NIC. tion and RDMA networking. We further outline live To establish a connection, an application needs to ex- migration and how RDMA networking obstructs this change the following addressing information: Memory process. protection keys to enable access to remote MRs, the

2 global vendor-assigned address (GUID), the routable ventional debugging mechanisms [5, 6]. However, to address (GID), the non-routable address (LID), and the extract state of process-specific kernel objects, CRIU node-specific QP number (QPN ). This exchange hap- depends on special Linux kernel interfaces. pens over another network, like TCP/IP. During the To restore a process, CRIU creates a new process that connection setup, each QP is configured for a specific initially runs the CRIU executable which reads the im- type of service. We implement MigrOS for Reliable Con- age of the target process and recreates all OS objects nections (RC) type of service, which provides reliable on its behalf. This approach allows CRIU to utilise in-order message delivery between two communication the available OS mechanisms to run most of the recov- partners. ery without the need for significant kernel modifications. The application sends or receives messages by post- Finally, CRIU removes any traces of itself from the pro- ing send requests (SR) or receive requests (RR) to a QP. cess. These requests describe the message structure and refer CRIU is also capable of restoring the state of TCP to the memory buffers within previously created MRs. connections. This feature is crucial for the live migra- The application checks for the completion of outstand- tion of distributed applications [42]. The Linux kernel ing work requests by polling the CQ for work comple- introduced a new TCP connection state, TCP REPAIR, tions (WC). for that purpose. In this state a user-level process can There are various implementations of the change the state of send and receive message queues, get IB verbs API for different hardware, including In- and set message sequence numbers and timestamps, or finiband [12], iWarp [31], and RoCE [7, 8]. InfiniBand open and close connection without notifying the other is generally the fastest among these but requires side. specialised NICs and switches. RoCE and iWarp As of now, if CRIU attempts to checkpoint an provide RDMA capabilities in Ethernet networks. RDMA-application, it will detect IB verbs objects and They still require require hardware support in the will refuse to proceed. Discarding IB verbs objects in NIC, however, do not depend on specialised switches the naive hope that the application will be able to re- and thus make it easier to incorporate RDMA into an cover is failure-prone: once an application runs into an existing infrastructure. This work focuses on RoCEv2, erroneous IB verbs object, in most cases, the application a version of RoCE protocol. will hang or crash. Thus, we provide explicit support To enable RDMA-application migration, it is impor- for IB verbs objects in CRIU (see Section 3). tant to consider following challenges: 1. User applications have to use physical network addresses (QPN, LID, GID, GUID), and the 3 Design IB verbs API does not specify a way for virtualising these. MigrOS is based on modern container runtimes and 2. The NIC can write to any memory it shares with reuses much of the existing infrastructure with minimal the application without the OS noticing. changes. Most importantly, we require no modification 3. The OS cannot instruct the NIC to pause the com- of the software running inside the container (see Sec- munication, except abruptly terminating it. tion 3.1). 4. The user applications are not prepared for a con- Existing container runtimes rely on CRIU for check- nection changing destination address and going into point/restore functionality [3, 15, 16, 52]. Therefore, an erroneous state. As, result, the applications will it is sufficient to extend CRIU with IB verbs sup- terminate abruptly. port to checkpoint and restore containerised RDMA- 5. Although the OS is aware of all IB verbs objects applications. Section 3.2 describes our modifications to created by the application, it does not control the the IB verbs API and how CRIU uses them. whole state of these objects, as the state partially We also add two new QP states to enable CRIU to resides on the NIC. create consistent checkpoints (see Section 3.3). Finally, We address all of these challenges in Section 3. Section 3.4 describes minimal changes to the packet- level RoCEv2 protocol to ensure that each QP maintains correct information about the location of its part- 2.3 CRIU ner QP. CRIU is a software framework for transparently checkpointing and restoring the state of Linux processes [9]. 3.1 Software Stack It enables live migration, snapshots, or remote debugging of processes, process trees, and containers. To ex- Typically, access to the RDMA network is hidden deep tract the user-space application state, CRIU uses con- inside the software stack. Figure 3 gives an example of a

int ibv_dump_context(

MPI app s Container rt. i struct ibv_context *ctx, d r Open MPI a CRIU int *count, void *dump, T ibv-user m-ibv-user Modified size_t length); Container int ibv_restore_object(

ibv-kern Kernel struct ibv_context *ctx,

void **object, Figure 3: Container migration architecture. Software int object_type, int cmd, inside the container, including the user level driver (ibv- void *args, size_t length); cont, grey), is unmodified. The host runs CRIU, kernel (ibv-kern) and user (ibv-migr, green) level drivers Listing 1: Checkpoint/Restart extension for the modified for migratability. IB verbs API. ibv_dump_context creates an image of the IB verbs context ctx with count objects and stores it in the caller-provided memory region dump of containerised RDMA-application. The container image size length. ibv_restore_object executes the restore comes with all library dependencies, like the libc, but command cmd for an individual object (QP, CQ, etc.) of not the kernel-level drivers. The application uses a stack type object_type. The call expects a list of arguments of communication libraries, comprising Open MPI [29], specific to the object type and recovery command. args Open UCX [67] (not shown), and IB verbs. Normally, is an opaque pointer to the argument buffer of size to migrate, container runtime would require the applica- length. A pointer to the restored object is returned tion inside the container to terminate and later recover via object. all IB verbs objects. This removes transparency from live migration. MigrOS runs alongside the container comprising of a to recreate a queue pair (QP) directly in the Ready-to- container runtime (e.g., docker[52]), CRIU, and IB verbs Send (RTS) state. Instead, the QP has to traverse all library. We modified CRIU to make it aware of IB verbs, intermediate states before reaching RTS. so that it can successfully save IB verbs objects when We introduce a fine-grained ibv_restore_object CRIU traverses the kernel objects belonging to the con- call to restore IB verbs objects one by one, for sit- tainer. We extend the IB verbs library (m-ibv-user and uations when the existing API is not sufficient. In ibv-kern) to enable serialisation and deserialisation of turn, modified MigrOS, uses the extended IB verbs the IB verbs objects. Importantly, the API extension API to save and restore the IB verbs state of applica- is backwards compatible with the IB verbs library run- tions. During recovery, MigrOS reads the object dump ning inside the container. Thus, both m-ibv-user and and applies a specific recovery procedure for each ob- ibv-user use the same kernel version of IB verbs. Mi- ject type. For example, to recover a QP, MigrOS calls grOS requires no modifications of any software inside ibv_restore_object with the command CREATE and the container. progresses the QP through the Init, RTR, and RTS states using ibv_modify_qp. The contents of memory 3.2 Checkpoint/Restore API regions or QP buffers are recovered using the standard file and memory operations. Finally, MigrOS brings the To enable checkpoint/restore for processes and con- queue to the original state using the REFILL command tainers, we extend the IB verbs API with two of the restore call. new calls (see Listing 1): ibv_dump_context and ibv_restore_object. The dump call returns dump of 3.3 Queue Pair States all IB verbs objects within a specific IB verbs context. The dumping runs almost entirely inside the kernel for Before communication can commence, an application two reasons. First, some links between the objects are establishes a connection bringing a QP through a se- only visible at the kernel level. Second, to get a consis- quence of states (depicted in Figure 4). Each newly- tent checkpoint it is crucial to ensure an atomic dump. created QP is in the Reset (R) state. To send and Of course, the existing IB verbs API allows to create receive messages, a QP must reach its final Ready-to- new objects. However, the existing IB verbs API is not Send (RTS) state. Before reaching RTS, the QP tra- expressive enough for restoring objects. For example, verses the Init and Ready-to-Receive (RTR) states. In when restoring a completion queue (CQ), the current case of an error, the QP goes into one of the error states; API does not allow to specify the address of the shared Error (E), or Send Queue Error (SQE). In the Send memory region for this queue. Also it is not possible Queue Drain (SQD) state, a QP does not accept new

4 Create R E SQE S node need to learn its new address. We address the first issue by extending RoCEv2 with a connection migration protocol. The connection mi- SQD gration protocol is active during and after migration Init RTR RTS P (see Figure 5). This protocol is part of the low-level packet transmission protocol and is typically implemented entirely within the NIC. Also, we add a new negative acknowledgement type NAK STOPPED. If a stopped Figure 4: QP State Diagram. Normal states and state QP receives a packet, it replies with NAK STOPPED and transitions ( , ) are controlled by the user applica- drops the packet. When the partner QP receives tion. A QP is put into error states ( , ) either by this negative acknowledgement, it transitions to the the OS or the NIC. New states ( , ) are used for Paused (P) state and refrains from sending further pack- connection migration. ets until receiving a resume message. After migration completes, the new host of the mi- N0 R S nack D grated process restores all QPs to their original state. send Once a QP reaches the RTS state, the new host ex- N R P ... R 1 ack ecutes the REFILL command. This command restores resume N2 ... R the driver-specific internal QP state and sends a newly t introduced resume message to the partner QP. Resume messages are sent unconditionally, even if the partner QP was not paused before. This way, we also address Figure 5: To migrate from host N to host N , the 0 2 the second issue: The recipient of the resume message state of the QP changes from RTS ( R ) to Stopped ( S ). updates its internal address information to point to the Finally, the QP is destroyed ( D ). If the partner QP at new location of the migrated QP; the source address of host N sends a message during migration, this QP gets 1 the resume message. paused ( P ). Both QPs resume normal operation once the migration is complete. Each pause and resume message carry source and destination information. Thus, if multiple QPs migrate at the same time, there can be no confusion which QPs send requests. Apart from that, SQD is equivalent to must be paused or resumed. If at any point the migra- the RTS state. tion process fails, the paused QPs will remain stuck and In addition to the existing states, we add two new will not resume communication. This scenario is com- states invisible to the user application (see Figure 4): pletely analogous to a failure during a TCP-connection Stopped (S) and Paused (P). When the kernel executes migration. In both cases, MigrOS will be responsible ibv_dump_context, all QPs of the specified context go for cleaning up the resources. into a Stopped state. A stopped QP does not send or receive any messages. The QPs remain stopped until they are destroyed together with the checkpointed process. 4 Implementation A QP becomes Paused when learns its destination QP has become Stopped (see Section 3.4). A paused To provide transparent live migration, MigrOS incorpo- QP does not send messages, but also has no other QP rates changes to CRIU, IB verbs library, RDMA-device to receive messages from. A QP remains paused, until driver (SoftRoCE), and packet-level RoCEv2-protocol. the migrated destination QP restores at a new location To migrate an application, the container runtime in- and sends a message with the new location address. The vokes CRIU which checkpoints the target container. paused QP retains the new location of the destination CRIU stops active RDMA-connections and saves the QP and returns to RTS state. After that, the commu- state of IB verbs objects (see Section 4.1). SoftRoCE nication can continue. then pauses communication using our extensions to the packet-level protocol. After transfering the checkpoint 3.4 Connection Migration to the destination node, the container runtime at that node invokes CRIU to recover the IB verbs objects and There are two considerations, when migrating a con- restores the application. SoftRoCE then resumes all nection. First, during the migration, the communica- paused communication to complete the migration pro- tion partner of the migrating container must not confuse cess. migration with a network failure. Second, once the mi- SoftRoCE is a Linux kernel-level software implemen- gration is complete, all partners of the communication tation (not an emulation [48]) of the RoCEv2 proto-

5 col [8]. RoCEv2 runs RDMA communication by tun- QPa QPb SR3 4 5 6 SR4 7 8 9 … Resume: 5 nelling Infiniband packets through a well-known UDP Req.:8 Req. RR RR … port. In contrast to other RDMA-device drivers, Soft- Resp. Resp.:Resp.:7 7 Ack: 6 RoCE allows the OS to inspect, modify, and control the Application WC WC … Comp.:5 Comp. state of IB verbs objects completely. As a performance-critical component of RDMA communication, RoCEv2 usually runs in NIC hardware. So Figure 6: Resuming connection in SoftRoCE. Packets changes to the protocol require hardware changes. We 8, 9 ( ) are to be processed by the requester. Packets implement MigrOS with the focus on minimising these 5-7 ( ) are yet to be acknowledged. Packet 4 ( ) is protocol changes. The key part of MigrOS is the addi- already acknowledged. QPb expects the next packet is tion of connection migration capabilities to the existing 7. Resume packet has PSN of the first unacknowledged RoCEv2 protocol (see Section 4.2). packet ( ). QPb replies with an acknowledgement of the last received packet. 4.1 State Extraction and Recovery State extraction begins when CRIU discovers that its provided key matches the expected key of a given MR. target process opened an IB verbs device. We modified Other than that, the key’s value does not carry any ad- CRIU to use the API presented in Section 3.2 to extract ditional semantics. Thus, no collision problems exist for the state of all available IB verbs objects. CRIU stores protection keys. this state together with other process data in an image. CRIU sets all protection keys to their origi- Later, CRIU recovers the image on another node using nal values before communication restarts by the new API. making an ibv restore object call with the When CRIU recovers MRs and QPs of the migrated IBV RESTORE MR KEYS command. application, the recovered objects must maintain their original unique identifiers. These identifiers are system- 4.2 Resuming Connections global and assigned by the NIC (in our case the Soft- RoCE driver) in a sequential manner. We augmented The connection migration protocol ensures that connec- the SoftRoCE driver to expose the IDs of the last as- tions are terminated gracefully and recovered to a con- signed MR and QP to MigrOS in userspace. These IDs sistent state. The implementation of this protocol is are memory region number (MRN) and queue pair num- device- and driver-specific. In this work, we modify the ber correspondingly. Before recreating an MR or QP, SoftRoCE driver to make it compliant with the con- CRIU configures the last ID appropriately. If no other nection migration protocol (Section 3.4) by providing MR or QP occupies this ID, the newly created object an implementation of the checkpoint/restore API (Sec- will maintain the original ID. This approach is analo- tion 3.2). gous to the way CRIU maintains the process ID of a Figure 6 outlines the basic operation of the SoftRoCE restored process using ns last pid mechanism in Linux, driver. The driver creates three kernel tasks for each which exposes the last process ID assigned by the kernel. QP: requester, responder, and completer. When an ap- It is possible for some other process to occupy MRN plication posts send (SR) and receive (RR) work re- or QPN, which CRIU wants to restore. Two processes quests to a QP, they are processed by requester and cannot use the same MRN or QPN on the same node, responder correspondingly. A work request may be resulting in a conflict. In the current scheme, we avoid split into multiple packets, depending on the MTU size. these conflicts by partitioning QP and MR addresses When the whole work request is complete, responder globally among all nodes in the system before the ap- or completer notify the application by posting a work plication startup. CRIU faces the very same problem completion to the completion queue. with process ID collisions. This problem has only been The kernel tasks process all requests packet by packet. solved with the introduction of process ID namespaces. Each task maintains the packet sequence number (PSN) To remedy the collision problem for IB verbs objects, of the next packet. A packet sent by a requester is pro- a similar namespace-based mechanism, would be re- cessed by the responder of the partner QP. The respon- quired. We leave this issue for future work. der replies with an acknowledgement that is processed Additionally, recovered MRs, have to maintain their by the completer. The completer generates a work com- original memory protection keys. The protection keys pletion (WC) after receiving acknowledgement for the are pseudo-random numbers provided by the NIC and last packet in an SR. Similarly, the responder generates are used by a remote communication partner when send- a WC after receiving all packets of an RR. ing a packet. An RDMA operation succeeds only if the After migration, when the recovered QPa is ready to

6 communicate again, it sends a resume message to QPb Level Component Original ∆ with the new address. This way, QPb learns the new location of QP . Receiving this resume message, the re- Kernel IB verbs 30 565 719 a SoftRoCE 9 446 872 sponder of QPb replies with an acknowledgement of the last successfully received packet. If some packets were QP tasks 1 112 249 lost during the migration, the next PSN at the respon- User IB verbs 12 431 339 der of QP is smaller than the next PSN at the requester SoftRoCE 1 004 332 b CRIU 61 616 1 845 of QPa. The difference corresponds to the lost packets, which must be retransmitted. Simultaneously, the re- Total 4 137 quester of QPb can already start sending messages. At this point, the connection between QPa and QPb is fully Table 1: Development effort in SLOC. We specifically recovered. show magnitude of changes done to QP tasks (see Fig- The presented protocol ensures that both QPs recover ure 6). the connection without losing packets irrecoverably. If packets were lost during migration, the QPs can de- termine which packets were lost and retransmit them. Object Features required State (b) This retransmission is part of the normal RoCEv2 pro- PD None 12 tocol. The whole connection migration protocol runs MR Set MR keys and MRN 48 transparently for the user applications. CQ Set ring buffer state 64 SRQ Set ring buffer state 68 QP + QP tasks state 271 5 Evaluation , set QPN QP w/ SRQ + Current WQE state 823 We evaluate MigrOS from three main aspects. First, we analyse the implementation effort, with a specific fo- Table 2: Additional features implemented in the kernel- cus on changes to the RoCEv2 protocol. Second, we level SoftRoCE driver to enable recovery of IB verbs study the overhead of adding migration capability, out- objects. We provide the size each object occupies in the side of the migration phase. Third, we estimate the dump. fine-grained cost of migration for individual IB verbs objects, as well as the full latency of migration in real- istic RDMA-applications. the changes apply to the kernel-level SoftRoCE driver. For most experiments, we use a system with two These changes mostly focus on saving and restoring machines: Each machine is equipped with an Intel i7- the state of IB verbs objects. We counted separately 4790 CPU, 16 GiB RAM, an on-board Intel 1 Gb Eth- changes to the requester, responder, and completer QP ernet adapter, a Mellanox ConnectX-3 VPI adapter, tasks responsible for the active phase of communication and a Mellanox Connect-IB 56 Gb adapter. The Mel- (see Figure 6). Such QP tasks are often implemented in lanox VPI adapters are set to 40 Gb Ethernet mode and the NIC hardware, for other RDMA-implementations. connected to a Cisco C93128TX 40 Gb Ethernet switch. Therefore it is important to minimise changes specifi- The SoftRoCE driver communicates over this adapter. cally to QP tasks, as changes there directly translate The machines run Debian 11 with a custom Linux 5.7- to hardware changes. In our implementation, changes based kernel. We refer to this setup as local. to QP tasks accounted only for around 6% of overall We conduct further measurements on a cluster com- changes. prising two-socket Intel E5-2680 v3 CPUs nodes with Connect-IB 56 Gb NICs deployed by Bull. We refer to We used gprof to record the coverage of connection this setup as cluster. Two nodes similar to those in the migration support code outside of migration phase. Out cluster were used in a local setup and equipped with of all changes done to the QP tasks, only 28 lines were Mellanox ConnectX-3 VPI NICs configured to 56 Gb In- touched, while the application communication was ac- finiBand mode. tive. Among them, 3 lines are variable assignments, one is an unconditional jump, the rest are newly introduced if-else-conditions that occur at most once per 5.1 Magnitude of Changes packet sent or received. The rest of the code changes to MigrOS requires few changes to the low-level RoCEv2 the QP task run only during the connection migration protocol, as shown in Table 1. We count newly added phase. or modified source lines of code in different compo- Besides additional logic to the QP tasks, saving and nents of the software stack. Only around 10% of all restoring IB verbs objects requires manipulation of

7 implementation-speciﬁc attributes. Some of these at- 1000 Buggy Fixed tributes cannot be set through original IB verbs API. For example, recovery of an MR requires an additional 10

Migratable s ability to restore the original values of memory keys and µ an MRN. Some other attributes are not visible in orig- 100 Migratable inal IB verbs API at all. The queues (CQ, SRQ, QP) 5 Fixed Latency, implemented in SoftRoCE require an ability to save and

Bandwidth, Gb/s Buggy restore metadata of ring buffers backing up the queues. 10 If a QP uses a shared receive queue (SRQ), the dump 0 of the QP additionally includes the full state of the cur- 20 24 28 212 216 220 20 24 28 212 216 220 rent work queue entry (WQE). We identified all required Message size, B Message size, B attributes for SoftRoCE, calculated their memory foot- (a) Communication Through- (b) Communication Latency print (see Table 2), and implemented features required put by these attributes. We show the analysis of the required changes to Ro- Figure 7: Performance comparison of different Soft- CEv2 implemented by SoftRoCE. We claim that similar RoCE drivers. The original version shows a better per- changes are required to other low-level implementations formance whereas adding connection migration support of RoCEv2 protocol residing in RDMA-capable NICs. to the modified version makes practically no impact. We demonstrate the changes to the communication path are minimal, outside of the migration phase. We rea- Short Full name Location sonably expect that once mapped to the hardware the proposed changes will remain minimal. SR SoftRoCE local CX3/40 ConnectX-3 40 Gb Ethernet local CX3/56 ConnectX-3 56 Gb InfiniBand cluster 5.2 Overhead of Migratability CIB ConnectIB local Just adding the capability for transparent container mi- BIB Bull Connect-IB cluster gration already may incur overhead even when the migration does not occur. For example, DMTCP (see Sec- Table 3: RDMA-capable NICs used for the evaluation. tion 6) intercepts all IB verbs library calls and rewrites both work requests and completions before forwarding both fixed versions of the SoftRoCE driver is practically them to the NIC. The interception happens persistently, indistinguishable. Therefore, we conclude that MigrOS even when the process running under DMTCP never mi- introduces no runtime overhead outside of the migration grates. In contrast to this, MigrOS does not intercept phase. communication operations at the critical path, thereby Next, we show the overhead added by DMTCP, which introducing no measurable overhead. This subsection intercepts all IB verbs calls. This way, we study the explores the overhead added for normal communication cost of adding migration capability at the user level. operations without migrations. We use the latency and bandwidth benchmarks from First, we reaffirm that the proposed low-level proto- the OSU 5.6.1 benchmark suite [4] running on top of col changes are minimal. For that, we need to compare Open MPI 4.0 [29]. We ran the experiment on the pre- performance of migratable and non-migratable versions viously described cluster with ConnectIB NICs. As we of SoftRoCE driver. Unfortunately, the original ver- have shown above, adding support for the migration sion (vanilla kernel, without any modifications from our does not add performance penalty. Thus, running with- side) of the SoftRoCE driver turned out to be notori- out DMTCP is similar to having native migration sup- ously unstable.2 The original driver contained multitude port. To be able to extract the state of IB verbs objects, of concurrency bugs and required significant restructur- DMTCP maintains shadow objects, which act as prox- ing. ies between the user process and the NIC [24]. Figure 8 Finally, we ended up with three versions of the driver: shows that maintaining these shadow objects incurs a the original buggy version, a non-migratable fixed ver- non-negligible runtime overhead for RDMA networks. sion, and a migratable fixed version (see Figure 7). The original version rendered to be faster, nevertheless for the scope of our paper correctness was of higher priority 5.3 Migration Costs than the performance. Nevertheless, the performance of With added support for migrating IB verbs objects, the 2SIGINT to a user-level RDMA-application caused the kernel container migration time will increase proportionally to to panic. the time required to recreate these objects. Our goal is

8 50 DMTCP DMTCP To RTS To RTR To Init Create 100 CQ MR

s 1.25 / 40 s

0% µ Native 1.00 0.75 30 1% 10 8% 0.75 20 0.50 46% 0.50 Latency, 23% 0.25 Bandwidth,10 Gb 70% 0.25 Native 0 1 0.00 0.00 0 4 8 12 16 20 0 4 8 12 16 20 PD QP 2 2 2 2 2 2 2 2 2 2 2 3 4 Message size, B Message size, B Time, ms

3 2 8 s 17 s (a) Communication (b) Communication latency. 2 µ µ throughput. DMTCP DMTCP increases the la- 1 reduces the bandwidth tency by up to 23% or 1 by up to 70% for small 0.34 s, if size < 16 KiB, else µ 0 0 messages. 1.3 µs. SR CX3/40CX3/56 CIB BIB SR CX3/40CX3/56 CIB BIB Figure 8: DMTCP adds substantial communication overhead, even when migration is not used. Figure 9: Object creation time for different RDMA- devices. Before being able to send a message, a QP to estimate the additional latency for migrating RDMA- needs to be in the state RTS, which requires the traver- enabled applications. This subsection shows the cost for sal of three intermediate states (Reset, Init, RTR). We migrating connections created by SoftRoCE, as well as show the interval of the standard deviation around the the cost for connection creation with hardware-based mean. IB verbs implementations. Several IB verbs objects are required before a reliable connection (RC) can be established, see Section 2.2. Usually, an application creates a single PD, one or two of the registered region. SoftRoCE does not incur the CQs, multiple memory regions, and one QP per com- “NIC-part” of the cost, therefore MR registration with munication partner. SoftRoCE is faster than for RDMA-enabled NICs. For To measure the cost of creating individual IB verbs this experiment, we do not consider the costs of trans- objects, we modified ib send bw from the perftest [14] ferring the contents of the MR during migration. benchmark suite to create additional MR objects. We The number of QPs is the second variable influencing created one CQ, one PD, 64 QPs, and 64 1 MiB-sized the migration time. Figure 11 shows the time for mi- MRs per run. Figure 9 shows the average time required grating a container running the ib send bw benchmark. to create each object across 50 runs. Each tested NIC The benchmark consists of two single-process containers is represented by a bar. running on two different nodes. Three seconds after the We draw two conclusions from this experiment. First, communication starts, the container runtime migrates there is a substantial variation for all operations across one of the containers to another node. The migration different NICs. Second, the time required for most op- time is measured as the maximum message latency as erations is in the range of milliseconds. seen by the container that did not move. The check- The exact time required for migrating RDMA con- point is transferred over the same network link used nections depends on two factors: the number of QPs by the benchmarks for communication. With growing and the total amount of memory assigned to MRs [53]. number of QPs, the benchmark consumes more mem- Both of these factors are application-specific and can ory, ranging from 8 MiB to 20 MiB. To put things into vary greatly. Therefore, next we show how the migra- perspective, we estimated the migration time for real tion time is influenced by the application’s usage of MRs devices by calculating the time to recreate IB verbs ob- and QPs. ject for RDMA-enabled NICs. We subtracted the time Figure 10 shows the MR registration time, depending to create IB verbs objects with SoftRoCE from the mea- on the region’s size. MR registration costs are split be- sured migration time and added time to create IB verbs tween the OS and the NIC: The OS pins the memory objects with RDMA-NICs (from Figure 9). We show and the NIC learns about the virtual memory mapping our estimations with the dashed lines.

9 BIB CIB CX3/40 CX3/56 SoftRoCE 100 1000 Checkpoint Transfer Restore X (fast) 10 750 X

1 cg (B) 500 Docker Time, ms 0.1 X (fast) 250 0.01 X Migration time, ms mg (C) 0 Container Runtime Docker 20 24 28 212 216 220 1 2 8 32 128 0 10 20 30 40 50 Size, KiB QPs, count Time (s) Figure 11: Migration speed Figure 12: Migration speed comparison of Figure 10: MR registration time with diﬀerent numbers of QPs. Docker against CR-X (X) depending on the region size.

Checkpoint Transfer Restore benchmark has a size (A to F) parameter. We chose Ckpt Size size such that different benchmarks run between 10 and ep (C) 77MB 300 seconds. For this reason, we excluded dt bench- lu (B) 132MB sp (B) 177MB mark, because it runs only around a second. Figure 13 cg (B) 197MB shows container migration latency and standard devi- bt (C) 496MB ation around the mean, averaged over 20 runs of each is (C) 507MB benchmark. Benchmark (Size) ft (B) 570MB We break down the migration latency into three parts: mg (C) 995MB checkpoint, transfer, and restore. MigrOS stops the tar- 0.0 0.5 1.0 1.5 get container in the beginning of the checkpoint phase. Time (s) Large part of the checkpoint arrives to the destination node already during the checkpoint phase. After, the transfer phase is over, MigrOS recovers the container Figure 13: MPI application migration. on the destination node. Overall, we observe the migration time to be proportional to the checkpoint size. 5.4 MPI Application Migration The benchmarks experience runtime delay proportional to the migration latency. For evaluating transparent live-migration of real-world To show interoperability with other container run- applications we chose to migrate NPB 3.4.1 [18], an times, we measured migration costs, when using MPI benchmark suite. The MPI applications run Docker 19.03 (see Figure 12). We had to implement on top of Open MPI 4.0 [29], which in turn uses full end-to-end migration flow ourselves, because Docker OpenUCX 1.6.1 [67] for point to point communication. supports only checkpoint and restore features. To our We configured UCX to use IB verbs communication over disappointment, Docker does not employ some impor- reliable connection (RC). tant optimisations and takes line time to complete mi- This setup corresponds to Figure 3. We container- gration. Nevertheless, we prove our claim that MigrOS ised the applications using self-developed runtime CR- is readily interoperable with other container runtimes. X, based on libcontainer [16]. Unlike Docker, our container runtime facilitates faster live migration by sending the image directly to the destination node, instead of 6 Related Work the local storage, during the checkpoint process. More- over, our container runtime stores checkpoint in RAM, Checkpoint/Restart Techniques Transparent live reducing migration latency even further. The remaining migration of processes [19, 54, 69], containers [51, 55, description of our container runtime is out of scope of 58], or virtual machines [25, 28, 38, 59, 63] has long this paper. been a topic of active research. The key challenge of this To measure the latency of application migration, we technique lies in the checkpoint/restart operation. For start each MPI application with four processes (ranks). processes and containers, this operation can be imple- We migrate one of the ranks to another node approxi- mented at three levels: application runtime, user-level mately in the middle of the application progress. Each system, or kernel-level system. Table 4 compares a se-

10 enlAI rvdn h aetasaec n gen- and transparency same the providing API, kernel 2.3. Section this in describe detail We more not in does API. tool and user-kernel minimum interposing a require at modification necessary kernel keeps Linux checkpoint/restart, for success- tool most OS-level the ful currently aban- [9], been CRIU has eventually. [36] doned wider BLCR significantly a a burden. incur maintenance support they higher applications, systems user of these spectrum or Although data level internal kernel kernel’s the structures. the at from state interposition application do extract either 62] 44, 41, a to system. application runtime 74]. the particular 45, bind approaches [22, runtime-based effi- deserialisation All more and even serialisation state for cient threads) lightweight objects agents, Some application-defined (tasks, 65]. on 43, 39, operate 34, systems 32, runtime [17, system runtime the modifica- of on tions rely networks RDMA with together gration space. address application’s the because within cheap serialisation happens is it resource interception Such facilitate to deserialisation. Second, and resource the resource. the from of about the information application state of enough user maintain state can the the runtime the serialise stop to easily so resource can doing underlying and the runtime used when the is exactly First, controls migratability: system resource the of with important two API issues resolves the restriction This through system. resources runtime external all access to systems. checkpoint/restart existing of lection migration for overhead support. communication additional no introduce application naturally or systems Runtime-based (C), (O). containers objects (P), processes ei- handle VMs, systems ther checkpoint/restart Selected 4: Table nt MPPPPC P P P P N VM Y O Y Reference Y Units N Hardware Kernel-OS N User-OS N Runtime Overhead RDMA Finally, OS-level Kernel mi- live transparent provide to attempts all Almost Runtime-based srOS-level user 2][9 6][7 2][1 Ours [21] [20] [17] [65] [39] [22] Legion ytm xetteue application user the expect systems hcpitrsatsses[1 36, [21, systems checkpoint/restart Nomad ytm neps h user- the interpose systems PS MPI DMTCP

MOSIX-4

MOSIX-3

MigrOS 11 on tufi ornra-ol plctos(o exam- ef- forts. implementation (for significant applications further without real-world MPI) run ple, to a unfit provides it found [68] we However, StRoM implementation. RoCEv2 sup- proof-of-concept not communication. does RDMA [76] NetFPGA port stacks. network of tations adoption wider found iWarp. RoCEv2 work than because our base of SoftRoCE to testing on chose coun- and We development hardware applications. the their RDMA-based facilitate with and compatible terparts are but communi- provide socket-based cation Both over advantage of respectively. performance [31] implementations no iWarp software and and [8] pure [48] RoCEv2 are SoftRoCE [70] SoftiWarp implementations. RDMA source RDMA- Implementations RDMA of ad- migration. during replaces information silently interception dressing MigrOS Mi- Instead, unnecessary However, communication. performance- avoids the RDMA-applications. on grOS for not path is critical which networks, TCP/IP rewrite. application requires mi- and no support offers but gration networks, RDMA runtime. virtualises application-level [71] LITE an inside connection protocol the implements migration but migration VM virtuali- container for address InfiniBand sation live uses support [39] not Nomad does migration. control but software connection implement in to via policies containers communication in intercepts verbs IB [46] FreeFlow do approaches working. these However, prob- networks. performance 75]. RDMA consider 64, these not new 60, Several address [23, to lems 75]. try [60, packets approaches en- network additional live of to enables due capsulation overhead virtualisation introduces network it though migration, topol- ap- Even network distributed physical isolating ogy. underlying for the tool from essential plications an is sation Virtualisation RDMA Network for overhead runtime networks. In non-negligible ob- shadow [24]. has these NIC maintaining jects the that and show we process 5.2, user verbsSection a IB between of proxies state the as extract maintains to DMTCP able be objects, To support fault- verbs. transparent with IB applications a for distributed is for [17] at tool entirely DMTCP ver- tolerance work [20]. In to level redesigned user sockets. been the and has MOSIX IDs, 4, process sion resources, descriptors, system file virtualise like and applications from calls systems Such LD the implementation. use kernel-based as erality hr r looe-oreFG-ae implemen- FPGA-based open-source also are There for virtualisation network traditional uses MigrOS net- RDMA virtualising on focuses work Other mcaimt necp system intercept to mechanism PRELOAD C/Pntokvirtuali- network TCP/IP hdwobjects shadow hr r utpeopen- multiple are There hc act which , 7 Discussion migration flow of MPI applications using different container runtimes and studied cost of migration. MigrOS Hardware Modifications and Software Imple- provides live migration without sacrificing RDMA net- mentation Propositions to modify hardware often work performance, yet at the cost of changes to the meet criticism because they tend to be hard to val- RDMA communication protocol. idate in practice. We believe that limited hardware To validate our solution, we integrated the pro- changes are worthy of consideration as the deployment posed RDMA communication protocol changes into an of custom [27, 47], programmable [47, 57], or software- open-source implementation of the RoCEv2 protocol, augmented NICs [35] has already been proven feasible. SoftRoCE. For real-world deployment, these protocol IB verbs has routinely been extended with additional changes must be implemented in NIC hardware. Fi- features [40, 49] as well. Deploying MigrOS to real data nally, we provide a detailed analysis of any changes we centres would require hardware changes. We believe make to SoftRoCE to show their smallness. this trade-off is justified because MigrOS provides tan- We are convinced the architecture of MigrOS can be gible performance benefits, in comparison to other ap- useful for dynamic load balancing, efficient prepared proaches. fail-over, and live software updates in data centres or To find out whether our proposed changes have any HPC clusters. effect on the critical path of the communication, we integrated them into a software implementation of RoCEv2. Our measurements show no performance difference af- Acknowledgments ter adding support for migration. Given the nature of The research and the work presented in this paper has these changes, we are confident this observation applies been supported by the German priority program 1648 to hardware as well. Moreover, we provide our open- “Software for Exascale Computing” via the research source software implementation to the research commu- project FFMK [11]. This work was supported in part nity for validating our findings and further study. by the German Research Foundation (DFG) within the Compatibility with Existing Infrastructure Mi- Collaborative Research Center HAEC and the the Cen- grOS ensures by design backwards compatibility at the ter for Advancing Electronics Dresden (cfaed). The IB verbs API and RoCEv2 protocol level. Moreover, Mi- authors are grateful to the Centre for Information grOS allows to use container runtimes interchangeably. Services and High Performance Computing (ZIH) TU By enabling migratability through MigrOS, a data cen- Dresden for providing its facilities for high through- tre provider does not have to make the hard choice of put calculations. In particular, we would like to thank punishing applications that do not benefit from migra- Dr. Ulf Markwardt and Sebastian Schrader for their sup- tion. We believe these features are crucial for successful port with the experimental setup. The authors acknowl- integration into existing data centre management infras- edge support from the AWS Cloud Credits for Research tructure. for providing cloud computing resources. Unreliable Datagram Communication MigrOS provides live migration for reliable communication (RC), but omits unreliable datagram (UD) commu- Availability nication for two reasons: First, every message received over UD exposes the address of its sender. When this The anonymised version of the code is available here: sender migrates, its address will change and currently dropbox.com/s/clych73kxmuwjrt. MigrOS cannot conceal this fact from the receiver. Sec- ond, a UD QP can receive messages from anywhere. References This means that a UD QP does not know where to the send resume messages after migration. We leave migra- [1] Elastic Fabric Adapter — Amazon Web Services. tion support for unreliable datagram for future work. URL https://aws.amazon.com/hpc/efa/. [2] High performance computing VM sizes. URL 8 Conclusion https://docs.microsoft.com/en-us/azure/ virtual-machines/sizes-hpc. We introduce MigrOS, an OS-level architecture enabling [3] Linux Containers. URL https:// transparent live container migration. Our architecture linuxcontainers.org/lxc/introduction/. design maintains full backwards-compatibility and interoperability with the existing RDMA network infras- [4] MVAPICH :: Benchmarks. URL http:// tructure at every level. We demonstrate end-to-end mvapich.cse.ohio-state.edu/benchmarks/.

12 [5] proc(5) - Linux manual page. URL http://man7. [20] Amnon Barak and Shiloh, Amnon. The MOSIX org/linux/man-pages/man5/proc.5.html. Cluster Management System for Distributed Com- puting on Linux Clusters and Multi-Cluster Private [6] ptrace(2) - Linux manual page. URL Clouds. http://man7.org/linux/man-pages/man2/ ptrace.2.html. [21] Amnon Barak, Shai Guday, and Richard G. Wheeler. The MOSIX Distributed Operating Sys- [7] Supplement to InfiniBand Architecture Specifica- tem: Load Balancing for UNIX. Springer-Verlag. tion: RoCE, volume 1. InfiniBand TA, 1.2.1 edi- ISBN 978-0-387-56663-4. tion, . [22] Michael Bauer, Sean Treichler, Elliott Slaughter, [8] Supplement to InfiniBand Architecture Specifica- and Alex Aiken. Legion: Expressing locality and tion: RoCEv2. InfiniBand TA, . URL https: independence with logical regions. SC ’12, pages //cw.infinibandta.org/document/dl/7781. 1–11. IEEE. ISBN 978-1-4673-0805-2 978-1-4673- 0806-9. doi:10.1109/SC.2012.71. [9] Checkpoint/Restore In Userspace. URL https:// criu.org/Main_Page. [23] Adam Belay, George Prekas, Christos Kozyrakis, Ana Klimovic, Samuel Grossman, and Edouard [10] Data Plane Development Kit. URL https://www. Bugnion. IX: A Protected Dataplane Operating dpdk.org/. System for High Throughput and Low Latency. OSDI ’14, pages 49–65. ISBN 978-1-931971-16-4. [11] FFMK Website. URL https://ffmk.tudos.org. [24] Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene [12] InfiniBand Architecture Specification, volume 1. Cooperman. Transparent checkpoint-restart over Infiniband TA, 1.3 edition. URL https://cw. Infiniband. In Proceedings of the 23rd International infinibandta.org/document/dl/8567. Symposium on High-performance parallel and distributed computing - HPDC ’14, pages 13–24. ACM [13] Messaging Accelerator (VMA) Documentation. Press. ISBN 978-1-4503-2749-7. doi:10/ggnfr4. URL https://docs.mellanox.com/display/ VMAv883. [25] Christopher Clark, Keir Fraser, Steven Hand, Ja- cob Gorm Hansen, Eric Jul, Christian Limpach, [14] OFED performance tests. URL https://github. Ian Pratt, and Andrew Warfield. Live Migra- com/linux-rdma/perftest. tion of Virtual Machines. In Proceedings of [15] Podman: daemonless container engine. URL the 2nd Conference on Symposium on Networked https://podman.io/. Systems Design & Implementation - Volume 2, NSDI ’05, pages 273–286. USENIX Association. [16] runc: CLI tool for spawning and running containers doi:10.5555/1251203.1251223. according to the OCI specification. URL https: //github.com/opencontainers/runc. [26] Connor, Patrick, Hearn, James R., Dubal, Scott P., Herdrich, Andrew J., and Sood, Kapil. Techniques [17] Jason Ansel, Kapil Arya, and Gene Cooperman. to migrate a virtual machine using disaggregated DMTCP: Transparent Checkpointing for Cluster computing resources. Computations and the Desktop. URL http:// arxiv.org/abs/cs/0701037. [27] Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike [18] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Brown- Andrewartha, Vivek Bhanu, Eric Chung, Harish ing, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Kumar Chandrappa, Somesh Chaturmohta, Matt Frederickson, T.A. Lasinski, R.S. Schreiber, H.D. Humphrey, Jack Lavier, Norman Lam, Fengfen Simon, V. Venkatakrishnan, and S.K. Weeratunga. Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Pop- The Nas Parallel Benchmarks. 5(3):63–73. ISSN uri, Shachar Raindel, Tejas Sapre, Mark Shaw, 0890-2720. doi:10/cgsfnm. Gabriel Silva, Madhan Sivakumar, Nisheeth Sri- vastava, Anshuman Verma, Qasim Zuhair, Deepak [19] Amnon Barak and Amnon Shiloh. A dis- Bansal, Doug Burger, Kushagra Vaid, David A. tributed load-balancing policy for a multicom- Maltz, and Albert Greenberg. Azure Acceler- puter. 15(9):901–913. ISSN 00380644, 1097024X. ated Networking: SmartNICs in the Public Cloud. doi:10/c8r7m6. USENIX Assoc. ISBN 978-1-931971-43-0.

13 [28] Umesh Deshpande, Yang You, Danny Chan, Nilton [36] Paul H. Hargrove and Jason C. Duell. Berke- Bila, and Kartik Gopalan. Fast Server Deprovi- ley lab checkpoint/restart (BLCR) for Linux clus- sioning through Scatter-Gather Live Migration of ters. 46:494–499. ISSN 1742-6588, 1742-6596. Virtual Machines. pages 376–383. IEEE. ISBN doi:10/d33sc5. 978-1-4799-5063-8. doi:10.1109/CLOUD.2014.58. [37] Berk Hess, Carsten Kutzner, David van der Spoel, [29] Edgar Gabriel, Graham E. Fagg, George Bosilca, and Erik Lindahl. GROMACS 4: Algorithms Thara Angskun, Jack J. Dongarra, Jeffrey M. for Highly Efficient, Load-Balanced, and Scalable Squyres, Vishal Sahay, Prabhanjan Kambadur, Molecular Simulation. 4(3):435–447. ISSN 1549- Brian Barrett, Andrew Lumsdaine, Ralph H. Cas- 9618, 1549-9626. doi:10/b7nkp6. tain, David J. Daniel, Richard L. Graham, and Timothy S. Woodall. Open MPI: Goals, Concept, [38] Michael R. Hines, Umesh Deshpande, and Kar- and Design of a Next Generation MPI Implemen- tik Gopalan. Post-copy live migration of vir- tation. In Dieter Kranzlmüller,PéterKacsuk, and tual machines. 43(3):14–26. ISSN 0163-5980. Jack Dongarra, editors, Recent Advances in Paral- doi:10/ccwrpt. lel Virtual Machine and Message Passing Interface, volume 3241, pages 97–104. Springer Berlin Heidel- [39] Wei Huang, Jiuxing Liu, Matthew Koop, Bulent berg. ISBN 978-3-540-30218-6. doi:10.1007/978-3- Abali, and Dhabaleswar Panda. Nomad: mi- 540-30218-6 19. grating OS-bypass networks in virtual machines. In Proceedings of the 3rd international conference [30] Peter X. Gao, Akshay Narayan, Rachit Agarwal, on Virtual execution environments - VEE ’07, Sagar Karandikar, Sylvia Ratnasamy, Joao Car- page 158. ACM Press. ISBN 978-1-59593-630-1. reira, Sangjin Han, and Scott Shenker. Network doi:10/frgqz4. Requirements for Resource Disaggregation. OSDI ’16, pages 249–264. USENIX Association. ISBN [40] InfiniBand Trade Association. Supplement 978-1-931971-33-1. doi:10.5555/3026877.3026897. to InfiniBand Architecture Specification: XRC. URL https://cw.infinibandta.org/document/ [31] D. Garcia, P. Culley, R. Recio, J. Hilland, and dl/7146. B. Metzler. A Remote Direct Memory Access Pro- tocol Specification. URL https://tools.ietf. [41] Jake Edge. Checkpoint/restart tries to head to- org/html/rfc5040. wards the mainline. URL https://lwn.net/ Articles/320508/. [32] Rohan Garg, Gregory Price, and Gene Cooper- man. MANA for MPI: MPI-Agnostic Network- [42] Jonathan Corbet. TCP connection repair. URL Agnostic Transparent Checkpointing. In Proceed- https://lwn.net/Articles/495304/. ings of the 28th International Symposium on High- Performance Parallel and Distributed Computing - [43] J. Jose, Mingzhe Li, Xiaoyi Lu, K. C. Kan- HPDC ’19, pages 49–60. ACM Press. ISBN 978-1- dalla, M. D. Arnold, and D. K. Panda. SR- 4503-6670-0. doi:10/ggnd38. IOV Support for Virtualization on InfiniBand Clus- ters: Early Experience. In 2013 13th IEEE/ACM [33] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, International Symposium on Cluster, Cloud, and Mosharaf Chowdhury, and Kang G. Shin. Effi- Grid Computing. IEEE. ISBN 978-0-7695-4996-5. cient Memory Disaggregation with INFINISWAP. doi:10/ggm53b. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, [44] Asim Kadav and Michael M. Swift. Live migration NSDI ’17, page 21. USENIX Association. ISBN of direct-access devices. 43(3):95. ISSN 01635980. 978-1-931971-37-9. doi:10.5555/3154630.3154683. doi:10/b9j36z. [34] Wei Lin Guay, Sven-Arne Reinemo, Bjørn Dag [45] Laxmikant V. Kale and Sanjeev Krishnan. Johnsen, Chien-Hua Yen, Tor Skeie, Olav Lysne, CHARM++: a portable concurrent object oriented and Ola Tø rudbakken. Early experiences with live system based on C++. 28(10):91–108. ISSN 0362- migration of SR-IOV enabled InfiniBand. 78:39–52. 1340. doi:10/cgnqf7. ISSN 07437315. doi:10/f68twd. [46] Daehyeok Kim, Tianlong Yu, Hongqiang Harry [35] Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Palkar, Dongsu Han, and Sylvia Ratnasamy. Soft- Chuanxiong Guo, Vyas Sekar, and Srinivasan Se- NIC: A Software NIC to Augment Hardware. shan. FreeFlow: Software-based Virtual RDMA

14 Networking for Containerized Clouds. In Proceed- [56] Christopher Mitchell, Yifeng Geng, and Jinyang ings of the 16th USENIX Conference on Networked Li. Using One-Sided {RDMA} Reads to Systems Design and Implementation, NSDI ’19, Build a Fast, CPU-Efficient Key-Value Store. pages 113–125. USENIX Association. ISBN 978- pages 103–114. ISBN 978-1-931971-01-0. URL 1-931971-49-2. doi:10.5555/3323234.3323245. https://www.usenix.org/conference/atc13/ technical-sessions/presentation/mitchell. [47] Alec Kochevar-Cureton, Somesh Chaturmohta, Norman Lam, Sambhrama Mundkur, and Daniel [57] YoungGyoun Moon, SeungEon Lee, Muham- Firestone. Remote direct memory access in com- mad Asim Jamshed, and KyoungSoo Park. Ac- puting systems. URL https://patents.google. celTCP: Accelerating Network Applications with com/patent/US10437775B2/en. Stateful TCP Offloading. NSDI ’20, pages 77–92. USENIX Association. ISBN 978-1-939133-13-7. [48] Liran Liss. The Linux SoftRoCE Driver. URL https://youtu.be/NumH5YeVjHU?t=45. [58] Shripad Nadgowda, Sahil Suneja, Nilton Bila, and Canturk Isci. Voyager: Complete Container [49] Liran Liss. On Demand Paging for User-level Net- State Migration. In 2017 IEEE 37th Interna- working. tional Conference on Distributed Computing Sys- tems (ICDCS), pages 2137–2142. IEEE. ISBN 978- [50] Jiuxing Liu, Jiesheng Wu, Sushmitha P. Kini, Pete 1-5386-1792-2. doi:10/ggnhq5. Wyckoff, and Dhabaleswar K. Panda. High performance RDMA-based MPI implementation over [59] Michael Nelson, Beng-Hong Lim, and Greg InfiniBand. In Proceedings of the 17th annual in- Hutchins. Fast Transparent Migration for ternational conference on Supercomputing, ICS ’03, Virtual Machines. In Proceedings of the pages 295–304. Association for Computing Machin- USENIX Annual Technical Conference, ATEC ery. ISBN 978-1-58113-733-0. doi:10/c4knj6. ’05, pages 391–394. USENIX Association. doi:10.5555/1247360.1247385. [51] Lele Ma, Shanhe Yi, and Qun Li. Efficient service handoff across edge servers via Docker con- [60] Zhixiong Niu, Hong Xu, Peng Cheng, Yongqiang tainer migration. In Proceedings of the Second Xiong, Tao Wang, Dongsu Han, and Keith Win- ACM/IEEE Symposium on Edge Computing, SEC stein. NetKernel: Making Network Stack Part ’17, pages 1–13. ACM Press. ISBN 978-1-4503- of the Virtualized Infrastructure. URL http:// 5087-7. doi:10/gf9x9r. arxiv.org/abs/1903.07119.

[52] Dirk Merkel. Docker: Lightweight Linux Con- [61] Opeyemi Osanaiye, Shuo Chen, Zheng Yan, Rongx- tainers for Consistent Development and De- ing Lu, Kim-Kwang Raymond Choo, and Mqhele ployment. 2014(239):5. ISSN 1075-3583. Dlodlo. From Cloud to Fog Computing: A Review doi:10.5555/2600239.2600241. and a Conceptual Live VM Migration Framework. 5:8284–8300. ISSN 2169-3536. doi:10/ggnfkt. [53] Frank Mietke, Robert Rex, Robert Baumgartl, Torsten Mehlan, Torsten Hoefler, and Wolfgang [62] Steven Osman, Dinesh Subhraveti, Gong Su, and Rehm. Analysis of the Memory Registration Pro- Jason Nieh. The design and implementation of Zap: cess in the Mellanox InfiniBand Software Stack. a system for migrating computing environments. In Wolfgang E. Nagel, Wolfgang V. Walter, and 36:361–376. ISSN 0163-5980. doi:10/fbg7vq. Wolfgang Lehner, editors, Euro-Par 2006 Parallel [63] Zhenhao Pan, Yaozu Dong, Yu Chen, Lei Zhang, Processing, volume 4128 of Lecture Notes in Com- and Zhijiao Zhang. CompSC: live migration with puter Science, pages 124–133. Springer Berlin Hei- pass-through devices. 47(7):109. ISSN 03621340. delberg. ISBN 978-3-540-37783-2 978-3-540-37784- doi:10/f3887q. 9. doi:10.1007/11823285 13. [64] Simon Peter, Jialin Li, Irene Zhang, Dan [54] Dejan Milojiˇcić, Frederick Douglis, and Richard R K Ports, Doug Woos, Arvind Krishnamurthy, Wheeler. Mobility: processes, computers, and Thomas Anderson, and Timothy Roscoe. Arrakis: agents. ACM Press/Addison-Wesley Publishing The Operating System is the Control Plane. Co. ISBN 978-0-201-37928-0. [65] S. Pickartz, C. Clauss, S. Lankes, S. Krem- [55] Andrey Mirkin, Alexey Kuznetsov, and Kir pel, T. Moschny, and A. Monti. Non-intrusive Kolyshkin. Containers checkpointing and live mi- Migration of MPI Processes in OS-Bypass Net- gration. works. In 2016 IEEE International Parallel

15 and Distributed Processing Symposium Workshops 2019, EuroSys ’19, pages 1–16. Association for (IPDPSW), pages 1728–1735. doi:10/ggscxh. Computing Machinery, . ISBN 978-1-4503-6281-8. doi:10/ggnq76. [66] Marius Poke and Torsten Hoeﬂer. DARE: High- Performance State Machine Replication on RDMA [74] David Wong, Noemi Paciorek, and Dana Moore. Networks. In Proceedings of the 24th Interna- Java-based mobile agents. 42(3):92–ﬀ. ISSN 0001- tional Symposium on High-Performance Parallel 0782. doi:10/btg3k7. and Distributed Computing - HPDC ’15, pages 107–118. ACM Press. ISBN 978-1-4503-3550-8. [75] Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, doi:10/ggm3sf. Hongqiang Harry Liu, Matthew Rockett, Arvind Krishnamurthy, and Thomas Anderson. Slim: [67] Pavel Shamis, Manjunath Gorentla Venkata, OS Kernel Support for a Low-Overhead Container M. Graham Lopez, Matthew B. Baker, Oscar Her- Overlay Network. In Proceedings of the 16th nandez, Yossi Itigin, Mike Dubman, Gilad Shainer, USENIX Conference on Networked Systems De- Richard L. Graham, Liran Liss, Yiftah Shahar, sign and Implementation, NSDI ’19, pages 331– Sreeram Potluri, Davide Rossetti, Donald Becker, 344. USENIX Association. ISBN 978-1-931971-49- Duncan Poole, Christopher Lamb, Sameer Ku- 2. doi:10.5555/3323234.3323263. mar, Craig Stunkel, George Bosilca, and Aurelien Bouteiller. UCX: An Open Source Framework for [76] Noa Zilberman, Yury Audzevich, Georgina HPC Network APIs and Beyond. In 2015 IEEE Kalogeridou, Neelakandan Manihatty-Bojan, 23rd Annual Symposium on High-Performance In- Jingyun Zhang, and Andrew Moore. NetF- terconnects, pages 40–43. doi:10/ggmx8k. PGA: Rapid Prototyping of Networking Devices in Open Source. In Proceedings of the 2015 [68] David Sidler, Zeke Wang, Monica Chiosa, Amit ACM Conference on Special Interest Group on Kulkarni, and Gustavo Alonso. StRoM: Smart Re- Data Communication - SIGCOMM ’15, pages mote Memory. page 16. doi:10/gg8qq7. 363–364. ACM Press. ISBN 978-1-4503-3542-3. doi:10/ggz9k3. [69] Jonathan M. Smith. A survey of process migration mechanisms. 22(3):28–40. ISSN 0163-5980. doi:10/bjp787.

[70] Animesh Trivedi, Bernard Metzler, and Patrick Stuedi. A case for RDMA in clouds: turning su- percomputer networking into commodity. In Pro- ceedings of the Second Asia-Paciﬁc Workshop on Systems - APSys ’11, page 1. ACM Press. ISBN 978-1-4503-1179-3. doi:10/fzv576.

[71] Shin-Yeh Tsai and Yiying Zhang. LITE Ker- nel RDMA Support for Datacenter Applications. In Proceedings of the 26th Symposium on Op- erating Systems Principles - SOSP ’17, pages 306–324. ACM Press. ISBN 978-1-4503-5085-3. doi:10/ggscxn.

[72] Dongyang Wang, Binzhang Fu, Gang Lu, Kun Tan, and Bei Hua. vSocket: virtual socket interface for RDMA in public clouds. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Con- ference on Virtual Execution Environments - VEE 2019, pages 179–192. ACM Press, . ISBN 978-1- 4503-6020-3. doi:10/ggscxg.

[73] Kai-Ting Amy Wang, Rayson Ho, and Peng Wu. Replayable Execution Optimized for Page Shar- ing for a Managed Runtime Environment. In Proceedings of the Fourteenth EuroSys Conference