MigrOS: Transparent Live-Migration Support for Containerised RDMA Applications Maksym Planeta and Jan Bierbaum, TU Dresden; Leo Sahaya Daphne Antony, AMOLF; Torsten Hoefler, ETH Zürich; Hermann Härtig, TU Dresden https://www.usenix.org/conference/atc21/presentation/planeta

This paper is included in the Proceedings of the 2021 USENIX Annual Technical Conference. July 14–16, 2021 978-1-939133-23-6

Open access to the Proceedings of the 2021 USENIX Annual Technical Conference is sponsored by USENIX. MigrOS: Transparent Live-Migration Support for Containerised RDMA Applications

Maksym Planeta Jan Bierbaum Leo Sahaya Daphne Antony Torsten Hoefler TU Dresden TU Dresden AMOLF ETH Zürich Hermann Härtig TU Dresden

Abstract App Kern App Kern App Kern App Kern

RDMA networks offload packet processing onto specialised 1 2 circuitry of the network interface controllers (NICs) and by- 1 2 3 4 pass the OS to improve network latency and bandwidth. As a NIC NIC NIC NIC consequence, the OS forfeits control over active RDMA con- nections and loses the possibility to migrate RDMA applica- Figure 1: Traditional (left) and RDMA (right) network stacks. tions transparently. This paper presents MigrOS, an OS-level In traditional networks, the user application triggers the NIC architecture for transparent live migration of containerised via kernel (1). After receiving a packet, the NIC notifies the RDMA applications. MigrOS shows that a set of minimal application back through the kernel (2). In RDMA networks changes to the RDMA communication protocol reenables the application communicates directly to the NICs (3) and live migration without interposing the critical path operations. vice-versa (4) without kernel intervention. Traditional net- Our approach requires no changes to the user applications and works require copy between application buffers ( ) and NIC maintains backwards compatibility at all levels of the network accessible kernel buffers ( ). RDMA NICs (right) access stack. Overall, MigrOS can achieve up to 33% lower network message buffers in the application memory directly ( ). latency in comparison to software-only techniques.

Container orchestration systems, like Kubernetes, stop con- 1 Introduction tainers and restart them on other hosts for the purpose of load balancing, resiliency, and administration. To maintain Major cloud providers increasingly offer RDMA network efficiency, containerised applications are expected to restart connectivity [1, 55] and high-performance network stacks [8, quickly [9]. Unfortunately, fast application restart requires 18, 37, 53, 69, 85, 89] to the end-users. RDMA networks in-application support and can be very costly, as RDMA ap- lower communication latency and increase network band- plications tend to have large state. In contrast, live migration width by offloading packet processing to the RDMA NICs moves application state transparently and imposes no addi- and by removing the OS from the communication criti- tional requirements for the application. cal path. To remove the OS, user applications employ spe- In the context of live migration, much of the container state cialised RDMA APIs, which access RDMA NICs directly, resides in the host OS kernel. It can be extracted and recovered when sending and receiving messages [54]. These perfor- elsewhere later on. This recoverable state includes open TCP mance benefits made RDMA networks ubiquitous both in the connections, shell sessions, file locks [13, 40]. However, as HPC [15, 30, 43, 50, 77] and in the cloud settings [22, 59, 72]. we elaborate in Section 3, the state of RDMA communication At the same time, containers have become a popular tool for channels is not recoverable by existing systems, and hence lightweight virtualisation. Containerised applications, being RDMA applications cannot be checkpointed or migrated. independent of the host’s (libraries, applications, To outline the conceptual difficulties involved in saving configuration files), greatly simplify distributed application the state of RDMA communication channels, we compare deployment and administration. However, RDMA networks a traditional TCP/IP-based network stack and the IB verbs and containers follow contradicting architectural principles: API1 (see Figure 1). First, with a traditional network stack, Containerisation enforces stricter isolation between applica- the kernel fully controls when the communication happens: tions and the host, whereas RDMA networks try to bring applications and underlying hardware “closer” to each other. 1IB verbs is the most common low-level API for RDMA networks.

USENIX Association 2021 USENIX Annual Technical Conference 47 applications need to perform system calls to send or receive messages. In IB verbs, because of the direct communication PD between the NIC and the application, the OS cannot alter QP GUID QP GUID GID GID the connection state, except for tearing the connection down. MR SQ SR SR ... SQ LID LID Although the OS can stop a process from sending further mes- . RQ RR RR ... QPN QPN sages, the NIC may still silently change the application state. . Second, part of the connection state resides at the NIC and MR CQ WC WC ... SRQ is inaccessible for the OS. Creating a consistent checkpoint transparently is impossible in this situation. Figure 2: Primitives of the IB verbs library. Each QP com- The contribution of our paper is a way to overcome the prises a send and a receive queue and has multiple IDs; node- described limitations of current RDMA networks and the sup- global IDs (grey) are shared by all QPs on the same node. porting operating systems with a novel architecture. The root of the problem to be solved is the impossibility for the OS to intercept packets to manipulate and repair disrupted connec- a given container. A distributed application may comprise tions, as it happens in transparent live application migration. multiple containers across a network: a Spark application, The core point of the architecture is a small change in the for example, can isolate each worker in a container and an RDMA protocol, that is usually implemented in RDMA NICs: MPI [21] application can containerise each rank. we propose to add two new states and two new message types to RoCEv2, a popular RDMA communication protocol, while paying careful attention to backwards compatibility. 2.2 Infiniband verbs To make the proposed changes credible, we show that they The IB verbs API is a de-facto standard for high-performance are 1. sufficient for migrations, as an example for disrupted RDMA communication today. RDMA applications achieve connections, and 2. indeed small. To prove the first point, we high throughput and low latency by accessing the NIC directly design and implement a transparent end-to-end live migration (OS-bypass), reducing memory movement (zero-copy), and architecture for containerised RDMA applications. To prove delegating packet processing to the NIC (offloading). the second point, and to enable the software architecture, we Figure 2 shows the following IB verbs objects involved implement the new protocol by modifying SoftRoCE [48], a in communication. Memory regions (MRs) represent pinned software implementation of the RoCEv2 protocol. memory shared between the application and the NIC. Queue Providing a credible evaluation is hard for us, since we can- pairs (QPs), comprising a send queue (SQ) and a receive not modify the state machine of an RDMA NIC. Instead, we queue (RQ), represent connections. To reduce the memory substantiate our claims by comparing the latency and band- footprint, the individual RQs of multiple QPs can be re- width of the original and our modified protocol using Soft- placed with a single shared receive queue (SRQ). Completion RoCE implementations and show that outside of the migra- queues (CQs) inform the application about completed commu- tion phase network performance is not affected. To justify the nication requests. A protection domain (PD) groups all these proposed protocol changes and the container-based software objects together and represents the process address space to architecture, we evaluate software-level support for transpar- the NIC. ent live migration [2, 44]. We show that these approaches add To establish a connection, the IB verbs specification [54] significant overhead to the normal operation. requires the applications to exchange the following address- ing information: Memory protection keys to enable access 2 Background to remote MRs, the global vendor-assigned address (GUID), the routable address (GID), the non-routable address (LID), This section gives a short introduction to containerisation and and the node-specific QP number (QPN). One way to per- RDMA networking. We further outline live migration and form this exchange is over an out-of-band network, e.g. Eth- how RDMA networking obstructs this process. ernet. During the connection setup, each QP is configured for a specific type of service. MigrOS supports only Reliable 2.1 Containers Connections (RC), which provide reliable in-order message delivery between two communication partners. Another pop- In , processes can be logically separated from the rest ular type of service is Unreliable Datagram (UD), which does of the system using namespaces. This way processes can not provide these guarantees. have an isolated view on the file system, network devices, The application sends or receives messages by posting send users, etc. Container runtimes leverage namespaces and other requests (SR) or receive requests (RR) to a QP as work queue low-level kernel mechanisms [17, 47] to create a complete entries (WQE). These requests describe the message and refer system view inside a container with one or multiple processes. to the memory buffers within previously created MRs. The Migration of a container moves all processes running inside application checks for the completion of outstanding work

48 2021 USENIX Annual Technical Conference USENIX Association requests by polling the CQ for work completions (WC). r e MPI app S docker n i There are various implementations of the IB verbs API for O r a t

Open MPI g CRIU i different hardware, including Infiniband [37], iWarp [23], and n o ibv-user M m-ibv-user Modified RoCE [35, 36]. InfiniBand is generally the fastest among these networks but requires specialised NICs and switches. RoCE m-ibv-kern Kernel and iWarp are easier to deploy, because they provide RDMA capabilities in Ethernet networks. They still require hardware support in the NIC, but do not depend on specialised switches. Figure 3: MigrOS architecture. Software inside the container, This work focuses on RoCEv2, an increasingly more popular including the user-level driver (ibv-user, grey), is unmodified. version of the RoCE protocol [15, 60, 88]. At the same time, The host runs CRIU, kernel- (m-ibv-kern) and user-level (m- we change only parts of RoCEv2 that are defined in the same ibv-user, green) drivers modified for migratability. way for Infiniband [37] and RoCEv1 [35], hence MigrOS is also compatible with other RDMA protocols. To enable RDMA-application migration, it is important to tion in the naive hope that the application will be able to consider the following challenges: recover is failure-prone: once an application runs into an er- 1. User applications have to use physical network addresses roneous IB verbs object, in most cases, the application will (QPN, LID, GID, GUID) but the IB verbs API does not hang or crash. Thus, MigrOS provides explicit support for specify a way for virtualising these. IB verbs objects in CRIU (see Section 3). 2. The NIC can write to any memory it shares with the application without the OS noticing. 3 Design 3. The OS cannot instruct the NIC to pause communication, except for abruptly terminating it. MigrOS is based on modern container runtimes and reuses 4. User applications do not handle changes in a connection much of the existing infrastructure with minimal changes destination address and will go into an erroneous state. (see Section 3.1). Most importantly, we require no modifi- As a result, the application will terminate abruptly. cation of the software running inside the container. Instead, 5. Although the OS resides on the control path and is there- MigrOS includes the following changes concerning RoCEv2: fore aware of all IB verbs objects created by the appli- two new QP states to enable the creation of consistent check- cation, the OS does not control the whole state of these points (see Section 3.2), a connection migration protocol objects, as the state partially resides on the NIC. (Section 3.3), and modifications to the packet-level RoCEv2 We address all of these challenges in Section 3. protocol (Section 3.4). Finally, Section 3.5 describes our mod- ifications to the IB verbs API and how CRIU uses them. It is 2.3 CRIU sufficient to only extend CRIU with IB verbs support, exist- ing container runtimes use CRIU for their checkpoint/restore CRIU is a software framework for transparent checkpoint- functionality [17, 47, 71, 76]. ing and restoring of Linux processes [13]. It enables live migration, snapshots, or accelerated start-up of processes and 3.1 Software Stack containers. To extract the user-space application state, CRIU uses conventional debugging mechanisms, like ptrace [73, 74]. Typically, access to the RDMA network is hidden deep However, to extract the state of process-specific kernel objects, inside the software stack. Figure 3 gives an example of CRIU depends on special interfaces. a containerised RDMA application. The container image During recovery, CRIU runs inside the target process and comes with all library dependencies, like the libc, but not the recreates all OS objects on behalf of the target. This way, kernel-level drivers. In this example, the application uses a CRIU utilises the available OS mechanisms to run most of the stack of communication libraries, comprising Open MPI [21], recovery without the need for significant kernel modifications. Open UCX [77] (not shown), and IB verbs. Normally, to Finally, CRIU removes any traces of itself from the process. migrate, a container runtime would require the application CRIU can also restore the state of TCP connections, a cru- inside the container to terminate and later recover all IB verbs cial feature for live migration of distributed applications [40]. objects. This removes transparency from live migration. The Linux kernel introduced a new TCP connection state, MigrOS runs alongside the container comprising a con- TCP_REPAIR, for that purpose. In this state, a user-level pro- tainer runtime (e.g., Docker [17]), CRIU, and the IB verbs cess can modify the send and receive message queues, get library. We modified CRIU to make it aware of IB verbs, so and set the message sequence numbers and timestamps, or it can save IB verbs objects when traversing the kernel ob- open and close a connection without notifying the other side. jects that belong to the container. We extend the IB verbs As of now, CRIU does not support saving and restoring library (m-ibv-user and m-ibv-kern) to enable serialisation IB verbs objects. Discarding IB verbs objects during migra- and deserialisation of the IB verbs objects. Importantly, the

USENIX Association 2021 USENIX Annual Technical Conference 49 Create R E SQE S N0 R S nack D send ... SQD N1 R P R ack

N ... R resume Init RTR RTS P 2 t

Figure 4: QP State Diagram. Normal states and state transi- Figure 5: To migrate from host N0 to host N2, the state of tions ( , ) are controlled by the user application. A QP the QP changes from RTS ( R ) to Stopped ( S ). Finally, the is put into error states ( , ) either by the OS or the NIC. QP is destroyed ( D ). If the partner QP at host N1 sends a New states ( , ) are used for connection migration. message during migration, this QP gets paused ( P ). Both QPs resume normal operation once the migration is complete.

API extension is backwards compatible with the IB verbs library running inside the container. Thus, both m-ibv-user once the migration is complete, all partners of the migrated and ibv-user use the same kernel version of IB verbs. All of container need to learn its new address. the IB verbs components (ibv-user, m-ibv-user, m-ibv-kern) We address the first issue by extending RoCEv2 with a comprise a generic and a device-specific part. MigrOS relies connection migration protocol. The connection migration on modified kernel and user parts (m-ibv-user and m-ibv-kern), protocol is active during and after migration (see Figure 5). but requires no modifications of the software inside the con- This protocol is part of the low-level packet transmission tainer. protocol and is typically implemented entirely within the NIC. Also, we add a new negative acknowledgement type 3.2 Queue Pair States NAK_STOPPED. If a stopped QP receives a packet, it replies with NAK_STOPPED and drops the packet. When the partner Before explaining QP migration, we first recapitulate how a QP receives this negative acknowledgement, it transitions to QP functions in general. Communication can start, when an the Paused (P) state and refrains from sending further packets application establishes a connection by taking a QP through a until receiving a message of the new resume type. If N0 de- sequence of states (depicted in Figure 4). Each newly-created stroys the QP before N2 sends the resume message, packets QP is in the Reset (R) state. To send and receive messages, coming from N1 will be dropped and no negative acknowl- a QP must reach its final Ready-to-Send (RTS) state. Be- edgement will be sent. It is possible to synchronise the de- fore reaching RTS, the QP traverses the Init and Ready-to- struction of old QPs and the transfer of resume messages, but Receive (RTR) states. In case of an error, the QP goes into we did not implement this feature in our prototype, because one of the error states; Error (E) or Send Queue Error (SQE). such packet loss does not impair correctness. In the Send Queue Drain (SQD) state, a QP does not accept After migration completes, the new host of the migrated new send requests. Apart from that, SQD is equivalent to the process restores all QPs to their original state and sends re- RTS state and we do not describe it further in this paper. sume messages. Resume messages are sent unconditionally, To migrate connections safely, we add two new states in- even if the partner QP was not paused before. Any recipient visible to the user application (see Figure 4): Stopped (S) and of a resume message updates its QP’s destination address to Paused (P). When the kernel checkpoints a process, all of its the source address of the resume message, i.e., to the new QPs go into the Stopped state. A stopped QP does not send or location of the migrated QP. receive any messages. The QPs remain stopped until they are Each pause and resume message carries source and desti- destroyed together with the checkpointed process. nation information. Thus, if multiple QPs migrate at the same A QP becomes Paused when it learns that its destination QP time, there can be no confusion which QPs must be paused or has transitioned to the Stopped state. A paused QP does not resumed. On the other hand, MigrOS must prevent concurrent send messages, but also has no other QP to receive messages migration of QPs at both ends of the same connection, be- from. A QP remains paused, until the migrated destination cause the senders of resume messages may be aware only of QP restores at a new location and sends a message with the old, outdated locations. If at any point the migration process new location address. The paused QP retains the new location fails, the paused QPs will remain stuck and will not resume of the destination QP and returns to the RTS state. After that, communication. This scenario is completely analogous to a the communication can continue. failure during a TCP-connection migration. In both cases, MigrOS will be responsible for cleaning up the resources. 3.3 Connection Migration 3.4 Protocol Changes There are two considerations when migrating a connection. First, the communication partner of the migrating container Our protocol changes must be reflected in an RDMA NIC, must not confuse migration with a network failure. Second, because most packet-level RoCEv2 protocol implementations

50 2021 USENIX Annual Technical Conference USENIX Association ƻ Send{PSN: Ack}   * * RESUME RESUME Addr: PKT ACK 9 Type{ACK} 9 Type{SR} Ù Run tasks RESUME 9 Type{PKT} L QP Paused: no PAUSE * ¬ Next SR exists? SR * ƻ Send{PSN: Sent} RESUME S P or S P or Ǚ Drop P Ǚ Exit Ù Run tasks Drop L QP{Paused: yes} Ë Complete PKT *

S (c) A stopped or paused QP ignores all acknowledgements. (a) A QP in the Stopped ( P ) A PAUSE acknowledgement pauses the QP. Otherwise, the ƻ Send{NACK: Pause} ƻ Send{ACK} or Paused ( S ) state does not QP checks outstanding send requests (SR) and completes process requests and goes (b) A stopped QP replies only with PAUSE them. After completion of an SR, there is a check for more to the idle state (). A re- acknowledgements. A paused QP drops all outstanding SRs with a lower PSN, therefore a single ACK sume message carries the messages until receiving RESUME, and then may complete multiple SRs one by one. A recently recovered PSN from the last acknowl- updates its destination from the RESUME’s QP waits specifically for a RESUME to restart the communi- edgement. source and runs its QP tasks to fully resume. cation fully.

Figure 6: MigrOS changes processing of incoming packets (PKT and ACK) and outgoing request (SR) by introducing new QP states ( P and S ), new messages (PAUSE and RESUME), new operations (Ñ and Ù), and new transitions ( ). Solid arrows ( ) represent previously existing transitions, “*” represents the “else” branch, 9 Type checks the SR or packet type. run in hardware. We proceed from the fact that a typical the NIC through a doorbell register, when a message arrives, RDMA NIC does all packet-level work, including transmit- or by a timeout. Unless a QP is paused or stopped, the NIC ting packets and acknowledgements by itself. As a conse- will try to send or complete multiple messages at once (Fig- quence, the OS cannot itself compose and process arbitrary ure 6a and Figure 6c). As part of resume (Ù), the NIC also packets, including the packets that MigrOS introduces. triggers these workflows. Therefore, for migration to work, we need to modify Figure 6 does not include, for example, additional timeout RDMA NICs. The NIC adds different packet headers, depend- reaction logic for the sake of brevity. Overall, the changes ing on the service (RC, UD, etc.) and message (read, send, in logic are simple and mostly reuse existing functionality. etc.) type. Our changes touch only two headers: Base Trans- Because we change only the AETH and BTH headers, our port Header (BTH) and ACK Extended Transport Header changes are equally applicable to other RDMA protocols (e.g. (AETH). BTH is the first RDMA-specific header, immedi- Infiniband or RoCEv1) that use these headers in the same way. ately following IP and UDP headers. Additionally, the NIC We believe neither new logic nor new states incur prohibitive needs to maintain two new flags per QP for pause and resume design or implementation cost. states. The NIC must be able to transfer a QP into the Stopped A real RDMA NIC would need hardware support for the or Paused state, process a send request with a resume message new QP states and follow the corresponding protocol. Full mi- issued by the OS, and handle pause/resume packets. gration support would also require the NIC to extract the state Our changes (see Figure 6) cover three existing workflows of Infiniband objects and handle the new message types. We in the underlying RoCEv2 protocol: 1. Processing a send re- believe all of this can be implemented in the NIC’s firmware2. quest WQE (Figure 6a), 2. receiving a packet (Figure 6b), and Section 4 presents a proof-of-concept software implementa- 3. receiving an acknowledgement (Figure 6c). The NIC must tion of the proposed changes. consider the two new states of the QP in all these workflows. This change of logic is small and does not change the packet layout, as it is only part of the internal state of the QP. 3.5 Checkpoint/Restore API The NIC must compose the new resume message or let the To enable checkpoint/restore for processes and containers, OS do so. In contrast to a normal message, resume takes the we extend the IB verbs API with two new calls (see Listing 1): packet sequence number (PSN) from the last acknowledged ibv_dump_context and ibv_restore_object. CRIU relies on message instead of the last sent message. A receiver recog- the normal IB verbs API supplemented by the two new calls nises the resume message by a new opcode in the BTH header. to save and restore the IB verbs state of applications. Creating a new message type does not require changes in the The ibv_dump_context call returns a dump of all IB verbs message layout due to an abundance of unused opcodes. objects within a specific IB verbs context, an object represent- Similarly, pause, sent as a new negative acknowledgement ing the connection between a process and an RDMA NIC. type, employs an unused value of the syndrome field in the The creation of a dump runs almost entirely inside the kernel AETH header. Therefore, the new pause NACK also requires for two reasons: First, some links between the objects are no change to the existing packet layout. The workflows in the Figure 6 run when the user triggers 2The hardware/firmware-boundary will differ for FPGA and ASIC.

USENIX Association 2021 USENIX Annual Technical Conference 51 int ibv_dump_context(struct ibv_context *ctx, restores the application. SoftRoCE then resumes all paused int *count , void *dump, size_t length); communication to complete the migration process. int ibv_restore_object(struct ibv_context *ctx, SoftRoCE is a Linux kernel-level software implementation void **object , int object_type, int cmd , void *args, size_t length); (not an emulation [48]) of the RoCEv2 protocol [36]. RoCEv2 runs RDMA communication by tunnelling Infiniband packets Listing 1: Checkpoint/Restore extension for the IB verbs API. through a well-known UDP port. In contrast to other RDMA- ibv_dump_context creates an image of the IB verbs context device drivers, SoftRoCE allows the OS to inspect, modify, ctx with count objects and stores it in the caller-provided and control the state of IB verbs objects completely. memory region dump of size length. ibv_restore_object ex- As a performance-critical component of RDMA com- ecutes the restore command cmd for an individual object (QP, munication, RoCEv2 usually runs inside the hardware and CQ, etc.) of type object_type. The call expects a list of argu- firmware of a NIC. We focus on minimising these protocol ments specific to the object type and recovery command. args changes. The key part of MigrOS is the addition of connec- is an opaque pointer to the argument buffer of size length.A tion migration capabilities to the existing RoCEv2 protocol pointer to the restored object is returned via object. (see Section 4.2). only visible at the kernel level. Second, to get a consistent 4.1 State Extraction and Recovery checkpoint it is crucial to ensure the dump is atomic. Although the existing IB verbs API allows to create new State extraction begins when CRIU discovers that its target objects, it is not expressive enough for restoring them. For ex- process has opened an IB verbs device. We modified CRIU ample, when restoring a completion queue (CQ), the current to use the API presented in Section 3.5 to extract the state of API does not allow to specify the address of the shared mem- all available IB verbs objects. CRIU stores this state together ory region for this queue, instead this address is assigned by with other process data in an image. Later, CRIU recovers the the kernel. It is also not possible to recreate a queue pair (QP) image on another node using the new API. directly in its original state, like Ready-to-Send (RTS). In- When CRIU recovers MRs and QPs of the migrated ap- stead, the QP has to traverse all intermediate states to reach plication, the recovered objects must maintain their original the desired state. unique identifiers. These identifiers are system-global and We introduce the fine-grained ibv_restore_object call to assigned by the NIC (in our case the SoftRoCE driver) in a restore IB verbs objects one by one, for situations when the sequential manner. We augmented the SoftRoCE driver to existing API is not sufficient. During recovery, CRIU reads expose the IDs of the last assigned MR and QP to MigrOS in the object dump and applies a specific recovery procedure for userspace. These IDs are memory region number (MRN) and each object type. For example, to recover a QP, CRIU calls queue pair number (QPN) correspondingly. Before recreating ibv_restore_object with the command CREATE and transi- an MR or QP, CRIU configures the last ID appropriately. If tions the QP through the Init, RTR, and RTS states using no other MR or QP occupies this ID, the newly created object ibv_modify_qp. The memory regions or QP buffers are recov- will maintain its original ID. This approach is analogous to ered using the standard file and memory operations. Finally, the way CRIU maintains the process ID of a restored process when a QP reaches the RTS state (representing an active using the ns_last_pid mechanism of Linux, which exposes connection), the new host executes the REFILL command us- the last process ID assigned by the kernel. ing the ibv_restore_object call. This command restores the It is possible for some other process to occupy an MRN or driver-specific internal QP state and sends a resume message QPN, which CRIU wants to restore. Two processes cannot to the partner QP. use the same MRN or QPN on the same node, resulting in a conflict. In the current scheme, we avoid these conflicts by partitioning QP and MR addresses globally among all 4 Implementation nodes in the system before application startup. CRIU faces the very same problem with process ID collisions. This problem To provide transparent live migration, MigrOS makes changes has only been solved with the introduction of process ID to CRIU, the IB verbs library, the RDMA- (Soft- namespaces. To remedy the collision problem for IB verbs RoCE), and the packet-level RoCEv2-protocol. To migrate objects, a similar namespace-based mechanism, together with an application, the container runtime invokes CRIU which a virtual RDMA network [29], would be required. We leave checkpoints the target container. CRIU stops active RDMA this issue for future work. connections and saves the state of IB verbs objects (see Sec- Additionally, recovered MRs have to maintain their origi- tion 4.1). SoftRoCE then pauses communication using our nal memory protection keys. The protection keys are pseudo- extensions to the packet-level protocol. After transfering the random numbers [75] provided by the NIC and are used by checkpoint to the destination node, the container runtime at a remote communication partner when sending a packet. An that node invokes CRIU to recover the IB verbs objects and RDMA operation succeeds only if the provided key matches

52 2021 USENIX Annual Technical Conference USENIX Association QP QP sponder of QP learns the new location of QP . Then, the SR3 4 5 6 SR4 7 8 9 … a b b a Req.: 8 Req. responder replies with an acknowledgement of the last suc- RR RR … Resp. Resp.: 7 cessfully received packet. If some packets were lost during

Application WC WC … Comp.: 5 Comp. the migration, the next PSN at the responder of QPb is smaller than the next PSN at the requester of QPa. The difference Figure 7: Resuming a connection in SoftRoCE. The figure corresponds to the lost packets. Simultaneously, the requester depicts a snapshot of an immediate state, whereas the arrows of QPb can already start sending messages. At this point, the indicate the flow of data over time. A send queue comprises connection between QPa and QPb is fully recovered. multiple SRs, each expected to send multiple packets. Pack- The presented protocol ensures that both QPs recover the ets 8 and 9 ( ) are to be processed by the requester. Pack- connection without losing packets irrecoverably. If packets ets 5 – 7 ( ) are yet to be acknowledged. Packet 4 ( ) is were lost during migration, the QPs can determine which already acknowledged. A receive queue contains RRs with packets were lost and retransmit them, as part of the normal RoCEv2 protocol. The whole connection migration protocol received ( ) and not yet received ( ) packets. QPb expects the next packet to be 7. A resume packet has the PSN of the first runs transparently for the user applications. unacknowledged packet ( ). QPb replies with an acknowl- edgement of the last received packet. 5 Evaluation the expected key of a given MR. Other than that, the key’s We evaluate MigrOS from three main aspects. First, we anal- value does not carry any additional semantics. Thus, no colli- yse the implementation effort, with a specific focus on the sion problems exist for protection keys. magnitude of changes to the RoCEv2 protocol. Second, we CRIU sets all protection keys to their original values before study the overhead of adding migration capability, outside the communication restarts by making an ibv_restore_object migration phase. Third, we estimate the fine-grained cost of call with the IBV_RESTORE_MR_KEYS command. migration for individual IB verbs objects, as well as the full latency of migration in realistic RDMA applications. For most experiments, we use a system with two ma- 4.2 Resuming Connections chines: Each machine is equipped with an Intel i7-4790 CPU, The connection migration protocol ensures that connections 16 GiB RAM, an on-board Intel 1 Gb Ethernet adapter, a Mel- are terminated gracefully and recovered to a consistent state. lanox ConnectX-3 VPI adapter, and a Mellanox Connect-IB The implementation of this protocol is device- and driver- 56 Gb adapter. The Mellanox VPI adapters are set to 40 Gb specific. In this work, we modify the SoftRoCE driver to Ethernet mode. The SoftRoCE driver communicates over this make it compliant with the connection migration protocol adapter. The machines run Debian 11 with a custom Linux 5.7- (Section 3.3) by providing an implementation of the check- based kernel. We refer to this setup as local. When comparing point/restore API (Section 3.5). against DMTCP and FreeFlow, we use Ubuntu 14.04. Figure 7 outlines the basic operation of the SoftRoCE We conduct further measurements on a cluster comprising driver, which creates three concurrent tasks for each QP: two-socket Intel E5-2680 v3 CPUs nodes with Connect-IB requester, responder, and completer. When an application 56 Gb NICs deployed by Bull. We refer to this setup as cluster. posts send (SR) and receive (RR) work requests to a QP, they Two nodes similar to those in the cluster were used in a local are processed by requester and responder correspondingly. A setup and equipped with Mellanox ConnectX-3 VPI NICs work request may be split into multiple packets, depending configured to 56 Gb InfiniBand mode. on the MTU size. When the whole work request is complete, responder or completer notify the application by posting a 5.1 Size of Changes work completion to the completion queue. The tasks process all requests packet by packet. Each task To quantify the changes MigrOS requires, we count the newly maintains the packet sequence number (PSN) of the next added or modified source lines of code (SLOC) in different packet. A requester sends packets for processing at the respon- components of the software stack. Out of around 4 kSLOC der of its partner QP. The responder replies with an acknowl- only around 10% apply to the kernel-level SoftRoCE driver. edgement sent to the completer. The completer generates a These changes mostly focus on saving and restoring the state work completion (WC) after receiving an acknowledgement of IB verbs objects. We separately counted changes to the for the last packet in an SR. Similarly, the responder generates requester, responder, and completer QP tasks responsible for a WC after receiving all packets of an RR. the active phase of communication (see Figure 7). These After migration, when the recovered QPa is ready to com- tasks would be implemented in the NICs, for hardware-based municate again, it sends a resume message to QPb with the RDMA implementations. Therefore, we keep the changes to new address. Upon receiving the resume message, the re- QP tasks simple and small, as these changes must be reflected

USENIX Association 2021 USENIX Annual Technical Conference 53 Object Features required State (bytes) Latency, µs Bandwidth, Gb/s

PD None 12 Size, B Unmod. FF DMTCP Unmod. FF DMTCP MR Set memory keys and MRN 48 20 0.8∗ 1.2∗ 1.4∗ 0.09 0.02 0.01 CQ Restore ring buffer metadata 64 24 0.8∗ 1.2∗ 1.4∗ 1.41 0.24 0.20 SRQ Restore ring buffer metadata 68 28 1.1∗ 1.6∗ 1.8∗ 22.31 3.95 3.25 QP + QP tasks state, set QPN 271 212 2.3 2.7 2.9 36.50 36.57 36.49 QP w/ SRQ + Current WQE state 823 216 15.8 16.2 16.5 36.59 36.59 36.59 220 230.8 231.2 231.4 36.59 36.59 36.59 Table 1: Additional features implemented in the kernel-level SoftRoCE driver to enable recovery of IB verbs objects. We Table 2: Performance of CX3/40. Comparing execution with- provide the size that each object occupies in the dump. out modifications against DMTCP and FreeFlow. The varia- tion over 30 runs was small, except,∗ when 0.05 < σ/µ < 0.1. in firmware or hardware. In our implementation, changes to QP tasks accounted only for around 6% of overall changes. Latency, µ ± σ µs Using gcov [20], we verified that most of the changes to Size, B Plain Migratable DMTCP the QP tasks do not affect the critical path of communication. During the active phase of communication, 1213 lines were 20 25.3 ± 0.2 25.0 ± 0.2 26.2 ± 0.6 touched out of 4808 lines of the SoftRoCE driver identified 24 25.3 ± 0.3 24.9 ± 0.4 25.5 ± 0.5 by gcov as executable. Among the touched lines only 28 lines 28 26.9 ± 0.6 25.7 ± 0.7 28.0 ± 0.5 correspond to code we added for migration, with 3 lines cor- 212 35.7 ± 0.4 36.8 ± 0.7 35.5 ± 0.5 responding to variable assignments and the rest being jumps 216 93.2 ± 1.9 93.8 ± 1.9 94.4 ± 1.9 related to checking for the Paused or Stopped state. The re- 220 793.9 ± 12.8 802.0 ± 11.3 802.1 ± 13.5 maining code changes to the QP tasks run only during the connection migration phase. Table 3: Communication latency with SoftRoCE. Comparison Besides additional logic in the QP tasks, saving and of migratable and plain (non-migratable) SoftRoCE against restoring IB verbs objects requires the manipulation of DMTCP using plain SoftRoCE. implementation-specific attributes. Some of these attributes cannot be set through the original IB verbs API. For example, recovering an MR requires the additional ability to restore the the process never migrates. In contrast to this, MigrOS does original values of memory keys and the MRN. Some other not intercept communication operations on the critical path, attributes are not visible in the original IB verbs API at all. thereby introducing no measurable overhead. This subsec- The queues (CQ, SRQ, QP) implemented in SoftRoCE re- tion explores the overhead added for normal communication quire the ability to save and restore metadata of ring buffers operations without migrations. backing up the queues. If a QP uses a shared receive queue DMTCP [2] and FreeFlow [44] do not offer live migration. (SRQ), the dump of the QP additionally includes the full state Nevertheless, they could be extended to provide it. Therefore, of the current work queue entry (WQE). We identified all we start by estimating the cost of adding migration capability required attributes for SoftRoCE, calculated their memory at the user level. We use the latency and bandwidth bench- footprint (see Table 1), and implemented all features required marks from the perftest benchmark suite [68]. We ran each by these attributes. experiment 30 times with 10000 iterations each at the local The changes to RoCEv2 implemented in SoftRoCE are setup, with CX3/40 NICs. small and affect the critical path of communication only Both frameworks do additional processing for each marginally outside of the migration phase. We believe that IB verbs work request, which results into near-constant over- for RDMA NICs the same changes to RoCEv2 will remain head to latency (see Table 2). Each work request corresponds equally small. to a single message, not a single packet, therefore the overhead diminishes for larger message sizes. Table 2 demonstrates 5.2 Overhead of Migratability that bandwidth is directly affected by the increased latency and thus is lower only for small messages. We expect such Just adding the capability for transparent container migration bandwidth reduction to be a minor disadvantage for realistic already may incur overhead even when no migration occurs. applications, whereas a near 50% increase in latency may be For example, DMTCP (see Section 6) intercepts all IB verbs critical for many latency-sensitive applications [30, 78]. library calls and rewrites both work requests and comple- It is possible, that despite our best effort to minimise the tions before forwarding them to the NIC. Both with DMTCP changes to the NIC, changes proposed by MigrOS introduce and FreeFlow, this interception happens persistently, even if measurable overhead. Therefore, we need to show that Mi-

54 2021 USENIX Annual Technical Conference USENIX Association Short Full name Location To RTS To RTR To Init Create CQ MR SR SoftRoCE local 1.25 CX3/40 ConnectX-3 40 Gb Ethernet local 1.00 0.75 CX3/56 ConnectX-3 56 Gb InfiniBand cluster 0.75 CIB ConnectIB local 0.50 BIB Bull Connect-IB cluster 0.50 0.25 0.25

Table 4: RDMA-capable NICs used for the evaluation. 0.00 0.00 PD QP 3

Time, ms 4 grOS adds no additional cycles during message transmission. 3 2 For that, we compare the performance of migratable and non- 8 µs 17 µs migratable versions of the SoftRoCE driver3 against DMTCP 2 1 running with the non-migratable version of SoftRoCE. Run- 1 ning ib_send_lat [68] benchmarks in three different configu- rations shows (see Table 3) that migratability adds no visible 0 0 overhead. Simultaneously, the latency increase for DMTCP SR CX3/40CX3/56 CIB BIB SR CX3/40CX3/56 CIB BIB is also minute, making it hard to distinguish all three con- figurations. The reason for such small difference between Figure 8: Object creation time for different RDMA devices different configurations is the large communication overhead (see Table 4). To send a message, a QP needs to be in the introduced by SoftRoCE. As a result, our experiment only state RTS, which requires the traversal of three intermediate shows that migratability does not add a large overhead, but states (Reset, Init, RTR). Error bars show the interval of the does not allow to make a confident judgement regarding a standard deviation (σ) around the mean (µ), if σ/µ ≥ 0.05. potential microsecond-scale overhead. This situation is an unfortunate limitation of SoftRoCE. Considering how tiny we expect the overhead to be, even an FPGA-implementation each object across 50 runs. Each tested NIC is represented by may not be a precise reflection of a commercial RDMA NIC, a bar. We draw two conclusions from this experiment. First, because of the significant differences between an FPGA and there is substantial variation for all operations across different an actual hardware implementation. We are convinced, due NICs. Second, the time required for most operations is in the to the arguments given in Section 3, that MigrOS does not range of milliseconds. introduce even microsecond-scale overhead. The exact time required for migrating RDMA connections depends on two factors: the number of QPs and the total 5.3 Migration Costs amount of memory assigned to MRs [56]. Both of these fac- tors are application-specific and can vary greatly. Therefore, With added support for migrating IB verbs objects, the con- next, we show how the migration time is influenced by the tainer migration time will increase proportionally to the time application’s usage of MRs and QPs. required to recreate these objects. Our goal is to estimate the additional latency for migrating RDMA-enabled applications. Figure 9 shows the MR registration time, depending on the This subsection shows the cost for migrating connections cre- region’s size. MR registration costs are split between the OS ated by SoftRoCE, as well as the cost for connection creation and the NIC: The OS pins the memory and the NIC learns with hardware-based IB verbs implementations. about the virtual memory mapping of the registered region. Several IB verbs objects are required before a reliable con- SoftRoCE does not incur the “NIC-part” of the cost, so MR nection (RC) can be established, see Section 2.2. Usually, an registration with SoftRoCE is faster than for RDMA-enabled application creates a single PD, one or two CQs, multiple NICs. For this experiment, we do not consider the cost of memory regions, and one QP per communication partner. transferring the contents of the MRs during migration. To measure the cost of creating individual IB verbs objects, The number of QPs is the second variable influencing the we modified ib_send_bw [68] to create additional MR objects. migration time. Figure 10 shows the time for migrating a con- We created one CQ, one PD, 64 QPs, and 64 1 MiB-sized MRs tainer running the ib_send_bw benchmark. This benchmark per run. Figure 8 shows the average time required to create consists of two single-process containers running on two dif- ferent nodes. Three seconds after communication starts, the 3Both versions are modified by us, because the original version (vanilla container runtime migrates one of the containers to another kernel) of the SoftRoCE driver is extremely unstable. It contained a multi- node. The migration time is measured as the maximum mes- tude of concurrency bugs and could not be used in migration experiments. Unfortunately, fixing the race conditions required significant restructuring sage latency as seen by the container that did not move. The and resulted in a performance loss of around 20%. checkpoint is transferred over the same network link used by

USENIX Association 2021 USENIX Annual Technical Conference 55 BIB CIB CX3/40 CX3/56 SoftRoCE ms 1000 Time = 87.8µs + 218.43 GiB · Size Time = 0.2s + 4.28ms · QPs Checkpoint Transfer Restore 100 konyk 750 10 konyk cg (B) 1 500 Docker

Time, ms konyk 0.1 250 konyk Migration time, ms Container Runtime 0.01 Docker mg (C) 0 20 24 28 212 216 220 1 2 8 32 128 0 10 20 30 40 50 Size, KiB QPs, count Time (s) Figure 9: MR registration time Figure 10: Migration speed Figure 11: Migration speed of Docker and konyk depending on the region size. with different numbers of QPs. with optimisations (konyk) and without (konyk)

Checkpoint Transfer Restore runtime, konyk internally uses CRIU for checkpointing and Ckpt Size restoring containers. Further description of our container run- ep (C) 77MB time is out of scope of this paper. lu (B) 132MB sp (B) 177MB To measure the application migration latency, we start each cg (B) 197MB MPI application with four processes (ranks). Approximately bt (C) 496MB in the middle of the application progress, one of the ranks is (C) 507MB migrates to another node. Each benchmark has a size (A to F) Benchmark (Size) ft (B) 570MB parameter. We chose the size such that all benchmarks run be- mg (C) 995MB tween 10 and 300 seconds. We excluded the “dt” benchmark, 0.0 0.5 1.0 1.5 because it runs only for around a second. Figure 12 shows Time (s) container migration latency and standard deviation around the Figure 12: MPI application migration. mean, averaged over 20 runs of each benchmark. We break down the migration latency into three parts: the benchmarks for communication. With a growing num- checkpoint, transfer, and restore. MigrOS first stops the target ber of QPs, the benchmark consumes more memory, ranging container and prepares the checkpoint. Almost immediately, from 8 MiB to 20 MiB. To put things into perspective, we es- and in parallel with checkpointing, MigrOS starts to transfer timated the migration time for real devices by calculating the checkpoint data to the destination node. The transfer happens time to recreate IB verbs object for RDMA-enabled NICs. We over the network link used by the benchmarks for communi- subtracted the time to create IB verbs objects with SoftRoCE cation. This overlap of checkpointing and data transfer min- from the measured migration time and added the time needed imises the time of exclusive data transfer. After the transfer is to create IB verbs objects with RDMA NICs (from Figure 8). complete, MigrOS recovers the container at the destination We show our estimations with the dashed lines. node. Overall, the benchmarks experience a runtime delay proportional to the migration latency, which is proportional to the checkpoint size. 5.4 MPI Application Migration MPI applications (Figure 12) migrate slower than mi- For evaluating transparent live-migration of real-world ap- crobenchmarks (Figure 10), even after accounting for the plications, we chose to migrate NPB 3.4.1 [3], an MPI checkpoint size, because of the difference in measurement benchmark suite. The MPI applications run on top of methodology. For the microbenchmark, we calculate the mi- Open MPI 4.0 [21], which in turn uses Open UCX 1.6.1 [77] gration time based on the maximum message latency observed for point-to-point communication. We configured UCX to use by the non-migrating process. For the MPI benchmarks, we IB verbs communication over reliable connections (RC). calculate the migration time from the increase in total execu- This setup corresponds to Figure 3. We containerised the tion time of the whole benchmark. This discrepancy indicates applications using our self-developed runtime konyk, based on that the migration of parallel applications may cause a larger libcontainer [76]. Unlike Docker, our runtime facilitates faster disruption to the application performance than the simple live migration by sending the image directly to the destina- state transfer time can explain. tion node, instead of the local storage, during the checkpoint To show the interoperability of MigrOS with other con- process. Additionally, konyk stores checkpoints in RAM, re- tainer runtimes, we measured the migration costs when using ducing migration latency even more. As any other container Docker 19.03 (see Figure 11). We had to provide the end-to-

56 2021 USENIX Annual Technical Conference USENIX Association a otase h hcpit l fti a o enough not was this of All checkpoint. the transfer to way (O). objects application or (C), containers (P), processes VMs, al :Slce hcpitrsoesseshnl either handle systems checkpoint/restore Selected 5: Table hsrsrcinrsle w motn suswt resource with issues important two resolves restriction This USENIX Association runtime. the of API the through resources external all cess systems. checkpoint/restore existing of 5 selection Table a system. compares kernel-level application or levels: system, three user-level at runtime, implemented be can operation containers, this and the in processes lies For technique operation. this checkpoint/restore of challenge key The research. active [ machines tual [ processes of tion 70] Techniques [32, Checkpoint/Restore VMs migration- the on inside relied drivers NICs paravirtualised RDMA aware, with migration VM live for [ and computing disaggregated like fog paradigms, the computing with new popular of more growth even become to migration [ live computing expect cloud in history usage VMs Work Related 6 Docker. using containerised migration of possibility RDMA-application principle the explicit Neverthe- demonstrate runtime. we of less, container importance a within the support proving migration live system, file the images checkpoint across moving investigation unnecessarily Further Docker. Docker of revealed performance the match to optimised an using disk), and dumping, hard during of checkpoints (instead sending memory main to konyk: checkpoints in saving optimisations some disabled performance we better, the difference understand To migration. much complete requires to and time optimisations important some does employ Docker not disappointment, our To restore. only and features Docker checkpoint because ourselves, flow migration end support. migration for overhead munication com- additional no introduce naturally systems Runtime-based Runtime-based nt MPPPPC P P P P VM N O Y Reference Y Units Y N NIC Kernel-OS N User-OS N Runtime Overhead RDMA iemgaino ita ahns(M)hsalong a has (VMs) machines virtual of migration Live 12 11 Ours [ 6] [5] [2] [70] [32] [7]           , , ytm xetteue plcto oac- to application user the expect systems 4 Legion 25 16 , 57 , , 65 31    , 81

, Nomad , 86 63 ,cnanr [ containers ], .Nvrhls,ps techniques past Nevertheless, ]. ,

67 PS MPI a ogbe oi of topic a been long has ] 11   rnprn iemigra- live Transparent DMTCP , 16 51 , MOSIX-4 31 , 58 , 63 , MOSIX-3 62 , 67 ,o vir- or ], 

.We ]. MigrOS Scin2.3. Section in detail more in tool this describe We ihnteapiainsadesspace. address application’s the within oketrl tteue ee [ level user the at entirely work kernel- as generality and transparency same the providing API, itaiainealslv irto,i nrdcsoverhead introduces it migration, live enables virtualisation IDs, process descriptors, file like resources, system virtualise urnl h otscesu Slvlto o checkpoint/re- for tool OS-level successful most the currently bur- [ maintenance BLCR ap- higher den. user significantly of a spectrum incur wider they plications, a support Al- systems structures. these data though internal kernel’s the from state plication 66 system. runtime particular a to application deseri- and serialisation [ alisation state efficient more even lightweight for agents, threads) (tasks, objects application-defined on ate the of [ modifications system on runtime rely networks RDMA with together happens it because cheap and is serialisation interception resource Such facilitate to deserialisation. resource the about of information state resource. enough the the maintain of can ap- state runtime the the the serialise stop Second, to easily so can doing and from used plication is resource when underlying exactly controls the system runtime the First, migratability: oa [ Nomad [ not migration do container but live software, support in policies control connection menting networks. RDMA performance consider these address [ to problems try approaches [ new packets network Several of encapsulation additional to due network though Even topology. the network from physical applications underlying distributed isolating for tool essential an Virtualisation the Network improve applications. to RDMA technique of our time migration with migration from combined downtime be could the and reduce to allow [ transfer techniques checkpoint the of speed the improve to networks. has RDMA objects for shadow overhead [ these runtime NIC maintaining non-negligible that the show and we 5.2, process tion user a verbs between IB proxies of state maintains the extract DMTCP to able objects, be To verbs. with IB for applications support distributed for tool to fault-tolerance redesigned parent been has MOSIX 4, version In sockets. and and applications from calls system the intercept to use mechanism systems Such implementations. based API. user-kernel the at interposition mini- require a not does at and modification mum kernel Linux necessary keeps store, ihrd nepsto ttekre ee retatap- extract or level kernel the at interposition do either ] OS-level Kernel migration live transparent provide to attempts all Almost DAntokvrulsto prahsfcso imple- on focus approaches virtualisation RDMA-network networks RDMA employ may migration live Furthermore, Finally, 32 7 2021 USENIX Annual Technical Conference 57 8 srOS-level user , ssIfiiadadesvrulsto o VM for virtualisation address InfiniBand uses ] , 43 28 64 , a enaadndeetal.CI [ CRIU eventually. abandoned been has ] , 87 2 69 , .Alrniebsdapoce idthe bind approaches runtime-based All ]. hcpitrsoesses[ systems checkpoint/restore 24 , 89 , 26 .Hwvr hs prahsd not do approaches these However, ]. ytm neps h user-kernel the interpose systems , 32 C/Pntokvrulsto is virtualisation network TCP/IP hdwobjects shadow , 41 29 5 , .DTP[ DMTCP ]. 70 , 44 .Sm utmsoper- runtimes Some ]. , 84 .A nexception, an As ]. hc c as act which , 33 2 6 10 satrans- a is ] , LD_PRELOAD , 28 39 Sec- In ]. , .These ]. 64 38 , , 13 89 42 ], ]. , migration, but implements the connection migration protocol UD QP does not know where to send resume messages after inside an application-level runtime. migration, because it can receive messages from anywhere. MigrOS uses traditional network virtualisation for TCP/IP Consequently, to support UD, a NIC would need to maintain networks, which is not on the performance-critical path for an additional simple table to translate between user-visible RDMA applications. However, MigrOS avoids unnecessary and actual addresses. We leave this issue for future work. interception of RDMA communication. Instead, MigrOS Security As of today, lack of authentication and integrity silently replaces addressing information during migration. control is a general problem in RDMA networks [52, 75, 80, RDMA Implementations SoftRoCE [48] and Softi- 82]. Therefore, modern RDMA networks rely on trusted NICs Warp [83] are open-source software implementations of Ro- and hosts. In the context of this paper, we require the host OS CEv2 [36] and iWarp [23] respectively. Both provide no per- to ensure that pause and resume messages may only be sent formance advantage over socket-based communication but by authorised hosts. If either NICs or hosts cannot be trusted, are compatible with their hardware counterparts and facilitate additional protocols must prevent the sending of unauthorised the development and testing of RDMA-based applications. (e.g. by spoofing) pause and resume messages. In this regard, We chose to base our work on SoftRoCE because RoCEv2 we do not degrade security of the RDMA network. found wider adoption than iWarp. White-box migration Previous live migration techniques There are also open-source FPGA-based implementations require cooperation on the application’s behalf, because they of network stacks. NetFPGA [90] does not support RDMA see RDMA NICs as black boxes. To our knowledge, our work communication. StRoM [79] provides a proof-of-concept is first to consider an RDMA NIC as a white box for the RoCEv2 implementation. However, we found it unfit to run purpose of live migration. We categorise the device state as real-world applications (for example, MPI) without further 1. public state, observable through the IB verbs API, 2. device– significant implementation efforts. driver-visible state, visible by kernel- and user-level drivers, 3. internal state, invisible outside the device. The device must expose its internal state to the OS at the time of migration. 7 Discussion and Conclusion We hope, these findings can be useful, when implementing live migration for other devices, e.g. GPUs. MigrOS is an OS-level architecture enabling transparent live Beyond migration container migration without sacrificing RDMA network per- We believe that the pause/resume proto- formance. We are convinced the architecture of MigrOS can col can find other uses, like efficient fail-over, congestion con- be useful for dynamic load balancing, efficient prepared fail- trol, or load balancing. As an example, consider MasQ [29], over, and live software updates in cloud or HPC settings. a virtual RDMA network with firewall capabilities. MasQ can shut down RDMA connections, but unlike TCP/IP fire- Hardware Modifications and Software Implementation walls, cannot block them temporarily. Our protocol could re- We believe that limited hardware changes are worthy of con- turn control over RDMA connections to the OS and replicate sideration and have already been proven feasible [14, 27, 45, TCP/IP-like behaviour to RDMA firewalls as well. 45, 61], even for RDMA protocols [34, 46, 49]. Nevertheless, Availability github.com/TUD-OS/migros-atc-2021. propositions to modify hardware often meet criticism because they are hard to validate and evaluate for an OS designer. To overcome this difficulty, we have modified SoftRoCE. It Acknowledgments turned out that adding only few states to the state machine and two new message types were necessary. As result, we We thank our shepherd, Vasily Tarasov, and the anonymous enabled transparent live migration of containerised RDMA reviewers for their helpful suggestions. The research and work applications without affecting the critical path of the commu- presented in this paper has been supported by the German nication. Lesokhin et al.[ 46] have demonstrated that RDMA priority program 1648 “Software for Exascale Computing” protocol changes can be achieved just through firmware up- via the research project FFMK [19]. This work was supported dates, barring the need to replace NICs. Furthermore, our in part by the German Research Foundation (DFG) within design maintains full backwards-compatibility with the ex- the Collaborative Research Center HAEC and the the Cen- isting RDMA network infrastructure at every level and can ter for Advancing Electronics Dresden (cfaed). The authors be adopted by other RDMA protocols (e.g. Infiniband and are grateful to the Centre for Information Services and High RoCEv1) verbatim. Performance Computing (ZIH) TU Dresden for providing its Unreliable Datagram Communication MigrOS provides facilities. In particular, we would like to thank Dr. Ulf Mark- live migration for reliable communication (RC), but not for wardt and Sebastian Schrader for their support with the exper- unreliable datagram (UD), because, first, every message re- imental setup. The authors thank Robert Wittig for sharing ceived over UD exposes the address of its sender. When this his expertise in FPGA architectures. The authors acknowl- sender migrates, its address will change and, currently, Mi- edge the support from AWS Cloud Credits for Research for grOS cannot conceal the change from the receiver. Second, a providing cloud computing resources.

58 2021 USENIX Annual Technical Conference USENIX Association References In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing - [1] Amazon Web Services, Inc. Elastic Fabric Adapter. HPDC ’14, pages 13–24, Vancouver, BC, Canada, 2014. https://aws.amazon.com/hpc/efa/, 2020. ACM Press. ISBN 978-1-4503-2749-7. doi:10/ggnfr4.

[2] Jason Ansel, Kapil Arya, and Gene Cooperman. [11] Christopher Clark, Keir Fraser, Steven Hand, Ja- DMTCP: Transparent Checkpointing for Cluster Com- cob Gorm Hansen, Eric Jul, Christian Limpach, Ian putations and the Desktop. In 2009 IEEE International Pratt, and Andrew Warfield. Live Migration of Virtual Symposium on Parallel Distributed Processing, IPDPS, Machines. In Proceedings of the 2nd Conference on pages 1–12, May 2009. doi:10/d62csg. Symposium on Networked Systems Design & Implemen- tation - Volume 2, NSDI ’05, pages 273–286. USENIX [3] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, Association, May 2005. doi:10.5555/1251203.1251223. R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, H.D. Simon, V. Venkatakr- [12] Connor, Patrick, Hearn, James R., Dubal, Scott P., Her- ishnan, and S.K. Weeratunga. The NAS Parallel Bench- drich, Andrew J., and Sood, Kapil. Techniques to mi- marks. The International Journal of Supercomputing grate a virtual machine using disaggregated computing Applications, 5(3):63–73, September 1991. ISSN 0890- resources, 2018. 2720. doi:10/cgsfnm. [13] CRIU. Checkpoint/Restore In Userspace. [4] Amnon Barak and Amnon Shiloh. A distributed load- https://criu.org/Main_Page, 2011. balancing policy for a multicomputer. Software: Prac- tice and Experience, 15(9):901–913, September 1985. [14] Daniel Firestone, Andrew Putnam, Sambhrama Mund- ISSN 00380644, 1097024X. doi:10/c8r7m6. kur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Vivek Bhanu, Eric Chung, Harish Kumar Chandrappa, [5] Amnon Barak and Shiloh, Amnon. The MOSIX Clus- Somesh Chaturmohta, Matt Humphrey, Jack Lavier, ter Management System for Distributed Computing on Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Pad- Linux Clusters and Multi-Cluster Private Clouds. Tech- hye, Gautham Popuri, Shachar Raindel, Tejas Sapre, nical report, Springer-Verlag, 2016. Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth [6] Amnon Barak, Shai Guday, and Richard G. Wheeler. Srivastava, Anshuman Verma, Qasim Zuhair, Deepak The MOSIX Distributed : Load Bal- Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, ancing for UNIX. Springer-Verlag, Berlin, Heidelberg, and Albert Greenberg. Azure Accelerated Network- 1993. ISBN 978-0-387-56663-4. ing: SmartNICs in the Public Cloud. In 15th USENIX Symposium on Networked Systems Design and Imple- [7] Michael Bauer, Sean Treichler, Elliott Slaughter, and mentation (NSDI 18), NSDI’18, Berkeley, Calif, 2018. Alex Aiken. Legion: Expressing locality and inde- ISBN 978-1-931971-43-0. pendence with logical regions. In International Con- ference for High Performance Computing, Networking, [15] Daniele De Sensi, Salvatore Di Girolamo, Kim H. Storage and Analysis, SC, pages 1–11, Salt Lake City, McMahon, Duncan Roweth, and Torsten Hoefler. UT, November 2012. IEEE. ISBN 978-1-4673-0806-9. An In-Depth Analysis of the Slingshot Interconnect. doi:10/gf9t42. arXiv:2008.08886 [cs], August 2020.

[8] Adam Belay, George Prekas, Christos Kozyrakis, Ana [16] Umesh Deshpande, Yang You, Danny Chan, Nilton Bila, Klimovic, Samuel Grossman, and Edouard Bugnion. and Kartik Gopalan. Fast Server Deprovisioning through IX: A Protected Dataplane Operating System for High Scatter-Gather Live Migration of Virtual Machines. In Throughput and Low Latency. In 11th USENIX Sympo- 2014 IEEE 7th International Conference on Cloud Com- sium on Operating Systems Design and Implementation, puting, pages 376–383, Anchorage, AK, USA, June OSDI ’14, pages 49–65, October 2014. ISBN 978-1- 2014. IEEE. ISBN 978-1-4799-5063-8. doi:10/ggnfmn. 931971-16-4. [17] Dirk Merkel. Docker: Lightweight Linux Containers [9] Brian Grant. Pod migration · Is- for Consistent Development and Deployment. Linux sue #3949 · kubernetes/kubernetes. Journal, March 2014. https://github.com/kubernetes/kubernetes/issues/3949, January 2015. [18] DPDK. Data Plane Development Kit. https://www.dpdk.org/, 2013. [10] Jiajun Cao, Gregory Kerr, Kapil Arya, and Gene Coop- erman. Transparent checkpoint-restart over Infiniband. [19] FFMK. FFMK Website. https://ffmk.tudos.org, 2013.

USENIX Association 2021 USENIX Annual Technical Conference 59 [20] Free Software Foundation, Inc. gcov—a Test Coverage [28] Paul H. Hargrove and Jason C. Duell. Berkeley lab Program. https://gcc.gnu.org/onlinedocs/gcc/Gcov.html, checkpoint/restart (BLCR) for Linux clusters. Journal 2021. of Physics: Conference Series, 46:494–499, September 2006. ISSN 1742-6588, 1742-6596. doi:10/d33sc5. [21] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra, Jeffrey M. Squyres, Vishal [29] Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan, Sahay, Prabhanjan Kambadur, Andrew Lumsdaine, Bei Hua, Zhi-Li Zhang, and Kai Zheng. MasQ: RDMA Ralph H. Castain, David J. Daniel, Richard L. Graham, for Virtual Private Cloud. In Proceedings of the Annual and Timothy S. Woodall. Open MPI: Goals, Concept, conference of the ACM Special Interest Group on Data and Design of a Next Generation MPI Implementation. Communication on the applications, technologies, ar- In Recent Advances in Parallel Virtual Machine and chitectures, and protocols for computer communication, Message Passing Interface, volume 3241, pages 97–104, SIGCOMM ’20, pages 1–14, New York, NY, USA, July Berlin, Heidelberg, 2004. Springer. ISBN 978-3-540- 2020. Association for Computing Machinery. ISBN 30218-6. doi:10/bxhsv5. 978-1-4503-7955-7. doi:10/gg9rjq.

[22] Peter X. Gao, Akshay Narayan, Rachit Agarwal, Sagar [30] Berk Hess, Carsten Kutzner, David van der Spoel, and Karandikar, Sylvia Ratnasamy, Joao Carreira, Sangjin Erik Lindahl. GROMACS 4: Algorithms for Highly Han, and Scott Shenker. Network Requirements for Re- Efficient, Load-Balanced, and Scalable Molecular Simu- source Disaggregation. In 12th USENIX Symposium on lation. Journal of Chemical Theory and Computation, 4 Operating Systems Design and Implementation, OSDI (3):435–447, March 2008. ISSN 1549-9618, 1549-9626. ’16, pages 249–264, Savannah, GA, USA, November doi:10/b7nkp6. 2016. USENIX Association. ISBN 978-1-931971-33-1. [31] Michael R. Hines, Umesh Deshpande, and Kartik doi:10.5555/3026877.3026897. Gopalan. Post-copy live migration of virtual machines. ACM SIGOPS Operating Systems Review, 43(3):14–26, [23] D. Garcia, P. Culley, R. Recio, J. Hilland, and B. Metzler. July 2009. ISSN 0163-5980. doi:10/ccwrpt. A Remote Direct Memory Access Protocol Specifica- tion. https://tools.ietf.org/html/rfc5040, October 2007. [32] Wei Huang, Jiuxing Liu, Matthew Koop, Bulent Abali, and Dhabaleswar Panda. Nomad: migrating OS-bypass [24] Rohan Garg, Gregory Price, and Gene Cooperman. networks in virtual machines. In Proceedings of the MANA for MPI: MPI-Agnostic Network-Agnostic 3rd international conference on Virtual execution en- Transparent Checkpointing. In Proceedings of the 28th vironments - VEE ’07, VEE’07, page 158, San Diego, International Symposium on High-Performance Parallel California, USA, 2007. ACM Press. ISBN 978-1-59593- and Distributed Computing - HPDC ’19, HPDC, pages 630-1. doi:10/frgqz4. 49–60, Phoenix, AZ, USA, 2019. ACM Press. ISBN 978-1-4503-6670-0. doi:10/ggnd38. [33] Khaled Z. Ibrahim, Steven Hofmeyr, Costin Iancu, and Eric Roman. Optimized pre-copy live migration for [25] Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf memory intensive applications. In Proceedings of 2011 Chowdhury, and Kang G. Shin. Efficient Memory Disag- International Conference for High Performance Com- gregation with INFINISWAP. In Proceedings of the 14th puting, Networking, Storage and Analysis on - SC ’11, USENIX Conference on Networked Systems Design and page 1, Seattle, Washington, 2011. ACM Press. ISBN Implementation, NSDI ’17, page 21, Boston, MA, USA, 978-1-4503-0771-0. doi:10/bsgrkf. 2017. USENIX Association. ISBN 978-1-931971-37-9. doi:10.5555/3154630.3154683. [34] InfiniBand Trade Association. Supplement to InfiniBand Architecture Specification: XRC. InfiniBand Trade As- [26] Wei Lin Guay, Sven-Arne Reinemo, Bjørn Dag Johnsen, sociation, February 2009. Chien-Hua Yen, Tor Skeie, Olav Lysne, and Ola Tørud- [35] InfiniBand Trade Association. Supplement to InfiniBand bakken. Early experiences with live migration of Architecture Specification: RoCE, volume 1. InfiniBand SR-IOV enabled InfiniBand. Journal of Parallel and Trade Association, 1.2.1 edition, 2010. Distributed Computing, 78:39–52, April 2015. ISSN 07437315. doi:10/f68twd. [36] InfiniBand Trade Association. Supplement to InfiniBand Architecture Specification: RoCEv2. InfiniBand Trade [27] Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, Association, September 2014. Dongsu Han, and Sylvia Ratnasamy. SoftNIC: A Soft- ware NIC to Augment Hardware. Technical Report [37] InfiniBand Trade Association. InfiniBand Architecture UCB/EECS-2015-155, EECS Department, University Specification, volume 1. InfiniBand Trade Association, of California, Berkeley, 2015. 1.3 edition, March 2015.

60 2021 USENIX Annual Technical Conference USENIX Association [38] Jake Edge. Checkpoint/restart tries to head towards [49] Liran Liss. On Demand Paging for User-level Network- the mainline. https://lwn.net/Articles/320508/, February ing, 2013. 2009. [50] Jiuxing Liu, Jiesheng Wu, Sushmitha P. Kini, Pete Wyck- [39] Joel Nider and Mike Rapoport. News from academia: off, and Dhabaleswar K. Panda. High performance FatELF, RDMA and CRIU. In Linux Plumbers Confer- RDMA-based MPI implementation over InfiniBand. In ence, Vancouver, BC, Canada, November 2018. Proceedings of the 17th annual international conference on Supercomputing, ICS ’03, pages 295–304, San Fran- [40] Jonathan Corbet. TCP connection repair. cisco, CA, USA, June 2003. Association for Computing https://lwn.net/Articles/495304/, May 2012. Machinery. ISBN 978-1-58113-733-0. doi:10/c4knj6.

[41] J. Jose, Mingzhe Li, Xiaoyi Lu, K. C. Kandalla, M. D. [51] Lele Ma, Shanhe Yi, and Qun Li. Efficient service hand- Arnold, and D. K. Panda. SR-IOV Support for Virtu- off across edge servers via Docker container migration. alization on InfiniBand Clusters: Early Experience. In In Proceedings of the Second ACM/IEEE Symposium on 2013 13th IEEE/ACM International Symposium on Clus- Edge Computing, SEC ’17, pages 1–13, San Jose, Cal- ter, Cloud, and Grid Computing, Delft, May 2013. IEEE. ifornia, 2017. ACM Press. ISBN 978-1-4503-5087-7. ISBN 978-0-7695-4996-5. doi:10/ggm53b. doi:10/gf9x9r.

[42] Asim Kadav and Michael M. Swift. Live migration of [52] Manhee Lee, Eun Jung Kim, and M. Yousif. Security direct-access devices. ACM SIGOPS Operating Sys- enhancement in InfiniBand architecture. In 19th IEEE tems Review, 43(3):95, July 2009. ISSN 01635980. International Parallel and Distributed Processing Sym- doi:10/b9j36z. posium, pages 10 pp.–, April 2005. doi:10/c4xpq5.

[43] Laxmikant V. Kale and Sanjeev Krishnan. CHARM++: [53] Mellanox. Messaging Accelerator (VMA) Documenta- a portable concurrent object oriented system based on tion. Technical report, Mellanox, April 2019. C++. ACM SIGPLAN Notices, 28(10):91–108, October [54] Mellanox Technologies. RDMA Aware Networks Pro- 1993. ISSN 0362-1340. doi:10/cgnqf7. gramming User Manual. Technical Report 1.7, Mellanox Technologies, 2015. [44] Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong [55] Microsoft. High performance computing VM Guo, Vyas Sekar, and Srinivasan Seshan. FreeFlow: sizes. https://docs.microsoft.com/en-us/azure/virtual- Software-based Virtual RDMA Networking for Con- machines/sizes-hpc, August 2020. tainerized Clouds. In Proceedings of the 16th USENIX Conference on Networked Systems De- [56] Frank Mietke, Robert Rex, Robert Baumgartl, Torsten sign and Implementation, NSDI, pages 113–125, Mehlan, Torsten Hoefler, and Wolfgang Rehm. Anal- Boston, MA, USA, 2019. ISBN 978-1-931971-49-2. ysis of the Memory Registration Process in the Mel- doi:10.5555/3323234.3323245. lanox InfiniBand Software Stack. In David Hutchison, Takeo Kanade, Josef Kittler, Jon M. Kleinberg, Friede- [45] Alec Kochevar-Cureton, Somesh Chaturmohta, Norman mann Mattern, John C. Mitchell, Moni Naor, Oscar Lam, Sambhrama Mundkur, and Daniel Firestone. Re- Nierstrasz, C. Pandu Rangan, Bernhard Steffen, Madhu mote direct memory access in computing systems, Oc- Sudan, Demetri Terzopoulos, Dough Tygar, Moshe Y. tober 2019. Vardi, Gerhard Weikum, Wolfgang E. Nagel, Wolf- gang V. Walter, and Wolfgang Lehner, editors, Euro- [46] Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Par 2006 Parallel Processing, volume 4128, pages 124– Shapiro, Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, 133. Springer Berlin Heidelberg, Berlin, Heidelberg, Nadav Amit, and Dan Tsafrir. Page Fault Support for 2006. ISBN 978-3-540-37783-2 978-3-540-37784-9. Network Controllers. In Proceedings of the Twenty- doi:10.1007/11823285_13. Second International Conference on Architectural Sup- port for Programming Languages and Operating Sys- [57] Dejan Milojiciˇ c,´ Frederick Douglis, and Richard tems, pages 449–466, Xi’an China, April 2017. ACM. Wheeler. Mobility: processes, computers, and agents. ISBN 978-1-4503-4465-4. doi:10/ghm5x7. ACM Press/Addison-Wesley Publishing Co., USA, 1999. ISBN 978-0-201-37928-0. [47] Linux Containers - LXC - Introduction. Linux Contain- ers (LXC). https://linuxcontainers.org/lxc/, 2008. [58] Andrey Mirkin, Alexey Kuznetsov, and Kir Kolyshkin. Containers checkpointing and live migration. In Linux [48] Liran Liss. The Linux SoftRoCE Driver, March 2017. Symposium, Ottawa, 2008.

USENIX Association 2021 USENIX Annual Technical Conference 61 [59] Christopher Mitchell, Yifeng Geng, and Jinyang Li. Us- [68] perftest. perftest. https://github.com/linux- ing One-Sided RDMA Reads to Build a Fast, CPU- rdma/perftest, April 2020. Efficient Key-Value Store. In USENIX Annual Technical Conference (USENIX ATC 13), pages 103–114, 2013. [69] Simon Peter, Jialin Li, Irene Zhang, Dan R K Ports, ISBN 978-1-931971-01-0. Doug Woos, Arvind Krishnamurthy, Thomas Anderson, and Timothy Roscoe. Arrakis: The Operating System is [60] Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan the Control Plane. In Proceedings of the 11th USENIX Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, and Symposium on Operating Systems Design and Imple- Scott Shenker. Revisiting network support for RDMA. mentation, 2014, OSDI ’14, 2014. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIG- [70] S. Pickartz, C. Clauss, S. Lankes, S. Krempel, COMM ’18, pages 313–326, New York, NY, USA, Au- T. Moschny, and A. Monti. Non-intrusive Migration gust 2018. Association for Computing Machinery. ISBN of MPI Processes in OS-Bypass Networks. In 2016 978-1-4503-5567-4. doi:10/ghf6mq. IEEE International Parallel and Distributed Processing [61] YoungGyoun Moon, SeungEon Lee, Muhammad Asim Symposium Workshops (IPDPSW), pages 1728–1735, Jamshed, and KyoungSoo Park. AccelTCP: Accelerat- May 2016. doi:10/ggscxh. ing Network Applications with Stateful TCP Offloading. In 17th USENIX Symposium on Networked Systems De- [71] podman.io. Podman: daemonless container engine. sign and Implementation, NSDI ’20, pages 77–92, Santa https://podman.io/, 2018. Clara, CA, USA, February 2020. USENIX Association. ISBN 978-1-939133-13-7. [72] Marius Poke and Torsten Hoefler. DARE: High- Performance State Machine Replication on RDMA Net- [62] Shripad Nadgowda, Sahil Suneja, Nilton Bila, and Can- works. In Proceedings of the 24th International Sym- turk Isci. Voyager: Complete Container State Migra- posium on High-Performance Parallel and Distributed tion. In 2017 IEEE 37th International Conference on Computing - HPDC ’15, pages 107–118, Portland, Ore- Distributed Computing Systems (ICDCS), pages 2137– gon, USA, 2015. ACM Press. ISBN 978-1-4503-3550-8. 2142, Atlanta, GA, USA, June 2017. IEEE. ISBN 978- doi:10/ggm3sf. 1-5386-1792-2. doi:10/ggnhq5. [73] proc - process information pseudo-filesystem. proc(5) [63] Michael Nelson, Beng-Hong Lim, and Greg Hutchins. - Linux manual page. http://man7.org/linux/man- Fast Transparent Migration for Virtual Machines. pages/man5/proc.5.html, 2020. In Proceedings of the USENIX Annual Techni- cal Conference, ATC ’05, pages 391–394, Ana- [74] ptrace - process trace. ptrace(2) - Linux manual page. heim, CA, USA, April 2005. USENIX Association. http://man7.org/linux/man-pages/man2/ptrace.2.html, doi:10.5555/1247360.1247385. 2020. [64] Zhixiong Niu, Hong Xu, Peng Cheng, Yongqiang Xiong, Tao Wang, Dongsu Han, and Keith Winstein. NetKernel: [75] Benjamin Rothenberger, Konstantin Taranov, Adrian Making Network Stack Part of the Virtualized Infras- Perrig, and Torsten Hoefler. ReDMArk: Bypassing tructure. arXiv:1903.07119 [cs], March 2019. RDMA Security Mechanisms. In 30th USENIX Security Symposium, page 16, Vancouver, B.C., 2021. [65] Opeyemi Osanaiye, Shuo Chen, Zheng Yan, Rongxing Lu, Kim-Kwang Raymond Choo, and Mqhele Dlodlo. [76] runc. runc: CLI tool for spawning and run- From Cloud to Fog Computing: A Review and a Con- ning containers according to the OCI specification. ceptual Live VM Migration Framework. IEEE Access, https://github.com/opencontainers/runc, 2020. 5:8284–8300, 2017. ISSN 2169-3536. doi:10/ggnfkt. [77] Pavel Shamis, Manjunath Gorentla Venkata, M. Graham [66] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Lopez, Matthew B. Baker, Oscar Hernandez, Yossi Iti- Nieh. The design and implementation of Zap: a system gin, Mike Dubman, Gilad Shainer, Richard L. Graham, for migrating computing environments. ACM SIGOPS Liran Liss, Yiftah Shahar, Sreeram Potluri, Davide Ros- Operating Systems Review, 36(SI):361–376, December setti, Donald Becker, Duncan Poole, Christopher Lamb, 2003. ISSN 0163-5980. doi:10/fbg7vq. Sameer Kumar, Craig Stunkel, George Bosilca, and Au- [67] Zhenhao Pan, Yaozu Dong, Yu Chen, Lei Zhang, and relien Bouteiller. UCX: An Open Source Framework Zhijiao Zhang. CompSC: live migration with pass- for HPC Network APIs and Beyond. In 2015 IEEE 23rd through devices. ACM SIGPLAN Notices, 47(7):109, Annual Symposium on High-Performance Interconnects, September 2012. ISSN 03621340. doi:10/f3887q. HotI, pages 40–43, August 2015. doi:10/ggmx8k.

62 2021 USENIX Annual Technical Conference USENIX Association [78] Jiaxin Shi, Youyang Yao, Rong Chen, Haibo Chen, and in public clouds. In Proceedings of the 15th ACM SIG- Feifei Li. Fast and Concurrent RDF Queries with PLAN/SIGOPS International Conference on Virtual Exe- RDMA-based Distributed Graph Exploration. In 12th cution Environments, VEE, pages 179–192, Providence, USENIX Symposium on Operating Systems Design and RI, USA, 2019. ACM Press. ISBN 978-1-4503-6020-3. Implementation, OSDI ’16, page 17, 2016. doi:10/ggscxg.

[79] David Sidler, Zeke Wang, Monica Chiosa, Amit Kulka- rni, and Gustavo Alonso. StRoM: Smart Remote [86] Kai-Ting Amy Wang, Rayson Ho, and Peng Wu. Re- Memory. In the Fifteenth European Conference on playable Execution Optimized for Page Sharing for a Computer Systems, EuroSys ’20, page 16, Heraklion, Managed Runtime Environment. In Proceedings of Greece, 2020. Association for Computing Machinery. the Fourteenth EuroSys Conference 2019, EuroSys ’19, doi:10.1145/3342195.3387519. pages 1–16, Dresden, Germany, March 2019. Associa- tion for Computing Machinery. ISBN 978-1-4503-6281- [80] Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nel- 8. doi:10/ggnq76. son, and Irene Zhang. Securing RDMA for High- [87] David Wong, Noemi Paciorek, and Dana Moore. Performance Datacenter Storage Systems. In 12th Java-based mobile agents. Communications of the USENIX Workshop on Hot Topics in Cloud Computing, ACM, 42(3):92–ff., March 1999. ISSN 0001-0782. HotCloud 20, page 8. USENIX Association, 2020. doi:10/btg3k7. [81] Jonathan M. Smith. A survey of process migration mech- anisms. ACM SIGOPS Operating Systems Review, 22 [88] Yibo Zhu, Ming Zhang, Haggai Eran, Daniel Fire- (3):28–40, July 1988. ISSN 0163-5980. doi:10/bjp787. stone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, and Mo- [82] Konstantin Taranov, Benjamin Rothenberger, Adrian hamad Haj Yahia. Congestion Control for Large- Perrig, and Torsten Hoefler. sRDMA – Efficient NIC- Scale RDMA Deployments. In Proceedings of the based Authentication and Encryption for Remote Direct 2015 ACM Conference on Special Interest Group on Memory Access. In 2020 USENIX Annual Technical Data Communication, SIGCOMM, pages 523–536, Lon- Conference (USENIX ATC 20), pages 691–704, 2020. don, UK, 2015. ACM. ISBN 978-1-4503-3542-3. ISBN 978-1-939133-14-4. doi:10.1145/2785956.2787484. [83] Animesh Trivedi, Bernard Metzler, and Patrick Stuedi. A case for RDMA in clouds: turning net- [89] Danyang Zhuo, Kaiyuan Zhang, Yibo Zhu, working into commodity. In Proceedings of the Second Hongqiang Harry Liu, Matthew Rockett, Arvind Asia-Pacific Workshop on Systems - APSys ’11, page 1, Krishnamurthy, and Thomas Anderson. Slim: OS Shanghai, China, 2011. ACM Press. ISBN 978-1-4503- Kernel Support for a Low-Overhead Container Overlay 1179-3. doi:10/fzv576. Network. In Proceedings of the 16th USENIX Confer- ence on Networked Systems Design and Implementation, [84] Shin-Yeh Tsai and Yiying Zhang. LITE Kernel RDMA NSDI, Boston, MA, USA, February 2019. USENIX Support for Datacenter Applications. In Proceedings of Association. ISBN 978-1-931971-49-2. the 26th Symposium on Operating Systems Principles - SOSP ’17, pages 306–324, Shanghai, China, 2017. ACM [90] Noa Zilberman, Yury Audzevich, G. Adam Covington, Press. ISBN 978-1-4503-5085-3. doi:10/ggscxn. and Andrew W. Moore. NetFPGA SUME: Toward 100 [85] Dongyang Wang, Binzhang Fu, Gang Lu, Kun Tan, and Gbps as Research Commodity. IEEE Micro, 34(5):32– Bei Hua. vSocket: virtual socket interface for RDMA 41, September 2014. ISSN 1937-4143. doi:10/gg8qsd.

USENIX Association 2021 USENIX Annual Technical Conference 63