The Nanopu: a Nanosecond Network Stack for Datacenters
Total Page:16
File Type:pdf, Size:1020Kb
The nanoPU: A Nanosecond Network Stack for Datacenters Stephen Ibanez, Alex Mallery, Serhat Arslan, and Theo Jepsen, Stanford University; Muhammad Shahbaz, Purdue University; Changhoon Kim and Nick McKeown, Stanford University https://www.usenix.org/conference/osdi21/presentation/ibanez This paper is included in the Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. July 14–16, 2021 978-1-939133-22-9 Open access to the Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX. The nanoPU: A Nanosecond Network Stack for Datacenters Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz?, Changhoon Kim, and Nick McKeown Stanford University ?Purdue University Abstract from when a client issues an RPC request until it receives a We present the nanoPU, a new NIC-CPU co-design to response) for applications invoking many sequential RPCs; accelerate an increasingly pervasive class of datacenter appli- (2) the tail response time (i.e., the longest or 99th %ile RPC cations: those that utilize many small Remote Procedure Calls response time) for applications with large fanouts (e.g., map- (RPCs) with very short (µs-scale) processing times. The novel reduce jobs), because they must wait for all RPCs to complete aspect of the nanoPU is the design of a fast path between the before continuing [17]; and (3) the communication overhead network and applications—bypassing the cache and memory (i.e., the communication-to-computation ratio). When com- hierarchy, and placing arriving messages directly into the CPU munication overhead is high, it may not be worth farming out register file. This fast path contains programmable hardware the request to a remote CPU at all [57]. We will sometimes support for low latency transport and congestion control as need more specific metrics for portions of the processing well as hardware support for efficient load balancing of RPCs pipeline, such as the median wire-to-wire latency, the time to cores. A hardware-accelerated thread scheduler makes sub- from when the first bit of an RPC request arrives at the server nanosecond decisions, leading to high CPU utilization and NIC until the last bit of the response departs. low tail response time for RPCs. Many authors have proposed exciting ways to accelerate We built an FPGA prototype of the nanoPU fast path by RPCs by reducing the message processing overhead. These modifying an open-source RISC-V CPU, and evaluated its per- include specialized networking stacks, both in software (e.g., formance using cycle-accurate simulations on AWS FPGAs. DPDK [18], ZygOS [51], Shinjuku [27], and Shenango [49]), The wire-to-wire RPC response time through the nanoPU and hardware (e.g., RSS [43], RDMA [9], Tonic [2], NeB- is just 69ns, an order of magnitude quicker than the best-of- uLa [57], and Optimus Prime [50]). Each proposal tackles breed, low latency, commercial NICs. We demonstrate that one or more components of the RPC stack (i.e., network trans- the hardware thread scheduler is able to lower RPC tail re- port, congestion control, core selection, thread scheduling, and sponse time by about 5× while enabling the system to sustain data marshalling). For example, DPDK removes the memory 20% higher load, relative to traditional thread scheduling tech- copying and network transport overhead of an OS and lets a niques. We implement and evaluate a suite of applications, developer handle them manually in user space. ZygOS imple- including MICA, Raft and Set Algebra for document retrieval; ments a scheme to efficiently load balance messages across and we demonstrate that the nanoPU can be used as a high multiple cores. Shenango efficiently shares CPUs among ser- performance, programmable alternative for one-sided RDMA vices requiring RPC messages to be processed. eRPC [28] operations. cleverly combines many software techniques to reduce me- dian RPC response times by optimizing for the common case 1 Introduction (i.e., small messages with short RPC handlers). These systems Today, large online services are typically deployed as multiple have successfully reduced the message-processing overhead tiers of software running in data centers. Tiers communicate from 100s of microseconds to 1–2 microseconds. with each other using Remote Procedure Calls (RPCs) of NeBuLa [57] is a radical hardware design that tries to varying size and complexity [7,28,57]. Some RPCs call upon further minimize response time by integrating the NIC with microservices lasting many milliseconds, while others call the CPU (bypassing PCIe), and dispatching RPC requests remote (serverless) functions, or retrieve a single piece of directly into the L1 cache. The approach effectively reduces data and last only a few microseconds. These are important the minimum wire-to-wire response time below 100ns. workloads, and so it seems feasible that small messages with Put another way, these results suggest that with the right microsecond (and possibly nanosecond) service times will hardware and software optimizations, it is practical and useful become more common in future data centers [7, 28]. For to remotely dispatch functions as small as a few microseconds. example, it is reported that a large fraction of messages com- The goal of our work is to enable even smaller functions, municated in Facebook data centers are for a single key-value with computation lasting less than 1µs, for which we need to memory reference [4,7], and a growing number of papers minimize communication overhead. We call these very short describe fine-grained (typically cache-resident) computation RPCs nanoRequests. based on very small RPCs [22, 23, 28, 57]. The nanoPU, presented and evaluated here, is a combined Three main metrics are useful when evaluating an RPC sys- NIC-CPU optimized to process nanoRequests very quickly. tem’s performance: (1) the median response time (i.e., time When designing nanoPU, we set out to answer two questions. USENIX Association 15th USENIX Symposium on Operating Systems Design and Implementation 239 CSRs Ethernet MAC Ethernet Core0 Core1 Core2 MAC Ethernet Registers Arbiter Arbiter Encrypt Core 0 Encrypt head tail HW Thread Sched. CSRs CPU HW Core Sel. Decrypt Programmable NIC Decrypt Registers Ethernet MAC Ethernet CPUs with Nanotasks MAC Ethernet Splitter Splitter MAU Pipeline MAU Transport Pipeline MAU Transport 5 Nanokernel & FIFOs Global RXQs PISA HW Transport 3 netRX (c) 6 Ingress CSRs Arbiter 4 Arbiter 7 netTX L1C Encrypt 2 Reassembly RX/TXQs Encrypt Ethernet MAC Ethernet Core0 Msg Out Core1 Msg InCore2 MAC Ethernet 1 Registers Msgs Decrypt Pkts Decrypt Arbiter Arbiter Ethernet MAC Ethernet Splitter MAC Ethernet Splitter Encrypt MAU Pipeline MAU Encrypt Transport Pipeline MAU Transport LLC Messaged Buffer MAC Ethernet Core N-1 head tail Global TXQs HW Thread Sched. CSRs Egress CPU Main Memory Packetization Arbiter Arbiter Thread Context Context Datapath Encrypt Decrypt Encrypt 8 Decrypt Arbiter Registers Serial IO MAC Ethernet CPUs with Nanotasks MAC Ethernet Splitter Splitter Encrypt MAU Pipeline MAU Scheduler Pipeline MAU Transport Transport Context FIFOs 9 NIC Message Nanokernel & FIFOs Ethernet MAC + (d) Scheduler SplitterSplitter netRXSplitter (b) Pkt Out Pkt InThread (c) Decrypt Decrypt Arbiter Arbiter netTX L1C Splitter Encrypt Splitter Encrypt Decrypt MAU Pipeline NIC MAU Message Buffer Pipeline MAU Transport RX/TXQs Transport Ethernet MAC Ethernet tail Splitter tail MAU Pipeline MAU - Transport FIFOs Msg Out Msg In FIFOs L head PISA Pipeline head Transport Transport CPU2 Transport CPU2 DMA over PCIe D Decrypt Decrypt Arbiter Ethernet MAC Ethernet Splitter MAC Ethernet Registers Splitter Registers Encrypt MAU Pipeline MAU Transport Pipeline MAU DecryptTransport Decrypt Decrypt Figure 1: The nanoPU design. The NIC includes ingressMAC andEthernet egress PISA pipelines as well as a hardware-terminated Decrypt Encrypt Transport Arbiter Datapath transport and a core selector with global RX queues; each CPU core is augmentedSplitter with a hardware thread scheduler Arbiter NIC Packet Arbiter Thread Context Context Decrypt Datapath Encrypt Encrypt Arbiter Ethernet MAC Ethernet tail Splitter tail MAU Pipeline MAU Encrypt Scheduler Transport CPU Pkt In CPU and local RX/TX queues connectedContext FIFOs directly to the registerFIFOs file. NIC Message FIFOs head Msg In head (a) (d) Ethernet MAC Scheduler CPU1 CPU1 SplitterSplitter Splitter (b) Pkt Out Pkt InThread Decrypt Decrypt Arbiter Ethernet MAC The first is, what is the absolute minimum communication Control 1. The nanoPU’s median wire-to-wire responseEthernet MAC time MAU Pipeline Registers MAU Pipeline Splitter Registers Splitter Encrypt Decrypt MAU Pipeline NIC MAU Transport Pipeline MAU Transport Ethernet MAC Ethernet tail Splitter tail MAU Pipeline MAU - Ethernet MAC Transport FIFOs overhead we can achieve for processing nanoRequests? for nanoRequests,FIFOs L from the wire through the header- head PISA Pipeline head Transport Transport MAU Pipeline Transport Encrypt CPU2 Encrypt NanoRequests are simply very short-lived RPCs marked bySplitterCPU2 processing pipeline, transport layer, core selection, and Decrypt Encrypt Arbiter tail Registers tail Splitter Registers Encrypt MAU Pipeline MAU CPU the client and server NICs for special treatment.Transport In nanoPU, thread scheduling,Decrypt plus a simpleCPU loopback application and FIFOs Decrypt FIFOs Decrypt tail head FIFOs head head Arbiter nanoRequests follow a new low-overhead path through theTransport