PL2: Towards Predictable Low Latency in Rack-Scale Networks

PL2: Towards Predictable Low Latency in Rack-Scale Networks Yanfang Le y, Radhika Niranjan Mysore yy, Lalith Suresh yy, Gerd Zellwegeryy, Sujata Banerjeeyy, Aditya Akellay, Michael Swifty University of Wisconsin-Madisony, VMware Researchyy Abstract Rack-scale networks1 need to satisfy the key requirements High performance rack-scale offerings package disaggre- of uniform low latency and high utilization, irrespec- gated pools of compute, memory and storage hardware in a tive of where applications reside, and which accelerators single rack to run diverse workloads with varying require- they access (e.g., FPGA vs. CPU vs. GPU). However, three ments, including applications that need low and predictable key obstacles stand in the way of achieving these goals latency. The intra-rack network is typically high speed Ether- because of the above-mentioned characteristics. First, the net, which can suffer from congestion leading to packet drops rack-scale network must be transport-agnostic, a neces- and may not satisfy the stringent tail latency requirements sity in environments with (a) heterogeneous accelerator de- 2 for some workloads (including remote memory/storage ac- vices that have different characteristics than CPU network cesses). In this paper, we design a Predictable Low Latency stacks [25, 38, 39], and (b) increasing use of CPU-bypass net- (PL2) network architecture for rack-scale systems with Eth- working [32, 35, 37]. Second, Ethernet is not a lossless fabric, ernet as interconnecting fabric. PL2 leverages programmable and yet, our experiments (§2) on a 100G testbed confirm Ethernet switches to carefully schedule packets such that that drops, not queueing, are the largest contributor to tail they incur no loss with NIC and switch queues maintained at latency pathologies. Third, the design must be workload- small, near-zero levels. In our 100 Gbps rack-prototype, PL2 oblivious – given that we cannot anticipate traffic and work- keeps 99th-percentile memcached RPC latencies under 60 µs load patterns across a broad range of customer environments, even when the RPCs compete with extreme offered-loads of it is impractical to rely on state-of-the-art proposals (§3) that 400%, without losing traffic. Network transfers for a machine hinge on configuring rate limits or priorities using a-priori learning training task complete 30% faster than a receiver- knowledge of the workload. driven scheme implementation modelled after Homa (222ms In this paper, we present Predictable Low Latency or PL2, vs 321ms 99%ile latency per iteration). a rack-scale lossless network architecture that achieves low latency and high throughput in a transport-agnostic and workload-oblivious manner. PL2 reduces NIC-to-NIC latencies by proactively avoiding losses. PL2 supports a variety of 1 Introduction message transport protocols and gracefully accommodates increasing numbers of flows, even at 100G line rates. It nei- Rack-scale data center solutions like Dell-EMC VxRail [57] ther requires a-priori knowledge of workload characteristics and Intel RSD [52] have emerged as a new building block nor depends on rate-limits or traffic priorities to be set based for modern enterprise, cloud, and edge infrastructure. These on workload characteristics. rack-units have three key characteristics; First is the increas- arXiv:2101.06537v2 [cs.NI] 22 Jan 2021 To achieve these goals, senders in PL2 explicitly request a ing use of resource disaggregation and hardware accelerators switch buffer reservation for a given number of packets, a within these rack-units like GPUs and FPGAs [52], in ad- packet burst, and receive notification as to when that burst dition to high-density compute and storage units. Second, can be transmitted without facing any cross traffic from other Ethernet is by far the dominant interconnect of choice within senders. PL2 achieves this form of centralized scheduling such racks, even for communication between compute units, even at 100G line rates by overcoming the key challenge storage units and accelerators (e.g., Ethernet-pooled FPGA of carefully partitioning the scheduling responsibility be- and NVMe in Intel RDS [33]). Third, these racks are deployed tween hosts in the rack and the Top-of-Rack (ToR) switch. in a wide range of enterprise and cloud customer environ- In particular, the end-host protocol is kept simple enough ments, running a heterogeneous mix of modern (e.g., machine learning, graph processing) and legacy applications 1The network extending between NICs of such rack-units across the top-of- (e.g., monolithic web applications), making it impractical to rack (ToR) switch. 2For example, FPGA stacks will not be connection-oriented due to scaling anticipate traffic and workload patterns. issues [54] and GPUs will not have a single receiver stack [39]). 1 to accommodate accelerator devices and implementations adds latencies to network transfers. However, the implication within NICs (§4.4), whereas the timeslot allocation itself is of this trend is that microbursts can over-run shared output performed in the ToR switch at line rate (as opposed to doing switch-buffers and cause drops. For instance, a 2 MB buffer so on a host, which is prone to software overheads). in a ToR switch with 100 Gbps ports provides buffering for In summary, our contributions are: just 160 µs which means only 8 simultaneous transfers of • The PL2 design that embodies novel yet simple algo- 2Mbits can be sustained before the switch starts dropping rithms for lossless transmissions and near-zero queu- packets. Today’s rack-scale networks support up to 64 rack- ing within a rack units [4], where each end-system can have tens of thousands • A PL2 implementation using a P4 programmable switch of ongoing transfers. At that scale a 2 MB can be overrun and an end-host stack that leverages Mellanox’s state- with only 6 simultaneous 5 µs (1 RTT) network transfers per of-the-art VMA message acceleration library [7] rack-scale unit. In short, as rack-densities go up, drops due to • A comprehensive PL2 evaluation on a 100 Gbps pro- microbursts can be frequent. Therefore, assumptions made totype, supporting three different transports (TCP, by congestion protocols like [30, 44] that the network-core UDP and Raw Ethernet), all benefiting from near-zero (ToR switch in the case of racks) is lossless, no longer holds. queueing in the network. Compared to VMA, we demon- strate up to 2.2x improvement in the 99th percentile 3. Rack-scale traffic is ON/OFF traffic [19] We believe latency for the Memcached application; a 20% improve- this trend will continue with network traffic generated by ac- ment to run a VGG16 machine learning workload; and celerators. Measuring traffic-demands in such environments near-optimal latency and throughput in experiments is hard, let alone learning about workload-characteristics; using trace-based workload generators. traffic demands at ns-timescales will be different compared to `s timescales and ms-timescales [59]. Workload churn and 2 Motivation different application mixes adds to the unpredictability. The primary goal of PL2 is to provide uniformly low-latency Coming up with effective rate-limits18 [ , 27, 34, 36, 41], across Ethernet rack-fabrics, while achieving high-utilization. and readjusting these rate-limits with changing traffic-conditions We take inspiration from prior work around low and pre- in time (i.e., less than an RTT) is impossible; so is setting dictable latency within data center networks [13, 15, 16, 18, relative packet priorities [44] effectively [40] so that impor- 20, 22, 26, 27, 29–31, 34, 36, 41–44, 47, 48, 56, 58, 60], but find tant traffic is never lost or delayed. In our experience nei- that rack-scale networks provide a rich set of new challenges. ther rate-limits nor priority prescription is an answer to Rack-scale characteristics and implications congestion-control within a rack. 1. Even as line-rates increase, intra-rack RTTs are Drops cause the most noticeable tails not getting smaller. [44] measured 5 µs end-to-end RTT Based on the above three observations, we believe that new on a 10 Gbps testbed with a single ToR switch, inclusive techniques are required to ensure low-latency and high- of software delays on the servers. A 64B packet still has utilization within a rack-scale network. We hinge the PL2 an RTT of 5 µs in our 100 Gbps rack-scale PL2 prototype. design on the observation that drops, not queuing cause the Even though the network transmission times reduce propor- most noticeable tails. tionally with increasing line-rates, switching-delays, NIC We illustrate this point with an experiment that introduces hardware delays, and DMA transfer delays at end-hosts have microbursts in our 100 Gbps testbed even when the network remained about the same, and these delays together domi- is partially loaded, by using 5-way incast of roughly 18 Gbps nate transmission delays in a rack. Forward-error-correction traffic per sender. All messages transmitted in our exper- delays increase as line-rates increase, and can add variability iment can be transmitted within a single RTT. As shown of up to 1-2 µs in a 100 Gbps network. This implies that rack- in Figure 1a, the 99%ile latencies experienced by a receiver- scale networks are able to transfer more data in the same RTT driven scheme (RDS-baseline) based on Homa (described as interconnect speeds increase. Flows up to 61 kB can com- in detail in Section 6) correspond to the maximum output- plete within 1 RTT with a 100 Gbps backplane as opposed to buffering available in the ToR switch (around 120 µs in Fig- 6 kB for a 10 Gbps backplane. For congestion control proto- ure 1b), while the 99.9%ile latencies correspond to delays due cols and predictable latency schemes to be effective for flows to drops, and are two orders of magnitude higher.

Load more