P4-Codel: Experiences on Programmable Data Plane Hardware

P4-CoDel: Experiences on Programmable Data Plane Hardware Ralf Kundel∗, Amr Rizk{, Jeremias Blendinz, Boris Koldehofex, Rhaban Hark∗, Ralf Steinmetz∗ ∗ Multimedia Communications Lab, Technical University of Darmstadt, Germany fralf.kundel, rhaban.hark, [email protected] { University of Ulm, Germany [email protected] z Barefoot Networks, an Intel Company, CA, US [email protected] x University of Groningen, Netherlands [email protected] Abstract—Fixed buffer sizing in computer networks, especially asynchronous = queueing metadata the Internet, is a compromise between latency and bandwidth. queue-feedback (optional) = packet header & data A decision in favor of high bandwidth, implying larger buffers, p1 Match-Action- Traffic Manager Match-Action- p1 subordinates the latency as a consequence of constantly filled Pipeline & Packet Buffer Pipeline scheduler (optional) buffers. This phenomenon is called Bufferbloat. Active Queue p2 p2 ... ... Management (AQM) algorithms such as CoDel or PIE, designed ... for the use on software based hosts, offer a flow agnostic remedy to Bufferbloat by controlling the queue filling and hence the pn pn latency through subtle packet drops. P4-programmable P4-programmable In previous work, we have shown that the data plane pro- Ingress Pipeline Egress Pipeline gramming language P4 is powerful enough to implement the packet/metadata flow direction CoDel algorithm. While legacy software algorithms can be easily Fig. 1: Functional building blocks of P4-programmable ASICs. The light- compiled onto almost any processing architecture, this is not red parts are P4 programmable. Packet queueing is not part of P4 and its generally true for AQM on programmable data plane hardware, configuration is vendor dependent. A programmable Match-Action Pipeline i.e., programmable packet processors. In this work, we highlight after the packet buffer is not given in all currently existing architectures. corresponding challenges, demonstrate how to tackle them, and provide techniques enabling the implementation of such AQM algorithms on different high speed P4-programmable data plane variant, to packet drops. By dropping packets earlier than at hardware targets. In addition, we provide measurement results a full buffer, the sender congestion control algorithm receives created on different P4-programmable data plane targets. The an early signal to reduce the sending data rate. In turn, this resulting latency measurements reveal the feasibility and the con- leads to the buffer filling and, hence, the queueing delay straints to be considered to perform Active Queue Management to remain relatively small. In recent years, two stateful and within these devices. Finally, we release the source code and instructions to reproduce the results in this paper as open source self-tuning AQM algorithms, i.e., CoDel (RFC 8289) [4] and to the research community. PIE (RFC 8033), have been presented and widely adopted. In Index Terms—CoDel, AQM, Bufferbloat, P4, ASIC, FPGA, addition to those two approaches, there exist many variants of NPU AQM algorithms with different goals and assumptions [5]. In a previous work, we have shown that such AQM algo- I. INTRODUCTION rithms can be expressed with programming languages aimed at arXiv:2010.04528v2 [cs.NI] 7 Jul 2021 Bufferbloat describes a phenomenon of high latencies ob- controlling network switch data plane behavior [6]. However, served in networks configured with large static buffer sizes [1]. it remained open how feasible it is to realize such AQM For a single TCP traffic flow, it is known that the buffer size algorithms on packet processors, as these algorithms were that maximizes the TCP throughput is directly proportional to not conceptualized to run on data plane network hardware the round trip time (RTT). This is also true for the case of mul- but rather on software based consumer edge devices. Indeed, tiplexing many homogeneous flows having the same RTT ex- for many networking applications packet processing with high cept of a correction prefactor [2]. As traffic flows usually have throughput is required and can only be ensured by algorithms widely different RTTs and throughput maximization is not the realized directly within the data plane. Further, it remained main goal for many contemporary networking applications, but open how different programmable networking hardware affect rather latency minimization, an idea of controlling flow delays the algorithms performance and which additional challenges through subtle packet drops is experiencing a renaissance [3]. arise in different architectures. The algorithmic version of this idea is denoted Active Queue To understand the problem of running AQM algorithms on Management (AQM) and is based on the sensitive reaction programmable data plane hardware, a deeper look into the of the transport protocol congestion control, typically a TCP pipeline of modern packet processors is required. The internal pipeline of packet processors, including network switches, algorithm, presented in 2018 by Nichols et al [4], will not is typically similar to the architecture depicted in Figure 1. be discussed further. Figure 2 shows the control flow graph of Packets enter on one of the n ports on the left side and are the algorithm that includes four stateful variables: 1) dropping, multiplexed into a single match-action-pipeline. Within this 2) count, 3) last count and 4) drp next. The value of a stateful pipeline operations, e.g. lookups, on packet header fields and variable persist the processing time of a single packet. metadata can be performed. After that, the traffic manager is The algorithm can be regarded as state machine consisting responsible for packet queueing and scheduling. Optionally, of two states: 1) dropping and 2) not dropping. As soon as depending on the architecture, a second match-action-pipeline the observed queueing delay exceeds the TARGET of 5 ms, can be applied on the packets before demultiplexing and the state of the state machine is changed to dropping which, sending the packets out on the specified egress port. Note however, does not imply an immediate packet drop (if 2). that packets can flow only from left to right and algorithmic After waiting for the time INTERVAL until time drp next loops within the pipeline are not possible. By that, a high within the dropping state, the first following packet is dropped and deterministic processing performance can be guaranteed and the counter of dropped packets is increased (if 4). From which is crucial for network functions running at line rate [7], this point on, the interval between dropping packets is contin- including AQM algorithms running within the network. uously decreased as INTp ERV AL until the queueing delay falls count In case of programmable packet processors built on top below the TARGET delay. Then, the state changes back to not of the Protocol Independent Switch Architecture (PISA) [8], dropping (if 1). In case of a recently left dropping state, the these ingress and egress match-action-pipelines are pro- new dropping rate is initialized higher than 1 (if 3). grammable within limitations. The programming language The CoDel algorithm is conceptualized for sequential packet P4 [9] represents the currently most widely accepted approach processing, i.e., no parallel processing of multiple packets. of programming the depicted ingress and egress match-action- Thus, the processing of packetn is expected to be completed pipelines. However, the configuration of the packet buffer in before the processing of packetn+1 starts. Otherwise, e.g., the the middle of the switch architecture (see Figure 1) as well read operation on the stateful variable drp next for packetn+1 as the specific algorithms used by this engine for queueing, is performed before the write operation on this variable by scheduling and buffer management are out of the scope packetn is completed and by that unexpected behavior occurs. and purpose of this language. Nevertheless, useful metadata Note that from an algorithmic perspective a partial overlap- produced inside the Traffic manager, such as the packet ping processing of multiple packets is possible. If for each queueing delay, can be passed alongside with other packet stateful variable the write operation is performed before the metadata into the egress pipeline. Alternatively, depending on read operation of the subsequent packet, the algorithm is exe- the actual architecture, the current buffer utilization can be cuted correctly. In the concrete case of CoDel, operations on passed asynchronously to the ingress pipeline. The work at hand focuses on the feasibility of realizing AQM algorithms on programmable hardware devices, which 1 #define TARGET 5//ms creates a variety of challenges compared to the implementa- 2 #define INTERVAL 100//ms tions on classical commodity processor. Here, we analyze the 3 Queue q; State s; required alterations of an established stateful AQM algorithm 4 5 upon receive packet p: (CoDel) that make it possible to implement it. Note that our 6 //no target delay violation?(if 1) key findings are not dependent on the specific AQM algo- 7 i f(p.queue delay < TARGET j j q . b y t e < IFACE MTU) : rithm at hand. We evaluate P4-Codel for P4-NetFPGAs [10], 8 s.dropping = false 9 c o n t i n u e Netronome SmartNICs and Intel Tofino switches. 10 // first packet which violates delay target?(if 2) The main contributions of this paper are: 11 i f(s.dropping == false): 12 s.dropping = true • The analysis of CoDel’s AQM implementation control 13 tmp = s.count flow, particularly considering information dependencies, 14 // drop frequency steady state?(if 3) 15 i f((s.count − s.last c o u n t > 1) && • how to transform an algorithm designed for CPUs for 16 ( now − s . d r p n e x t p a c k e t < 16*INTERVAL) ) : packet processing pipelines and a detailed implementa- 17 // optimization for faster packet dropping tion description of CoDel for existing P4 hardware, 18 s.count = s.count − s.last c o u n t 19 e l s e: • evaluation results of characteristic properties for these 20 s .

Load more