Detecting Fair Queuing for Better Congestion Control

Maximilian Bachl, Joachim Fabini, Tanja Zseby Technische Universität Wien fi[email protected]

Abstract—Low delay is an explicit requirement for applications While FQ solves many problems regarding Congestion such as cloud gaming and video conferencing. Delay-based con- Control (CC), it is still not ubiquitously deployed. Flows can gestion control can achieve the same but significantly benefit from knowledge of FQ enabled bottleneck links on the smaller delay than loss-based one and is thus ideal for these applications. However, when a delay- and a loss-based flow path to dynamically adapt their CC. In the case of FQ they can compete for a bottleneck, the loss-based one can monopolize all use a delay-based CCA and otherwise revert to a loss-based the and starve the delay-based one. Fair queuing at the one. Our contribution is the design and evaluation of such bottleneck link solves this problem by assigning an equal share of a mechanism that determines the presence of FQ during the the available bandwidth to each flow. However, so far no end host startup phase of a flow’s CCA and sets the CCA accordingly based algorithm to detect fair queuing exists. Our contribution is the development of an algorithm that detects fair queuing at flow at the end of the startup. startup and chooses delay-based congestion control if there is fair We show that this solution reliably determines the presence queuing. Otherwise, loss-based congestion control can be used as of fair queuing and that by using this approach, queuing delay a backup option. Results show that our algorithm reliably detects can be considerably lowered while maintaining throughput fair queuing and can achieve low delay and high throughput in high if FQ is detected. In case no FQ is enabled, our algorithm case fair queuing is detected. detects this as well and a loss-based CC is used, which does not starve when facing competition from other loss-based I.INTRODUCTION flows. While our mechanism is beneficial for network flows in Emerging applications such as cloud gaming [1] and remote general, we argue that it is most useful for long-running delay- virtual reality [2] applications require high throughput and low sensitive flows, such as cloud gaming, video conferencing etc. delay. Furthermore, ultra low latency communications have To make our work reproducible and to encourage further been made a priority for 5G [3]. We argue that for achieving research, we make the code, the results and the figures publicly high throughput as well as low delay, congestion control must available1. be taken into account. Delay-based Congestion Control Algorithms (CCAs) were II.RELATED WORK proposed to provide high throughput and low delay for net- Our work depends on active measurements for the startup work flows. They use the queuing delay as an indication of phase of CC and proposes a new measurement technique congestion and lower their throughput as soon as the delay to detect Active Queue Management (AQM). This section increases. Such approaches have been proposed in the last discusses preliminary work related to these fields. century [4] and recently have seen a surge in popularity [5, 6, 7, 8]. While delay-based CCAs fulfill their goal of A. CC using Active Measurements low delay and high throughput, they cannot handle competing Several approaches have been recently proposed which flows with different CCAs well [9, 10]. This is especially aim to incorporate active measurements into CCAs. [15, 16] arXiv:2010.08362v3 [cs.NI] 19 Feb 2021 prevalent when they compete against loss-based CCAs, the introduced the CCA PCC, which uses active measurements to latter being more “aggressive”. While the delay-based CCAs find the sending rate which optimizes a given reward function. back off as soon as queuing delay increases, the loss-based [8] introduce the BBR CCA, which uses active measurement ones continue increasing their throughput until the buffer is to create a model of the network connection. This model is full and packet loss occurs. This results in the loss-based flows then used to determine the optimal sending rate. An approach monopolize the available bandwidth and starve the delay-based proposed by [17] aims to differentiate bulk transfer flows, ones [11, 12, 13]. which they deem “elastic” because they try to grab as much This unfairness can be mitigated by using Fair Queuing bandwidth as available, from other flows that send at a constant (FQ) at the bottleneck link, isolating each flow from all other rate, which they call “unelastic”. They use this elasticity flows and assigning each flow an equal share of bandwidth detection to determine if delay-based CC can be used or not. If [14]. Consequently, loss-based flows can no longer acquire cross traffic is deemed elastic then they use classic loss-based bandwidth that delay-based flows have released due to their CC and otherwise delay-based CC. network-friendly behavior. In this case, delay-based CCAs achieve their goal of high bandwidth and low delay. 1https://github.com/CN-TU/PCC-Uspace

1 B. Flow startup techniques sending rate receiving rate receiving rate Besides the classical TCP slow start algorithm [18] some ƒow 2 ƒow 2 ƒow 2 other proposals were made such as Quick Start [19], which ƒow 1 ƒow 1 ƒow 1 aims to speed up flow start by routers informing the end point, which rate they deem appropriate. [20] investigate the impact link speed link speed of AQM and CC during flow startup. They inspect a variety of different AQM algorithms and also investigate FQ. [21] propose “chirping” for flow startup to estimate the available bandwidth. This works by sending packets with different gaps time time time between them to verify what the actual link speed is. [22] 1 r‹ 2 r‹s 3 r‹s 1 r‹ 2 r‹s 3 r‹s 1 r‹ 2 r‹s 3 r‹s propose speeding up flow startup by using packets of different (a) Sending rate (b) Receiving rate (no FQ). (c) Receiving rate (FQ). priority. The link is flooded with low-priority packets and Fig. 1: An example illustrating our proposed flow startup in case the link is not fully utilized during startup, these mechanism. Figure 1a shows the sending rate, 1b the receiving low-priority packets are going to be transmitted, allowing the rate in case there’s no FQ and 1c the receiving rate if there is starting flow to achieve high throughput. If the link is already FQ. saturated, the low-priority packets are going to be discarded and no other flow suffers. C. Detecting AQM mechanisms 2) After every Round Trip Time (RTT), if no packet loss occurred, double the sending rate of both flows. There are few publications that investigate methods to detect 3) If packet loss occurred in both flows in the previous RTT, the presence of an AQM. [23] propose a method to detect and calculate the following metric: distinguish certain popular AQM mechanisms. However, un- like our work, it doesn’t consider FQ. [24] propose a machine  receiving rate of flow 1  sending rate of flow 1 learning based approach to fingerprint AQM algorithm. FQ is loss ratio = (1) not considered, too.  receiving rate of flow 2  sending rate of flow 2 D. Detecting the presence of a shared bottleneck This loss ratio metric indicates if flow 2, which has Several approaches have been proposed to detect whether a higher sending rate, achieves a higher receiving rate in a set of flows, some of them share a common bottleneck (goodput) than flow 1. If this is the case, there’s no FQ, [25, 26, 27]. These approaches typically operate by compar- otherwise there is. ing some statistics of flows to check whether they correlate It is necessary to wait for packet loss in both connections between flows. For example, it can be checked if throughput because only then it is certain that the link is saturated. correlates between two flows. If throughput correlates, it can Our algorithm assumes that the link is saturated. mean that these flows share a common bottleneck link, and if 4) If FQ is used, the loss ratio is 2, otherwise it is 1. We additional flows join at the bottleneck link, all other flows’ choose 1.5 as the cutoff value. go down, which is the reason why they are 5) Since the presence or absence of FQ is now known the correlated. While approaches to detect a shared bottleneck link appropriate CCA can be launched. appear to be similar to what this paper is aiming to do, there are differences: Two flows can share a common bottleneck link Figure 1a, 1b and 1c schematically illustrate this mecha- but the bottleneck link can have fair queuing. Thus the aims nism. We assume a sender being connected to a receiver, with of shared bottleneck and fair queuing detection are different. a bottleneck link between them. We also assume that all flows between the source and the destination follow the same route. III.CONCEPT On this bottleneck link, the queue is either managed by FQ The overall concept is the following: or by a shared queue (no FQ). 1) During flow startup, determine whether the bottleneck Figure 1a shows how the sending rate (observed before the link deploys FQ or not. bottleneck link) is doubled after each RTT. Furthermore, it 2) If FQ is used change to a delay-based CCA or revert to shows that flow 2 has double the sending rate of flow 1. a loss-based CCA otherwise. Figure 1b shows the receiving rate at the receiver after the bottleneck link, if this bottleneck link has a a shared queue A. Algorithm to determine the presence of FQ without FQ. In the third RTT the sending rate exceeds the link The algorithm to determine the presence of FQ works as speed of the bottleneck and thus the receiving rate is smaller follows (seen from the sender’s point of view): than the sending rate. Because no FQ is employed, flow 2 1) Launch two concurrent flows, flow 1 and flow 2 from the manages to have a receiving rate two times the one of flow 1. same source host to the same destination host. Initialize Figure 1c shows the receiving rate of the two flows mea- flow 1 with some starting sending rate and flow 2 with sured after the bottleneck link in case FQ is deployed. In the two times that rate. third RTT, flow 1 and flow 2 achieve the same receiving rate

2 because of the FQ, which makes sure that each flow gets an V. EXPERIMENTSETUP equal share of the total capacity. For the evaluation we use the network emulation library py-virtnet2. All experiments run on virtual hosts on one B. Simple delay-based CC physical host, on which the necessary network components and If FQ is detected, a simple delay-based CC is used from virtual hosts are emulated using Linux network namespaces. then on. Our algorithm is based on the two following simple The setup consists of two virtual hosts connected through a rules: virtual switch. One virtual host acts as a server and one as a client and a bulk transfer is initiated from the client to the 1) If current rtt ≤ smallest rtt ever measured on the con- server. nection + 5ms then increase the sending rate by 1%. We For the experiments with FQ we use the fq module for choose 5 ms since a delay of 5 ms is barely noticeable by Linux. This module keeps one drop tail queue for each flow humans. This design choice was first established in [28]. with a configurable maximum size of packets. However, the 2) Otherwise (current rtt > smallest rtt ever measured on design of our FQ detection algorithm does not assume any the connection + 5ms) decrease the sending rate by 5%. particular queue management algorithm being used for the This CCA is meant as a simple proof-of-concept to demon- queue of each flow and thus is also compatible with more strate the efficacy of our FQ detection mechanism but we do sophisticated AQM mechanisms which were explicitly devel- not claim it to be optimal in any way. oped to be used in conjunction with FQ [36, 37, 38, 39]. It only assumes that each flow gets the same share of bandwidth C. Fallback loss-based CC and that for a flow, whose queue is growing faster because If the absence of FQ is detected, a loss based CC is used more packets of this flow arrive at the bottleneck, packets are as a fallback. We use the algorithm PCC proposed by [15]. dropped faster than for a flow of which fewer packets appear In simplified terms, this CCA aims to maximize a utility at the bottleneck link. For the experiments with a shared queue function, which strives for high throughput and low packet we use pfifo, which is a simple drop tail queue which can loss. However, it is reasonably aggressive and does not starve contain up to a fixed number of packets. when competing with other loss-based CCAs such as Cubic VI.RESULTS [29]. A. Delay- vs. loss-based CC IV. IMPLEMENTATION We base our implementation on the code of PCC Vivace 125 [16], which in turn is an improved version of PCC. The PCC 8 100 code is based on the UDT library [30], which is a library that 6 75 provides a reliable transport protocol similar to TCP on top 4 of UDP. This approach is similar to QUIC [31]. An advantage RTT [ms] 50 2 25

of implementing our approach on top of UDP as opposed to Throughput [Mbit/s] TCP is that it is easier to develop and modify since no kernel 0 0 module needs to be created and that it is more secure since 0 5 10 15 20 0 5 10 15 20 Time [s] Time [s] the code is isolated in a user space process and most attacks (a) Throughput (receiving rate) (b) RTT can only influence this process and not the entire kernel. Fig. 2: Throughput and delay of a flow controlled by the loss- A. Deployment based CCA Cubic on a link with a speed of 10 Mbps, a delay of 10 ms and a buffer size of 100 packets (pfifo’s default Our approach of detecting FQ relies on using two con- value). current flows between the sender and the receiver. During flow startup, user data is transmitted in two flows that the First, we demonstrate empirically that delay-based CC ac- receiver must detect and reassemble. This may be a challenge tually has a benefit over loss-based one. On one side, Figure 2 for deployments since solutions that require cooperation of shows that the loss-based CC achieves a throughput of almost several instances are more difficult to deploy at a wide scale 10 Mbps on a 10 Mbps link, which means full utilization and in the Internet. However, multipath TCP [32] and QUIC, an RTT that is higher than 100 ms, which is not desirable. for which there are proposals to enable recombining data On the other side, Figure 3 shows an example flow using our of several flows [33], are becoming increasingly prevalent. simple delay-based CCA and a slow start algorithm similar to We thus argue that the actual obstacles to deployment of the one used by the classic TCP CCA New Reno [40]. While our solution are constantly decreasing since QUIC (mostly the delay-based flow also achieves very high utilization, delay Google traffic) already accounts for around 10% of all Internet rarely exceeds 20 ms, meaning that it is more than 5 times traffic [34] and is growing. Moreover, Apple’s macOS and iOS support multipath TCP [35] support. 2https://pypi.org/project/py-virtnet/

3 50 8 125 40 30 100 6 30 75 20 4 20 RTT [ms] 50 RTT [ms] 2 10 25 10 Throughput [Mbit/s] Throughput [Mbit/s] 0 0 0 0 0 5 10 15 0 5 10 15 20 0 10 20 0 10 20 Time [s] Time [s] Time [s] Time [s] (a) Throughput (receiving rate) (b) RTT (a) Throughput (receiving rate) (b) RTT Fig. 3: Throughput and delay of a flow controlled by our delay- Fig. 4: Throughput and delay of a flow controlled by the loss- based CCA on a link with a speed of 10 Mbps, a delay of 10 ms based CCA Cubic on a link with a speed of 50 Mbps, a delay and a buffer size of 100 packets. of 10 ms and a buffer size of 100 packets. The bottleneck is controlled by a shared queue (no FQ) and is shared with the flow of Figure 5. lower than the one of the loss-based CCA of Figure 2. The initial peak in Figure 3 is caused by the exponential increase of the sending rate in the beginning, which is commonly used 40 by transport protocols and called “slow start” [18]. Our delay- 30 based CC also uses a mechanism similar to slow start in the 30 20 beginning. 20 RTT [ms] 10 B. Comparing flows with different Congestion Controls and 10 AQM mechanisms Throughput [Mbit/s] 0 0 In this scenario, we show an example of a flow with our 0 5 10 15 0 5 10 15 20 Time [s] Time [s] mechanism that shares a bottleneck link with a flow using a (a) Throughput (receiving rate) (b) RTT loss-based congestion control. 1) No FQ: The examples in Figure 4 and in Figure 5 Fig. 5: Throughput and delay of a flow controlled by our delay- show the loss-based CCA Cubic compete against our delay- based CCA on a link with a speed of 50 Mbps, a delay of 10 ms based CCA on a link without FQ. It is important to note that and a buffer size of 100 packets. The bottleneck is controlled we do not use our FQ detection mechanism here but instead by a shared queue (no FQ) and is shared with the flow of choose delay-based CC from the beginning to demonstrate Figure 4. the starvation of delay-based flows that happens when they compete against loss-based ones without FQ. We start the loss- based flow before to make sure that it has sufficient time to fill size from 1 to 100 packets, using a grid of these parameters. up the link: The loss-based flow starts 5 seconds earlier and For each parameter, we take 5 values, evenly spaced. Thus, after 5 seconds, the delay-based flow joins. We choose these 5 we run experiments with 125 different configurations. We run seconds so that the first flow can enter its stable state because each configuration once with FQ and once without. Results we want to show how how the flow, even though it had the show that for experiments without FQ (using a shared queue), advantage of the 5 seconds in which it could use the link all in over 99% of cases, the absence of FQ was correctly detected for itself, loses to the loss-based CC: While the delay-based and in less than 1% of cases, our algorithm detected FQ even flow gains some share of the link during startup, it is pushed though there was no FQ. For experiments, during which FQ away by the loss-based flow later on and “starves”. was actually deployed on the bottleneck link, this was correctly 2) FQ: For a link with fair queuing, the following example detected in 97% of cases and wrongly in 3% of the time. The results show that throughput is identical for our flow (Figure 7) overall accuracy is thus 98%, which we consider sufficient. and the Cubic flow (Figure 6), just that the Cubic flow’s delay To verify the behavior of our algorithm under the presence is more than two times higher than the delay-based one’s. of cross traffic, we repeated the above experiments but with Thus, our mechanism detects that the bottleneck uses FQ one bulk traffic flow using Cubic running at the same time, and uses a delay-based congestion control because it is better which we start 5 seconds before so that it has sufficient time suited in a scenario with FQ. to gain full bandwidth. Also under these circumstances we also achieve 98% detection accuracy using our algorithm. C. Accuracy of the FQ detection mechanism We investigated the reason for the detection accuracy not To evaluate if our proposed mechanism correctly recognizes being 100% and found the following: The PCC code relies on FQ in a systematic manner, we perform experiments using packet scheduling using timers, which is very computationally a wide range of network configurations: we vary bandwidth expensive because it needs a timer for every packet and thus from 5 to 50 Mbps, delay from 10 to 100 ms and the buffer the program has to be put to sleep/woken up repeatedly by the

4 our algorithm. For Cubic, the mean throughput is 21 Mbit/s 50 80 and the mean RTT is 96 ms while for our algorithm it is 40 60 26 Mbit/s and 64 ms. This shows that our algorithm can cor- 30 rectly determine that there is FQ and switch to the delay-based 40 20 CC and that our algorithm achieves significantly lower delay RTT [ms] (loss-based Cubic’s delay is 50% higher) and – surprisingly – 10 20 Throughput [Mbit/s] higher throughput. This is astonishing considering that loss- 0 0 0 10 0 10 20 based CCAs such as Cubic should be more aggressive in Time [s] Time [s] filling the link. When analyzing the results we identified that (a) Throughput (receiving rate) (b) RTT Cubic performs very poorly in scenarios with small buffers Fig. 6: Throughput and delay of a flow controlled by the loss- because the multiplicative decrease of Cubic is 0.7, meaning based CCA Cubic on a link with a speed of 50 Mbps, a delay that the bandwidth is reduced by 30% after every packet loss of 10 ms and a buffer size of 100 packets. The bottleneck is while our algorithm only reduces it by only 5%. Thus Cubic controlled by FQ and is shared with the flow of Figure 7. reduces bandwidth too much after packet loss which results in underutilization of the link.

25 VII.DISCUSSION

20 80 During the implementation of our algorithm we noticed

15 60 peculiar behavior when not using FQ: During startup, we make flow 1 have half the sending rate of flow 2. We would expect 10 40

RTT [ms] the receiving rates to be the same: flow 1 achieving around 5 20 half the rate of flow 2. However, we noticed that usually flow Throughput [Mbit/s] 0 0 1 would have a receiving rate of 0. After investigating the 0 5 10 15 0 5 10 15 20 issue we noticed that the reason was pacing: PCC (and other Time [s] Time [s] CCAs) such as BBR send packets in regular intervals, for (a) Throughput (receiving rate) (b) RTT example every millisecond, because sending regularly keeps Fig. 7: Throughput and delay of a flow controlled by our delay- the queues shorter. This behavior is called pacing. However, based CCA on a link with a speed of 50 Mbps, a delay of 10 ms when there are two flows where one flow’s sending rate is a and a buffer size of 100 packets. The bottleneck is controlled multiple of the other’s, an interleaving effect occurs: flow 1 by FQ and is shared with the flow of Figure 6. would always send each packet slightly after flow 2 sends a packet. Then, when the packets arrive at the bottleneck, flow 2’s packet, which arrives a bit earlier, fills up the buffer and operating system. Thus, if processor load is high, the timers flow 1’s packet is consequently always dropped. As a solution, fail and PCC can’t send as many packets as it would want to. we do not send packets at regular intervals during startup but our algorithm relies on flow 1 sending half the rate of flow 2 always add a small random value: For example, if the sending but this cannot always be guaranteed if CPU load is high. We interval were 1 ms, we would send the packet randomly in the tried to minimize all tasks which run in the background and interval from 0.5 to 1.5 ms. This solves the aforementioned achieved the accuracy we got. When we ran the experiments problem. The insight we gained from this is that while pacing while we was browsing the web, we got lower accuracy. is generally supportive, it is actually harmful if several flows Solutions to the problem are to not schedule packets using compete at a bottleneck without FQ and one flow’s sending timers or to implement the algorithm not in user space but in rate is a multiple of the other’s. kernel space. Then the computational cost of could decrease One problem of our proposed solution occurs when the because fewer context switches between the PCC user space bottleneck link changes while a flow is running. For example, application and the kernel are necessary. the bottleneck link could be the home router in the beginning, while later the bottleneck link could be another router deeper D. Systematic evaluation in the core network. The problem occurs if the home router Besides the examples we provided, we also conducted a supports FQ and our algorithm detects this while the router more systematic comparison of the throughput and delay that in the core network, which becomes the bottleneck later on, our algorithm achieves: we use the same scenario as above doesn’t support FQ. Then, our algorithm is going to use and again vary bandwidth from 5 to 50 Mbps, delay from a delay-based CC because that’s what it determined to be 10 to 100 ms and the buffer size from 1 to 100 packets. For correct in the beginning of the flow. However, later when the each parameter, we take 5 values, evenly spaced, totaling 125 bottleneck changes, our algorithm is still using delay-based distinct scenarios. We use FQ at the bottleneck link. We run CC even though the current bottleneck doesn’t support FQ each scenario both using Cubic as well as our algorithm for anymore. As a result, it might be possible that our algorithm 30 seconds each. We average the achieved throughputs and uses the wrong CC. One way to prevent this problem from average delays from all experiments for both Cubic as well as occurring is to use periodic measurements. For example, our

5 algorithm to determine FQ could be used every 5 or 10 seconds ACM SIGCOMM Computer Communication Review, vol. 45, to measure again if the current bottleneck supports FQ. pp. 537–550, Aug. 2015. Another scenario that is worth noting is how our proposed [8] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson, “BBR: Congestion-Based Congestion Control,” solution deals with load balancing: In this research we assume ACM Queue, 2016. that both of our flows, which we use during startup to deter- [9] B. Turkovic, F. A. Kuipers, and S. Uhlig, “Fifty Shades of Con- mine the presence of FQ take the same path to the host on the gestion Control: A Performance and Interactions Evaluation,” other side. However, with load balancing it could happen that arXiv:1903.03852 [cs], 2019. flow 1 takes a different path than flow 2. Fortunately, several [10] B. Turkovic, F. A. Kuipers, and S. Uhlig, “Interactions between Congestion Control Algorithms,” in TMA, 2019. load balancing algorithms have been proposed whose aim is [11] M. Hock, R. Bless, and M. Zitterbart, “Toward Coexistence of to make sure that connections, which use multiple paths using Different Congestion Control Mechanisms,” in 2016 IEEE 41st Multipath TCP or QUIC, stay on the same path [41, 42, 43]. Conference on Local Computer Networks (LCN), pp. 567–570, Since we envision our solution to be based on Multipath TCP Nov. 2016. or QUIC load balancing should not be a problem. Apple Music, [12] Y.-C. Lai, “Improving the performance of TCP Vegas in a het- erogeneous environment,” in Proceedings. Eighth International for example, successfully load balances multipath connections Conference on Parallel and Distributed Systems. ICPADS 2001, [44]. pp. 581–587, June 2001. ISSN: 1521-9097. An alternative to our algorithm could be that all routers [13] R. Awdeh and S. Akhtar, “Comparing TCP variants in presence inform the endpoints if they support FQ or not. Such a solution of reverse traffic,” Electronics Letters, vol. 40, pp. 706–708, could, for example, use an IP or a TCP extension, which every May 2004. Conference Name: Electronics Letters. [14] E. Dumazet, “pkt_sched: fq: Fair Queue packet scheduler router uses to indicate if they’re capable of FQ. Then, if every [LWN.net],” 2013. device on a path supports FQ, the client would use a delay- [15] M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira, based CCA. However, while this solutions would achieve “PCC: Re-architecting congestion control for consistent high good results, the problem is that solutions, which require performance,” in 12th USENIX Symposium on Networked Sys- participation of every device in the Internet generally suffer tems Design and Implementation (NSDI 15), pp. 395–408, 2015. [16] M. Dong, T. Meng, D. Zarchy, E. Arslan, Y. Gilad, P. B. God- from insufficient deployment because not every hardware frey, and M. Schapira, “PCC vivace: online-learning congestion vendor and network operator can be forced to implement the control,” in NSDI, 2018. proposed solution for informing about FQ capability. [17] P. Goyal, A. Narayan, F. Cangialosi, D. Raghavan, S. Narayana, Concluding, we argue that our new flow startup method can M. Alizadeh, and H. Balakrishnan, “Elasticity Detection: A help to increase the deployment of delay-based CC. Especially Building Block for Delay-Sensitive Congestion Control,” in Proceedings of the Applied Networking Research Workshop, for applications that are very delay-sensitive, such as video ANRW ’18, (New York, NY, USA), p. 75, Association for conferencing and cloud gaming, the adoption of delay-based Computing Machinery, July 2018. CC can result in a significant increase of Quality of Experience [18] W. Stevens and others, “TCP slow start, congestion avoidance, (QoE). We demonstrated that our method works using a fast retransmit, and fast recovery algorithms,” 1997. Publisher: prototype implementation using PCC but we are also confident rfc 2001, January. [19] A. Jain, S. Floyd, M. Allman, and P. Sarolahti, “Quick-Start for that a similar flow startup method can also be integrated with TCP and IP,” 2019. other CCAs such as BBR. [20] I. Järvinen and others, “Congestion Control and Active Queue Management During Flow Startup,” Series of publications REFERENCES A/Department of Computer Science, University of Helsinki, [1] M. Jarschel, D. Schlosser, S. Scheuring, and T. Hoßfeld, “An 2019. Publisher: Helsingin yliopisto. Evaluation of QoE in Cloud Gaming Based on Subjective [21] M. Kühlewind and B. Briscoe, “Chirping for congestion control- Tests,” in IMIS, June 2011. implementation feasibility,” Proceedings of PFLDNeT’10, 2010. [2] M. S. Elbamby, C. Perfecto, M. Bennis, and K. Doppler, [22] R. Mittal, J. Sherry, S. Ratnasamy, and S. Shenker, “Re- “Toward Low-Latency and Ultra-Reliable Virtual Reality,” IEEE cursively cautious congestion control,” in Proceedings of the Network, 2018. 11th USENIX Conference on Networked Systems Design and [3] Z. Li, M. A. Uusitalo, H. Shariatmadari, and B. Singh, “5G Implementation, NSDI’14, (USA), pp. 373–385, USENIX As- URLLC: Design Challenges and System Concepts,” in ISWCS, sociation, Apr. 2014. 2018. [23] M. Kargar Bideh, A. Petlund, C. Griwodz, I. Ahmed, R. behjati, [4] L. S. Brakmo and L. L. Peterson, “TCP Vegas: End to end A. Brunstrom, and S. Alfredsson, “TADA: An Active Measure- congestion avoidance on a global Internet,” IEEE Journal on ment Tool for Automatic Detection of AQM,” in Proceedings selected Areas in communications, vol. 13, no. 8, pp. 1465– of the 9th EAI International Conference on Performance Eval- 1480, 1995. Publisher: IEEE. uation Methodologies and Tools, VALUETOOLS’15, (Brus- [5] V. Arun and H. Balakrishnan, “Copa: Practical delay-based sels, BEL), pp. 55–60, ICST (Institute for Computer Sciences, congestion control for the internet,” in 15th USENIX Symposium Social-Informatics and Telecommunications Engineering), Jan. on Networked Systems Design and Implementation (NSDI 18), 2016. pp. 329–342, 2018. [24] C. Baykal, W. Schwarting, and A. Wallar, “Detection of AQM [6] M. Hock, F. Neumeister, M. Zitterbart, and R. Bless, “TCP on Paths using Machine Learning Methods,” arXiv:1707.02386 LoLa: Congestion Control for Low Latencies and High [cs], July 2017. arXiv: 1707.02386. Throughput,” in LCN, Oct. 2017. [25] D. A. Hayes, S. Ferlin, and M. Welzl, “Practical passive [7] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, shared bottleneck detection using shape summary statistics,” in M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats, 39th Annual IEEE Conference on Local Computer Networks, “TIMELY: RTT-based Congestion Control for the Datacenter,” pp. 150–158, Sept. 2014. ISSN: 0742-1303.

6 [26] S. Ferlin, O. Alay, T. Dreibholz, D. A. Hayes, and M. Welzl, “Revisiting congestion control for multipath TCP with shared bottleneck detection,” in IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Commu- nications, pp. 1–9, Apr. 2016. [27] D. A. Hayes, M. Welzl, S. Ferlin, D. Ros, and S. Islam, “Online Identification of Groups of Flows Sharing a Network Bottleneck,” IEEE/ACM Transactions on Networking, vol. 28, pp. 2229–2242, Oct. 2020. Conference Name: IEEE/ACM Transactions on Networking. [28] K. Nichols and V. Jacobson, “Controlling Queue Delay,” Queue, 2012. [29] S. Ha, I. Rhee, and L. Xu, “CUBIC: a new TCP-friendly high- speed TCP variant,” ACM SIGOPS operating systems review, vol. 42, no. 5, pp. 64–74, 2008. Publisher: ACM New York, NY, USA. [30] Y. Gu and R. L. Grossman, “UDT: UDP-based data transfer for high-speed wide area networks,” Computer Networks, vol. 51, no. 7, pp. 1777–1799, 2007. Publisher: Elsevier. [31] J. Iyengar and M. Thomson, “QUIC: A UDP-based multiplexed and secure transport,” Internet Engineering Task Force, Internet- Draft draftietf-quic-transport-17, 2018. [32] M. Handley, O. Bonaventure, C. Raiciu, and A. Ford, “TCP Extensions for Multipath Operation with Multiple Addresses,” 2012. [33] Q. De Coninck and O. Bonaventure, “Multipath QUIC: De- sign and Evaluation,” in Proceedings of the 13th International Conference on emerging Networking EXperiments and Tech- nologies, CoNEXT ’17, (New York, NY, USA), pp. 160–166, Association for Computing Machinery, Nov. 2017. [34] J. Rüth, I. Poese, C. Dietzel, and O. Hohlfeld, “A First Look at QUIC in the Wild,” in Passive and Active Measurement (R. Beverly, G. Smaragdakis, and A. Feldmann, eds.), Lecture Notes in Computer Science, (Cham), pp. 255–268, Springer International Publishing, 2018. [35] Apple, “About Wi-Fi Assist,” 2018. [36] D. Taht, T. Hoeiland-Joergensen, J. Gettys, E. Dumazet, and P. McKenney, “The Flow Queue CoDel Packet Scheduler and Active Queue Management Algorithm,” 2018. [37] T. Høiland-Jørgensen, D. Täht, and J. Morton, “Piece of CAKE: A Comprehensive Queue Management Solution for Home Gate- ways,” in 2018 IEEE International Symposium on Local and Metropolitan Area Networks (LANMAN), pp. 37–42, June 2018. ISSN: 1944-0375. [38] M. Bachl, J. Fabini, and T. Zseby, “Cocoa: Congestion Con- trol Aware Queuing,” in Proceedings of the 2019 Workshop on Buffer Sizing, BS ’19, (New York, NY, USA), pp. 1–7, Association for Computing Machinery, Dec. 2019. [39] M. Bachl, J. Fabini, and T. Zseby, “LFQ: Online Learn- ing of Per-flow Queuing Policies using Deep Reinforcement Learning,” in 2020 IEEE 45th Conference on Local Computer Networks (LCN), pp. 417–420, Nov. 2020. ISSN: 0742-1303. [40] M. Allman, V. Paxson, W. Stevens, and others, “TCP congestion control,” 1999. Publisher: RFC 2581, april. [41] Olivier Bonaventure, “Multipath TCP and load balancers — MPTCP,” 2018. [42] V. Olteanu and C. Raiciu, “Datacenter Scale Load Balancing for Multipath Transport,” in Proceedings of the 2016 workshop on Hot topics in Middleboxes and Network Function Virtualization - HotMIddlebox ’16, (Florianopolis, Brazil), pp. 20–25, ACM Press, 2016. [43] S. Liénardy and B. Donnet, “Towards a Multipath TCP Aware Load Balancer,” in Proceedings of the 2016 workshop on Applied Networking Research Workshop - ANRW 16, (Berlin, Germany), pp. 13–15, ACM Press, 2016. [44] Olivier Bonaventure, “Apple Music on iOS13 uses Multipath TCP through load-balancers — MPTCP,” 2019.

7