ABSTRACT

GANJI, ANIRUDH. Network Performance Analysis of IoT, Cloud and Mobile systems. (Under the direction of Muhammad Shahzad.)

Today’s end-user and enterprise applications heavily rely on large scale network infrastructure and novel network protocol implementations to provide their intended functionality. An ever- increasing demand for these applications has outpaced the rate of innovation and testing of the protocols and infrastructure necessary to support those demands. Hence, existing network pro- tocols and systems are being repurposed to facilitate these new application paradigms. But this repurposing can sometimes have unintended consequences on the network performance and can create performance bottlenecks in the system. Therefore, it is important to study such scenarios to not only understand what these bottlenecks are but also to find ways to alleviate these performance issues. In this work, I look at three such network scenarios, namely Internet of Things (IoT), Cloud and Datacenter networks, and Network protocol stacks in mobile systems. For the IoT scenario, I first provide a novel measurement framework to accurately measure fine-grained performance metrics in resource constrained IoT nodes. Next I look at the performance issues encountered when using an existing commercial wireless solution, like WiFi, to facilitate dense IoT deployments. Contrary to common intuition, our measurements showed that at higher IoT traffic rates the rate control mechanism by the Transmission Control Protocol (TCP) is a larger performance bottleneck when compared to frame collisions due to wireless medium access. Next, for Cloud and Datacenter networks, I mainly focus on the impact of TCP protocol on the performance of typical distributed workloads found in datacenter networks. Specifically, I look at how the choice of TCP in a cloud network environment can impact the application level metrics of a tenants workload. In another study, I evaluate the repercussions of coexisting TCP traffic from multiple different senders in a datacenter network in terms of its effect on the workload performance. Our evaluations have shown us that choosing a TCP variant in datacenter networks is not a trivial problem and heavily depends on the network infrastructure, type of workload or on the coexisting traffic. Moreover, I also demonstrated that low-level throughput unfairness caused due to coexistence of traffic from senders using different TCP variants can result in appropriate application-level performance unfairness in datacenter networks. Finally, for mobile systems, I look at how QUIC, a recent application-level transport protocol, integrates with the Android networking stack and the functional issues encountered during typical mobile traffic scenarios. I found that the benefits of QUIC are not universal unlike what is usually seen in its server deployments. Moreover, I also found that current QUIC implementation in Android can result in severe performance issues during mobile roaming scenarios. © Copyright 2021 by Anirudh Ganji

All Rights Reserved Network Performance Analysis of IoT, Cloud and Mobile systems

by Anirudh Ganji

A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy

Computer Science

Raleigh, North Carolina

2021

APPROVED BY:

Rudra Dutta Khaled Harfoush

Ismail Guvenc Anand Singh (External Member)

Muhammad Shahzad Chair of Advisory Committee DEDICATION

This dissertation is dedicated to my parents!

ii ACKNOWLEDGEMENTS

Working towards my Ph.D. has been a wonderful experience and I would like to thank everyone who has been part of this journey. First and foremost, I would like to thank Dr. Shahzad for his continuous support, guidance and faith in my work. He is a wonderful teacher and had a profound impact in shaping my research career and my personal life. In many ways, I feel lucky to have him during this journey. I am also deeply grateful to Dr. Anand Singh for being a valuable support during my Ph.D. His guidance and wit helped me both in my academic and personal life. I thank the members of the advisory committee, Dr. Rudra Dutta, Dr. Khaled Harfoush and Dr. Ismail Guvenc for their valuable feedback, insights and encouragement! Many thanks to Dr. Christina Vlachou at HPE. She was an incredible resource during my intern- ship. I am extremely lucky to have a very supportive group of lab-members during my Ph.D. Raghav, Shaohu, Hassan Iqbal, Hassan Ali and Usman have not only been my lab-mates, but are also good friends. I feel I found good friends at the end of my program. I enjoyed every bit of my time at the Department of Computer Science at NC State. Not only did I form new friendships here, but I owe a lot of my professional development to this department. I also specially like to thank NSF and other funding agencies which funded my Ph.D. My numerous friends who have been very helpful during these five years in keeping me happy and entertained whenever I felt stressed. They were my pillar of strength during these stressful times. Thanks guys! Thank you Amulya for being the best friend you are. Many thanks to my family! They have been very patient, kind and understanding when I moved to a different country to do my Ph.D. Thanks mom and dad! Finally, I do not know how to say this, but woof woof (thank you) Milo. Adopting you was one of the best decisions in my life. Thanks for being the bundle of joy you are!

iii TABLE OF CONTENTS

LIST OF TABLES ...... vii

LIST OF FIGURES ...... viii

Chapter 1 Introduction ...... 1

Chapter 2 IoTm: A Lightweight Framework for Fine-grained Measurements of IoT Per- formance Metrics ...... 5 2.1 Introduction ...... 5 2.2 Data Structure...... 8 2.2.1 Construction ...... 8 2.2.2 Management ...... 10 2.2.3 Analysis...... 10 2.3 Query Processing Engine ...... 12 2.4 Applications and Evaluation...... 15 2.4.1 Disk IO Operations per Process...... 15 2.4.2 Round Trip Latency per Flow...... 20 2.5 Related Work...... 23 2.5.1 Sketches...... 24 2.5.2 Data Center Monitoring Framework ...... 24 2.5.3 Other Measurement Schemes...... 24 2.6 Conclusion ...... 25

Chapter 3 Characterizing the Performance of WiFi in Dense IoT Deployments ...... 27 3.1 Introduction ...... 27 3.2 Related work...... 30 3.3 Overview of IEEE 802.11’s MAC & Aggregation...... 31 3.3.1 Regular DCF...... 31 3.3.2 RTS/CTS-based DCF...... 32 3.3.3 Frame Aggregation...... 32 3.4 Experimental Setup...... 33 3.4.1 Testbed Setup...... 33 3.4.2 IoT Traffics...... 33 3.4.3 Experiment Execution...... 35 3.5 Characterization of the IEEE 802.11n/ac’s MAC...... 36 3.5.1 Throughput...... 36 3.5.2 RTS/CTS Bandwidth and Block Acknowledgments...... 38 3.5.3 Frame Aggregation...... 39 3.5.4 TCP Pipelining...... 40 3.6 Key Take-aways...... 41

Chapter 4 Choosing TCP Variants for Cloud Tenants - A Measurement based Approach .. 43 4.1 Introduction ...... 43 4.2 Related Work...... 46 4.3 Measurement Setup...... 46

iv 4.3.1 Network and System Parameters...... 46 4.3.2 Measurement Metrics...... 47 4.4 Measurement Methodology...... 48 4.4.1 Data Collection...... 49 4.4.2 Evaluations ...... 49 4.4.3 Implications...... 51 4.5 Application Traffic Scenarios ...... 52 4.5.1 Topology and Data Collection...... 52 4.5.2 Streaming Traffic ...... 53 4.5.3 Distributed IO ...... 54 4.5.4 Sort Workload...... 55 4.6 Implications ...... 56

Chapter 5 Characterizing the Impact of TCP Coexistence in Data Center Networks ...... 58 5.1 Introduction ...... 58 5.2 Related Work...... 60 5.3 Measurement Setup...... 61 5.3.1 Testbed Topologies...... 61 5.3.2 Testbed Network Parameters...... 62 5.3.3 TCP Variants...... 62 5.4 Benchmarking Workloads...... 63 5.4.1 Equal Number of Flows of All Variants...... 63 5.4.2 One Flow of One Variant vs. Multiple Flows of Another...... 67 5.4.3 Impact of Path RTTs ...... 69 5.4.4 Unsaturated Traffic...... 70 5.4.5 The Fat-Tree Topology ...... 71 5.5 Data Center Workloads...... 73 5.5.1 Experimental Setup ...... 74 5.5.2 Distributed Data Center Workloads...... 75 5.5.3 Homogeneous Workloads...... 76 5.5.4 Heterogeneous Workloads ...... 78 5.6 Predicting Coexisting TCP variants...... 80 5.6.1 Methodology ...... 80 5.6.2 Prediction Accuracy ...... 83 5.6.3 Online Prediction...... 84 5.7 Conclusion ...... 87

Chapter 6 Characterizing the Performance of QUIC on Android and Wear OS Devices ... 88 6.1 Introduction ...... 88 6.2 Related Work...... 90 6.3 Background...... 91 6.3.1 QUIC Protocol ...... 91 6.3.2 Cronet Library ...... 92 6.4 Data Collection ...... 92 6.4.1 Testbed ...... 92 6.4.2 Network Environment ...... 92 6.4.3 Metrics and Experimental Scenarios ...... 93

v 6.5 Request Completion Time ...... 94 6.5.1 Methodology ...... 94 6.5.2 Evaluations ...... 95 6.6 Connection Migration ...... 100 6.6.1 Background ...... 100 6.6.2 Data Collection...... 100 6.6.3 Evaluations ...... 102 6.7 Transport Protocol Selection...... 103 6.7.1 Design Goals ...... 104 6.7.2 DTS Architecture ...... 104 6.7.3 Implementation...... 105 6.7.4 Evaluations ...... 109 6.7.5 Limitations...... 110 6.8 Conclusion ...... 110

Chapter 7 Conclusion ...... 112

BIBLIOGRAPHY ...... 113

vi LIST OF TABLES

Table 3.1 Bit rates for different classes of IoT traffic...... 34

Table 6.1 Testbed and experiment parameters ...... 94

vii LIST OF FIGURES

Figure 2.1 Illustration of bucket array and bucket subarrays...... 9 Figure 2.2 The number of disk IOs stored in bucket epochs...... 17 Figure 2.3 Actual # of disk IOs vs. estimated # of disk IOs...... 18 Figure 2.4 Effect of n on average relative error (m = 5) ...... 18 Figure 2.5 Effect of m on average relative error (n = 20) ...... 18 Figure 2.6 Effect of bucket epochs spanned by a query on avg. rel. error...... 19 Figure 2.7 Memory required as a function of the number of buckets ...... 19 Figure 2.8 CDF of packet counts of flows ...... 21 Figure 2.9 CDF of RTTs of all packets...... 21 Figure 2.10 Actual average RTT per flow vs. estimated average RTT per flow ...... 22 Figure 2.11 Effect of the number of instances on average relative error ...... 23

Figure 3.1 Number of files published for class 1,2,3 IoT traffic...... 37 Figure 3.2 Number of files published for class 4 IoT traffic...... 37 Figure 3.3 Number of files published for class 5 IoT traffic...... 37 Figure 3.4 Average throughput of each client ...... 37 Figure 3.5 System throughput...... 37 Figure 3.6 RTS data...... 38 Figure 3.7 CTS data...... 38 Figure 3.8 Number of block ACKs ...... 39 Figure 3.9 Average aggregate lengths of AMPDUs...... 39 Figure 3.10 CDFs of the number of packets TCP hands down to lower layers in each release 40

Figure 4.1 Throughput vs. Time...... 49 Figure 4.2 Tput (N BBR flows, GCP)...... 50 Figure 4.3 Packet-loss per second ...... 50 Figure 4.4 Observed latency for different TCP variants ...... 51 Figure 4.5 Topology of dist. apps...... 52 Figure 4.6 Box plot of average application latency for streaming workload in AWS . . . . 54 Figure 4.7 Box plot of average application latency for streaming workload in GCP . . . . 54 Figure 4.8 Box plots of average throughput per node for DFSIOe workloads...... 55 Figure 4.9 Box plots of average J.C.T. for sort workload in AWS and GCP...... 56

Figure 5.1 Leaf-spine data center topology...... 62 Figure 5.2 Fat-Tree data center topology ...... 62 Figure 5.3 Throughputs achieved by the set of flows f1 and f2 while using the TCP variants T1/T2 (red is T1, blue is T2) for N=1,10, and 100 respectively...... 64 Figure 5.4 Avg. congestion window size per flow (red circle: T1, blue cross: T2) ...... 65 Figure 5.5 Avg. retransmissions per flow per RTT (red: T1, blue: T2) ...... 65 Figure 5.6 Instantaneous per-flow throughputs of New Reno and Cubic (N = 10) . . . . . 66 Figure 5.7 When T2’s flow is added after 75 sec...... 67 Figure 5.8 Poincaré plots. In any T1/T2 figure, red is T1, blue is T2 ...... 67 Figure 5.9 Average throughput of a single flow of T1 (T1’s name in subfigure caption) when coexisting with multiple flows of T2 (T2 names in legend)...... 68 Figure 5.10 Throughput w.r.t. RTT. Red is T1, blue is T2, and green is normalized ratio of throughput between T1 and T2 ...... 69

viii Figure 5.11 Network Utilization...... 70 Figure 5.12 Normalized throughput per TCP as a function of total network load ...... 71 Figure 5.13 Combined throughputs achieved by the set of flows from S1 and S3 while using TCP variant T1 (Red) and from S2 and S4 while using TCP variant T2 (Blue) for N=1,8,16, and 32 respectively ...... 72 Figure 5.14 Relative difference between throughputs of traffic passing through different number of congestion points...... 74 Figure 5.15 Job completion times for homogeneous workloads. T1 is represented with red, T2 with blue, and average with green markers...... 76 Figure 5.16 Observed Throughput vs. Time ...... 78 Figure 5.17 Heterogeneous workload. Sort uses T1 (red), other uses T2 (blue)...... 79 Figure 5.18 Model definition of the convolution neural network used for predicting remote TCP variants ...... 82 Figure 5.19 Prediction accuracy for the three traffic scenarios...... 82 Figure 5.20 Effect of number of parallel flows and traffic saturation on prediction accu- racy for leaf-spice, fat-tree and unsaturated traffic scenarios...... 83 Figure 5.21 Effect of TCP variants on prediction accuracy for window slice = 10 seconds 84 Figure 5.22 Effect of TCP variants on prediction accuracy for window slice = 30 seconds 85 Figure 5.23 Testbed for testing the online prediction tool ...... 85 Figure 5.24 Throughput vs. Time for a local CUBIC flow against different background flows ...... 86 Figure 5.25 Error in remote TCP variant prediction on first go...... 87

Figure 6.1 Testbed...... 93 Figure 6.2 p.d.f. of WiFi RSSI on the two devices...... 93 Figure 6.3 p.d.f. of ping values to GCP server ...... 93 Figure 6.4 RCTs: Red (Blue) lines represent QUIC (TCP) and solid (dotted) lines repre- sent smart-phone (smart-watch). Each figure plots observations for a given traffic direction and link type pair...... 95 Figure 6.5 Ratio of RCTTCP to RCTQUIC to study the effect of request size and direction. Lighter shade implies QUIC had lower RCT...... 96 Figure 6.6 Ratio of RCTLTE to RCTW i F i for different request sizes. Lighter shades imply LTE network had larger RCT...... 97 Figure 6.7 RCT ratios when extra network delay is introduced for QUIC requests...... 98 Figure 6.8 RCT ratios when extra network delay is introduced for TCP requests...... 98 Figure 6.9 Ratio of RCTs observed for cold connections with warmed-up connections. Light colors imply larger improvement in RCT when using warmed-up con- nections...... 99 Figure 6.10 Link state change at t=5 for one second...... 101 Figure 6.11 Agg. RCTs during connection migration while using TCP...... 102 Figure 6.12 Agg. RCTs during connection migration while using QUIC...... 102 Figure 6.13 RCTs of individual chunks for Disconnection Time=1s and LTE Window=5s. 103 Figure 6.14 DTS Architecture and individual modules...... 105 Figure 6.15 Information flow across different modules of DTS...... 106 Figure 6.16 Cumulative RCTs when only QUIC (yellow), TCP (blue), or DTS (red) are used for HTTP requests...... 109

ix Figure 6.17 Conditional probabilities of predicting QUIC. Red (light red) bar belongs to GET request over WiFi (LTE) and blue (light blue) bar belongs to POST request over WiFi (LTE)...... 110

x CHAPTER

1

INTRODUCTION

Networking landscape has evolved rapidly in the past decade and has paved way to many inter- esting technologies and systems. Evolving business requirements and demand for improved user experience have specifically fueled the growth of paradigms like Internet of Things (IoT), Cloud, and novel network protocols like QUIC. Since then, these systems and infrastructure have grown exponentially and have been getting increasingly complex. While providing new and creative tech- nologies to end-users is vital for the survival of businesses, it is also important to deliver predictable performance to the users for such new technologies. Sometimes, performance degradation (par- ticularly network performance degradation) is non-negotiable for an application and can severely hamper its functionality. Therefore, performance measurements of these systems are important for benchmarking and fixing network bottlenecks. With this in view, in this thesis, I present my work on the measurements and characterization of the network performance in IoT, Cloud and mobile systems. Due to the evolving needs of network applications, the networking community has been re- purposing existing protocols for use in settings that are different from what they were originally designed for. These changes are increasingly influenced by external factors and business needs. For instance, in this thesis, I have identified three such scenarios

1. With an ubiquitous goal for sensing and remote data processing, IoT applications have resulted in an uplink biased networks when compared to downlink heavy web traffic

2. Transport protocols in cloud and datacenter networks are being increasingly used for east-west traffic or server-to-server communication compared to traditional north-south or server-to- client communication

1 3. The emergence of QUIC has led to a departure from modular nature of networking stack because the transport functionality is now implemented in the

For IoT traffic, I present my work on a framework to measure low-level network and compute metrics in a resource constrained node. I also evaluate how IoT deployments experience congestion in dense wireless deployments. For transport protocols, I focus on characterizing TCP behavior in cloud and datacenter networks using saturated TCP flows and application traffic. Finally, I study the performance and implementation issues of QUIC in mobile operating systems. I present this work in the subsequent five chapters. Next, I give a quick overview of the topic covered in each of the five chapters.

Chapter 1: Framework to collect performance measurements in IoT networks

IoT networks have become a reality, with applications in a wide variety of verticals like in healthcare, industry automation, smart-lifestyle etc. Depending on the use-case, the IoT deployments have to adhere to strict service level agreements set by the business intelligence to full-fill their functionality. Therefore, it is important to have a framework to quantitatively measure low-level metrics to detect and troubleshoot any performance bottlenecks. But compared to traditional compute devices, IoT nodes are resource constrained and lack complex software and hardware stacks. This limits the scale and complexity of performance monitoring applications that can be deployed in an IoT node. In this chapter, we address this problem by designing and evaluating a measurement framework with low overhead that can be deployed in a resource constrained IoT node to collect fine-grained measurements. The key idea is to achieve tradeoff between accuracy and resource consumption by storing metrics for different measurement instance in overlapping counters. When we want to query the stored metrics, we use statistical methods to denoise the measurements with an acceptable error and present it to the user.

Chapter 2: Performance measurements in dense wireless IoT networks

The rise of IoT applications and use-cases has led to an increase in the number of connected devices to the Internet. The ubiquitous availability of IEEE802.11, commonly known as WiFi, has further lowered the barrier for IoT applications to enter smart-home and enterprise locations. This has led to the creation of dense wireless environments, where the traffic characteristics of IoT and traditional web applications are at odds with each other. Conventional devices like smartphone, laptops have significantly lower uplink traffic footprint compared to IoT nodes. Therefore, in dense wireless IoT environments, IoT nodes contest for the shared medium to upload their data. In this work, we present a comprehensive measurement study of various aspects of IEEE802.11 MAC like RTS/CTS, block aggregation etc. along with the impact of TCP on wireless bandwidth utilization. We conduct this study using a real testbed with large number of Raspberry Pis that emulate a variety of IoT workloads. This study demonstrates the various bottlenecks that current IEEE802.11 protocol exhibits in dense IoT deployments.

2 Chapter 3: Characterizing TCP in cloud networks

The low cost and high reliability offered by today’s cloud platforms have attracted many businesses and services to migrate their IT and business logic to the cloud. This has resulted in the rise of the deployment’s of multi- tiered and distributed applications in the cloud. Although many solutions have been proposed to improve the networking performance of these distributed applications, most of them are from the perspective of the cloud providers and not from the perspective of the tenants. In this chapter, we investigate whether it is possible to improve the performance of a cloud workload by choosing the best-fit TCP for that workload. We determine the best-fit TCP for a workload using a measurement based methodology (by evaluating low level performance metrics like throughput, latency etc. in a cloud platform) and from traffic heuristics. Two aspects of this study distinguishes it from prior works. First, we conduct our studies in real cloud platforms with real workloads. Second, our approach is designed to be from the tenants perspective, where the cloud networking infrastructure is just a black-box network.

Chapter 4: Understanding TCP coexistence in datacenter networks

Prior research has resulted in development of multiple TCP variants addressing specific performance needs of data center workloads. Therefore, a large-scale heterogeneous data center hosting many workloads and applications generates traffic using multiple different TCP variants, which coexist in the same switch fabric. Consequently, the coexisting TCP traffic can experience unfairness in terms of throughput. end-to-end delay etc. . In order to evaluate this unfairness, we conducted a measurement study on a real testbed emulating common data center topologies and workloads. We follow a two pronged approach in this study. First, we characterize TCP coexistence using saturating TCP flows and study how factors such as link propagation delay, number of congestion points in the network, number of coexisting flows etc. can affect throughput fairness. We then apply the insights learned from these experiments to evaluate the performance in real workloads under different TCP coexistence scenarios. One important takeaway of this study is that unfairness due to TCP coexistence in low-level metrics like throughput can proportionally translate to unfairness in high-level application metrics, even when the bottleneck link is not completely saturated with traffic.

Chapter 5: Performance measurement of QUIC in mobile systems

Mobile devices are different from high performance servers in two ways. First, they are battery pow- ered and have smaller processing power compared to enterprise machines. Second, the operating systems that they use are comparatively more restrictive than the general purpose in terms of access to low-level hardware or software routines. Given this context, it is important to understand the behavior of QUIC, a user-space transport protocol, in mobile systems. In this work, we particularly focus on Google’s implementation of QUIC in their Android operating system and show that QUIC does not always show better network performance compared to TCP in mobile

3 systems. We use these insights to develop a probabilistic framework that chooses either QUIC or TCP to minimize long-term request completion times.

4 CHAPTER

2

IOTM: A LIGHTWEIGHT FRAMEWORK FOR FINE-GRAINED MEASUREMENTS OF IOT PERFORMANCE METRICS

2.1 Introduction

The Internet of Things (IoT) enables an exchange of data that wasn’t previously easily available such as temporal and spatial distribution of moisture in soil [25], power consumption of electrical appliances in smart buildings [123], and integrity of concrete structures [124]. Each IoT application scenario requires unique guarantees on certain performance metrics (such as latency, loss, power consumption etc. ) from the IoT infrastructure deployed in that scenario. For example, the values of vibration levels in the parts of an in-flight aircraft must be delivered to a server with as little latency as possible for real-time monitoring. The CPU utilization of a certain process on an IoT node must not exceed a certain value to ensure CPU availability for other, perhaps more critical, processes. A small deterioration in these performance metrics can cause violations of service level agreements (SLAs), and result in significant revenue and functional losses. To ensure that the deployed IoT architecture delivers the guarantees on these performance metrics and that their values lie in a desired range, the first step is to measure these metrics. IoT system and network administrators need the measurements of these performance metrics to reactively perform troubleshooting tasks such as detecting and localizing offending flows that are responsible for causing delay bursts and throughput deterioration and identifying processes that are hogging an IoT node’s CPU. They also

5 need these measurements to proactively locate and update any potential future bottlenecks. In this work, our objective is to take the first step towards developing a framework for measuring IoT performance metrics, which include both IoT network’s Quality of Service (QoS) metrics such as latency, loss, and throughput and IoT nodes’ Resource Utilization (RU) metrics such as power consumption, disk accesses and utilization, radio-on time etc. This framework must have two key properties. First, it should be lightweight for IoT nodes, i.e. , it should require very small amount of memory and compute resources on IoT nodes when performing measurements, and should keep all the complexity on the cloud side. This is important because most IoT nodes are resource constrained in terms of either the available power, or the compute and memory resources, or even both. Second, it should be able to perform fine-grained measurements, i.e. , it should not only be able to obtain the aggregate measurements of the desired performance metric (such as the aggregate utilization of CPU by all processes on an IoT node or the average latency of all the packets going from an IoT node to a cloud platform), but it should also be able to obtain per-instance measurements of that metric (such as the CPU utilization of each process running on the IoT node, and the average latency of each flow going from the IoT node to the cloud platform, respectively). In the rest of this chapter, we will use the term instance to represent an individual entity for which the desired performance metric exists. Examples of instance include a network flow or a CPU process. The desired performance metrics for these instances can be flow size and CPU utilization, respectively. Fine-grained measurements are important because even if the aggregate value of a performance metric appears normal, its value for a particular instance may be wildly abnormal. For example, IoT gateways can experience microbursts (simultaneous short bursts of data from a large subset of IoT nodes) that cause packet losses for some flows, even when the average traffic at that gateway across all flows over a period of time is well within the limits of the gateway capacity. While solutions to such problems have been proposed in the past for conventional networks and for data centers, and it may even be possible to adapt those solutions for IoT architectures, it is still imperative to have a measurement scheme in place to rapidly detect and localize such problems. While measuring performance metrics in IoT is a new and largely an unexplored area, a signifi- cant amount of work has been done on measuring performance metrics in conventional networks and data centers, which reiterates the importance of measuring network performance metrics in the emerging IoT networks. Although several schemes have been proposed to measure performance metrics in conventional networks and systems, they are not suitable for IoT architectures for two main reasons. First, the majority of the existing schemes are not passive, i.e. , they perform active operations, such as using active probes, to obtain the measurements [83, 125, 58, 12, 105, 36, 101, 146]. While active schemes work well with conventional network devices, they are problematic for IoT nodes because they interfere with the regular operations of the IoT nodes, and the limited amount of resources on such nodes makes it hard for the nodes to perform such auxiliary active operations while ensuring that their regular operations proceed unaffected. Second, the existing passive measurement schemes for conventional networks and systems require a significant amount of computational and memory resources. For example, Moshref et al. leveraged the abundance of

6 computational resources in the servers in data centers to develop a measurement framework to detect events of interest [96]. While computational resources are usually not a problem on conven- tional servers, they are not as abundant on IoT nodes. Consequently, it is not possible for IoT nodes to implement conventional measurement methods without impeding their regular operations. Thus, there is a need to develop a framework that is designed keeping IoT architectures in mind. In this chapter, we present IoTm, a lightweight framework for fine-grained measurement of IoT performance metrics, which include both QoS as well as RU metrics. It is comprised of two components, an IoT node unit (INU), which resides in each IoT node, and a control and query unit (CQU), which resides in a logically centralized management server. The management server itself can either be deployed locally or in the cloud. INU is comprised of two sub-components: 1) a data structure in which the IoT node stores appropriate information about the desired performance metric; and 2) a local query processing engine, which receives queries from CQU (we will discus CQU shortly) and answers them using the information stored in the data structure. The INU sends the data structure to the CQU either periodically or when the data structure gets full and the CQU stores it for analysis and long-term storage. The data structure has three properties that make it ideal for implementation in IoT nodes. First, it enables both fine-grained and aggregate measurements of a variety of performance metrics, including, but not limited to, latency, disk accesses, CPU utilization, and several others. Second, it is computationally very lightweight and requires only a single hash computation and no more than two memory updates per insertion. Third, it has a very small memory footprint. CQU is comprised of three sub-components: 1) a control unit, which, based on the high level measurement objective, is responsible for selecting the IoT nodes on which the measurements should be taken, the performance metrics that should be measured on those IoT nodes, and the granularity at which those performance metrics should be measured; 2) a storage, where CQU stores the data structures sent by the INUs; and 3) a global query processing engine, which answers user’s queries from the stored data structures. Next, we give an example to demonstrate the use of these three sub-components. Consider a scenario where an IoT infrastructure provider implements a large number of traffic monitoring sensors that measure and provide the extent of congestion on all roads and intersections throughout the city. Suppose an IoT service controls the timings for traffic lights at all intersections in that city and uses an algorithm to calculate the optimal durations for the red, yellow, and green lights based on the state of congestion on different roads and intersections in the city. Lets assume that the algorithm that this IoT service uses requires that the latest state of congestion at city intersections be delivered to it within 100ms, while the state of congestion at portions of roads farther away from the intersections can take longer to be delivered. This requirement gives rise to the need for measuring latencies of packets sent by traffic sensors at city intersection to ensure that they never experience delay >100ms . With this high level latency measurement objective in view, the control unit instructs the INUs in the traffic sensors at the intersections (and not at the portions of roads away from the intersections) to start recording information in their data structures that can later be

7 used for measuring latencies. The query processing engine in CQU can later be used to estimate and analyze latencies experienced by packets at any desired time from the data structures sent by INUs to CQU and stored in CQU’s storage to ensure that the latencies experienced by these packets were never over 100ms, and if they were, appropriate actions can be taken to keep them under 100ms. From the discussion above, we see that our measurement framework has 5 sub-components: the data structure and the local query processing engine in INU, and the control unit, storage, and global query processing engine in CQU. In this work, we focus on the design of 3 of these 5 sub- components: the data structure and the local query processing engine in INU, and the global query processing engine in CQU. For control unit, we assume that a network administrator configures it manually based on the high level measurement objectives. Automating the operations of control unit will be the part of our future work. For the storage, we assume that an appropriate database is in place that can store the data structures sent by INUs to the CQU.

Key Contributions

In this work, we make three key contributions: 1) we present, IoTm, a lightweight framework for fine-grained measurement of IoT performance metrics, which include both QoS and RU metrics; 2), we present a generic data structure that enables INUs to store information about IoT performance metrics in compute and memory efficient way; 3) we demonstrate the application of our framework using two arbitrarily chosen IoT performance metrics (disk accesses and round trip latencies) as examples, and present the accuracy of our framework through extensive experimental evaluations.

2.2 Data Structure

In this section, we present our data structure that can efficiently store measurements for a variety of different performance metrics. We first describe how the INU on any given IoT node inserts measurements into this data structure. After that, we discuss how INU sends it to CQU for long term storage and analysis. Last, we present its complexity analysis and theoretical modeling. To describe how the INU inserts measurements into this data structures, we use an arbitrary metric M that has to be measured for one or more distinct instances. Let we use I to represent an arbitrary instance. As an example, this data structure can be used to store the packet count (i.e. , the metric M ) of each network flow (i.e. , the instance I ) that the IoT node generates. As another example, this data structure can also be used to store the number of disk accesses (i.e. , the metric M ) of each process (i.e. , the instance I ) running in a given IoT node.

2.2.1 Construction

Our data structure is comprised of an array B of n buckets, where each bucket B[i ] (1 i n) has b bits with initial value 0. The data structure requires instances to have unique IDs. An≤ ID≤ can be any information that distinguishes one instance from the others. For example, if the instance is a network flow, its ID can be the standard five tuple (i.e. , source IP, destination IP, source port,

8 destination port, and protocol type). Similarly, if the instance is a process running in the IoT node, its ID can be the unique process ID assigned to it by the OS. For any arbitrary instance with ID I , our data structure maps it to a bucket subarray, which is a unique subset of the buckets in the bucket array. Each bucket subarray comprises m buckets, where m << n. To make the mapping unique and memoryless (i.e. , without requiring any lookup or hash tables), our data structure maps each instance to a random subarray such that the probability of two different instances being mapped to the same subarray is practically negligible. A bucket may belong to multiple bucket subarrays.

Figure 2.1 shows an example bucket array and its four bucket subarrays for four instances I1 through

I4. We observe from this figure that buckets 5 and 7 belong to multiple bucket subarrays.

Bucket subarray of I1 Bucket subarray of I2 5 1 4 7 9 10 Bucket Array 1 2 3 4 5 6 7 8 9 10

3 2 5 5 8 7

Bucket subarray of I3 Bucket subarray of I4

Figure 2.1 Illustration of bucket array and bucket subarrays

To insert each measurement of the desired metric M of an instance I , the INU first randomly selects one bucket in the bucket subarray corresponding to the instance I , and then adds the measured value of the metric in that bucket. Formally, to add a measurement for instance I , INU chooses a random number j from a uniform distribution in the range [1,m], calculates the hash function H (I , j ) whose output is uniformly distributed in the range [1,n], and increments the bucket B[H (I , j )] by the value of the measurement. Thus, the sum of all the measurements of the metric M of instance I will be uniformly distributed across m buckets: B[H (I ,1)],B[H (I ,2)], ,B[H (I ,m)]. These m buckets constitute the bucket subarray of the instance I , which is denoted··· by BI where I B [j ] = B[H (I , j )] for j [1,m]. Note that while storing∈ the sum of all the measurements is appropriate in the case of some metrics, such as the total number of memory accesses, it is not appropriate for some other metrics, such as CPU utilization. The appropriate measures for such metrics are the average value of those metrics. To calculate the average value of such a metric for any given instance I , in addition to the sum of all the measurements for I , the query processing engine will also need to know the number of times the measurements for I were inserted to I ’s bucket subarray. For this, the INU maintains another array C containing the same number of buckets as the array B (i.e. , n) with initial values of 0, and the same number of buckets per subarray (i.e. , m). Every time INU adds a measurement to some bucket B[i ], it also increments the corresponding bucket C[i ] by 1. This way, the bucket

9 array B stores the sum of the measurements of the desired metric for each instance and the bucket array C stores the number of times the measurements of the desired metric for each instance were inserted in the bucket array B. In Section 2.3, we will describe how a query processing engine applies appropriate statistical operations to extract the average and/or the sum of the measurements of the metric for any given instance I from the bucket array.

2.2.2 Management

The INU on each IoT node sends the data structure, i.e. , the bucket array B (and the bucket array C, if being used) to the CQU for storage either when any of the bucket becomes full or after a certain amount of time has passed since the bucket array B was initialized to 0. We call the bucket array B that is sent to the CQU a bucket epoch B. For each bucket epoch, INU also sends time stamps of the first and the last recorded measurement, which the CQU uses to distinguish between several bucket epochs sent by the INU on any given node. After sending the data structure, INU resets all buckets to zero and starts recording measurements again. Note that instead of using the data structure that we have proposed, INU may be able to store the measurements of the desired metric for each instance by simply maintaining a separate counter for each instance. When the number of instances is small, this approach may even require less memory. However, this approach has 3 limitations. First, it does not scale: as the number of instances increase, the amount of lookup state that the IoT node needs to maintain increases to keep track of which counters are associated with which instances. This may not always be possible for resource constrained IoT nodes. In comparison, the data structure in our IoTm framework does not require the INU to maintain any lookup state, rather maps the instances to appropriate counters using our memoryless hashing approach. Second, as the number of instances increase, separate counters for each instance consume a lot more memory compared to our data structure, as we will show in Section 2.4. Third, it is often not possible to anticipate the number of measurements of any given instance in advance. Thus, allocating the same number of bits to counters for each instance can result in overflow in some counters and under utilization in others. In comparison, in our data structure, each bucket is shared among many instances and each instance is mapped to many buckets, which mitigates the problems of under utilization and overflow, respectively.

2.2.3 Analysis

2.2.3.1 Complexity

When inserting information about a measurement in bucket array, the IoT node has to perform only one hash computation (to identify the bucket) and up to only two memory updates (one to add the value of the measurement to the bucket array B and another to keep the count of the number of measurements in bucket array C, if used). Due to such minimal computation and memory access requirements, our data structure is very lightweight and amenable for implementation in IoT nodes. We will study the memory footprint of our data structure in Section 2.4 when we use the

10 IoTm framework to measure various IoT performance metrics.

2.2.3.2 Modeling

Next, we derive the expression for the probability distribution of the values of the buckets in the bucket arrays, which the query processing engine will use to estimate the average and/or the sum of the measurements of the metric stored in the bucket arrays for any given instance. Let BI be the I random variable representing the amount contributed by instance I to a bucket B [j ] (1 j m) in ≤ ≤ its bucket subarray. Let sI be the sum of all the measurements of instance I that contributed to the 1 current bucket epoch B. As each bucket in the bucket subarray of I has a probability m of being selected for insertion of any measurement for the instance I , BI follows a binomial distribution, i.e.

, BI Binom(sI ,1/m). Thus, the amount contributed by an instance to each bucket in its bucket subarray∼ follows a binomial distribution. The amount contributed by all instances other than the instance I to each bucket in the bucket subarray of instance I also follows a binomial distribution. Let Br be the random variable represent- ing the amount contributed by measurements of all instances other than the instance I to bucket I B [j ]. The probability that a measurement of an instance I¯ = I contributes an amount to bucket I 6 I B [j ] is the product of the probability that the hash function H maps the measurement to B [j ] I I¯ I given that B [j ] B , which is 1/m, and the probability that bucket B [j ] is in the bucket subarray ∈ I I¯ of I¯, which is denoted by P B [j ] B and calculated as follows: { ∈ }   ‹0 ‹m  I I¯ m 1 1 P B [j ] B = 1 1 { ∈ } − 0 n − n § m m m 1 ª m 1 1 ( )( ) ... (2.1) = + 2 − − − n n 2! − ≈ n × Thus, the probability that a measurement for an instance I¯ = I contributes an amount to bucket I 1 m 1 6 B [j ] is m n = n . Representing the sum of all the amounts contributed by the measurements of all × instances in the given bucket epoch by s , the binomial distribution of Br is Br Binom(s sI ,1/n). ∼ − Next, we calculate the probability distribution of any given bucket in the bucket array. As B = BI + Br , and BI and Br are almost independent when sI << s , the probability distribution function of B is calculated as follows.

b X ¦ © P B = b P BI = bI P Br = b bI { } ≈ bI =0 { } × { − }

Note that in practice, sI is indeed much smaller than s because sI is the amount added by a single instance in the bucket epoch while s is the amount added by all the instances. Replacing P BI = bI { }

11 and P Br = b bI with their respective binomial distribution expressions, we get { − } b s  1 ‹bI  1 ‹sI bI X I − P B = b 1 bI m m { } ≈ bI =0 −  s s  1 ‹b bI  1 ‹s sI b +bI  I − − − − 1 (2.2) × b bI n − n − Following the same steps, we can obtain the expression for the probability distribution of the values of the buckets in the bucket array C. Let tI represent the number of times the measurements of instance I were inserted in B, and t represent the total number of insertions into the bucket array I B. Let C be the random variable representing the value of any arbitrary bucket C [j ] (1 j m) in the bucket subarray. The expression for the probability distribution of the values of the≤ buckets≤ in the bucket array C turns out to be the following.

c t  1 ‹cI  1 ‹tI cI X I − P C = c 1 cI m m { } ≈ cI =0 − t t  1 ‹c cI  1 ‹t tI c +cI  I − − − − 1 (2.3) × c cI n − n − 2.3 Query Processing Engine

The control unit issues queries to the query processing engines. The query can be provided manually by the administrator through the control unit or the control unit can issue them automatically to measure these performance metrics to ensure that the IoT system is delivering on any required SLAs. A query comprises three attributes: 1) the instance ID, 2) the starting and ending times of the period during which the value of the desired metric is required, 3) whether the response to the query should be the sum or the average of the values of the desired metric. Based on the starting and ending times, the control unit first determines whether the bucket epochs whose time frames overlap with these starting and ending times are all stored in the storage of CQU or the time frame also covers the data structure currently under construction by INU in an IoT node. Next, the control unit instructs the appropriate query processing engines (i.e. , either in CQU, or INU, or both) to estimate the sum of the value of the desired metric from each identified bucket epoch B, and send them back to the control unit. Control unit adds these values estimated from the bucket epochs to obtain the overall sum for the desired instance. If the average value of the metric is desired, the query processing engine further extracts the number of times the measurement of the desired metric is inserted into bucket epochs B from all corresponding bucket epochs C, and sends them to the control unit. The control unit adds them to get an estimate of the total number of times the measurements of the desired metric were inserted into all bucket epochs and divides the sum from all bucket epochs with it to get the average value of the metric for the desired instance. Next, we describe how the query processing engine extracts the value of the sum from any given bucket epoch B. The method to extract the value of the number of times a measurement is

12 inserted from any given bucket epoch C is exactly the same. Furthermore, both local and global query processing engines use the exact same method to extract the values of the sum and the number of times the measurements are inserted from bucket epochs B and C, respectively. Our objective here is to estimate the value of sI for any given instance with ID I from any given bucket ˜ I I epoch B. Let B [j ] denote the observed value of bucket B [j ] and let s˜I denote the estimate that the query processing engine returns for the value of sI . The probability or likelihood of getting the I I observed value B˜ [j ] of a bucket B [j ] in the bucket subarray of instance I is given by Eq. (2.2). Thus, the likelihood of getting the observed values of all buckets in the bucket subarray of the flow is given by the following equation.

m Y ¦ I © L = P B = B˜ [j ] j =1  I m B˜ j  [ ] s  1 ‹bI  1 ‹sI bI Y X I − = 1 bI m m j =1 bI =0 − ×   B˜ I j b s s B˜ I j b ™« s sI  1 ‹ [ ] I  1 ‹ I [ ]+ I − 1 − − (2.4) ˜ I − B [j ] bI n − n −

Note that the right hand side of the equation above has only one unknown, i.e. , sI . We use the maximum likelihood estimation method to estimate the value of sI using this equation. Formally, the estimated value s˜I of the sum of measurements of the instance I is given by s˜I = argmaxsI L . d { } Taking natural log of L, differentiating ln L with respect to sI , and equating ln L to 0, we get { } d sI { }

 ˜ I m B [j ]   b s b X d  X s  1 ‹ I  1 ‹ I I ln I 1 − d sI bI m m j =1 bI =0 − ×   B˜ I j b s s B˜ I j b ™« s sI  1 ‹ [ ] I 1 ‹ I [ ]+ I − 1 − − 0 ˜ I − = B [j ] bI n − n − (2.5)

Let

˜ I ˜ I  1 ‹bI 1 ‹ bI 1 ‹B [j ] bI 1 ‹s B [j ]+bI − − − X[j ] = 1 1 m − m n − n    s s s s sI  1 ‹ I  1 ‹ I j I 1 1 − Y[ ] = ˜ I − bI B [j ] bI − m − n −

13 Thus, Eq. (2.5) becomes

 I  m B˜ [j ] X d  X   ln X[j ] Y[j ] = 0 d sI j =1 bI =0 × 

As X[j ] is not a function of sI , the equation above becomes

 B˜ I j ¦ © m P [ ] j d j X bI =0 X[ ] d sI Y[ ]  I × = 0 (2.6) B˜ j j 1 P [ ]  =  b 0 X[j ] Y[j ]  I = × To calculate d j , we use the following identity. d sI Y[ ]     d v v 0 0  = ψ( ) v w + 1 ψ( ) w + 1 d w w w { − } − { }

Using this identity and standard algebraic operations, we get the following equation for d j . Let d sI Y[ ] j d j . Thus, Z[ ] = d sI Y[ ]

• § 1 ª § 1 ª Z[j ] = Y[j ] log 1 log 1 − m − − n (0) (0) ¦ ˜ I © ψ 1 + s sI + ψ 1 + bI B [j ] + s sI − { − } −˜ − (0) (0) +ψ 1 + sI ψ 1 bI + sI { } − { − }

Thus, the estimated value s˜I of sI , i.e. , the sum of measurements of instance I , in the given bucket epoch B is given by the numerical solution of the following equation:

 I  PB˜ [j ]  m  j j  X bI =0 X[ ] Z[ ] I × = 0 (2.7) B˜ j j 1 P [ ]  =  b 0 X[j ] Y[j ]  I = ×

The estimated value t˜I of tI , i.e. , the number of times the measurements are inserted in the given ˜ I bucket epoch B, is also given by the numerical solution of Eq. (2.7) after replacing B [j ], bI , s , and ˜ I sI with C [j ], cI , t , and tI , respectively.

Discussion

As described in Section 2.2.1, the INU spreads the measurements of any instance into m buckets. By distributing the measurements into multiple buckets and by keeping m << n, INU significantly reduces the dependence of the accuracy of estimates on the distribution of the measurements. This is because when m << n, no two flows will have a large number of common buckets in their bucket subarrays. Thus, even if one instance with large value shares a bucket with another instance with small value, the remaining m 1 buckets of the instance with small value will not be shared with −

14 the same large instance. Consequently, the net estimate from the m buckets does not have a large error. Our proposed framework finds applications in the majority of IoT deployment scenarios. For example, it can be deployed to keep track of the amount of time each node transmits for subsequent analysis of the fairness in transmission workload of low-power nodes. As another example, it can be used to monitor the latency experienced by the packets of individual IoT nodes, which is a critical metric in time-sensitive IoT deployments such as in autonomous machines on factory floors.

2.4 Applications and Evaluation

In this section, we demonstrate the use of our IoTm framework by applying it to store and estimate one RU metric, namely the number of disk input/output (IO) operations and one QoS metric, namely round trip latency. Note that our framework is not limited to these two metrics, and can be used to store and estimate several other IoT metrics, such as CPU utilization, throughput, , memory consumption, etc. For the number of disk IO operations, the appropriate measurement is the total number of IO operations per process, while for latency, the appropriate measurement is the average value per flow. Thus, the two IoT performance metrics that we have chosen cover both types of metrics: one that does not require the use of the supplementary bucket array C and one that does. In the rest of this section, for each of the two metrics, we first describe how the INU inserts the measurements of that metric into the data structure. After that we describe the real world traces that we used to evaluate the performance of our framework for that metric. Last, we present the results on the accuracy of the estimates of that metric provided by the query processing engine, and on the physical memory required on the IoT nodes to maintain the data structure.

2.4.1 Disk IO Operations per Process

2.4.1.1 Method

When the control unit instructs INU on any IoT node to start recording disk IOs of processes, the INU initializes the bucket array B comprising n buckets. As we are interested in the total number of disk IOs per process and not the average, the INU does not initialize the bucket array C. Based on the requirements, the control unit can even specify exactly for which processes the disk IOs should be recorded. Every time a process on the given IoT node performs a disk IO operation, the INU on that node selects a random number j from a uniform distribution in the range [1,m], appends it with the ID p of the process, calculates the hash function H (p, j ) whose output is uniformly distributed in the range [1,n], and increments the bucket B[H (p, j )] by one. To estimate the number of disk IOs of any given process with ID p over a desired period of time, the control unit first identifies all bucket epochs whose time frames overlap with that desired period of time. Depending on where the identified bucket epochs are currently stored, the control unit asks the query processing engine either in CQU, or in INU, or in both, to estimate the number of disk IOs from their corresponding

15 bucket epochs. The query processing engine(s) use Eq. (2.7) to estimate the number of disk IOs for that process from the identified bucket epochs, and send the estimates back to the control unit. The control unit adds the estimate from each bucket epoch to get the final estimate of the total number of disk IOs of the process p over the desired period of time.

2.4.1.2 Traces

To evaluate the accuracy of our framework in estimating the number of disk IOs per process, we collected disk IO traces from a Raspberry Pi executing MQTT protocol (the commonly used applica- tion layer protocol for IoT [63]) and sending/receiving packets from an MQTT broker. Our trace file contains log entries, where each log entry corresponds to a disk IO operation and comprises the time stamp of that disk IO operation as well as the ID of the process that performed that disk IO. We captured disk IOs for 10 processes for a duration of 10 minutes.

2.4.1.3 Evaluation

Next, we first study the accuracy of the estimates using fixed values of the two parameters n (i.e. , the number of buckets in the bucket array) and m (i.e. , the number of buckets in each bucket subarray). After that, we study the effect of the change in the values of these two parameters on the accuracy of our framework. After that, we study the effect of the number of bucket epochs across which queries are spread, i.e. , the number of bucket epochs from which the query processing engine(s) have to estimate values in order for the control unit to be able to answer the query. Last, we discuss the memory requirements of our data structure in an IoT node. In all our experiments, we used b (i.e. , the size of each bucket) as 16 bits. To study the accuracy of IoTm, we implemented the INU emulator in Matlab that traverses the log file containing the traces and inserts the disk IO counts to the data structure using the desired values of n, m, and b . More specifically, the INU emulator simulates the time duration of 10 minutes. Every time the simulator time matches the time stamp of a log entry, the INU emulator increments a bucket in the bucket subarray corresponding to the process ID in that log entry, as described in Section 2.4.1.1. The INU transfers the data structure to CQU (also implemented in Matlab) every t minutes. This way, from the simulation with t minute wide bucket epochs, the CQU emulator receives 10/t bucket epochs. In different simulations, we used one of the following three values for t : {1, 5, 10}. The motivation behind performing simulations with different durations of bucket epochs (i.e. , using different values of t in different simulations) is twofold. First, it enables us to evaluate the accuracy of IoTm for queries that span a single bucket epoch as well as those that span multiple bucket epochs. Second, in evaluating the accuracy, the query processing engine is exposed to several ranges of the number of disk IOs per process. Figure 2.2 shows boxplot of the number of disk IOs of the 10 processes stored in each bucket epoch from three simulations corresponding to three values of t , i.e. , t = 1,5, and 10. For this figure, n = 20 and m = 5. Each boxplot is made from 10 values of the number of disk IOs, one for each process. We observe from this figure that indeed the bucket

16 epochs contain a variety of ranges of the number of disk IOs per process, with the smallest spans for 1-minute bucket epochs and the largest span for the 10-minute epoch.

10000

8000

6000

4000 # of Disk IOs

2000

12345678910 5 10 10 t = 1 t = 5 t = 10

Figure 2.2 The number of disk IOs stored in bucket epochs.

Accuracy

To study the overall accuracy, we used n = 20 and m = 5, and performed three simulations using t = 1,5, and 10 minutes. From the bucket epochs resulting from each simulation, we queried the total number of disk IOs performed by each of the 10 processes throughout the duration of 10 minutes. Note that for the bucket epochs resulting from simulation using t = 1, each query requires the query processing engine to estimate a value from each of the 10 bucket epochs. Similarly, for the bucket epochs resulting from simulation using t = 5 and 10, each query requires the query processing engine to estimate a value from two and one bucket epochs, respectively. Figure 2.3 shows a scatter plot of the actual number of disk IOs vs the estimated number of disk IOs from the bucket epochs resulting from t = 1 minute. Due to space limitation, we do not show the scatter plots of the estimates obtained using the bucket epochs resulting from t = 5 and 10 minutes, rather only describe the observations. We observe from this figure that all estimates always lie within the 5% error lines, i.e. , our IoTm framework estimated the number of disk IOs for each process with± less than 5% error. The estimates from the t = 5 and 10 minute wide bucket epochs have smaller error compared to the estimates from t = 1 minute wide bucket epochs because for t = 1 minute wide bucket epochs, the query processing engine has to obtain 10 estimates from 10 bucket epochs. As each estimate has error, the more the number of estimates involved in answering a query, the higher the error. We make two conclusions from the discussion above. First, IoTm accurately estimates the values of the number of disk IOs with less than 5% error. Second, increasing the bucket epoch duration reduces the error. However, note that the increase in bucket epoch duration may not always be feasible because larger duration implies that more information will be inserted into the bucket epochs, and thus, the size b of each bucket will need to be increased, which may not always be feasible for resource constrained IoT nodes.

17 12000

10000

8000

6000

4000

2000 Estimated # of disk IOs

0 0 2000 4000 6000 8000 10000 Actual # of disk IOs

Figure 2.3 Actual # of disk IOs vs. estimated # of disk IOs

Effect of the Number of Buckets in Array

To study the effect of n on the accuracy of the IoTm framework, we performed 7 sets of the same three simulations described above (i.e. , using t = 1,5, and 10), where each set of simulations used a unique value of n. For all these simulations, we kept m fixed at 5. From the bucket epochs resulting from each simulation, we queried the total number of disk IOs performed by each of the 10 processes throughout the duration of 10 minutes. Figure 2.4 plots the average relative error in the estimates across the 10 processes for different values of n and different bucket epoch durations. We define relative error in an estimate as ( actual value estimated value /actual value). We observe from this figure that as n increases,| the relative− error decreases. This| is intuitive because when the value of n is large, the information of different processes is spread over more diverse sets of buckets, and each bucket is storing information from fewer number of processes. In other words, there is less noise in any given bucket when seen from the perspective of any process whose subarray the given bucket is in. We conclude from this discussion that the larger the n, the lower the error. However, very large values of n may not be possible for resource constrained IoT nodes.

0.11 0.11 1 min. bucket epochs 1 min. bucket epochs 0.09 5 min. bucket epochs 0.09 5 min. bucket epochs 10 min. bucket epochs 10 min. bucket epochs

0.07 0.07

0.05 0.05

0.03 0.03 Average relative error Average relative error

0.01 0.01 10 15 20 25 30 35 40 1 3 5 7 9 11 13 15 # of buckets in array # of buckets in subarrays

Figure 2.4 Effect of n on average relative error Figure 2.5 Effect of m on average relative error (m = 5) (n = 20)

18 0.037 80

0.036 70

0.035 60

0.034 50

0.033 40

Average relative error 0.032 30 Mem. Req. / BE (Bytes)

0.031 20 1 2 3 4 5 6 7 8 9 10 10 15 20 25 30 35 40 # of bucket epochs spanned by the query # of buckets in array

Figure 2.6 Effect of bucket epochs spanned by a Figure 2.7 Memory required as a function of the query on avg. rel. error number of buckets

Effect of the Number of Buckets in Subarray

To study the effect of m on the accuracy of IoTm, we performed 8 sets of the same three simulations (i.e. , using t = 1,5, and 10), where each set of simulation used a unique value of m. For all these simulations, we kept n fixed at 20. Figure 2.5 plots the average relative error in the estimates across the 10 processes for different values of m and different bucket epoch durations. We observe from this figure that the relative error is a convex function of m. The initial decrease in the relative error with the increase in m is because the increase in the number of buckets in a subarray increases the observations from which an estimate is obtained. It is a well-known result in estimation theory that the increase in the number of observations decreases the variance in the error of estimates [75]. The subsequent increase in the relative error is because as m increases, more and more processes start sharing the same bucket, which increases the amount of noise in each bucket and thus increases the error in estimates. We conclude from this discussion that given the values of other parameters, such as n and b , an optimal value of m exists that minimizes the error. In our future work, we plan to theoretically derive the expression to calculate this optimal value of m.

Effect of the Number of Bucket Epochs

To study the effect of the number of bucket epochs across which the query spans on the accuracy of our framework, we performed a simulation using t = 1, n = 20, and m = 5. This simulation resulted in ten 1-minute wide bucket epochs. We executed 10 sets of queries, where each set comprised of 10 queries corresponding to the 10 processes to estimate their number of disk IOs. Each query for any given process in the i th set of queries (1 i 10) was to estimate the number of disk IOs performed by that process in the first i minutes≤ of≤ the 10 minute trace. Thus each query in the i th set spans i bucket epochs. Figure 2.6 plots the average relative error across all queries in each set of queries. We observe from this figure that as the number of bucket epochs across which the query spans increases, the average relative error increases (although only slightly). This concurs with our earlier observation that using a lager duration for each bucket epoch is advisable as it decreases the number of bucket epochs from which any given estimate is obtained. However, as also discussed earlier, the larger duration may not always be feasible due to the need for increasing the

19 size of each bucket and the limited resources on IoT nodes. The system administrator should first determine how much memory is available for data structure on the IoT nodes and how frequently the measurements will be taken. Based on this, he/she should decide the appropriate size of each bucket and the duration of bucket epochs.

Memory

Figure 2.7 plots the memory required by our data structure corresponding to the different values of n in Figure 2.4. Each point in Figure 2.7 is obtained by multiplying n with b = 16 bits. We observe from this figure that even for the highest accuracy configuration (i.e. , when n = 40), our data structure requires just 80 bytes of memory. Most IoT nodes, such as Raspberry Pi and Photon IO have much higher amounts of RAM available compared to 80 bytes. This shows that our proposed INU can easily be implemented on IoT nodes.

2.4.2 Round Trip Latency per Flow

Next, we describe how the IoTm framework can estimate the average round trip times (RTT) experi- enced by the packets in any flow. To measure RTT for any packet, we need two time stamps: one when the packet leaves the IoT node and another when its ACK arrives back. Consider an arbitrary flow with ID f that has l packets. The ID of a flow can be any flow identifier, such as the standard th five tuple. Let Si represent the time stamp when the i packet of this flow is sent and let Ri represent the time stamp when the ACK of this i th packet arrives back. The average RTT of the packets in a flow f is: ‚ l l Œ R S ... R S 1 X X RTT ( 1 1)+ +( l l ) R S f = − l − = l i i i =1 − i =1 This equation shows that if we simply add the time stamp of each ACK in our data structure and subtract the time stamp of each sent packet from the data structure, then we will be storing the sum of RTT values of all packets in the data structure. This method, however, will work only if there are no packet losses. To demonstrate the use of our framework for measuring average RTTs per flow, we assume no packet losses. Handling packet losses is straighforward: maintain two bucket arrays for separately storing time stamps of sent and received packet and another two bucket arrays to separately keep a count of the number of sent and received packets.

2.4.2.1 Method

When the control unit instructs INU on any IoT node to start recording RTT values of flows, the INU initializes the bucket arrays B and C, each comprising n buckets. To make sure that the time stamp of the ACK of a packet is added to the same bucket from which the time stamp of its transmission time was subtracted, every time a packet of a flow ID f is transmitted or an ACK of a packet of that flow ID f arrives, instead of randomly choosing a number j in the range [1,m] and appending it to flow ID before calculating the hash function, INU applies the modulo m operation on the packet

20 sequence number, adds 1 to it, and appends that to the flow ID. It then calculates the hash function H (f ,Seq#%m +1) whose output is uniformly distributed in the range [1,n], and adds the time stamp to the bucket B[H (f ,Seq#%m + 1)] if processing a received ACK or subtracts the time stamp from the bucket B[H (f ,Seq#%m + 1)] if processing a transmitted packet. For each sent packet of any given flow with ID f , it also increments the corresponding bucket C[H (f ,Seq#%m + 1)] by one to keep count of the number of packets of that flow. To estimate the average RTT experienced by the packets of any given flow with ID f over a desired period of time, the control unit first identifies all bucket epochs whose time frames overlap with that desired period of time. Next, it asks the query processing engine(s) to estimate the sum of RTTs from each identified bucket epoch B and the count of the number of transmitted packets from each corresponding bucket epoch C. The query processing engine(s) use Eq. (2.7) to obtain these estimates and send them to the control unit. The control unit then divides the sum of estimates from all B bucket epochs with the sum of estimates from all C bucket epochs to estimate the average RTT experienced by the packets of the flow f .

2.4.2.2 Traces

To evaluate the accuracy of our framework, we collected traces from a Raspberry Pi executing MQTT protocol and sending/receiving packets from an MQTT broker. The Raspberry Pi ran 50 different processes, each having a unique persistent TCP connection with the MQTT broker. This resulted in 50 flows with distinct IDs. The flow ID is the standard 5 tuple. Each process sent 100 byte messages to the MQTT server on up to 10 different topics. Furthermore, each process sent messages at different rates: the process with flow ID fi (1 i 50) carried an average of i messages per second with exponentially distributed inter-arrival≤ times.≤ We used tcpdump to log all departing and arriving traffic at the Raspberry Pi for a duration of 10 minutes. The pcap files resulting from the tcpdump contain the flow ID for each packet as well as the TCP sequence numbers, which the INU uses to decide the bucket in which the time stamp of an ACK should be added and from which the time stamp of a sent packet should be subtracted. Figures 2.8 and 2.9 plot the CDFs of the packet counts of the 50 flows and the RTTs experienced by the packets across these 50 flows, respectively. We observe from these figures that we have flows ranging from very few packets to a very large number of packets and latencies ranging from a few milliseconds up to about 50 milliseconds.

1 1

0.8 0.8

0.6 0.6 CDF 0.4 CDF 0.4

0.2 0.2

0 0 0 0.5 1 1.5 2 2.5 3 10 20 30 40 50 # of packets per flow ( 104) RTT (ms)

Figure 2.8 CDF of packet counts of flows Figure 2.9 CDF of RTTs of all packets

21 2.4.2.3 Evaluation

In all our experiments, we used b = 32 bits because time stamp values at the granularity of millisec- onds require larger memory compared to the number of disk IO values. Similar to Sections 2.4.1.3, we used our INU emulator, which traverses the pcap file. Every time the simulation time matches the time stamp of a packet in the pcap file, the INU emulator increments a bucket in the C bucket subarray corresponding to the flow ID of that packet, and adds (subtracts) the time stamp of the received ACK (sent packet) from the corresponding bucket in the B bucket subarray.

Accuracy

To study the accuracy, we used n = 20 and m = 5, and performed three simulations using t = 1,5, and 10 minutes. From the bucket epochs resulting from each simulation, we queried the average RTTs experienced by each of the 50 flows. Figures 2.10(a), 2.10(b), and 2.10(c) show scatter plots of the actual average RTTs of each flow vs. the estimated average RTTs from the bucket epochs resulting from t = 1,5, and 10 minutes, respectively. We observe from these figures that the estimates always lie within 7% error error lines when t = 1 and within 5% error lines when t = 5 and 10. ± ±

50 50 50

40 40 40

30 30 30

20 20 20

10 10 10 Estimated avg RTT/flow Estimated avg RTT/flow Estimated average RTT/flow

0 0 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Actual average RTT/flow Actual average RTT/flow Actual average RTT/flow

(a) t = 1 minute (b) t = 5 minutes (c) t = 10 minutes

Figure 2.10 Actual average RTT per flow vs. estimated average RTT per flow

Effect of Number of Instances

To study the effect of the number of instances on the accuracy of IoTm, we performed 9 more simulations by varying the number of flows from 10 to 50 in steps of 5, and using n = 20, m = 5, and t = 10. In any simulation with i flows (where i = 5j and j [2,10]), we randomly selected the i flows out of our 50 flows. Figure 2.11 plots the average relative∈ error in the estimates of the RTTs of the i flows averaged over 100 runs of simulation with i flows. We observe from this figure that, as the number of flows increase, the error slowly increases. This happens because with increase in the number of flows, more and more processes start sharing the same bucket, which increases the amount of noise in each bucket and thus increases the error in estimates. Nonetheless, the error

22 is still small. This slight loss in accuracy brings significant reduction in the memory requirements compared to the naive approach, as we describe next.

Memory

The memory required by our data structure for measuring average RTT is four times of that required for measuring the number of disk IOs because average RTTs need bucket epoch Cs in addition to bucket epoch Bs and the size of each bucket is twice the size of each bucket used for disk IOs. Nonetheless, any low end IoT nodes can still easily implement our INU and the data structure for measuring latencies. Note that if one were to use the naive approach of maintaining a unique counter for each flow, as discussed in Section 2.2.2, one would need 50 2 32-bit counters for the 50 flows compared to just 20 2 32-bit buckets used by our data structure.× Thus, among the other advantages mentioned in Section× 2.2.2 of using our data structure instead of the naive approach, in this particular example, our data structure requires 2.5 times less memory compared to the naive approach. The saving in the memory increases further with the increase in the number of instances, which makes our data structure a lot more scalable compared to the naive approach.

2.5 Related Work

While the problem of measuring IoT performance metrics has largely been unexplored, work has been done on measuring performance metrics in conventional networks and systems. Next, we first describe a relevant class of data structures, called sketches, which are frequently used to store performance metrics in conventional networks and systems. After that, we describe a recently proposed framework, namely Trumpet [96], designed for fine-grained network monitoring in data centers. Last, we present some representative prior work on measuring performance metrics in conventional networks.

0.05

0.04

0.03

0.02

Average relative error 0.01

0 10 20 30 40 50 # of flows in array

Figure 2.11 Effect of the number of instances on average relative error

23 2.5.1 Sketches

Count-Min (CM) sketch is the most relevant data structure that can be used to store and estimate the sums and counts of various performance metrics [30]. It has been extensively used in conventional networks and systems [22, 145]. Several other variants of CM-sketch also exist, such as Count sketch [21], conservative update sketch [40], and CM-log-sketch [108]. CM-sketch falls short from two perspectives when measuring IoT performance metrics. First, it requires d hash computations and d memory updates per insertion, which can be too much work for resource constrained IoT nodes. In comparison, our data structure requires just a single hash computation and memory update per insertion. Second, CM-sketch achieves similar error bound as our proposed data structure only if we use a dedicated sketch for each instance. Lets demonstrate this using the number of disk IOs as example. Let sp represent the number of disk IOs of a process with ID p and let there be N processes in total whose number of disk IOs need to be recorded. We can use a CM-sketch to store the number of disk IOs and obtain the estimate s˜p of the number of disk IOs of any process with ID p, as per the method described in [30]. The PN estimate s˜p obtained through CM-sketch satisfies the condition s˜p sp +ε j 1 sp with probability ≤ × = ξ [30]. In comparison, the estimate s˜p obtained through our data structure satisfies the condition s˜p sp + ε sp with probability ξ. To achieve similar error bound using CM-sketch, we need to ≤ ×P ensure that j sp sp , which is possible only if we do not add the number of disk IOs of any process other than∀ p≤ to the CM-sketch. Thus, we need a CM-sketch for each process, leading to prohibitively large memory requirements that the resource constrained IoT nodes cannot provision.

2.5.2 Data Center Monitoring Framework

Trumpet is a framework similar to IoTm in spirit, but designed for measuring performance metrics and detecting events of interest in data centers [96]. It leverages abundance of compute & memory resources and programmability at end hosts to monitor every packet and to report events. Trumpet detects events of interest using triggers at end hosts. It evaluates triggers by inspecting packets at full line rate and reports events to a controller. The fundamental difference between Trumpet and IoTm is that Trumpet assumes abundance of compute & memory resources and remote programmability at end host, whereas IoTm assumes neither. To avoid requiring abundance of compute & memory resources, IoTm delegates decision making about the metrics to monitor and the nodes on which to monitor them to the control unit in CQU. To avoid requiring remote programmability at IoT nodes, IoTm employs a generic data structure in INU that can record measurements for a variety of metrics. Thus, IoTm is well suited for IoT implementations while Trumpet is well suited for data centers.

2.5.3 Other Measurement Schemes

Due to space limitation, we summarize the existing work only on the measurement of one QoS metrics, namely latency, for conventional networks and describe how existing work is infeasible for measuring IoT metrics. Existing schemes for almost all QoS & RU metrics can be broadly divided

24 into two categories: active and passive. Active measurement schemes rely on performing active operations, such as injecting probe packets, to measure the performance metrics. Such schemes are usually easy to implement, but can alter the true value of the metrics due to active operations. Passive schemes do away with active operations but require more computational and memory resources to measure the performance metric.

Active Latency Measurement Schemes

Lee et al. proposed MAPLE, a scheme to efficiently store latencies of packets [83]. MAPLE attaches time stamp to each packet at the sender and the receiver calculates the latencies by subtracting the attached time stamp from its current time. Attaching time stamps is the key limitation of MAPLE be- cause it not only requires modifications to standardized packet header formats and data forwarding paths of existing routers and middleboxes but also puts additional work on resource constrained IoT nodes. Furthermore, attaching time stamps can consume up to 10% of the bandwidth [78]. Lee et al. proposed RLI, which measures latency of any given flow by inserting time stamped probe packets into the flow [82]. To calculate the latency of the regular packets between two probe packets, RLI applies straight line interpolation. Inserting probe packets is the key limitation of RLI because for fine-grained measurement, the number of probe packets is large and the latency measured with a large number of probe packets significantly deviates from the real latency. The bandwidth that such probe packets waste can actually be utilized for in-band transportation of the measurement data collected by the IoT nodes.

Passive Latency Measurement Schemes

LDA provides passive but aggregate latency measurement between a sender and a receiver [78]. As it does not provide fine-grained latency measurements, it cannot be used to understand the causes of sudden and short-lived deteriorations in the performance of IoT networks. In LDA, both the sender and the receiver maintain a counter vector where each element is a pair of counters: time stamp counter for accumulating packet time stamps and packet counter for counting the number of arriving/departing packets. For each arriving or departing packet, LDA randomly maps the packet to a counter pair in the counter vector and adds the time stamp of the packet to the time stamp counter and increments the packet counter by one. To obtain the aggregate latency estimate, for each counter pair, LDA checks whether they have the same packet counter value and selects all counter pairs that have the same packet counter value for both the sender and receiver. Finally, LDA obtains aggregate average latency by subtracting the sum of time stamps at the sender side from that at the receiver side and divides it with the total number of successfully delivered packets.

2.6 Conclusion

The key contribution of this work is in proposing IoTm, a framework for measuring IoT performance metrics, and demonstrating its use and accuracy by applying it to measure various IoT performance

25 metrics. The key technical depth of this work is in the design and analysis of our generic data structure as well as the estimation theory that enables the query processing engines to accurately estimate the values of the desired metrics. IoTm is lightweight in terms of both computational resources and physical memory. It requires just a single hash computation and memory update per measurement and only a few tens of bytes of memory to store several minutes worth of measurements. This makes IoTm amenable for implementation on resource constrained IoT nodes. Our experimental results showed that our framework can achieve high accuracy (over 95%) in estimating a variety of IoT performance metrics. In future, we plan to develop theoretical models to calculate the optimal values of the parameters of data structure used in IoTm.

26 CHAPTER

3

CHARACTERIZING THE PERFORMANCE OF WIFI IN DENSE IOT DEPLOYMENTS

3.1 Introduction

The number of devices connected to the internet is increasing at an unprecedented rate. It is estimated that by the year 2020, the number of devices connected to the internet will exceed 24 billion [51]. While these devices will include conventional devices such as laptops, mobile phones, and web servers, the major portion of these devices will be contributed by the Internet of Things (IoT). IEEE 802.11n/ac, more commonly known as WiFi, is the most popular and convenient wireless technology to connect devices to the internet, especially in home and enterprise networks. As more and more devices are connecting to the internet over WiFi, maintaining reliable WiFi connectivity is becoming problematic due to the distributed and uncoordinated nature of the (MAC) of IEEE 802.11n/ac. The direct effect of this limitation is that the usable share of the wireless bandwidth for individual clients decreases, which results in the degradation of performance metrics such as loss in throughput and increase in latency. This, in turn, leads to poor quality of experience for users in home networks and financial and functional losses for users in enterprise networks. For example, IoT applications need varying amounts of performance guarantees in order to meet their service level agreements. Low latency is a key requirement in applications such as connected vehicle monitoring and aircraft vibration level monitoring. Similarly,

27 high throughput is a key requirement in enterprise applications that involve a large number of IoT devices such as smart grid and industrial equipment. As the wireless environments are becoming increasingly dense due to the proliferation of the IoT devices and as IEEE 802.11n/ac’s current MAC is facing problems in dense IoT deployments, a need for a MAC protocol that is tailored for IoT devices is rising. In order to identify the aspects of the existing MAC that should be updated, it is imperative to first identify how various aspects of current MAC are holding up in the new dense IoT environments, and what are the root causes behind the deterioration of the aspects that do not perform well in the dense IoT environments. To study these aspects, and to identify the causes for any deterioration, in this work, we conduct a comprehensive measurement study of various aspects of IEEE 802.11n/ac’s MAC using real IoT devices and realistic IoT workloads. More specifically, we characterize the throughput of individual IoT devices as well as the aggregate wireless system for five different emulated classes of IoT devices, which range from simple sensors that generate as little as 10kbps data to complex CCTV security cameras that generate data at the rate of 5Mbps. We also study other parameters of IEEE 802.11n/ac’s MAC that include RTS/CTS bandwidth, block acknowledgments, and frame aggregation. We further study the impact of TCP’s congestion control protocol on the performance of IEEE 802.11n/ac’s MAC. Protocols like IEEE 802.15.4 and Bluetooth Low Energy (BLE) were proposed specifically for the purpose of solving interconnection issues in IoT networks. But we focus our study only on IEEE 802.11n/ac for the following two reasons. First, ubiquity of IEEE 802.11n/ac infrastructure. WiFi access points are a commonplace in home and enterprise deployments and therefore, it is easy and requires no extra infrastructure for newer devices to connect to WiFi networks. This is evident from a growing number of household appliance like smart TV, smart refrigerator, personal assistants being dependent on WiFi to provide their services. On the other hand, devices operating under IEEE 802.15.4 specification need separate bridges to facilitate communication between those devices and the internet. Second, continuous access to the internet. An essential component of the IoT paradigm apart from data sensing lies in providing meaningful services based on the sensed data. There is an increasing trend in business applications which provide remote access and control to users to manage home devices and appliances. For example, applications like remote home surveillance relays continuous video and audio feed of the house to a smartphone. A primary requirement of these applications is sustained internet access to the monitoring devices and a high throughput link. This cannot be achieved with BLE as it is a short range protocol and works by querying information from locally connected devices [113]. Keeping these constraints in mind, we base our measurement study solely on IEEE 802.11n/ac networks as they provide both high throughput and sustain large network sizes. We leave studying the coexistence of BLE and other low power protocols with WiFi in our future works. Although several prior measurement studies have been conducted on congested wireless net- works, they were conducted using IEEE 802.11a, b, or g, which all are now considered the legacy versions of IEEE 802.11n/ac. In [71, 70], Jardosh et al. showed that the link quality in congested networks is dependent on the data rates and frame sizes. Aguayo et al. studied the relation of signal to

28 noise ratio with packet losses in a large network [2]. Balachandra et al. characterized the wireless link performance under different user behaviors such as session lengths and types of applications [13]. Other studies such as [107, 18, 114] focused on the implications of large scale wireless deployments or mesh networks on user level metrics. For example, Brik et al. showed that a major bottleneck in a multi-hop network is the last hop wireless link [18]. Raghavendra et al. showed that a significant amount of traffic is wasted on just ensuring that the clients are still connected to the WiFi access point (AP) [114]. While all of these studies are very interesting and useful, they do not provide much insights on the efficiency of IEEE 802.11n/ac’s MAC in dense IoT networks. One of the key reasons behind this is that the traffic model in IoT deployments is very different from the traffic models in conventional networks. In conventional wireless networks, the devices use conventional applications such as web and file downloads, which are well-known to give rise to a significantly larger downlink traffic compared to the uplink traffic. In contrast, in IoT networks, IoT nodes primar- ily function as sensors and transmit sensed data to some remote location, such as a cloud service, for processing, which gives rise to a significantly larger uplink traffic compared to the downlink traffic. This creates a fundamental difference: in the case of conventional networks, the majority of the data is going from access point to the clients connected to it, and access point silences the clients when it has to transmit. Consequently, there are very few collisions on the wireless channel during most of the data transmission. Contrary to that, in the case of IoT networks, the majority of the data is going from clients to the access point. As there are many clients in a dense IoT deployment, the number of collisions on the wireless channel increase manyfold, which results in the deterioration of the overall throughput of the wireless network. In this work, we make following three key contributions. First, we construct an IoT testbed using Raspberry Pis that emulate a dense IoT deployment, and study the performance of IEEE 802.11n/ac’s MAC by emulating five different classes of IoT devices on the Raspberry Pis in a real world wireless environment. Second, we analyze the inter-dependencies between the link layer and the in regulating the system throughput in dense IoT deployments and show that the congestion control methods in TCP have detrimental impacts on the performance of IEEE 802.11n/ac’s MAC, and must be updated to enable IoT devices to efficiently communicate in dense IoT environments. Third, we have collected an extensive number of wireless traffic traces using real IoT devices. These traces emulate various dense as well as sparse IoT deployments.

Chapter organization

In Section 3.2, we briefly discuss prior work on studying the performance of IEEE 802.11 and contrast our study with the prior studies. In Section 3.3, we present a brief overview of the MAC protocol of IEEE 802.11n/ac, which will be useful in understanding various observations that we will present from our wireless traffic traces. In Section 3.4, we describe our testbed, how we conducted the experiments, and how we collected the wireless traffic traces. In Section 3.5, we present our observations from the traces that we collected and highlight the aspects of IEEE 802.11n/ac’s MAC and of TCP that should be revised to enable IoT devices to efficiently communicate using IEEE

29 802.11n/ac in dense environments. Finally, in Section 3.6, we describe the key take-aways from the observations that we present in this chapter and discuss the possible research directions to address the limitations highlighted by these observations.

3.2 Related work

Researchers have previously performed several measurement studies that evaluate WiFi’s perfor- mance in legacy networks and in large scale deployments. In this section, we discuss some of the representative prior studies and contrast them with the measurement study that we present in this chapter. Jardosh et al. studied the effects of congestion in a wireless network by analysing traffic traces nd obtained from a network deployed during the 62 IETF conference [71, 70]. They showed that several link layer parameters such as frame retransmission, frame size, and data rates directly affect a wireless link’s reliability and wireless channel’s congestion. An important result from their study was that the use of rate adaptation in response to congestion has detrimental effects on network performance. In [2], Aguayo et al. explored the possible causes for packet losses in a 38- node urban multi hop IEEE 802.11b network. Their measurement study showed that link distance and the signal to noise ratio only has a weak correlation with the packet loss rates, whereas multi path fading is strongly correlated. In [13], Balachandran et al. studied how WiFi networks perform under a variety of user behaviors. The authors characterized user behavior in terms of connection session length, user distribution across APs, mobility, application mix, and bandwidth requirements. Their key observation was that load distribution is not correlated with the number of clients, i.e. , load balancing solely based on the number of associated users may be detrimental to the overall performance of the WiFi network. While these prior schemes developed useful insights, they were conducted on legacy versions of IEEE 802.11, which, unlike the more recent IEEE 802.11n/ac, do not benefit from advancements such as frame aggregation, MIMO, and beam-forming. In our study, we worked with the most recent version of IEEE 802.11, i.e. , the n/ac version. Another key difference between our study and the prior studies is that prior studies used downlink-heavy traffic, which is appropriate only for conventional networks. In IoT networks, the traffic is uplink-heavy. In [107], Patro et al. distributed access points containing sniffers to home owners and collected data over a period of 6 months. Using this data, they analyzed user-space quality of service of WiFi and developed a generic link quality metric that was based on throughput, downtimes, and interference from other APs over an extended period of time. Raghavendra et al. studied the effects of unwanted link layer traffic on the overall link quality [114]. Unwanted traffic results from protocol operations such as initiating, maintaining, and changing client-AP relationship. The authors showed that such unwanted traffic can result in significant overhead and throughput deterioration in the WiFi network. In [18], Brik et al. presented a measurement study of a large-scale urban WiFi mesh network consisting of more than 250 Mesh Access Points. They concluded that the last hop 2.4GHz wireless link between the mesh and the client is the major bottleneck that limits the throughput

30 achieved by the client. In addition to using IoT specific uplink-heavy traffics, the key difference between our study and such prior studies is that the prior studies focused only on the application layer performance, whereas we not only measure the application layer performance using IoT traffic, but also analyze the link layer parameters to understand exactly which aspects of IEEE 802.11n/ac’s MAC get impacted in different IoT deployment scenarios.

3.3 Overview of IEEE 802.11’s MAC & Aggregation

Tomake the chapter self-contained and to facilitate the readers to better understand the observations that we will present from our measurement study, in this section, we briefly summarize distributed coordination function (DCF): the MAC protocol of IEEE 802.11. We also summarize the frame aggregation technique, leveraged by the n/ac version to improve the throughput of clients. For a more complete and detailed description, we refer the readers to the standard document [65].

3.3.1 Regular DCF

When a new frame arrives in the link layer transmit buffer of a client, the client starts to monitor the channel activity. If it finds the channel to be idle for a period of distributed inter frame space (DIFS), it transmits the frame at the end of this DIFS period. If it finds the channel to be busy at the start or during the DIFS period, it persists to monitor the channel until the channel becomes idle for a DIFS period. After DIFS, the client selects a random number of discretized time slots, “backs off” for that many number of time slots, and transmits after that many number of time slots have passed. This random backoff functionality of the DCF minimizes the probability of a collision when multiple clients operating in the same channel have frames to transmit because even when two clients sense the channel to be idle for a period of DIFS, with high probability, they select different number of backoff time slots. The exact duration of the time slot depends on the PHY layer that the client is using, and accounts for the propagation delay, time needed to switch from receiving to transmitting state, and the time needed to signal MAC layer about the channel state. The client decrements its backoff counter at the end of each idle time slot. If it detects a transmission during any time slot, it freezes its backoff counter and resumes decrementing it after the channel is sensed idle again for a DIFS period. The number of time slots in the backoff interval are uniformly chosen from the range [0,W -1], where W is called the initial contention window size. If two clients choose the same backoff interval after DIFS period, then they would start their transmission in the same time slot, causing a collision. After every unsuccessful transmission attempt, a client doubles its contention window size up to a maximum value. If a frame is transmitted correctly, the receiver sends back a small acknowledgment (ACK) frame after waiting for a short inter frame space (SIFS). The transmitting client considers its frame transmission successful only if it receives this acknowledgment within a certain amount of time after the transmission of the frame. The receiver also has the option of sending a cumulative acknowledgment for multiple frames. Such cumulative acknowledgments are called block ACKs.

31 Each block ACK contains a sequence number of the starting frame that it is acknowledging as well as a bitmap that describes which other frames, counting from the starting frame, it is acknowledging. For example, in a block ACK, if the starting sequence number is 5 and the bitmap is 0xD (i.e. , 1101), it means that this block ACK is acknowledging three frames that have sequence numbers 5, 7, and 8.

3.3.2 RTS/CTS-based DCF The clients also have the option of using an additional layer of handshaking in the form of request to send (RTS) and clear to send (CTS) frames. More specifically, when a frame arrives in the buffer, instead of transmitting the frame after waiting for the DIFS period and the backoff interval (if any), it transmits an RTS frame. If the AP receives the RTS frame correctly, it transmits back a CTS frame after the SIFS duration. On receiving the CTS frame, the client transmits the actual frame after another SIFS duration. The motivation behind using the RTS/CTS based method is twofold. First, it helps eliminate the hidden terminal problem [130]. Second, in the case of large frames, the impact of collisions on the wireless channel is minimized, because the collision happens either at the time of transmission of RTS frame or at the time of CTS frame, which are much shorter compared to the regular frames. If the transmission of RTS and CTS frames is successful, then ideally, the remaining clients in the network should stay quiet during the transmission of the actual frame. If the RTS/CTS approach is not used, and the collision happens while a large frame is being transmitted, the collision lasts throughout the duration of the transmission of this large frame because in wireless communication, after a client starts to transmit, it has no way of detecting a collision. Consequently, the wireless bandwidth is wasted for a considerably longer duration of time compared to when RTS/CTS-based method is employed. In typical implementations, clients use a threshold to determine whether they will use RTS/CTS method or the regular method. If the size of a frame to be transmitted exceeds that threshold, they use RTS/CTS-based DCF,otherwise, they use the regular DCF.This threshold is usually set at 2347 bytes.

3.3.3 Frame Aggregation

The transmission of each frame incurs a set of overheads, which include MAC and PHY headers, inter frame spaces, and backoff delays. To reduce the wastage of bandwidth due to these overheads, the IEEE 802.11n introduced a method called frame aggregation. Frame aggregation combines multiple regular frames, which would otherwise have been transmitted individually incurring the set of overheads with each transmission, into a single big frame that is transmitted incurring only a single set of these overheads. Kim et al. showed that larger aggregation sizes improve saturation throughput of WiFi networks because the aggregation reduces the frame overheads [76]. IEEE 802.11n/ac employs two types of frame aggregation methods: aggregation of MAC service data units called AMSDUs, and aggregation of MAC protocol data units called AMPDUs. MSDUs are frames forwarded from the logical link control layer to the MAC while MPDUs are the frames forwarded by MAC to the physical layer. In this work, our focus will be on the AMPDUs because the aggregation

32 of MPDUs is the default and the most commonly used aggregation method employed by the IEEE 802.11n/ac devices.

3.4 Experimental Setup

In this section, we first describe the testbed on which we conducted all our experiments. After that we describe the five classes of traffic that the devices in our testbed generated. Finally, we describe the experiments that we conducted.

3.4.1 Testbed Setup

Our testbed is comprised of 45 Raspberry Pi 3 (RPi) [117] IoT prototype devices, deployed inside a 25ft 16ft room. The motivation behind putting them in a single, albeit, a large room was to create× an environment that is very densely populated with the IoT devices. Th RPis ran the default Raspbian Linux with kernel version 4.4.21-v7+. The TCP variant that Raspbian Linux uses is TCP Cubic, which is also the default version of TCP in the latest releases of almost all flavors of Linux [26]. While RPis come equipped with built-in WiFi radios, we used an external Edimax N150 WiFi adapter [38] on each RPi to increase the wireless communication range and reliability. We used NETGEAR R6700 [98] as access point to which all RPis connect and communicate within the 2.4GHz band on the 20 MHz wide channel (channel # 11). The AP was connected over the to a server machine powered by an Intel i7-6700HQ processor and 16GB RAM. The motivation behind using channel number 11 was to avoid interference from the university network that our testbed coexists with. The university network does not use the channel number 11. The default drivers for Edimax WiFi adapter use a 400ns short guard interval [27]. Consequently, the maximum achievable channel bit rate in our setup is 72.2 Mbps. The clients used the default threshold for deciding whether to use the RTS/CTS-based DCF or the regular DCF.In Linux, this default value is 2347 bytes. Our testbed also contains a sniffer made using a dedicated RPi with Nexmon patch [121]. The patch enables the RPi to operate in monitor mode and sniff any WiFi frames that are being transmitted over the wireless medium. The sniffed frames contain the MAC and PHY headers along with the payloads.

3.4.2 IoT Traffics

Real world network applications transmit information across a spectrum of data sizes to and from the server. Since the purpose of this study is to observe the performance of MAC in dense IoT networks, we have used only uplink traffic where multiple clients send data to the server via the single AP.Typical IoT deployments are sensor networks that sense various physical parameters and relay that data to a central processing engine. While IoT deployments often also contain actuators, the actuator traffic from the AP to clients is minimal as it just contains simple instructions to change the states of the actuators. Furthermore, the study of the downlink traffic is not of much interest also because there is no contention for medium when the traffic originates from the access point as the access point silences all clients when it has to transmit.

33 Depending on the nature of the IoT application, clients can transmit data from a few kbps to several Mbps. For example, moisture sensing nodes transmit data with a larger time period and in smaller quantities, whereas surveillance devices that continuously stream video data transmit continuously at a very high bit rate. To simulate various IoT application scenarios, we have abstracted the data transfer in the form of a file transfer. For exmaple, if we want to generate 10Kbps traffic from each RPi to the server, we send a file of size 10Kbits every second. This is a simple yet effective approximation of real world network applications. For the purpose of our experiments, we have defined five IoT traffic classes that emulate IoT applications/devices that generate as little traffic as 10kbps up to as much traffic as 5Mbps. Table 3.1 lists all the traffic classes and their corresponding application layer bit rates that we used in our study, along with a few example scenarios/devices for each traffic class. We’ve also assumed a non Quality of Service (QoS) differentiated traffic, i.e. , all traffic is delivered using the best effort access category. The motivation behind this choice is to enables us to restrict any performance variations observed in the study to variations only in the network size and traffic type and not due to different access categories.

Table 3.1 Bit rates for different classes of IoT traffic

Class # App. Bit Rate Examples IoT Devices 1 10 kbps Stove, Lights, Dishwasher, Garage, HVAC 2 50 kbps Smoke detector, Smart refrigerator 3 200 kbps Audio streaming 4 2.5 Mbps Smartphone, Laptop, Video conferencing 5 5 Mbps Video Streaming, CCTV monitoring

Traffic is generated by application layer process running on RPis using the Eclipse-Paho [37] implementation of MQTT protocol. We did not use packet generator applications because packet generators do not consider the network overhead incurred in real world scenarios due to application layer messages. MQTT follows the publish/subscribe model where information producing clients publish messages of a particular topic to a central broker. In our implementation, the broker is implemented on the server machine, described in Section 3.4.1. Consumer clients can subscribe to one or more topics and any new data published to those topics is relayed to the subscribed clients by the broker. We chose MQTT as the application layer protocol instead of other protocols like HTTP or FTP because it is designed specifically for IoT applications and because its overheads are minimal and thus has negligible impact on the throughput performance of the underlaying IEEE 802.11 protocol. Each RPi MQTT client publishes messages to the broker with it’s own IP address as the topic. The server side of the application running on the server machine has two components: 1) the MQTT broker implemented using Mosquitto [85], and 2) the consumer client that is subscribed to all RPi IP address topics. Consequently, whenever an RPi publishes a message to the broker with it’s own IP address as the topic, the consumer application receives it and logs the data.

34 3.4.3 Experiment Execution

Every experiment contains a setup phase, a IoT device communication phase, and a log transfer phase. In the setup phase, we configure the experiment parameters which include specifying exactly which RPis will participate in the communication phase, how many times the communication phase will be repeated (we repeated 5 times), which of the 5 traffic classes specified in Table 3.1 will the RPis emulate while communicating, and the duration of each communication phase (we used a duration of 120 seconds). During this phase, the server also performs an ARP-scan [61] on the network to obtain the list of IP addresses of all RPis connected with the AP.During the communication phase, we want all IoT devices to start and end the communication at the same time so that we can correctly study the impact of the number of IoT devices in a given network on their collective as well as individual throughputs. Therefore, in the setup phase, we also synchronize the time across all RPis using the [91]. In the communication phase, each RPi client waits for an explicit command from the server application to begin transferring data to the server using the rate corresponding to the traffic class it is emulating. This command essentially contains the time stamp when the RPi clients should start the transmissions. As the times of all RPis were synchronized during the setup phase, the application layer process in each RPi client starts passing the messages of sizes commensurate with the traffic class they are emulating to the lower layers. Consequently, each RPi starts attempting to transmit at the same time and IEEE 802.11n/ac’s MAC comes into play. We use a non-persistent TCP connection for data transfer, i.e. , a new TCP connection is made for the data transfer every second. The motivation behind using the non-persistent TCP is that depending on the application scenario, IoT nodes transmit data at varying periodicities. Therefore, it is better to use non-persistent TCP in this case as it reduces the burden on the server to keep track of various state variables of each node. Furthermore, non-persistent TCP connections are also more desirable if the IoT devices are resource constrained because then such IoT devices also do not have to keep track of any TCP state variables between data transfers. During each communication phase, we collected the following information: 1) tcpdumps cap- tured at the layer 2 of network stack on every RPi client as well as on the server machine, 2) an over-the-air packet capture trace obtained using the sniffer, and 3) the number of times each RPi client published a complete file. We conducted experiments using the five different traffic classes mentioned in Table 3.1 on networks of sizes 3,5,10,15,...,45. For class 4 and class 5 IoT traffic, IEEE 802.11n/ac faced consistent disconnections during the experiments for networks of size greater than 30 RPis (because of socket timeouts resulting from heavy traffic). As IEEE 802.11n/ac did not yield any fruitful communication for class 4 and class 5 IoT traffic in network sizes greater than 30 RPi nodes, it is not possible to derive any statistically significant conclusions from the corresponding traces. Therefore, in our measurement study, we do not present any observations for class 4 and class 5 IoT traffic for network sizes greater than 30. Finally, in the log transfer phase, tcpdumps from all RPis involved in the communication phase as well as the sniffer RPi are transferred to the server machine for subsequent analysis. The process

35 described in this section is completely automated, where the user input is required only in the setup phase.

3.5 Characterization of the IEEE 802.11n/ac’s MAC

In this section we present a comprehensive set of observations that we made from the traces that we collected from our experiments. We first study the throughput achieved by individual RPi clients as well as the throughput of the wireless system as a whole in a variety of scenarios. After that, we study the bandwidths consumed by RTS and CTS frames and the number of block acknowledgments generated by the AP.The analysis of RTS/CTS bandwidth in conjunction with the block acknowledgments will provide us useful insights about the factors that deteriorate the throughput of IEEE 802.11n/ac’s MAC in dense and high traffic IoT environments. After this, we will study the aggregation sizes of the frames and the impact of using aggregation on the performance of IEEE 802.11n/ac’s MAC. Last, we will present our observations on how the default version of TCP that comes with Linux impacts the performance of IEEE 802.11n/ac’s MAC and keeps it from achieving the high throughput that it could otherwise achieve.

3.5.1 Throughput

Figures 3.1 through 3.3 plot the cumulative number of files published by each RPi client to the broker as a function of time elapsed for IoT traffic classes 1 through 5, respectively. As the client application tries to publish one file every second, an ideal plot of the cumulative number of files published with respect to time should be a straight line with slope 1. This is indeed what we observe in figure 3.1 for traffic classes 1 through 3. The overlapping lines in this figure corresponding to various network sizes show that IEEE 802.11n/ac’s MAC does not face any performance degradation up to 45 IoT devices in the environment, which is a reasonable estimate of the number of IoT devices that a single access point would need to serve in home and enterprise networks. Contrary to this, for traffic classes 4 and 5, we observe from Figures 3.2 and 3.3, respectively, that the total number of files published is less than 120 even for a small network of just 3 RPi clients. This shows that IEEE 802.11n/ac’s MAC running on the RPi clients could not sustain the rate at which the application layer provided data to the lower layers. We further observe that for any given network size, the rate of file transfer is higher for class 4 traffic compared to class 5 traffic. This happens because a file transferred in class 5 traffic is twice the size of class 4 traffic and takes more time to get published to the broker. Figure 3.4 plots the average of the throughputs achieved by each client obtained from each execution of the communication phase, described in Section 3.4.3. We calculated these throughput values from the tcpdumps that we collected from each RPi client. We observe from this figure that the per client throughput stays constant across different network sizes for class 1,2 and 3 traffic. This happens because the combined application data rate of all clients for these traffic classes is much less compared to the channel capacity of 72.2 Mbps. Thus, each client is able to empty it’s

36 120 120 120 3 clients 3 clients 3 clients 100 5 clients 100 5 clients 100 5 clients 10 clients 10 clients 10 clients 15 clients 15 clients 15 clients 80 20 clients 80 20 clients 80 20 clients 25 clients 25 clients 25 clients 60 30 clients 60 30 clients 60 30 clients 35 clients 40 clients 40 40 40

Number of Files 45 clients Number of Files Number of Files

20 20 20

0 0 0 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 20 40 60 80 100 120 Time (sec) Time (sec) Time (sec)

Figure 3.1 Number of files pub- Figure 3.2 Number of files pub- Figure 3.3 Number of files pub- lished for class 1,2,3 IoT traffic lished for class 4 IoT traffic lished for class 5 IoT traffic

queue before the next file arrives. On the contrary, for class 4 and 5 traffic, the average per client throughput degrades drastically with the increase in network size. This shows that with increasing amount of offered traffic, IEEE 802.11n/ac’s MAC starts to waste more and more of its available bandwidth, which can prove to be a major bottleneck in near future as consumers adopt more and more IoT devices that generate high network traffic. This last observation empirically validates the analytical prediction that Bianchi made in [15]. Figure 3.5 plots the average system throughput as a function of the network size. We calculated the average system throughput by simply taking the sum of the average individual throughputs of all RPi clients obtained from each execution of the communication phase. Ideally, in a network where all clients transmit at the same application rate, the average system throughput should be equal to the the product of that common rate with the total network size and thus be a linear function of the network size. For class 1, 2, and 3 traffics, we indeed observe this linear relationship, as one would come to expect after the observations we presented above. However, for class 4 and class 5 traffics, the system throughput first increases with the network size but then gradually drops down as the network size increases further. This is a reaffirmation of the observation presented earlier that IEEE 802.11n/ac’s MAC starts to waste more and more of its available bandwidth with increase in the network size, because increase in the network size increases the net amount of offered traffic. The key reason behind this increased wastage of bandwidth is that as the network size increases, the number of collisions on the wireless channel increase, which lead to larger and larger backoff

3500 40 class 1 class 1 35 3000 class 2 class 2 class 3 class 3 30 2500 class 4 class 4 class 5 25 class 5 2000 20 1500 15 1000 10

500 5 Avg. System Throughput (Mbps) Avg. Per Client Throughput (Kbps) 0 0 0 10 20 30 40 50 0 10 20 30 40 50 Network Size Network Size

Figure 3.4 Average throughput of each client Figure 3.5 System throughput

37 intervals; thus, wasting valuable bandwidth. This can be explained from observations derived from studying CTS/RTS data and the block acknowledgment data in the next sections.

3.5.2 RTS/CTS Bandwidth and Block Acknowledgments

To study the bandwidth consumed by the RTS/CTS frames and the trends in the block acknowl- edgments, we utilized the traces captured over-the-air by our sniffer RPi that was operating in the monitor mode. Figures 3.6 and 3.7 plot the amount of average data generated as RTS and CTS frames across all executions of the communication phase for any given network size. We calculated the bandwidth consumed by the RTS frames from any given execution of the communication phase by first filtering out all the RTS requests from clients to the AP and then summing up their frame lengths. We calculated the CTS bandwidth in the same way. We observe from these figures that under low traffic conditions, i.e. , for class 1, 2, and 3 traffics, the RTS/CTS data increases linearly with network size, which is intuitive, as larger number of RPi clients have more data to send and thus, result in greater number of RTS/CTS frames. For class 4 and class 5 traffics, however, we see three distinct regions in each of these two figures. The first region extends from network size of 3 to network size of 15, the second extends from network size of 15 to network size of 25, and the third from network size of 25 to network size of 30. We next evaluate this behaviour w.r.t. the observations made from studying block acknowledgments. Figure 3.8 plots the number of block ACKs averaged across all executions of the communi- cation phase for various network sizes. If we compare the plots of block ACK in Figure 3.8 with the corresponding plots of the system throughput in Figure 3.5, we observe that the curves of the corresponding traffic classes closely follow each other in shape. This can be explained by the fact that for a transmission to be perceived successful, the transmitting client must receive the acknowl- edgment frame. Therefore, the throughput observed by a client directly depends on the number of acknowledgment frames it receives. However, if we compare the the plots of block ACK in Figure 3.8 with the plots of CTS bandwidth in Figure 3.7, we make two contrasting observations. First, the total amount of CTS data for network size 15 is more than that for network size 10, whereas the number of block ACKs for network size 15 is smaller than that for network size 10. Second, the number of block ACKs continue to decrease as the network size increase beyond 10, whereas the amount of CTS data

12 14 class 1 class 1 10 class 2 12 class 2 class 3 class 3 class 4 10 class 4 8 class 5 class 5 8 6 6 4 RTS Data (MB) CTS Data (MB) 4

2 2

0 0 0 10 20 30 40 50 0 10 20 30 40 50 Network Size Network Size

Figure 3.6 RTS data Figure 3.7 CTS data

38 remains saturated till network size 25 and drops only when the network size grows to 30. These observations reveal an interesting insight: the saturation of RTS/CTS frames implies that the clients were consistently able to access the channel despite the increase in network size and thus were able to initiate the transmissions (as they were getting CTS frames from the APs) but were unable to successfully complete many of those transmissions due to collisions during the transmissions of the actual frames. This shows that as the number of clients and the net amount of traffic that they offer increases, delays caused due to frame collisions become more prominent compared to the backoff delays, resulting in an overall decrease in throughput. But as the total network traffic becomes really dense, consistent retransmissions result in decreased overall transmission opportunity and increased backoff delays (as seen from low low RTS/CTS data for network size 30).

3.5.3 Frame Aggregation

Next, we study how the network size and the net amount of traffic offered by the clients impacts the sizes of the AMPDU frames. Figure 3.9 plots the average aggregate length of AMPDU frames that the RPi clients transmitted for each traffic class and different network sizes. We calculated the average aggregate length of the AMPDUs from the frame bitmap in block ACKs. We make three important observations from this figure. First, for any given network size, aggregate lengths increase as the traffic class increases. This happens because higher traffic classes put more number of frames into the transmit buffer, and therefore, upon medium access, a client has more number of frames to aggregate. Second, for lightly loaded networks such as for class 1, 2, and 3 traffics, the average aggregate length slightly increases with the networks size. This happens because increase in the number of clients in the networks induces larger backoff delays. This leads to an increase in the expected amount of time that any given client has to wait before it gets access to the medium. Due to this increase in the wait time, the number of frames that queue into its transmit buffer while it waits also increase, which results in a net increase in the number of frames it has to transmit in a single aggregate. Third, for heavily loaded networks such as for class 4 and 5 traffics, we see an opposite behavior: the average aggregate length decreases with the networks size. This is counter intuitive because the increase in the medium access delay for heavily loaded networks is even larger

6 0.8 class 1 class 1 5.5 class 2 0.7 class 2 class 3 class 3 5 class 4 0.6 class 4 4.5 class 5 0.5 class 5 4 0.4 3.5 0.3 3

0.2 Avg. Aggregate Length 2.5 0.1 2 0 0 10 20 30 40 50 0 10 20 30 40 50 Network Size Network Size Figure 3.9 Average aggregate lengths of AMP- Figure 3.8 Number of block ACKs DUs

39 1 1 1

0.8 0.8 0.8

0.6 0.6 0.6 CDF CDF CDF 0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 0 2 4 6 8 10 12 14 0 10 20 30 40 50 0 10 20 30 40 50 Number of Pipelined Segments Number of Pipelined Segments Number of Pipelined Segments (a) Class 3 (b) Class 4 (c) Class 5

Figure 3.10 CDFs of the number of packets TCP hands down to lower layers in each release

compared to the increase for the lightly loaded networks, which should give rise to even larger sizes of AMPDUs. On further investigation, we found out that as MQTT uses TCP as the transport layer protocol, the TCP on RPi clients started perceiving the medium access delay as congestion in the network and kicked in its congestion control mechanism to decongest the network, which significantly reduced the rate at which it hands data to the lower layers in the network stack. As fewer amount of data arrives in the transmit buffer at the link layer, the sizes of the aggregates continue to diminish, which results in lost opportunity at the time when the client gets access to the medium, and thus experiences low throughput. Had TCP offered more packets to the lower layers, the average aggregate size would have increased, resulting in higher throughput. This shows that while IEEE 802.11n/ac’s MAC benefits from frame aggregation in all scenarios, when paired with TCP,TCP reduces the usefulness of the frame aggregation. This, in turn, implies that conventional TCPs are not well suited for dense IoT networks with heavy traffic.

3.5.4 TCP Pipelining

Motivated by the last observation, in this section, we study how TCP forwards data to the lower layers for different IoT traffic classes and networks sizes. As it is well known that due to its congestion control functionality, TCP does not forward the entire data it receives from the application layer immediately to the lower layers. Instead, it uses a sliding window to control the amount of data that it hands over to the lower layers. This sliding window is usually referred to as the congestion window in the case of TCP.Every time a new TCP connection is established, the congestion window in that connection starts with a fixed size, which is usually a multiple of maximum segment size (MSS). It is then increased either exponentially or linearly depending on whether the TCP is operating in slow start or congestion avoidance modes, respectively. During the exponential increase, i.e. , the slow start phase, every acknowledged TCP segment results in an increase in the size of the congestion window by 1 MSS. During the linear increase, i.e. , the congestion avoidance phase, every acknowledged TCP segment results in an increase in the size of the congestion window by 1/W MSS, where W is the current size of the congestion window. For a more detailed description of TCP and its congestion control mechanisms, we refer readers to [7, 118].

40 Figures 3.10(a), 3.10(b), and 3.10(c) plot the CDFs of the number of pipelined TCP segments across all RPi clients, i.e. , the number of MSS sized segments that TCP releases to the lower layers in each batch at each RPi client. We obtained the information about the number of pipelined segments in each release from the tcpdumps. We have not included figures for class 1 and 2 traffics because the number of segments that TCP generates for each transmission in class 1 and 2 traffics is less than its initial window size of 10. Thus, all segments for each file always get handed to the lower layers within the first release. In Figure 3.10(a), we see three noticeable jumps at 2, 4, and 10, i.e. , for class 3 traffic, most of the segments were released in batches of 10, 4, and 2. The reason behind this is that class 3 traffic transmits a 25 KB file every second. TCP encapsulates the 25KB byte file in 18 segments (using the standard MSS of 1460 bytes). On establishing a new connection, TCP immediately hands over 10 segments to the lower layers as its initial window size is 10 segments. We observed from our traces that receivers employ delayed ACKs [16], i.e. , as long as the packets arriving at the receiver are not separated by more than 500ms, the receiver does not ACK individual packets, rather it ACKs pairs of packets. Therefore, upon receiving each delayed ACK, as TCP at the sender is in exponential increase mode, it increases the size of the congestion window by 2, and sends 4 new segments (two due to the increase in the size of the congestion window by 2 and the other two because two previously sent segments were acknowledged by the latest delayed ACK) As a result, TCP should ideally forward segments in chunks of 10, 4, and 4 for class 3 traffic. However, the jump at 2 in Figure 3.10(a) shows that there are cases where only two segments were forwarded, i.e. , TCP perceived congestion in the network for class 3 even when the physical layer was able to sustain the offered rate. Unlike class 3 traffic, where segments were released in 3 distinct chunks of 10, 4, or 2, we see multiple values for class 4 and 5 traffics. In class 3 traffic, a new connection was established every 18 segments, bringing down the congestion window back to 10. However, in class 4 and 5 traffics, as each file is large (encapsulated in 225 and 450 TCP segments, respectively), TCP gets enough iterations to increase it’s congestion window to be large enough such that it can forward a greater number of segments compared to 10, the number of segments it sent initially at the start of the connection. This shows that had MQTT used persistent TCP connections, i.e. , instead of terminating the connection after every file transfer, kept the TCP alive (provided the IoT client device has enough resources to maintain the TCP state variables at all times), it would have achieved a higher throughput by handing a larger number of segments to the lower layers due to the larger congestion window sizes. With persistent TCP connections, the RPi clients could better benefit from the frame aggregation method of IEEE 802.11n/ac’s MAC.

3.6 Key Take-aways

In this work we extensively studied the performance of IEEE 802.11n/ac’s MAC on real IoT devices for various classes of IoT traffic and for various network sizes. While we presented several observations from our measurement study, we mention the four key take-aways from this study. First, when the

41 net amount of offered IoT traffic increases, IEEE 802.11n/ac’s MAC does not fairly distribute the bandwidth across the clients (Section 3.5.1). It is imperative to address this limitation because in order to handle a high influx of IoT devices, if the available bandwidth is not fairly shared, some devices may suffer from starvation problem, leading to poor quality of experience for the users of those IoT devices. Second, when the net amount of offered IoT traffic increases, IEEE 802.11n/ac’s MAC starts to waste more and more of the already scarce bandwidth (Section 3.5.1). Thus, there is a need to update IEEE 802.11n/ac’s MAC such that this wastage of bandwidth is minimized. One plausible approach to achieve this is to enable contention free channel access to some degree for networks with larger sizes by polling each client for their data. The contention free channel access will reduce the frame collision rate, which in turn will decrease the overall wastage of the bandwidth. Such contention free access can be enabled by leveraging the point coordination function, which has already been standardized in the IEEE 802.11n/ac standard. For this to work, the clients must be capable of receiving and processing CF-POLL frames [28]. Third, as the number of clients and the net amount of traffic that they offer increases, delays caused due to frame collisions become more prominent compared to the backoff delays resulting in an overall decrease in throughput (Section 3.5.2). This limitation of current version of IEEE 802.11n/ac’s MAC can also be addressed using the approach of contention free channel access described above. Fourth, the current versions of TCP are detrimental to the throughput that the MAC of IEEE 802.11n/ac could otherwise achieve because they throttle the segment forwarding rate to lower layers resulting in lower aggregation sizes of the AMPDUs (Sections 3.5.3 and 3.5.4). Consequently, there is a need to develop variants of TCP that are well suited for high density IoT traffic. Such variants of TCP should respond less aggressively to the delays in data transfer caused by the delay in medium access. This will possibly require the development of an information channel between the transport and the link layers, such that the link layer can inform TCP in the case of delay in medium access, so that TCP can take this information into account while deciding how to react to the delay in data transfer. While developing such a TCP variant could take a long time, for now, we can partially address the detrimental impacts of TCP’s congestion control by using persistent TCP connections whenever possible.

42 CHAPTER

4

CHOOSING TCP VARIANTS FOR CLOUD TENANTS - A MEASUREMENT BASED APPROACH

4.1 Introduction

Background and Motivation: Today, cloud computing is a reliable, efficient, and cost-effective plat- form for many enterprises for their computing requirements such as big data, streaming, Internet of Things, and web applications. With increasing adoption of cloud and ever-increasing demand for data transfer and processing, the cloud performance improvement has emerged as a challenging problem. Due to the increasing number of multi-tiered and distributed application deployments in the cloud platforms, the importance of the improvement in the server-to-server network perfor- mance has increased more than ever. To satisfy the networking performance requirements of such applications, we have seen a variety of approaches that include innovative data center architectures [4, 122, 48], network optimized instances [10, 90], network appliances [47, 100], efficient big data frameworks such as Hadoop [35, 128] and Spark [143], cloud based content distribution networks [9], and high throughput data center interconnects [79, 69]. A common theme among all these prior approaches is that they address the problem of im- proving cloud networking performance from the perspective of network administrators that have full visibility and control over the internal hardware and software of the network. Unfortunately, very little has been done to explore how cloud tenants can achieve the best performance for their

43 applications. From a tenant’s point of view, network performance can be tuned by improving ap- plication design, following cloud provider’s best practices, and appropriate selection of compute instance related parameters, such as CPU, memory, and instance type (e.g., network optimized, bare metal, reserved instances). Among the available options for a tenant to improve the networking performance of their applications, the TCP variant in the deployed instances is important. Because TCP variants are packaged as loadable kernel modules in Linux systems, the congestion control algorithm used by all TCP flows from a system can be changed by simply inserting a different TCP variant module. The benefits of choosing an appropriate TCP variant for a given application and cloud environ- ment are twofold. First, choosing the best-fit TCP for a particular application improves the overall resource utilization and job completion times. This translates to reduced execution times of virtual instances leading to reduced costs. Second, better network level performance, such as decrease in observed latency, means improved user satisfaction and quality of experience. Therefore, the funda- mental question for tenants then is to ask which TCP variant would lead to the best performance of their applications. Challenges: Choosing the right TCP variant in cloud platforms is not straightforward and depends on the characteristics of the deployed application and the cloud itself. In many ways, the tenant’s view of the cloud provider’s network is a black box because compute instance placement is variable and can be in the same server, same rack, or in another rack altogether. Additionally, the cloud infrastructure is shared among multiple tenants and their traffic co-exists. Since different TCP variants have been developed specific to certain constraints on end-to-end latency, bandwidth delay product (BDP), switch fabric characteristics, and application traffic patterns, it becomes difficult to choose the right TCP variant in an unknown topology. Even if the best-fit TCP is known for a given application and cloud platform combination, it might not be the best-fit in another cloud platform. Besides, even in the same cloud platform for which the best-fit TCP is known, the network infrastructure keeps evolving over time due to addition, deletion, and migration of the nodes and traffic from other tenants. Proposed Approach: To assist the tenants to choose the best-fit TCP for their applications in a given cloud, we need a measurement based approach to characterize the performance of different TCP variants in that cloud environment. The best-fit TCP is not a one time decision and depends on the traffic characteristics of the application and the cloud itself. Therefore, in this work, we propose a generic measurement based methodology that characterizes low level performance metrics, such as throughput and latency, and uses the insights developed from these measurements along with the known knowledge about application traffic patterns to choose the best-fit TCP.Our measurement based approach uses traffic devoid of any application level overhead (such as the volume of traffic and its arrival rate) to create baseline benchmarks for comparison of performance metrics such as throughput and latency. We use iPerf [66] to generate the traffic that we use to measure baseline performance metrics. Using the insights from the baseline performance metrics and prior well- known heuristics about the traffic patterns of a given application, we predict the best-fit TCP for

44 it. In later part of this work, we demonstrate the effectiveness of our proposed methodology using three case studies. In each case study, we use a different application and empirically establish that the best-fit TCP that we predict using our methodology indeed produces the best performance for that application. Limitation of Prior Art: Many previous works have studied cloud networks from the perspective of improving virtualization performance, service scheduling, resource utilization, etc. [44, 134, 64, 84]. While these studies improve many aspects of cloud computing and specifically the networking performance in cloud networks, there are still several problems that have not been addressed by prior work. The two types of problems that remain unexplored from the perspective of cloud tenants are a) the impact of TCP variants on the performance of instance-to-instance traffic (i.e. , the traffic that flows between the servers) without any application overhead, and b) the impact of TCP variants on the performance of real distributed applications in the cloud. We study the impact of TCP variants in both these scenarios in this work. There exists another set of studies in literature that is specific to analyzing the effects of different data center TCPs on the performance of various data center applications [115, 133, 72]. The key difference, however, is that in data center environments, the authors know the networking param- eters like delay, bandwidth, switch fabric architecture, the number of competing flows and their types, which are only available to the infrastructure providers in the case of cloud platforms. In our work, we acknowledge the black box nature of the cloud networks from the tenant’s perspective and design our approach around this constraint to assist the tenants in deciding the best-fit TCP variants for their applications. Key Contributions: In this work, we make two key contributions. First, we present a generic mea- surement based methodology to characterize the performance of TCP variants in black box cloud networks and based on the insights from this characterization and the traffic patterns of any given application, predict the best-fit TCP for that application. To the best of our knowledge, this is the first study that characterizes the performance with different TCP variants in cloud environments treating the cloud network as a black box. We implemented our methodology in two well known commercial cloud platforms and demonstrated the dependence of application performance on the choice of the TCP variant. From our observations, we found that using the best-fit TCP for a given application and cloud platform combination can increase throughput by up to 13.7% and reduce latency by more than 5 times. Second, we collected a large data set of network traces from our experiments comprising 910 billion packets and 362 hours of measurements from two public clouds. We will make this data set available to the research community after the publication of this work. Organization: Next, in Sec. 4.2, we discuss the related work. Sec. 4.3 explains the design choices and the measurement setup of our study. In Sec. 4.4, we implement our methodology to measure and compare baseline performance metrics between different TCP variants and different clouds. In Sec. 4.5, we carry out three different case studies of our methodology to empirically validate our observations from Sec. 4.4. Finally, we conclude the chapter in Sec. 4.6.

45 4.2 Related Work

Design of TCP Variants: The first body of prior art that relates to our work is on the design of TCP variants. Cubic [54], New Reno [41], BBR [19], and DCTCP [6] are the four TCP variants that we have used. Each variant was proposed to address specific network performance issues. Cubic was designed to improve throughput in high BDP networks. New Reno is an improvement to the classic Reno [tcp_reno] and is widely used in Linux systems. DCTCP decreases congestion in data center by using explicit congestion notifications from switch fabric to preemptively decrease the congestion window size. BBR keeps the buffer occupancy low and yet fully utilizes the bottleneck link by sending data into the network at a rate that is equal to the bottleneck link capacity. BBR is actively used in Google services, such as its search and video streaming applications. Our study is different from the evaluations carried out in these works in two ways. First, the difference between the experimental scenarios presented in these papers and our work is that we treat the network as a black box, whereas prior works either explicitly configure the network or the network topology is known before hand. This is an important distinction as cloud tenants do not have visibility into the physical topology of their applications. Second, prior works do not consider the existence of traffics that have different patterns and use different TCP variants. In reality, multiple tenants share the same network infrastructure in cloud networks and thus, their traffic coexists in the same switch fabric. Comparison Between TCP Variants: The second line of related work is comparative studies be- tween different TCP variants such as [116], [95], [49], [57], [1]. The main focus of these works is to compare performance metrics, such as throughput, RTT, and fairness of different TCP variants, when their traffic coexists in a known testbed topology. In [116], Rao et. al. did a comparative study of throughput performance of Cubic, HTCP,and STCP in wide area data transfers. Mo et. al. pre- sented an analytical comparison of throughput for TCP Reno and Vegas in [95]. Similarly, [49, 57, 1] presented comparative studies that focused on RTT and throughput fairness. In contrast, our measurement based approach does not have prior knowledge about the topology and focuses on cloud networks, which is different from the testbeds used in the earlier works.

4.3 Measurement Setup

In this section, we describe our measurement setup and discuss the rationale behind the choice of several system parameters along with the key performance metrics that we have used in our evaluations.

4.3.1 Network and System Parameters

We controlled three network and system level parameters in this work: the choice of public clouds, the TCP variants, and the traffic patterns, described next.

46 4.3.1.1 Choice of Cloud:

To conduct our experiments, we used two popular public clouds, Amazon Web Services (AWS) and Google Cloud Platform (GCP), both of which control a significant amount of the market share. For the purpose of this study, all instances are spawned within the same availability zone. It is common for application deployments to be contained in the same availability zone to avoid large data transfer costs associated with inter availability zone transfers.

4.3.1.2 Choice of TCP Variants:

We use four popular TCP variants: Cubic, Reno, DCTCP,and BBR. Cubic and Reno [54, 41] are loss based variants and are by default available in Linux kernels. DCTCP [6] relies on explicit feedback from the network hardware in addition to perceived losses to estimate congestion. It has been optimized for data center environments and is popular because of it’s handling of the in-cast problem in data centers. BBR [19] is a delay based variant that Google developed recently and uses in its YouTube and search services.

4.3.1.3 Traffic Pattern:

In our approach, we use iPerf [66] to measure baseline performance metrics of different TCP variants. We call the traffic generated by iPerf the Benchmarking Traffic. The motivation behind using iPerf is that it does not introduce any application specific behavior into the traffic it generates, rather pumps in as much data as the TCP can handle to saturate the communication link. Since iPerf generates as much traffic as the underlying network can handle, the experiments with iPerf serve the purpose of providing the baseline comparison across different TCP variants in the cloud. In the later part of this work, we do an empirical validation of the insights gained from the baseline performance metrics on real world applications. Majority of the applications on cloud platforms serve web traffic or process compute intensive workloads for edge hosts. Application traffics are distinct from benchmarking traffic in the sense that they have unique traffic arrival rates and communication patterns between compute nodes. Therefore, we use a representative set of three distributed workloads, namely Streaming, Distributed IO, and Sort, where each of them experiences a distinct performance bottleneck. The streaming workload generates a stream of data, which is concurrently processed by a streaming platform. The distributed IO workloads read large volumes of data from distributed file systems. The sort workload uses the map-reduce paradigm to sort huge volumes of data. Each workload produces different traffic patterns depending on the application requirement. We will discuss these three workloads and their traffic types in more detail in Sec. 4.5.

4.3.2 Measurement Metrics

Next, we describe the performance metrics for both benchmarking and application traffics that we study to evaluate the impact of the choice of TCP variant.

47 4.3.2.1 Benchmarking Traffic:

In the benchmarking traffic, we study the following three performance metrics. 1) Throughput: Throughput is the primary metric that we use to compare the performance across different TCP variants. The iPerf traces provide the average throughput between the communicating hosts every 100ms. 2) Packet-Loss: The iPerf traces provide the number of retransmissions that occur in every one second interval. 3) Latency: To obtain per-packet RTT, we use TCPprobe, which captures the TCP state every time an acknowledgement is received at the sender.

4.3.2.2 Application Traffic:

Due to the differences in the nature of the workloads, every workload has a unique metric that appropriately quantifies the performance achieved by the corresponding application in the given cloud network. 1. Streaming Workload: For streaming applications, application level latency is more important than throughput in over provisioned environments, such as clouds. Thus, to quantify the performance achieved by a streaming workload, we use time elapsed between the generation of a request packet and the arrival of its corresponding response packet as the performance metric. 2. Distributed IO Workload: As distributed IO workloads involve transfer of large amounts of data be- tween the communicating nodes, we use the throughput achieved by each node as the performance metric. 3. Sort Workload: Unlike the two applications mentioned above, the amount of time spent in doing computations in the sort workload is usually more than the amount of time spent in data transfer. Thus, we use the total job completion time, which is the sum of the times taken by both the compute and the networking components, as the performance metric.

4.4 Measurement Methodology

In this section, we will describe our measurement based methodology and implement it on two public cloud platforms. We start by first describing the testbed setup, followed by the design of the benchmarking experiments. After that, we study the measurement metrics mentioned in Sec. 4.3.2 for benchmarking traffic. While we occasionally compare these performance metrics between GCP and AWS, our objective is not to compare the networking performance of clouds platforms, but rather is to characterize the impact of the choice of TCP on the performance of the applications. For each performance metric, we also discuss the implications of our observations for real-world applications.

48 4.4.1 Data Collection

Our implementation consists of 4 sender-receiver pairs for each TCP variant (16 pairs in total) distributed within a single availability zone in the cloud. We used t2.large instances in AWS and n1-standard-2 instances in GCP.Both these instance types have 2 vCPU cores and over 7GB memory, enough to avoid any computational bottlenecks. A single iteration of the experiment involves choosing a TCP variant in each sender-receiver pair, generating traffic using the iPerf application between the pair for 10 minutes, and collecting traffic logs for analysis. From the iPerf traces, we obtain information about the throughput between the sender-receiver pair as a function of time as well as the number of retransmission. We obtain the RTT values directly from the Linux kernel using TCPprobe. We repeated each of our experiment 5 times a day at different times of day for 10 consecutive days.

4.4.2 Evaluations

4.4.2.1 Throughput Measurements: iPerf reports the average throughput values between the sender-receiver pair every 100ms in our experiments. Figs. 4.1(a) and 4.1(b) plot the averaged throughput values as a function of time, for all iterations of each TCP variant. We average the throughput values observed at every 100ms interval in all the iteration and for every TCP variant to create these plots. We did not observe considerable differences in measured throughput values across different days and different times of day. Therefore, we present only the aggregate results in this section. Three important observations can be made from these two figures. First, throughput obtained by a single TCP flow in GCP is at least 4 times the maximum throughput observed in AWS, for the same dollar amount and system configurations. Second, AWS network implements rate-limiting on their traffic and we did not observe any impact of TCP variant on the overall average throughput obtained by a single flow. Third, GCP does not use rate-limiting, and a single BBR flow achieves only a third of the throughput achieved by the

BBR Cubic Reno DCTCP BBR Cubic Reno DCTCP 4.0 1.0 3.0 0.8

0.6 2.0

0.4 1.0 Throughput (Gbps) Throughput (Gbps) 0 1.5 3 4.5 6 0 1.5 3 4.5 6 Time (x100 sec) Time (x100 sec) (a) AWS (b) GCP

Figure 4.1 Throughput vs. Time

49 other variants. However, this difference vanishes as the number of parallel flows from a single host increase, as seen in Fig. 4.2. As we treat the cloud network as a black box, we cannot root cause the exact reason for BBR’s low throughput. But we can speculate that the small RTTs between hosts hurt the throughput performance as BBR uses link RTTs to estimate the link BDP.Smaller RTT implies low BDP and therefore smaller congestion window sizes.

4.4.2.2 Packet-Loss Measurements:

While the overall average throughput values between different TCP variants can be similar, observing the packet loss tells us if that throughput was steady or of oscillatory nature in saturated traffic scenarios. Fig. 4.3 plots the average number of packet-losses observed in one second intervals for all iterations and TCP variants. The main difference between the packet-loss observations and throughput observations of AWS in Fig. 4.1(a) is that, even when the overall throughput values are similar between different TCP variants, the packet-loss measurements vary widely. In both AWS and GCP,BBR has negligible packet-loss compared to other variants. Low packet losses imply under-utilization of bottleneck buffer during bursty traffic scenarios, which can negatively impact the sender-side throughput.

4.4.2.3 Latency Measurements:

To measure per packet latency from the iPerf traffic, we use TCPprobe kernel module on the sending host to log the incoming acknowledgment packets. When a host receives an acknowledgment, TCPprobe logs the current time-stamp, amount of outstanding bytes, and the smoothed RTT value at that particular instant. Figs. 4.4(a) and 4.4(b) plot the mean, 90th, 99th, 99.9th percentile values of the smoothed RTT for a single iPerf flow. These values are calculated from all iterations over the 10 days and each data point represents the mean of the corresponding metric (mean, 90th, 99th, 99.9th in each iteration). We make two observations from these figures. First, the 99th percentile latency values are lower in GCP than in AWS, giving a better latency performance for the same dollar value. Second, in both

4 DCTCP

3 Cubic

2 Reno

1 BBR AWS GCP Avg. Tput (Gbps) 0 10 0 10 1 10 2 10 3 1 10 20 50 100 Flows Retries/sec

Figure 4.2 Tput (N BBR flows, GCP) Figure 4.3 Packet-loss per second

50 BBR Reno CUBIC DCTCP BBR Reno CUBIC DCTCP

15 15

10 10

5 5 RTT (ms) RTT (ms)

0 0 Mean 90 99 99.9 Mean 90 99 99.9 Percentile RTT Percentile RTT (a) AWS (b) GCP

Figure 4.4 Observed latency for different TCP variants

AWS and GCP,we observed that BBR has the lowest latency when compared to the other variants. This is because the congestion signal in BBR is observed-latency, using which, it tries to pace its sending rate to maintain low latency. Another interesting observation is that both cloud providers have the same relative order of TCP variants with respect to mean latency. In both cloud platforms, BBR has the smallest and DCTCP has the largest mean latency. In fact, the mean latency of DCTCP flows is 5 times more than that of BBR in AWS and 7 times more in GCP.

4.4.3 Implications

The specific implications of the choice of TCP variant on application performance depends on both the traffic size and the communication patterns of the cloud application. Recall that in Sec. 4.3.1.3, we mentioned that we will carry out case studies for three different traffic scenarios to empirically evaluate the impact of TCP variant on application performance. Before we carry out these case studies, let us first predict the preliminary best-fit TCP for each of these three scenarios based on our observations from the baseline measurements described above.

4.4.3.1 Streaming Traffic:

Since user experience in streaming applications is sensitive to application level latency, the TCP variant that produces the lowest per-packet RTT should be preferred over others. The latency measurements from benchmarking traffic show that BBR traffic has the least per-packet RTT in both GCP and AWS, and should be the preferred TCP variant for streaming traffic. We will later see in Sec. 4.5.2 that a detailed end-to-end analysis of the performance metric is required to choose the best-fit TCP.

4.4.3.2 Distributed IO:

For these applications, achieving the highest throughput per node is important for faster reads. But observations from throughput measurements alone do not justify the choice of best-fit TCP.We should also look at the ability of the TCP variant to completely utilize the bottleneck buffers to absorb frequent bursts of application traffic. As BBR implements packet pacing to avoid queue buildup

51 Wi1

Mi Cloud W Network i2

Wi3

Figure 4.5 Topology of dist. apps.

at bottleneck links, it cannot utilize the additional bandwidth provided by those buffers unlike loss based variants. Thus, BBR should not be the choice of TCP variant for distributed, throughput sensitive applications.

4.4.3.3 Sort Workload:

Sorting is a compute intensive workload with occasional communication between nodes. Since the data transfer is just a small part of the overall job, the choice of TCP should not impact job completion times.

4.5 Application Traffic Scenarios

In this section, we carry out three different case studies using three different distributed applications as mentioned in Sec. 4.3.1.3, to empirically validate our predictions in Sec. 4.4 about the choice of best-fit TCP for each application.

4.5.1 Topology and Data Collection

‘ To evaluate our four TCP variants, we deployed 4 four-node clusters with three worker nodes and one master node in both AWS and GCP.We used t2.xlarge instances in AWS and n1-standard-4 instances in GCP.Both these instance types have 4 vCPU cores and over 16GB memory, enough to hold the largest data set from our experiments in memory. All nodes of any given cluster use the same TCP variant and each cluster was configured to use a different TCP variant. Having four separate clusters enabled us to execute our experiments at the same time for all TCP variants and eliminate any performance variations due to the time of the day. The resultant topology is shown in t h t h t h Fig. 4.5, where Mi is the master node of the i cluster and Wi j is the j worker of the i cluster. Along with Hadoop, we used HiBench [60], an open source big data benchmarking suite, to execute our application traffic. HiBench has function prototypes to instantiate and execute several common

52 distributed workloads and reports appropriate application performance metrics for the workloads. We executed 15 iterations of each workload at different times of day on several different days using each TCP variant.

4.5.2 Streaming Traffic

The streaming workload involves reading data packets from a work queue, processing them in a stream processor, and writing the processed response packets to another work queue. This pipeline of events represents many real life workloads, such as the arrival of purchase orders in an e-commerce store. As we are interested only in the network traffic generated during this process, we used an identity function (i.e., the input data is returned as it is at the output) at the stream processor to minimize the computation overhead. Our metric of interest is the total turn around time of the response packet from the arrival of the request packet, which is essentially the sum of buffering time of the packet after arrival in the work queue, per packet RTT between communicating hosts after the packet has been transmitted, and record processing time in work queue of the producer and consumer. Note that the more the average throughput between the hosts, the faster the packet will be allowed to transmit on the link and vice versa. We saw in Sec. 4.4.2.1 that BBR has lower average throughput than other TCP variants in GCP,i.e., BBR takes more time to successfully transmit the same amount of data when compared to the other three variants. This means that, on average, data packets in BBR nodes spend more time in the transmit buffer of the sender before being transmitted on to the communication link.

4.5.2.1 Experiments:

To deploy the streaming workload, we used a Kafka [45] broker to store the incoming and processed packets and a Flink [42] cluster as the streaming platform to execute the identity stream processor.

The Kafka broker was instantiated on Wi 1 on each cluster, while Wi 2 and Wi 3 were configured as the Flink worker nodes. Each iteration of an experiment involves generating a stream of 1000 byte data packets at a constant rate and writing them to a topic called “identity" in the Kafka broker. The Flink worker nodes were configured to read data from this topic and write it back to a different topic. At the end of the experiment, the time difference between the read and write of each data packet to the Kafka topics is calculated to get the application level latency. In our study, we experimented with three different rates at which data arrives into the Kafka broker: 10K, 20K, and 40K packets/sec.

4.5.2.2 Observations:

Figs. 4.6(a)–4.7(c) show the box plots of the average application level latencies calculated from the Kafka broker for each TCP variant in the two clouds. Each figure corresponds to a different data generation rate to the Kafka broker as mentioned with the figure. When the average throughput between communicating hosts is constant, turnaround time is directly proportional to the per packet RTT. We have already seen in Sec. 4.4.2.1 that the average throughput between hosts in AWS

53 50

40 40 40 30 20 20 20 Avg. Latency Avg. Latency Avg. Latency 10 0 0 0 BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP

(a) Rate=10K packets/sec (b) Rate=20K packets/sec (c) Rate=40K packets/sec

Figure 4.6 Box plot of average application latency for streaming workload in AWS

8 12 6 6 10 8 4 4 6 2 4 Avg. Latency

Avg. Latency 2 Avg. Latency 2 0 BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP

(a) Rate=10K packets/sec (b) Rate=20K packets/sec (c) Rate=40K packets/sec

Figure 4.7 Box plot of average application latency for streaming workload in GCP

is constant. Therefore, the application level latency should follow the trend of observed latency in Sec. 4.4.2.3 for AWS clusters, and the empirical evaluations prove the same. In Fig. 4.6(a)-4.6(c), the cluster using BBR has the least latency when compared to DCTCP,with BBR cluster having 5 times less application latency when compared to the DCTCP cluster. When it comes to GCP,recall that the average throughput between hosts is not the same for all TCP variants. Because of the small size of the Flink cluster, the number of parallel flows between the Kafka broker and Flink worker nodes is close to one. Thus, the average throughput share between the hosts also influences the application level latency. Figs. 4.7(a)-4.7(c) plot the box plots of the average application latencies for GCP clusters. We see an opposite trend from AWS (despite lower per packet latencies for BBR from Sec. 4.4.2.3) because BBR flows have smaller throughput share compared to the other variants and thus experience more application level latency. On average, the application latency in BBR cluster is 2.4 times of that observed in DCTCP cluster. In conclusion, the empirical evaluations follow the predictions that we made in Sec. 4.4.3 for streaming traffic. When there is no perceivable throughput difference between TCP variants, per- packet RTT measurements from iPerf experiments are enough to predict the best-fit TCP for stream- ing traffic. Otherwise, an end-to-end analysis is required considering the effects of throughput differences.

4.5.3 Distributed IO

Distributed IO workloads are sensitive to available throughput between the nodes in the cluster. To stress test the average throughput capacity of our clusters, we used the Distributed File System Input Output (DFSIOe) workload available in HiBench. The DFSIOe workload is a map-reduce job, where each map operation reads a 1GB file and transfers its contents to the master node.

54 70 600 150 60 600

50 500 500 100 40 400 400 50 30 300 Avg. Tput (Mbps) Avg. Tput (Mbps) Avg. Tput (Mbps) Avg. Tput (Mbps)

BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP (a) AWS-8GB (b) AWS-16GB (c) GCP-8GB (d) GCP-16GB

Figure 4.8 Box plots of average throughput per node for DFSIOe workloads

4.5.3.1 Experiments:

We conducted multiple iterations of the DFSIOe workload on each cluster with two workload sizes: 8GB and 16GB. Each iteration comprised of creating workload data and storing it in the Hadoop Distributed File System (HDFS), starting a map-reduce job to read and transfer that data to the master node, and tracking the amount of data transferred across the network from each node. At the end of the job, HiBench provides the average throughput per node.

4.5.3.2 Observations:

In contrast to saturated traffic of Sec. 4.4, traffic in distributed reads seldom saturates the link. Thus, we cannot directly translate the observed throughput values from Sec. 4.4.2.1 to distributed IO workloads. But we can use the knowledge about the congestion control algorithms along with the empirical observations about packet-loss from Sec. 4.4.2.2 to understand the throughput per node in a distributed cluster. Figs. 4.8(a)-4.8(d) show the box plots of the average throughput per node from executing the DFSIOe workload on each cluster. In both AWS and GCP,the cluster using BBR has the smallest throughput. In AWS, clusters using Cubic and DCTCP achieve 35.4% and 16.2% more throughput per-node, respectively, compared to BBR. In GCP,these values are 26.3% and 13.7%, respectively. The difference between BBR and the other three variants is that BBR uses increase in packet RTTs as the congestion signal, whereas the other variants use packet loss as the congestion signal. Thus, BBR throttles its sending rate before completely utilizing the switch buffer because the delay due to queuing in the buffer acts like a congestion signal. These observations are inline with our prediction in Sec. 4.4.3 that BBR will achieve inferior throughput compared to other TCP variants.

4.5.4 Sort Workload

We have seen two workloads with heavy network traffic component. Now we study a compute intensive application, the sort workload. We implemented the sort workload using the map-reduce paradigm on a Hadoop cluster as shown in Fig. 4.5. We are interested in measuring the Job Comple- tion Time (JCT) for this workload, which is the total amount of time required to complete the sort operation on data of size N.

55 120 240 110 300 600

100 250 220 550 J.C.T. 90 J.C.T. J.C.T. J.C.T. 500 80 200 200

BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP BBR Reno Cubic DCTCP (a) AWS-8GB (b) AWS-16GB (c) GCP-8GB (d) GCP-16GB

Figure 4.9 Box plots of average J.C.T. for sort workload in AWS and GCP

4.5.4.1 Experiments:

In every iteration, the HiBench tool generates a list of random data records (as the seed data) for the sort operation and stores them in the HDFS. We varied the size of the seed data between 8 and 16 GB in our experiments. HiBench also provides the map-reduce application to sort the seed data. The sort application has two phases, the map phase and the reduce phase. In the map phase, the mapper nodes read chunks of data records from the HDFS and sort them. Next, the sorted chunks are merged in the reduce phase by the reduce nodes to create a final sorted list. At the end of the sort function, HiBench provides detailed metrics about the job along with the total job completion time.

4.5.4.2 Observations

Figs. 4.9(a)-4.9(d) show the box plots of the total job completion times on each cluster for different seed data sizes. Unlike previous workloads, we do not see a particular trend in the job completion times as a function of the TCP variant. It is difficult to accurately extract the network component in these values because of the lack of network traces and unknown cloud topology, but two reasons could contribute towards the lack of any trends in the observations. First, being a compute intensive workload, the time required to complete the map and reduce phases alone would be more than the network traffic component. Second, not all data would be required to be transferred between the worker nodes as the map or reduce operations could work on local data, thereby, reducing the effects of the choice of TCP variant. Thus, in an over provisioned cloud network, applications that are compute intensive with limited networking component are not significantly affected by the choice of TCP,as we predicted in Sec. 4.4.3. ‘

4.6 Implications

In this work, we have empirically evaluated the effect of the choice of TCP variants on distributed applications in cloud environments. We developed a methodology to analyze the performance of different TCP variants in the network infrastructure of any given cloud provider and used those insights to infer the trends in application level metrics for distributed applications. Any tenant, or a third party, can perform such a study before deploying their application(s) to ensure that they

56 efficiently utilize the cloud resources that they are paying for. Our key takeaways and implications of this work are as follows. Best-fit TCP: the best-fit TCP variant for cloud applications depends on both their traffic patterns and the baseline performance of that TCP variant in the given cloud. Latency: Per packet RTTs play a major role in determining the application level latency in streaming workloads when the average throughputs of all TCPs are the same. However, when the throughput share between different TCP variants is not equal, the application latency is determined by both the per-packet latency and the average throughput share of the TCP variant. In general, if the applications are not limited by available bandwidth, then BBR provides the least per packet RTT. Throughput: For workloads that require high throughput and have a bursty traffic model, the delay based TCP variants, such as BBR, provide inferior performance compared to the loss based TCP variants. Although this work has experimented with common representative workloads found in cloud environments, there are many other different types of applications that we did not include in this work due to the monetary cost of conducting the experiments on the cloud platform and space constraints in this work. Similarly, there also exists other cloud providers on which one could conduct such experiments. Nonetheless, the approach that we have presented is generic and is usable on any cloud platform and any workload.

57 CHAPTER

5

CHARACTERIZING THE IMPACT OF TCP COEXISTENCE IN DATA CENTER NETWORKS

5.1 Introduction

Today’s data centers host a variety of applications with different network performance needs such as bounded jitter, predictable average throughput, and low latency. To satisfy the performance needs of such applications that predominantly generate server-to-server traffic, we have seen a multi-faceted effort, with solutions from both network perspective, such as efficient switch fabrics [3, 48, 102] and network performance appliances [99, 100], and end-to-end perspective, such as optimizing application layer frameworks [137, 144] and developing data-center focused variants of TCP [19,6, 94]. Regarding TCP in particular, we have witnessed the development of several new TCP variants, such as BBR [19], DCTCP [6], ICTCP [139], D2TCP [133], and Timely [94], that propose a variety of different congestion control methods to improve data center network performance for different types of server-to-server workloads. As no single TCP variant can satisfy the needs of the wide variety of diverse applications and as some applications that are not sensitive to the choice of TCP variant just use the default variants that come with the operating system, the switch fabrics of today’s data centers inevitably carry traffics that are controlled by multiple TCP variants. Several studies, such as [31, 59], have thoroughly demonstrated the existence of traffic controlled by multiple TCP variants

58 through measurements from real data centers. Different TCP variants employ different approaches to adjust their sending rates in response to packet loses and/or changes in RTT. Therefore, the flows of some TCP variants capture more bandwidth compared to the others, which, in turn, translates to longer job completion times and degradation of the performance achieved by the network applications that are using the disadvan- taged TCP variants. This motivates us to ask a fundamental question: how do the performance metrics (which we will define shortly) achieved by the network flows controlled by one TCP variant get impacted by the simultaneous coexistence of network flows controlled by other variants. We say that two flows simultaneously coexist if the paths they traverse have at least one common network link and they traverse that common link at the same time. The answers to this fundamental question contain critical insights that the network administrators can proactively use to place their workloads in the data center in such a way that the negative impacts of the coexistence of TCP variants can be avoided and the positive impacts can be leveraged. This, in turn, will not only lead to optimal utilization of the data center network but will also improve the performance achieved by individual applications running in the data center. In this work, we extensively study the impact of coexisting TCP variants on each other’s per- formance using two common data center topologies, namely Leaf-Spine [5] and Fat-Tree [3]. We conduct our measurement study in two phases. In the first phase, we purely study how the conges- tion control mechanisms of different coexisting TCP variants impact each other’s performance. For this, we used iPerf [66] because iPerf offers as much traffic to TCP as the TCP can transfer and does not introduce any application induced temporal variabilities in the amount of offered traffic. In the second phase, we execute three representative real data center workloads, namely streaming, MapReduce, and storage workloads, such that they coexist while using different TCP variants. In this phase, we study how the insights and lessons from the first phase translate to real application workload performance. We conducted our experiments using four TCP variants commonly used in data centers, namely BBR [19], DCTCP [6], CUBIC [55], and New Reno [41]. We will provide more details and motivation about our choice of topologies and TCP variants in Sec. 5.3 when we describe our measurement setup. To quantify the impact of coexisting TCPs on each other’s performance, we will use one or more of the following five commonly used metrics in a variety of different scenarios: 1) Throughput achieved by each coexisting flow; 2) Fairness demonstrated by the coexisting flows in sharing the available bandwidth; 3) Total Network Utilization, i.e. , the fraction of the available bandwidth that all flows collectively utilized; 4) Stability of the throughput achieved over time by the flows; and 5) Job Completion Time, i.e. , the time any given data center application took to complete its basic work unit. Organization: Next, we discuss the related work, and then describe our measurement setup in Sec. 5.3. In Sec. 5.4, we present the first phase of our study and extensively discuss how congestion control mechanisms of various coexisting TCP variants impact each other’s performance. In Sec. 5.5, we present the second phase where we discuss the impact of TCP coexistence on real data

59 center workloads. In Sec. 5.6, we demonstrate how to predict the TCP variant of coexisting traffic and evaluate its benefits. Finally, we conclude the chapter in Sec. 5.7.

5.2 Related Work

Some previous studies have indeed explored the impacts of coexisting TCP variants on each other’s performance. For example, [62] studied the coexistence of BBR and CUBIC on a dumbbell topology with a single congestion point. Similarly, [72] studied the coexistence of DCTCP and CUBIC flows through a single congestion point. [50] presented a simulation based analysis of throughput fairness of TCP New Reno, Vegas, and Westwood+. Another class of work focused on developing active queue management approaches to improve throughput fairness among coexisting flows of different variants [68, 67, 80]. To guide these approaches, they performed some explorations, albeit much less extensive and often simulation based, of coexisting TCP variants. Although these studies provided some interesting insights, there are several important perspec- tives that we cover in this work from which the prior studies did not characterize the impacts of coexisting TCPs. These perspectives include:

1. Switch Fabric Topologies: prior studies used simple topologies, such as the dumbbell topology, that are infrequent in real world and data centers. It is important to characterize the impact of coexisting TCP variants using modern data center topologies because they give rise to multiple congestion points that do not appear in simpler topologies.

2. Link RTTs: prior studies did not analyze the impact of link RTTs on the coexistence of TCP variants. This is important because: 1) the switch fabrics of modern data centers offer a wide range of link RTTs depending on the locations of the hosts; 2) the impacts of coexisting TCP variants on each other’s performance change with changes in link RTTs because some variants adjust their sending rates based on RTT (e.g. , Vegas[17] & Timely) while others do not (e.g. , CUBIC & DCTCP).

3. Workloads: prior studies did not present observations from realistic data center workloads, rather only from saturated flows that pump as much data into the network as their congestion control algorithms allow. In addition to saturated flows, it is important to characterize the impacts of coexisting TCP variants using unsaturated flows and using modern data center workloads because they have different network characteristics compared to the saturated flows.

4. TCP Variants: prior studies have not characterized several important combinations of TCP variants that are now commonly found in today’s data centers, such as BBR, DCTCP,Timely, and several more.

5. Stability: prior studies did not analyze how the coexisting TCP variants impact the stability of each other’s throughput. Throughput stability is important because many applications,

60 such as live-streaming and cloud-gaming, cannot tolerate oscillations or frequent variations in throughput.

An orthogonal class of measurement studies, such as [14, 74, 39, 135], characterized various other networking aspects of data centers, such as the nature and characteristics of the traffic on switch fabric, frequency of congestion episodes, etc. Yet another class of work focused on developing variants of TCP for data centers, such as DCTCP [6], ICTCP [139], D2TCP [133], Timely [94], and BBR [19]. We mentioned these classes of work only for the sake of completeness; their objectives, otherwise, are orthogonal to the work we present in this chapter. Several prior studies [92, 141, 103, 23, 93] have engaged in modelling and predicting TCP conges- tion control algorithms from heuristic and machine learning approaches. [92] uses a Support Vector Regression approach to predict throughput in a lab environment using an in-house developed tool called PathPerf. Yang et al. proposed a tool to actively identify the TCP congestion avoidance algorithm in remote web servers using features extracted from gathered window size traces. [23] uses passive methods to identify TCP variants using a four layer long short term memory model. Sander et al. in [120] use deep learning approaches to identify TCP variants from packet arrival data. In contrast to prior work, our goal is not to identify the TCP variant of the local flow because we already have the knowledge of the local congestion control variant. Our goal is to predict the TCP variant of the remote flows with which our local flow coexists at the bottleneck link.

5.3 Measurement Setup

5.3.1 Testbed Topologies

Researchers have proposed a wide variety of data center topologies such as Leaf-Spine [5], Fat-Tree [3], DCell [52], BCube [53], etc. Among these, Leaf-Spine and Fat-Tree have emerged as popular topologies in modern data centers [140,5,3 ]. This is because they are well-suited for server–server traffic as they provide predictable latency, enhanced bandwidth through massive multi-pathing, and high scalability. Thus, we used these two topologies in our experiments. Figs. 5.1 and 5.2 show our two testbeds configured in Leaf-Spine and Fat-Tree topologies. The nodes at the bottom in each figure represent the servers that execute the workloads while the nodes at all other levels are switches. The text beside each node is the name that we will use when referring to that node. For some servers, we have provided two names to ease our discussion later. These figures also provide names for the congestion points that will come up during discussion in this chapter. A congestion point is an outgoing port of a switch where flows from more than one incoming ports have to exit, and thus contend for the bandwidth of the outgoing link. The servers in our testbeds are equipped with 8 processors, 96 GB RAM, and gigabit ethernet. All switches are ECN enabled gigabit switches that simultaneously support 1Gbps data rate between any pair of ports. Similar to typical data centers switches [132], our switches employ round-robin ECN queues with buffer of size 60 packets.

61 SS1 SS2

CP6 CP4

CP5 CP1 CP 2 CP3 LS1 LS2 LS3

S1 S2 S3 D1 D2 D3 W11 W21 M1 M2 W12 W22 W13 W23

Figure 5.1 Leaf-spine data center topology

E1 E2

CP3 CP2

A1 A2 A3 A4

CP1

R1 R2 R3 R4

S1 S2 S3 S4 D1 D2 D3 D4

Figure 5.2 Fat-Tree data center topology

5.3.2 Testbed Network Parameters

The network links in today’s data centers boast large bandwidths, such as 10 or even 40 Gbps. However, the bandwidths that individual application instances usually receive in multi-tenant and multi-application data centers are far less than 10 Gbps [135]. This is because of the common place use of servers that support virtualization, and thus share the bandwidth among the flows of multiple virtual machines. As a single server only runs a single application in our experiments, we selected the bandwidth of each link to be 1 Gbps. Due to the large link bandwidths, the minimum RTTs that the server-server traffic experiences in today’s data centers are small (<< ms) [94]. However, the average RTTs are often much larger due to switch fabric congestion and processing delays introduced by server virtualization [135]. In our testbeds, the minimum RTT is 500µs, which is dictated by our testbed topologies and by our link bandwidths. To cover both minimum and average RTT cases of real data centers, we used the Linux traffic control (TC) tool [126] to increase the RTT of our testbed up to 2ms as per the need of any experiment.

5.3.3 TCP Variants

As mentioned earlier, we chose four TCP variants, namely BBR [19], DCTCP [6], CUBIC [55], and New Reno [41]. The motivation behind the choice of BBR and DCTCP is that they have been designed for

62 and frequently used in data centers [6, 19]. The motivation behind the choice of CUBIC and New Reno is that they come as the default TCP variants in most Linux kernels today and many network administrators run virtual machines in data centers without making any changes to the TCP stack. Our choice of these four TCP variants covers all three types of TCP congestion control mechanisms: RTT-based (BBR), network feedback based (DCTCP), and loss-based (DCTCP,New Reno, & CUBIC). To log the TCP parameters and traffic, we used TCP kernel probe [127].

5.4 Benchmarking Workloads

To study how the congestion control mechanisms of different coexisting TCP variants impact each other’s performance, we use iPerf traffic as it does not incorporate any application induced temporal variabilities in the amount of offered traffic. The iPerf traffic enables us to establish the baseline behaviors of how different coexisting TCP variants impact each other’s performance, and thus, we name this traffic benchmarking workloads. We start with a simple scenario where two senders in a leaf-spine topology send equal number of saturated TCP flows through a single congestion point (Sec. 5.4.1). After that, we study the scenario where a single flow of one variant coexists with multiple flows of another variant (Sec. 5.4.2). Next, we study how the impacts of TCP coexistence evolve with the RTTs of the paths (Sec. 5.4.3). After that, we study TCP coexistence using unsaturated flows (Sec. 5.4.4). Finally, we repeat our experiments on the more complex fat-tree topology, which has multiple congestion points and several sender-receiver pairs, and study how our observations evolve when going from leaf-spine to fat-tree topology (Sec. 5.4.5).

5.4.1 Equal Number of Flows of All Variants

To study how two sets of coexisting flows controlled by two different variants impact each other’s per- formance, we placed iPerf source instances on servers S1 and S2 and the corresponding destination in- stances on servers D1 and D2, shown in Fig. 5.1. Their flows traversed the path Si LS1 SS1 LS3 Di . − − − − Next, we simultaneously started two sets of iPerf flows, one from S1 represented with f1 and another from S2 represented with f2, and let them run for 30 seconds. Each set of flows comprised of N parallel flows, where we experimented with N = 1, 10, and 100. We represent the TCP variants controlling f1 and f2 with T1 and T2, respectively. We repeated this experiment 50 times for each of the six possible combinations of T1 and T2, where T1 = T2. The six combinations include BBR/DCTCP 6 (B/D), BBR/CUBIC (B/C), BBR/New Reno (B/R), DCTCP/CUBIC (D/C), DCTCP/New Reno (D/R), and CUBIC/New Reno (C/R). The two sets of flows faced a single congestion point, CP1.

5.4.1.1 Throughput & Fairness

Fig. 5.3 shows the average throughput over 50 repetitions achieved by each of the two sets of flows. The height of each bar further shows the average of the combined total throughput achieved by the two sets of flows corresponding to that bar. The error bars represent the 95th percentile confidence interval.

63 1 N=1 N=10 N=100 0.8 0.6 0.4 0.2

Throughput (Gbps) 0

B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R

Figure 5.3 Throughputs achieved by the set of flows f1 and f2 while using the TCP variants T1/T2 (red is T1, blue is T2) for N=1,10, and 100 respectively.

The flows of the newer TCP variants, i.e. , DCTCP and BBR, unfairly capture a significantly larger bandwidth compared to the flows of the legacy variants, i.e. , CUBIC and New Reno. This was observed by the authors of prior studies [72] and [62] as well, albeit using few flows per variant and without describing why this behavior is observed. We further observe from Fig. 5.3 that this unfairness persists even when the number of parallel flows of each variant increases all the way to N = 100. The cause of unfairness between throughputs of different TCP variants is the difference in the way different variants interpret congestion and react to network events. For example, on experiencing a packet loss, CUBIC reduces its congestion window size as it treats the packet loss as a sign of ongoing or impending congestion. However, BBR simply retransmits the lost packet without changing its congestion window size as long as the RTT values are stable. Thus, the same loss event penalizes one variant while it does not affect or has a diminutive effect on the other. To understand the unfairness when a newer variant coexists with a legacy variant, consider Figs. 5.4 and 5.5, which plot the average sizes of congestion window and the average number of retrans- missions. We observe from these figures that for the B/C and B/R combinations, although BBR flows experienced significantly larger number of retransmissions, they still kept their congestion window sizes significantly larger compared to CUBIC and New Reno flows, who reduce their congestion window sizes on experiencing packet losses. This phenomenon induces a negative feedback for legacy variants and results in the BBR flows capturing a significantly larger bandwidth compared to the CUBIC and New Reno flows. For D/C and D/R combinations, the knowledge of impending loss due to the ECN signals helps DCTCP flows keep their packet losses close to zero even when N = 100, as shown by Fig. 5.5. As DCTCP flows experience significantly fewer losses compared to CUBIC and New Reno, they reduce their congestion window sizes less frequently. Consequently, their average congestion window sizes are much larger than the CUBIC and New Reno flows, as shown by Fig. 5.4, and thus capture larger bandwidth. Another interesting observation from Fig. 5.3 is that when the number of parallel flows is small, BBR captures significantly larger bandwidth compared to DCTCP. However, as the number of parallel flows increases, DCTCP captures more and more bandwidth, such that at N = 100, DCTCP captures significantly more bandwidth compared to BBR. This can be explained by the phenomenon demonstrated in [111] that TCP flows become asynchronous as the number of flows increases. As

64 360 160 20 N=1 N=10 N=100 270 120 15 180 80 10 90 40 5 CW (KB) CW (KB) CW (KB) 0 0 0

B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R

Figure 5.4 Avg. congestion window size per flow (red circle: T1, blue cross: T2)

1.26 2.43 0.12 N=1 N=10 N=100 0.84 1.62 0.08

0.42 0.81 0.04

0 0 0 Retransmissions Retransmissions Retransmissions B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R

Figure 5.5 Avg. retransmissions per flow per RTT (red: T1, blue: T2)

the number of BBR flows increase, the times when they probe for link RTTs go out of sync, resulting in persistent buffer usage, which increases the value of the minimum RTT but decreases the value of available bandwidth, resulting in a net decrease in the bandwidth delay product (BDP). This results in each BBR flow making its congestion window size smaller and smaller relative to coexisting DCTCP flows. Fig. 5.4 empirically demonstrates this, where we can see that the congestion window size of BBR decreases with respect to DCTCP as N increases from 1 to 100. This, along with DCTCP’s fewer retransmissions, as seen in Fig. 5.5, results in DCTCP capturing larger bandwidth at larger N . We further observe from Fig. 5.3 that as N increases, New Reno unfairly starts capturing more and more bandwidth compared to CUBIC. This happens due to two reasons: 1) after a loss event, New Reno decreases its congestion window by 50%, whereas CUBIC decreases by only 20%; 2) CUBIC increases its congestion window size more aggressively compared to New Reno after loss. When there are multiple CUBIC flows, they compete among themselves and cause further congestion that leads to significant reduction in the sizes of their individual congestion windows. In comparison, New Reno linearly increases its congestion window sizes, whose rate of increase reduces further with the number of flows due to its ACK-based self-clocking. As a result, CUBIC gets larger share of bandwidth initially but eventually collapses due to the congestion caused by CUBIC flows, whereas New Reno flows start with smaller throughput but eventually dominate the bottleneck link. This phenomenon can be seen in Fig. 5.6, where we plot the instantaneous throughput per flow for both CUBIC and New Reno flows when N = 10. Transient flows also experience the throughput unfairness discussed above. To demonstrate this, we conducted new experiments where in each experiment, we started a flow of one TCP variant, let it run for 75 seconds, and then started the flow of the second TCP variant while the first flow was still running. We observed from our experiments that as soon as the second flow starts, the

65 coexisting pair of flows behave exactly the same as in the experiments we have discussed until now where we started both flows simultaneously. For example, Figs. 5.7(a) and 5.7(b) show the instantaneous throughput of each flow for the B/C and D/C combinations, respectively. Similar to previous observations, we observe from these figures that the newer TCP variants starve the legacy variant.

5.4.1.2 Throughput Stability

As several common applications, such as live-streaming, cloud-gaming, are sensitive to variations in throughput, an important metric to study when flows of multiple TCP variants coexist is the stability of the throughput achieved by the flows of each variant. We visualize the throughput stability of flows using Poincaré maps [106]. To explain what a Poincaré map is, let us represent the instantaneous throughput of any given flow measured at time instants i with θi . A Poincaré map of any given

flow is a two dimensional scatter plot comprised of the points (θi ,θi +1) for all values of i . For a flow with stable throughput, the points in the Poincaré map appear in a tight cluster along the 45 line, ◦ whereas, for a flow with unstable throughput, the points appear dispersed. Fig. 5.8 plots the Poincaré maps using normalized throughput values for the various combina- tions of TCP variants when N = 1. As we saw above, CUBIC and New Reno behave very similarly when N = 1. Therefore, we have not shown the Poincaré maps for combinations with New Reno. Fig. 5.8 reveals that throughput stability heavily depends on which TCP variants coexist. We ob- serve from this figure that a BBR flow sharing the bottleneck link with another BBR flow exhibits unstable behavior. This happens because BBR periodically reduces its congestion window size to 4 to probe for the minimum link RTT [19]. This was seen in Fig. 5.7(a) as well where we observed that BBR’s throughput dropped periodically, resulting in an oscillatory pattern. When BBR reduces its congestion window size, the buffer in the switch frees up and thus, the flows of other variants see a momentary surge in their throughputs. Due to this, the flows of coexisting TCP variants also experience an unstable throughput, as shown by the Poincaré maps for B/C and B/D combinations in Fig. 5.8. In comparison, loss based TCPs, i.e. , DCTCP and CUBIC achieve stable throughputs as shown by the Poincaré maps for D/D, C/C, and D/C, even though the throughput achieved by

150 CUBIC 100 New Reno

50 Instantaneous Sent Data(Mbps) 0 0 1 2 3 4 5 6 Time (sec)

Figure 5.6 Instantaneous per-flow throughputs of New Reno and Cubic (N = 10)

66 1 1

0.5 B C 0.5 D C

0 0

Throughput (Gbps) 0 100 200 300 Throughput (Gbps) 0 100 200 300 Time (sec) Time (sec)

Figure 5.7 When T2’s flow is added after 75 sec.

CUBIC is much smaller compared to the throughput achieved by DCTCP in the D/C combination.

1 1 1

0.5 0.5 0.5 i+1 i+1 i+1

B/B D/D C/C 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1

i i i 1 1 1

0.5 0.5 0.5 i+1 i+1 i+1

B/C B/D D/C 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1

i i i

Figure 5.8 Poincaré plots. In any T1/T2 figure, red is T1, blue is T2

5.4.2 One Flow of One Variant vs. Multiple Flows of Another

As mentioned in Sec. 5.1, different TCP variants coexist because today’s data centers host a variety of different applications. These applications often have diverse traffic patterns, and thus, the number of flows of different coexisting variants are often not the same. To study how uneven number of flows of different TCP variants coexist, we conducted new experiments in the same way as in Sec.

5.4.1, where this time, we started one flow of T1 and N parallel flows of T2, and experimented using N = 10 to 100. Each subfigure in Fig. 5.9 plots the average throughput achieved by the single flow of T against N flows of T . The dashed line in each subfigure corresponds to 1 (bottleneck link 1 2 N +1 ×

67 capacity), which is the throughput that each flow would achieve had all flows shared the available bandwidth fairly.

We observe from Fig. 5.9 that when different TCP variants coexist, the single flow of T1 often does not capture its fair share, rather captures more or less depending on which variants the T1/T2 combination is comprised of. For example, Fig. 5.9(a) shows that a single BBR flow completely starves the CUBIC or New Reno flows even when N = 100 for these legacy variants. We further observe that although a single BBR flow takes unfair share of bandwidth when the other variant is DCTCP,the severity of unfairness is comparatively less. The reasons behind these observations are the same: BBR does not reduce its congestion window size on packet losses, but the other variants do; DCTCP performs better than CUBIC and New Reno as it experiences fewer packet losses due to ECN and thus sustains larger average congestion window size. In the case of a single CUBIC or a single New Reno flow, shown in Figs. 5.9(b) and 5.9(c), respec- tively, we observe convex profiles. This happens because as the number of flows from S2 increase, the per-flow throughput of S2 increases relative to S1 due to port-blackout [109]. Port blackout is a phenomenon where when a larger set of flows compete with a smaller set of flows, an almost full output port with tail-drop policy results in consecutive packet drops of the smaller set of flows. This phenomenon ceases to exist when the outstanding packets are less than 3 per connection. As N increases, the number of outstanding packets per-flow decrease. Any loss from there on is

1.00 0.08 0.80 0.06 0.60 D 0.04 B C D 0.40 R 0.03 R 0.20 0.01

Throughput (Gbps) 0 Throughput (Gbps) 0 10 20 30 40 50 100 10 20 30 40 50 100

(a) Thpt. of BBR flow (b) Thpt. of CUBIC flow

0.10 0.60 0.08 0.48 0.06 B 0.36 B D C 0.04 C 0.24 R 0.02 0.12

Throughput (Gbps) 0 Throughput (Gbps) 0 10 20 30 40 50 100 10 20 30 40 50 100

(c) Thpt. of New Reno flow (d) Thpt. of DCTCP flow

Figure 5.9 Average throughput of a single flow of T1 (T1’s name in subfigure caption) when coexisting with multiple flows of T2 (T2 names in legend)

68 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 B/D 0.2 B/C 0.2 B/R 0 0 0

Throughput (Gbps) 0.5 0.6 0.8 1.2 2.0 Throughput (Gbps) 0.5 0.6 0.8 1.2 2.0 Throughput (Gbps) 0.5 0.6 0.8 1.2 2.0 Minimum RTT (ms) Minimum RTT (ms) Minimum RTT (ms)

1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 D/C 0.2 D/R 0.2 C/R 0 0 0

Throughput (Gbps) 0.5 0.6 0.8 1.2 2.0 Throughput (Gbps) 0.5 0.6 0.8 1.2 2.0 Throughput (Gbps) 0.5 0.6 0.8 1.2 2.0 Minimum RTT (ms) Minimum RTT (ms) Minimum RTT (ms)

Figure 5.10 Throughput w.r.t. RTT. Red is T1, blue is T2, and green is normalized ratio of throughput be- tween T1 and T2

rectified using timeouts. As timeouts are more costly than fast-retransmits, the bandwidth captured by the larger set of flows decreases, while that captured by the smaller set (in our case, a single flow) increases. For single DCTCP flow, as shown in Fig. 5.9(d), unfairness exists just like BBR in Fig. 5.9(a), but the severity reduces as N increases for T2. DCTCP does not show a convex profile because, unlike CUBIC and New Reno, it maintains stable CW sizes. As the number of CUBIC and New Reno flows increases, the available buffer capacity decreases proportionally for DCTCP,and so does it’s captured bandwidth.

5.4.3 Impact of Path RTTs

To study how TCP variants coexist on paths of different RTTs, we conducted experiments in the same way as in Secs. 5.4.1 and 5.4.2, this time using a single flow of both variants and paths with RTTs ranging from 500µs to 2ms. The path RTT here means the minimum RTT of the path between sender and receiver in the absence of any traffic. Fig. 5.10 plots the average throughput achieved by the flows of both variants for all six combinations of TCP variants. We observe from this figure that, across all RTTs, the individual bandwidths captured by the flows of the two variants follow the same order as in Fig. 5.3 for various T1/T2 combinations. However, depending on the variants constituting any given T1/T2 combination, the extent of unfairness between coexisting flows changes differently as the path RTT changes. For example, as shown by Figs. 5.10(a), 5.10(b), and 5.10(c), BBR being a delay based TCP variant and thus being inherently biased towards higher RTT paths [62], captures more and more bandwidth as RTT increases and starves the coexisting variant even more. DCTCP,

69 CUBIC, and New Reno being delay based protocols that do not adjust their congestion window sizes based on path RTTs see a minimal change in their throughputs relative to each other as path RTTs change, as shows by Figs 5.10(d), 5.10(e), and 5.10(f). Fig. 5.11 shows the combined throughput achieved by the flows of the coexisting TCP pairs for different path RTTs. We see from this figure that when T1 = T2 and when one of the variants is BBR, the overall network utilization is the highest. However, if one6 of the variants is DCTCP and the other is CUBIC or New Reno, the network utilization is the lowest. As discussed earlier as well, this is because BBR maintains much larger average congestion window size compared to DCTCP,and even smaller congestion window size of CUBIC and New Reno.

5.4.4 Unsaturated Traffic

We have seen that when flows of different variants coexist, they mostly do not share the available bandwidth fairly. In this section, we investigate whether this unfairness occurs even when the links are not saturated. To study this, we conducted same experiments as before using N = 1 for each pair of TCP variants, this time limiting the maximum traffic offered by any given flow to β times the capacity of the bottleneck link. We experimented with different value of β from 0.1 to 0.5, which translates to combined offered traffic ranging from 20% to 100% of the bottleneck link’s capacity. Fig. 5.12 plots the normalized throughputs of each flow (normalized w.r.t. each flow’s offered traffic) as a function of the total offered traffic (normalized with the bottleneck link capacity). We observe from these figures that as long as the total offered traffic is less than 60% of the bottleneck link capacity, flows of different variants do not negatively impact each other. However, beyond 60%, the effects of coexistence on throughput fairness start appearing and become more and more severe as the amount of offered traffic approaches 100%. CUBIC and New Reno, when coexisting with the newer TCP variants, start experiencing drop in throughputs soon after the offered traffic exceeds 60% as they reduce their congestion windows rapidly on packet drops. DCTCP,when coexisting with BBR, also experiences throughput drop, but after the offered traffic exceed 80%. This is due to the ECN signals that enable it to keep packet losses close to zero even at such high link utilizations where CUBIC and New Reno experience rapid packet drops. Thus, DCTCP sustains larger average

0.95

0.90

0.85 B/D B/C B/R

Net. Util. (Gbps) D/C D/R C/R 0.5 0.6 0.8 1.2 2.0 Minimum Link RTT (ms)

Figure 5.11 Network Utilization

70 1 1 1 0.8 0.8 0.8 BBR BBR BBR 0.6 DCTCP 0.6 CUBIC 0.6 RENO 0.4 0.4 0.4

Norm. Throughput 0.6 0.7 0.8 0.9 1.0 Norm. Throughput 0.6 0.7 0.8 0.9 1.0 Norm. Throughput 0.6 0.7 0.8 0.9 1.0 Total Load Total Load Total Load

1 1 1 0.8 0.8 0.8 DCTCP DCTCP CUBIC 0.6 CUBIC 0.6 RENO 0.6 RENO 0.4 0.4 0.4

Norm. Throughput 0.6 0.7 0.8 0.9 1.0 Norm. Throughput 0.6 0.7 0.8 0.9 1.0 Norm. Throughput 0.6 0.7 0.8 0.9 1.0 Total Load Total Load Total Load

Figure 5.12 Normalized throughput per TCP as a function of total network load

congestion window sizes and is able to capture its fair share of bandwidth evern at larger values of link utilization. In summary, our observations about TCP coexistence are similar when links are unsaturated as when they were saturated (Sec. 5.4.1). The difference primarily is that as the offered traffic decreases, the extent of unfairness decreases, and disappears completely when the combined offered traffic falls below 60%.

5.4.5 The Fat-Tree Topology

The observations presented until now are from the leaf-spine topology, shown in Fig. 5.1, with one congestion point. To study how a change in the number of congestion points impacts throughput fairness among coexisting variants, we conducted experiments on the Fat-Tree topology, shown in Fig. 5.2.

5.4.5.1 The Experiments

We placed four iPerf source-destination pairs on server pairs S1 D1 through S4 D4 and simultane- ously started four sets of iPerf flows from the four servers, and let− them run for 30− seconds. Each set was comprised of N flows. The four sets of flows used two different TCP variants, where all N flows within the same set used the same variant. We used N = 1, 8, 16, and 32 in these experiments. The two sets of flows from S1 and S3 used the same variant and the two sets from S2 and S4 used another variant. Due to the space constraints, we cannot present observations from the experiments where the four sets of flows used up to four different variants. We repeated each experiment 50 times.

71 5.4.5.2 Traffic Routes

We represent the set of N flows generated from the server Si to Di with FSi Tj , where i [1,4] and ∈ j [1,2]. We configured our switch fabric so that the paths that the four sets of flows traverse lead to different∈ number of congestion points for different sets of flows. In the switch fabric of Fig. 5.2, the

flow sets FS1T1 and FS2T2 follow the path R1 A1 E1 A3 R3. Flow sets FS3T1 and FS4T2 follow the paths − − − − R2 A1 E1 A3 R4 and R2 A2 E1 A3 R4, respectively. The traffics of flow sets FS1T1 and FS2T2 − − − − − − − − share a common bottleneck link (R1 A1). The aggregate traffic of flow sets FS1T1 and FS2T2 shares a − common bottleneck link (A1 E1) with the traffic from flow set FS3T1 . Finally, the aggregate traffic of − flow sets FS1T1 , FS2T2 , and FS3T1 shares a common bottleneck link (E1 A3) with the traffic from flow − set FS4T2 . Consequently, the flows in sets FS1T1 and FS2T2 face three congestion points, namely CP1,

CP2, and CP3; flows in set FS3T1 face two congestion points, namely CP2 and CP3; and flows in set

FS4T2 face only one congestion point, namely CP3. The two flow sets, FS3T1 and FS4T2 , do not create congestion at the switch R4 because they do not share the same outgoing link.

5.4.5.3 Observations

Fig. 5.13 shows the average throughput over 50 repetitions achieved by both sets of flows of each of the two TCP variants in any given combination. Despite differences in the number of congestion points encountered by each set of flows, we observe many similarities between the observations from fat-tree topology in Fig. 5.13 and the leaf-spine topology in Fig. 5.3. For example, the newer TCP variants are unfair to the legacy TCP variants irrespective of the number of parallel flows of each. Similarly, in the B/D combination, BBR captures more bandwidth when N is small, but as N increases, DCTCP captures more and more bandwidth. The similarity in these observations emphasizes that on the aggregate, the impact of TCP variants on each other’s performance is only loosely related to the underlying network topology, and depends more significantly on the number of flows and the variants that coexist.

N=1 N=8 N=16 N=32 1 0.8 0.6 0.4 0.2 0 Throughput (Gbps)

B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R B/D B/C B/R D/C D/R C/R

Figure 5.13 Combined throughputs achieved by the set of flows from S1 and S3 while using TCP variant T1 (Red) and from S2 and S4 while using TCP variant T2 (Blue) for N=1,8,16, and 32 respectively

Next, we study how the flows controlled by the same variant get impacted differently when

72 traversing different number of congestion points while coexisting with the flows of other variants.

For example, S1 and S3 use the same TCP variant but traffic from S1 traverses 3 congestion points, while that from S3 traverses only 2. Intuitively, one would expect that the set of flows traversing fewer congestion points will capture more bandwidth, but this is not always true as we discuss next.

Let α(Si ) represent the combined throughput of all flows originating from sender Si . Let µT1 represent the average of the throughputs achieved by the senders S1 and S3, both of which use TCP variant T1. Similarly, let µT2 represent the average of the throughputs achieved by S2 and S4, which α(S3) α(S1) use T2. γT − quantifies the relative difference in the throughputs achieved by S1 and S3 1 = µT1 with respect to their average throughput. γT2 is defined similarly for S2 and S4. The value of γT1 lies in the range [ 2,+2], where γT1 = 0 when the two senders achieve equal throughput, γT1 < 0 when − S3 achieves lower throughput compared to S1, and γT1 > 0 otherwise. The ranges of γT2 are defined similarly.

Fig. 5.14 plots the the values of γTi for six combinations of TCP variants. We observe from these figures that γ > 0 for all TCP variants when N = 1, i.e. , the sender whose flows traverse fewer congestion points captures more bandwidth, which is what we intuitively inferred earlier. However, contrary to this, as N grows, γ decreases for almost all variants, i.e. , the sender whose flows traverse more congestion points start achieving increasingly more bandwidth. BBR flows behave this way because BBR sets its congestion window as a multiple of the BDP of the path. As N increases, the paths with more congestion points have larger queuing delays due to increasing traffic and thus larger BDPs compared to the paths with fewer congestion points. This leads BBR sender with more congestion points to set larger congestion windows and thus send more data. CUBIC, New Reno, and DCTCP behave this way due to port blackout [109] at CP3 that causes the flows traversing more congestion points to capture more bandwidth. More specifically, at E1, N flows from S4 compete with 3 N flows coming from A1. Thus, the flows from S4 lose their bandwidth share to the remaining × three senders as N increases and results in decreasing γ values. The rate of decrease in γ with increasing N is different for different variants and depends on which variants constitute the coexisting pair. An aggressive T1 results in a steeper drop in γ, as seen for D/C and D/R combinations. An exception to this observation is when T1 is BBR. Instead of a steep drop in γ of CUBIC or New Reno, S4 gets the entire bandwidth share when compared to S2.

The reason being that the traffic of the loss based variant from S2 has to compete with two BBR senders at R1 and A1. Paths with higher RTTs increase the congestion window sizes in BBR senders, which, in turn, results in throughput starvation of S2 at both R1 and A1.

5.5 Data Center Workloads

Now we turn our attention to studying how the impacts of TCP coexistence manifest when the traffics are generated due to the data communication between the nodes of distributed applications deployed in data center networks. We start by describing our experimental setup, where we also discuss the placement of the two compute clusters in a data center topology that we used to execute

73 2 2 2

0 0 0

-2 B D -2 B C -2 B R 1 8 16 32 1 8 16 32 1 8 16 32 Num. of Flows Num. of Flows Num. of Flows

2 2 2

0 0 0

-2 D C -2 D R -2 C R 1 8 16 32 1 8 16 32 1 8 16 32 Num. of Flows Num. of Flows Num. of Flows

Figure 5.14 Relative difference between throughputs of traffic passing through different number of conges- tion points

our coexisting distributed data center workloads (Sec. 5.5.1). Next, we describe the data center workloads that we experimented with (Sec. 5.5.2). After that, we present our observations about the impact of TCP coexistence on the performance of our data center workloads for the homogeneous case, where the two compute clusters execute the same workload (Sec. 5.5.3). Last, we present our observations for the heterogeneous case, where the two clusters execute different workloads (Sec. 5.5.4).

5.5.1 Experimental Setup

We studied coexisting data center workloads with different TCP variants on both leaf-spine and fat- tree topologies. However, due to space constraints and due to many similarities in the observations from the two topologies (as we also saw in Sec. 5.4.5), we present our observations only from the leaf- spine topology. To execute any pair of distributed data center workloads, we created two Hadoop clusters and used HiBench [60] to execute workloads on these clusters. HiBench is a well-known benchmarking framework that contains several standardized Hadoop based test-suites for a variety of workloads. Each cluster has one master and three worker nodes, as shown in Fig. 5.1. Each node th Mi in this figure is the master node of the i Hadoop cluster, where i 1,2 , and, each node Wi j is the j th worker node of the i th cluster, where j 1,2,3 . All nodes belonging∈ { } to the same cluster use the same TCP variant. The reason behind placing∈ { the} master and worker nodes in the way shown in Fig. 5.1 is that with this placement, the traffics generated by the workloads create congestion points between TCP variants at both leaf and spine switches. For example, when the worker nodes

W11 and W12 of cluster 1 communicate with each other while at the same time W21 and W22 from cluster 2 also communicate with each other, congestion occurs at CP1 and CP2 on the two leaf switches LS1 and LS2, respectively. Similarly, when W11 and W12 send traffic to W13, congestion occurs at CP4 on the spine switch SS1. During these experiments, we configured the topology to

74 not use the spine switch SS2 in order to ensure that congestion indeed occurred at CP1, CP2, and

CP4. We repeated each experiment 10 times, which was sufficient due to very small variations in the observed values. Finally, due to the similarity of observations between CUBIC and New Reno, we omit the observations obtained using New Reno.

5.5.2 Distributed Data Center Workloads

We selected three common types of data center workloads: streaming, MapReduce, and storage. Next, we provide essential details about these workloads. For more extensive details, we refer interested readers to HiBench documentation [60].

5.5.2.1 Streaming Workload

A typical streaming workload reads a record from a queue, processes it, and writes the processed record back to storage. To execute a streaming workload, in addition to HiBench, we also needed a streaming solution. For this, we used Kafka [45] to queue and dequeue messages generated by HiBench and Flink to process the data stream. In each cluster i , we used Wi 2 and Wi 3 as the

Flink nodes and Wi 1 as the Kafka node. As streaming workloads do not have a job completion state, each experiment constituted starting instances of the streaming workload of the two clusters simultaneously and stopping them after 300 seconds. We quantify a cluster’s job completion time as the average time to process the records. HiBench provides a variety of different streaming functions. We chose the identity function, which simply reads input records from the storage node and writes them back to it. We chose this identity function as it does not incur any noticeable processing delay. The delay between a read and the corresponding write operation incurred by a streaming instance is, therefore, entirely dictated by how fast the underlying TCP moves data from the storage node to the processing node and back to the storage node. This makes it straightforward to translate the record processing time of the streaming application to the application layer performance it achieves.

5.5.2.2 MapReduce Workload

MapReduce workloads are among the most commonly executing workloads in modern data centers [110]. In selecting a MapReduce workload, our goal was to choose a workload that would be network intensive so that the impacts of the coexistence of TCP variants on the application layer performance of the workload can be readily observed. Therefore, we selected the Sort workload in HiBench framework because it generates a large amount of traffic. In executing the sort workload, we used 40GB seed data as input, which sufficed to congest the network links on our testbed. The job completion time here is the time it took the cluster to sort this 40GB data.

75 150 1150 800

120 1000 600 90 850 400 60

30 Job Comp. Times (s) 700 Job Comp. Times (s) 200 Record Proc. Time (ms) B/B C/C D/D B/C C/D D/B B/B C/C D/D B/C C/D D/B B/B C/C D/D B/C B/D D/C (a) Streaming (b) Sort (c) DFSIOe

Figure 5.15 Job completion times for homogeneous workloads. T1 is represented with red, T2 with blue, and average with green markers.

5.5.2.3 Storage Workload

A large number of user-facing applications read data from distributed storage systems to fulfil user requests [46], which motivates us to study the impact of the coexistence of TCP variants on the performance of storage workloads. For this, we selected the extended distributed file system input/output (DFSIOe) workload in HiBench, which reads multiple files from the Hadoop distributed file system (HDFS). The cumulative file size of the reads was 80GB, which sufficed to congest the links. When this workload executes, the traffic that traverses the links is synchronous, i.e. , the flows from multiple HDFS nodes originate synchronously in response to the read requests. The job completion time here is the time it took the cluster to read the 80GB data.

5.5.3 Homogeneous Workloads

The homogeneous scenarios consist of executing two instances of the same workload in the two clusters, where the differentiating factor between the two instances is the different TCP variants.

5.5.3.1 Streaming Workload

Fig. 5.15(a) plots the average record processing times of the two clusters for various combinations of TCP variants. We observe from this figure that when both clusters used the same variant, they achieved very similar record processing times. This is expected because traffics controlled by the same TCP variant are usually fair to each other. We further observe from this figure that when both clusters used BBR, they experienced at least 6% higher record processing times compared to when both clusters used DCTCP or CUBIC. This is explained by the Poincaré plot in Fig. 5.8(a), where we observed that BBR flows, even when coexisting with each other, demonstrate highly unstable throughput, which leads to lower network utilization, and thus higher record processing time. We make these two observations for our two other workloads as well and will thus not repeat them. We also observe from Fig. 5.15(a) that the relative order of record processing times for any given combination of TCP variants is very similar to the relative order of throughput achieved by that combination in Figs. 5.3, 5.12, and 5.13, which were made from benchmarking workloads. This was

76 expected because like iPerf, the Kafka node offered as many streaming records to the Flink nodes for processing as the underlying TCP could carry to the Flink nodes.

5.5.3.2 Sort - MapReduce Workload

Fig. 5.15(b) plots the average job completion times of the two cluster for the sort workload using various combinations of TCP variants. We again observe that the relative order of job completion times for any given combination of TCP variants is very similar to the relative order of through- puts achieved by that combination in Figs. 5.3, 5.12, and 5.13. This is an interesting observation given that the traffic characteristics of the sort workload are fairly more complex compared to the streaming workload because the map, shuffle, and reduce phases of the MapReduce paradigm lead to unpredictable amounts and timings of data transfer across the worker nodes. From these figures, we further observe that the fairness among distributed workloads is higher compared to among iPerf workloads. This happens because, with iPerf workloads, the senders generate as much data as the underlying network can handle and thus the effects of TCP coexistence is maximum. Whereas, distributed applications generate traffic according to application logic, where traffic could be unavailable at times due to reasons such as unavailability of data, computation overhead, user input etc. When this happens, the traffic from the other variant gets to use the entire bandwidth.

5.5.3.3 DFSIOe - Distributed Read Workload

Fig. 5.15(c) plots the average job completion times for the DFSIOe workload using various combi- nations of TCP variants. We again observe that the relative order of job completion times for any given combination of TCP variants is very similar to streaming, sort, and iPerf workloads. From Fig. 5.15(c), we make another observation that when the two clusters used different TCP variants, the average job completion times of the cluster using aggressive variant (i.e. , BBR in the B/C and B/D combinations, and DCTCP in the D/C combination) were lower compared to when both clusters used that same aggressive TCP variant, while the non-aggressive variant had same values in both cases. This shows that the use of different TCP variants in the two clusters actually improved the overall network utilization in the case of DFSIOe workload. We did not make similar observations from Fig. 5.15(b) for the sort workload because sort workload has two distinct phases, the map phase and the reduce phase. When executing the sort workload, most of the traffic appears on the network during the data shuffle that happens between the map and the reduce phases. Due to the inherent randomness in the map-reduce workloads in terms of which workloads are assigned to which worker nodes and which data is shuffled between which worker nodes, over time, the map and the reduce phases of the two clusters go out of sync, and thus send the traffic on the network at different times. When executing the DFSIOe workload, however, the traffic from both clusters appears immediately and simultaneously as soon as the experiment starts. Fig. 5.16 plots the amount of data sent per second by a worker node for the DFSIOe workload and for the sort workload. We can see in this figure that the traffic from the DFSIOe workload

77 1.5 DFSIOe Sort 1

0.5

Throughput (Gbps) 0 0 200 400 600 Time (sec)

Figure 5.16 Observed Throughput vs. Time

starts immediately after the experiment starts while the traffic from the sort workload is unevenly distributed over time.

5.5.4 Heterogeneous Workloads

The heterogeneous scenarios consist of executing two different workloads in the two clusters using various combinations of TCP variants. We conducted experiments for two scenarios of heteroge- neous workloads, where one cluster always executed the sort workload while the other cluster executed DFSIOe workload in one scenario and streaming workload in the other. In both these scenarios, the sort workload served as the source for unpredictable traffic while the DFSIOe and streaming workloads served as the continuous stream of data. Next, we present our observations from the experiments conducted for the two scenarios.

5.5.4.1 Sort – DFSIOe Scenario

Fig. 5.17(a) shows the average job completion times of the two workloads. We observe from this figure that the choice of TCP variant has very little impact on the job completion times of the sort workload when coexisting with DFSIOe workload. This happens because DFSIOe workload offers a relatively stable amount of traffic to the network while network activity in the sort workload is relatively infrequent and happens only during the data shuffle phases whose start times are random. This leads to the choice of TCP variant having a more pronounced impact on the DFSIOe workload compared to on the sort workload. We further observe from Fig. 5.17(a) that the choice of TCP variant, however, does phenomenally impact the DFSIOe workload. When the cluster executing DFSIOe workload uses relatively less aggressive variant of the combination, i.e. , CUBIC in the B/C and D/C combinations, and DCTCP in the B/D combination, it takes longer to complete the workload compared to when the cluster uses the more aggressive variant. This observation has important implications. For example, it shows that when traffic from a workload that requires high throughput, such as DFSIOe, competes at the bottleneck link with traffic from a MapReduce type of workload that does not send traffic all the time, a network administrator can improve the throughput

78 performance of the workload that requires high throughput by assigning it a more aggressive TCP variant without significantly deteriorating the performance of the MapReduce workload.

1000 700 1000 60

600 750 750 50 500 500 500 400 40 250 250 300 0 30 Sort Comp. Time (s)

0 200 Rec. Proc. Time (ms) Sort Comp. Time (s)

DFSIOe Comp. Time (s) B/B C/B D/B B/C C/C C/D D/C B/D D/D B/B B/C D/C B/D C/C C/B D/B C/D D/D (a) Sort-DFSIOe (b) Sort-Streaming

Figure 5.17 Heterogeneous workload. Sort uses T1 (red), other uses T2 (blue).

5.5.4.2 Sort – Streaming Scenario

Fig. 5.17(b) plots the average job completion times of the sort workload and the average record processing times of the streaming workload. Similar to the sort – DFSIOe scenario, the choice of TCP variant has very little impact on the sort workload when coexisting with streaming workload, but has a significant impact on the record processing times of the streaming workload. The record processing times were the worst when the cluster executing the streaming workload used BBR. The processing times improved when the streaming workload cluster used CUBIC and were the best when it used DCTCP.There are two reasons behind this. First, with smaller number of flows, BBR results in large number of retransmissions, as we saw in Fig. 5.5, when the BBR traffic dominates the output buffer. This results in an increased delay at application level due to network level buffering and retransmissions. Second, BBR periodically probes the network for BDP,during which, it cuts its congestion window size to just 4 packets. This mechanism can persist for up to 200ms [19], which severely hampers the record processing times. We further observe from Fig. 5.17(b) that when the streaming workload cluster used BBR, it achieved best record processing times when the sort cluster did not use BBR, i.e. , coexisting TCP variants lead to better overall performance of the network compared to if both clusters used BBR. This happens because BBR has an unstable throughput, as we saw in the Poincaré plots in Fig. 5.8. While BBR impacts the throughput stability of the coexisting TCP variants as well, when the sort cluster uses any other TCP variant, the overall throughput instability of the network reduces. This in turn leads to reduction in the record processing times of the streaming workload. When the streaming workload cluster used DCTCP,we observed the opposite behavior: record processing times were the lowest when both clusters used DCTCP but deteriorated when the sort workload cluster used a different TCP variant. This happens due to the highly stable throughput and near

79 zero packets losses achieved by DCTCP.

5.6 Predicting Coexisting TCP variants

Our analysis of the TCP coexistence problem in earlier sections revealed that traffic corresponding to a non-homogeneous mix of TCP variants suffer from throughput unfairness among the individual flows. Therefore, if throughput fairness is an important requirement in a given deployment, then it is crucial to homogenize the TCP traffic as much as possible. There are two ways to approach this problem, either by using a centralized approach where a central entity controls the TCP parameters of every host in the cluster or using a decentralized end-to-end system where hosts change their TCP parameters based on local measurements. In other words, if the local hosts know the congestion control variant of the coexisting (remote) TCP flows, then it can change its TCP variant to that of the remote flow to achieve fair bandwidth allocation. In this work, we will pursue the end-to-end rather than the centralized approach as the former is scalable, convenient to deploy and does not need the knowledge of the entire network space. In the following sections, we will describe our methodology to predict the congestion control variant of a remote TCP flow using the traffic characteristics observed in the local TCP flow and how it is different from prior work. We further demonstrate the effectiveness of our modeling techniques by deploying the trained model to predict remote TCP variants in real time and illustrate how this prediction can lead to fair bandwidth allocation.

5.6.1 Methodology

Before we describe our methodology, it is important that we establish the objective of this section. Our objective in this study is that, given the traffic characteristics measured from a local TCP flow (which are modeled as a set of features like throughput, loss, congestion window size etc.), is it possible to predict the congestion control variant of a remote TCP flow with which the local TCP flow is coexisting at a bottleneck link.

5.6.1.1 Available Features

We use the same data that was already recorded in our earlier measurements for characterizing the TCP coexistence problem in various scenarios. From the available features for each time instance at every host, we chose normalized throughput (normalized w.r.t. the maximum link bandwidth) and packet loss per sampling period as the characteristic feature set to predict the remote TCP variant. The reason to choose only these two features (and why these two were sufficient) is because the effects of the coexisting TCP variant were mainly felt as throughput starvation and high packet loss. For example, in the case where a BBR flow coexisted with a CUBIC flow, we saw that the CUBIC flow was severely starved of a fair bandwidth allocation and also faced high packet losses. Moreover, the shape of the throughput vs. time for CUBIC flow depended a lot on the remote TCP variant (unstable profile with a remote BBR flow and a stable profile with DCTCP). These two features can

80 also be recorded at the flow level without any modifications to the process or application that are generating them.

5.6.1.2 Data modeling

We first transform the recorded measurements into a time series of throughput and packet loss values for each experiment iteration. Since our sampling rate was 10 items per second and all our experiments ran for 300 seconds, the resulting vector at the end of this step was of length 3000 for throughput and packet loss. This vector is further supplemented with encoded information about the local and remote TCP variants. The encodings for BBR, DCTCP,CUBIC and New Reno are 0,1,2, and 3 respectively. We will illustrate the final format of each input feature set with the following example. Consider the case where a local host is using CUBIC and its TCP flow is competing at a bottleneck link with a BBR flow. Here the local TCP variant is CUBIC (which is encoded as 2) and the remote variant is BBR (encoded as 0). For this iteration, the preliminary feature set consists of a vector of length 3000 whose fields are time step, throughput and packet loss in that time step. Next we add an additional field called the local TCP and fill this field with the encoded value of the local TCP variant. Finally, the output value for this feature set will be the encoded value of the remote TCP variant, i.e. 0. We follow the same exact process for the BBR flow, but the local TCP variant now becomes BBR and the remote variant becomes CUBIC. So each iteration results in two feature sets. The above three steps complete the data modelling for a single iteration of our earlier measure- ments. We repeat this process for all iterations and scenarios to compile a corpus of all the available feature sets for training.

5.6.1.3 System Definition

There exists a variety of machine learning techniques in the literature to classify time series data into their corresponding labels. Out of these techniques, numerous prior art have shown that Convolution Neural Networks (CNN) have performed extremely well when it comes to predicting labels for time series data (activity recognition, gesture recognition etc.). Since our objective is to predict the remote TCP variant only from measurements on the local TCP flow, we are essentially creating a time series profile of the local TCP flow and predicting the label of the remote TCP variant. Given the similarity between our problem definition and the problems that CNNs are suited to solve, we have used a CNN model to train our data corpus. Figure 5.18 provides the pipeline and definition of the CNN model we have used in our training. It first consists of transformation of the recorded measurements into its corresponding feature set. Then this feature set is fed to the CNN model. The CNN model itself consists of two convolution layers, each containing 100 filters. The output of these two convolution functions is fed to a maximum pooling layer of window size 2. We repeat the convolution process one more time, now with 160 filters and an average pooling layer at the end. Finally, we add a dropout layer to reduce overfitting before generating the output classes.

81 Conv Layer Conv Layer Pooling Logs

3 column 100 filters 100 filters feature set

Max pooling window length is 2

Output Dropout Conv Layer & Pooling Layer

50% of neurons set to 0 to reduce 160 filters Result with half overfitting rows

Figure 5.18 Model definition of the convolution neural network used for predicting remote TCP variants

5.6.1.4 Model Training

In sec. 5.6.1.2 we mentioned that we transform the measurement logs to feature sets of length 3000. But in practice, feeding the CNN with 3000 length vectors can lead to overfitting and might not provide high accuracy for unknown samples. Moreover, recording 3000 samples to predict the remote TCP variant results in a large data collection duration in real world implementations. Therefore, we further divide our feature vector into multiple slices. The length of each slice is decided by the window slice parameter called (W). Given a 10Hz sampling rate, each slice of 1 second results in 10 data points. In training our CNN model, we have fixed the batch size and epoch to 16 and 100 respectively. All the model trainings were repeated 10 times to generate average accuracies. The training corpus is split into 70% for training and 30% for testing the model. After each training iteration, we recorded the prediction accuracy on the test data and the confusion matrix of prediction classes.

Unsaturated Fat Tree Leaf-Spine 100

90

80

Accuracy (%) 70 10 15 20 25 30 Window Slice (s) Figure 5.19 Prediction accuracy for the three traffic scenarios

82 slice = 10s slice = 30s slice = 10s slice = 30s slice = 10s slice = 30s 100 100 100

90 90 90 Accuracy (%) 80 Accuracy (%) 80 Accuracy (%) 80 1 10 100 8 16 32 64 300 350 400 450 500 Num. Flows (N) Num. Flows (N) Max. Flow Throughput (Mbps) (a) Leaf-Spine (b) Fat-Tree (c) Unsaturated Traffic Figure 5.20 Effect of number of parallel flows and traffic saturation on prediction accuracy for leaf-spice, fat-tree and unsaturated traffic scenarios

5.6.2 Prediction Accuracy

We trained the model described in the earlier section on data collected for three different traffic scenarios, namely, single congestion point (leaf-spine topology), multiple congestion points (fat-tree topology) and unsaturated traffic. Fig. 5.19 shows the prediction accuracies of these three scenarios as a function of increasing window slices of feature set to the CNN model. We found that unsaturated traffic had the highest prediction accuracy, followed by saturated traffic passing through a single congestion point. Prediction accuracies for the remote TCP variant were least for traffic that had to traverse multiple congestion points, as seen in the case of fat-tree topology. Larger window slices only seemed to increase the prediction accuracy for unsaturated traffic. The other two scenarios exhibited a non-linear response, where a window slice of length 20 showed the highest accuracy. The reason why unsaturated traffic showed higher prediction accuracies was because throughput unfairness manifested in individual flows even when their individual flow demand is less than the link bandwidth capacity. Moreover, BBR and DCTCP showed different amounts of throughput unfairness at the same un-saturation point as seen in Figs. 5.12(a)- 5.12(f), further decreasing the ambiguity in resolving the remote TCP variant. In case of saturated traffic, both BBR and DCTCP flows captured roughly the same proportion of the available bandwidth, thus starving the CUBIC flows. This resulted in infrequent erroneous resolutions, mainly between BBR and DCTCP. But feeding feature sets with larger window slices decreased the error rate before increasing them again, implying that there is an available tunable knob to affect the prediction accuracies. We further resolved the prediction accuracies to see the effect of local components like number of parallel flows, or amount of unsaturated traffic. Figures 5.20 plot these prediction accuracies for window slice 10 and 30. In case of saturated traffic through single and multiple congestion points, increase in the number of parallel flows increased the prediction error, whereas for unsaturated traffic, the prediction accuracy increased with more traffic in the link. When the number of flows are increased, the variance in the captured bandwidth for each individual flow decreases because of smaller fair share bandwidth and results in a further increase in the ambiguity of distinguishing the remote TCP variant.

83 5.6.2.1 Effect of TCP Variants

Figures 5.21 and 5.22 plot the confusion matrices for the three different traffic scenarios discussed earlier in this section. There are two sets of matrices for window slice 10 seconds and 30 seconds respectively. These matrices demonstrate that it is difficult to predict some congestion control variants in comparison to others, when they are used for remote TCP flows. In case of unsaturated traffic, the ambiguity in predicting the remote TCP variant is approximately the same between all three, and increasing the window slice further decreased this ambiguity as seen in results discussed earlier in this section. However, saturated traffic displayed interesting results, particularly between BBR and DCTCP.To further understand this ambiguity, we have to revisit one of the features of BBR congestion control, that is, RTT probing. BBR periodically limits its packets in flight to a low value, typically 4, to drain the bottleneck buffers and probe the minimum RTT of the link. In doing so, it momentarily loses the bandwidth to the competing flow (which it will recapture after coming out of the probing phases). This periodic probing for minimum link RTT creates an unstable throughput vs. time profile in the coexisting TCP flows. Therefore, the model learns to associate unstable throughput in the local flow with a remote BBR flow. Increasing the window slice also increases the probability of finding throughput oscillations in the local flow, which is why DCTCP has lower prediction accuracy for larger window slices in case of leaf-spine traffic. The fat-tree topology however results in lower accuracies for predicting a remote BBR flow because of unfair bandwidth sharing among the BBR flows themselves. Since BBR regulates its congestion window size as a function of the link BDP,higher RTT links (or links with more congestion points in our case) have higher BDPs and are unfair to BBR flows in lower BDP links, as already demonstrated in an earlier studiy [62]. Increasing the number of parallel flows in this traffic scenario only increased the ambiguity and resulted in overall higher error rates when compared to DCTCP or CUBIC.

5.6.3 Online Prediction

The earlier section demonstrated that it is possible to predict the congestion control of the remote TCP flow from measurements on the local flow. An advantage of doing this prediction is to change

90 90 80 88.29 0.5405 11.17 58.49 27.96 13.55 80 94.76 4.178 1.067 80 BBR BBR BBR 70 70 60 60 60 50 0.3922 93.86 5.752 4.595 95.41 0 1.383 96.54 2.075 50 DCTCP 40 DCTCP 40 DCTCP 40 True Class True Class True Class 30 30

20 20 20 2.921 4.607 92.47 8.845 0.132 91.02 0.9375 1.302 97.76 Cubic 10 Cubic Cubic 10 0 BBR DCTCP Cubic BBR DCTCP Cubic BBR DCTCP Cubic Predicted Class) Predicted Class) Predicted Class) (a) Leaf-spine (b) Fat-tree (c) Unsaturated

Figure 5.21 Effect of TCP variants on prediction accuracy for window slice = 10 seconds

84 90

80 80 93.55 1.29 5.161 53.4 34.6 12 98.12 1.647 0.2353 80 BBR BBR BBR 70

60 60 60 50 9.796 87.76 2.449 6.316 93.68 0 0.5263 98.42 1.053 DCTCP 40 DCTCP 40 DCTCP 40 True Class True Class True Class 30

20 20 20 0.8824 6.765 92.35 5.849 0 94.15 0 2.5 97.5 Cubic 10 Cubic Cubic

0 0 BBR DCTCP Cubic BBR DCTCP Cubic BBR DCTCP Cubic Predicted Class) Predicted Class) Predicted Class)

(a) Leaf-spine, W=30s (b) Fat-tree, W=30s (c) Unsaturated, W=30s

Figure 5.22 Effect of TCP variants on prediction accuracy for window slice = 30 seconds the congestion control of the local flows in case if its traffic is coexisting with a variant that starves the local flow from fair share of the link bandwidth. Therefore, it is important to do this prediction in real time rather than on corpus of cold data. In this section we present a proof-of-concept of an online system that predicts the remote congestion control variant and adapts the local variant to it to achieve fair bandwidth sharing. In the following sections, we describe the methodology and show our results of this online system

Local Sender, H1 Local Receiver, H3

Remote Sender, H2 Remote Receiver, H4 Figure 5.23 Testbed for testing the online prediction tool

5.6.3.1 Methodology

The design of our current CNN model simplifies the process of transforming it to an online system. The only change we need to do is to source the feature sets in real time rather than from a corpus of pre-existing data. We achieved this by recording the instantaneous throughput and retransmissions for every 100 milliseconds (the sampling period for data collected in the Benchmarking section) and storing it into a log buffer. The contents of the buffer are transformed into the feature set

85 format after a set amount of time has elapsed (which we called as a window slice in the earlier section). The transformed feature set is then fed to the trained model to predict the remote TCP variant. If the remote variant is not the same as the current variant, and if we could increase the fair bandwidth share of the local flow by changing the local variant (based on our observations from the Benchmarking section), then the application will change the congestion control variant of the local host. Although the above system definition seems simple, there were multiple implementation issues that we encountered during our experiments. First, how do we know that our prediction was right? Next, when do we schedule the next prediction? We provided simple solutions to these concerns in our implementation, but we acknowledge that better and more robust solutions can be implemented. The scope of this section is to provide a proof-of-concept of achieving fair bandwidth share in presence of unfair TCP flows and the optimum implementation of such a system is left to the future work. In order to judge our current prediction, we keep a running window average and variance of the last 20 valid measured throughput values. If the current average is more than the average observed when the previous congestion control variant was present by more than 4 variance units, then we are satisfied with our prediction. If the prediction does not appear to change the average measured throughput, we continue to do more predictions until we see an increase in the windowed average. We stop our predictions (but continue to collect data) when this increase happens and will only re-enter this phase if we see a drop in the running average by more than 4 variance units.

(a) Background is BBR (b) Background is DCTCP Figure 5.24 Throughput vs. Time for a local CUBIC flow against different background flows

5.6.3.2 Results

We implemented the proof-of-concept as a Python application running on an Ubuntu 18.04 host. We simplified our testing environment into a dumbbell topology as shown in Fig. 5.23. Here, there is a background remote flow which is sharing the same bottleneck link with our local TCP flow.

Our prediction system is running on host H1. Since ongoing TCP flows cannot reflect the updated congestion control variant in the OS, we also start a new TCP flow right after we make a prediction.

86 Figure 5.25 Error in remote TCP variant prediction on first go

Figures 5.24(a) and 5.24(b) plot the throughput vs. time plot where the local flow was initially using CUBIC and the remote flows were using BBR and DCTCP respectively. In these plots, the first 5 second period is considered as warm-up and this data is not fed to the CNN model. Data buffering starts after this 5 second period (purple region) for a total of 20 seconds. The resulting feature set is then fed to the trained model. In both these figures, the predictions were right and the subsequent local TCP flow achieved fair bandwidth share. But we also observed that in some cases the predictions were not correct. Figure 5.25 plots the case where the background flow was BBR and the local flow was using CUBIC. There was an incorrect prediction in the first round (red region), followed by the correct prediction (green region), at which point the two flows shared the link bandwidth.

5.7 Conclusion

In this work, we have extensively studied the impact of coexisting TCP variants on each other’s performance using real test beds and data center topologies. We have conducted our experiments using iPerf benchmarking traffic as well as using real data center workloads. We studied four variants commonly used in data centers: BBR, DCTCP,CUBIC, and New Reno. Our study revealed several important observations that have important implications in data centers from the perspectives of throughput fairness, network utilization, throughput stability, and job completion times. For example, a single BBR flow can significantly disrupt almost any CUBIC workload, irrespective of how large and long-standing it is, but not always the DCTCP workload; coexisting flows of different TCP variants achieve higher throughputs when the paths they traverse have more congestion points; and so on. We hope that this work will motivate researchers to develop automated workload placement approaches for coexisting variants and to further explore innovative aspects of the impact of TCP coexistence.

87 CHAPTER

6

CHARACTERIZING THE PERFORMANCE OF QUIC ON ANDROID AND WEAR OS DEVICES

6.1 Introduction

Since its release to the wider public in 2013, QUIC has gained a lot of popularity, both in the industry and within the research community. It improved the performance of a lot of key functionalities offered by the existing TCP+TLS+HTTP/2 stack, such as reducing connection establishment latency and improving congestion control, along with introducing new features like connection migration and multiplexing without head-of-line blocking. Consequently, it attracted a lot of attention from the research community to study [81, 119], measure [29, 73, 20] and augment [32, 34] the current QUIC implementations to assist in its wider deployment. Although a number of measurement studies have been carried out to evaluate the performance of QUIC when compared to TCP [73, 89, 81], one particular area that has been overlooked so far is its impact on low powered and low resource devices like smart-phone and smart-watches. Evaluating QUIC’sperformance in smart-phones and smart-watches is important because of two reasons. First, QUIC is an application level protocol. It implements all the necessary transport-level logic, congestion control, and flow control at the application level. Since every byte of application data now needs to be processed by an additional software layer, the computation demand that QUIC exerts on the operating system is more compared to TCP (which is implemented at kernel level

88 in most general-purpose OSes). Thus, studying the performance of QUIC in devices with limited resources, like smart-watches and smart-phones, is important in determining the feasibility of incorporating QUIC in their corresponding application designs. Second, the nature of operating systems of smart phones and watches is different. Server machines and desktop PCs use general purpose operating systems that are flexible and offer fine-grained access to underlying software and network stacks. Therefore, server applications can have complete control over the entire network stack and individual applications can implement their own logic to service their network requests. For example, the Google Chrome browser implements its own engine to generate web requests over QUIC [8]. In contrast, the operating systems in smart-watches and smart-phones are usually restrictive w.r.t. allowing fine-grained software access to applications. Mobile applications use application programming interface (API) of services managed by the underlying operating system to fulfill their networking operations. Therefore, evaluating QUIC in these environments would provide insights about its implementation in mobile operating systems and its interactions with other OS services. Approach: In this work, we conduct a comprehensive measurement study about the performance of QUIC when compared to TCP in Android (for smart-phone) and Wear OS (for smart-watch) devices. We chose Android and Wear OS because of their widespread popularity and similarity in software architecture [11, 136]. The measurement study particularly focuses on the amount of time, called Request Completion Time (RCT), that QUIC and TCP take to completely service an HTTP request from these devices. We measure RCTs for various object sizes, in different network conditions, and for both up-link and down-link traffic as well. We also focused on the interaction between Cronet (the networking library used in this work) and the OS services during connection migration, i.e. , roaming from one connectivity type to another due to connection issues (for example, from WiFi to LTE). We conducted several experiments to investigate the role of QUIC and TCP on the overall RCTs for this aspect. An important observation from our measurement study was that the configuration of a request, such as the set of parameters like request size, traffic direction etc., can determine whether using QUIC would improve or deteriorate the RCTs. Therefore, we can minimize the long-term RCT distribution of an application if we can automatically choose the appropriate transport protocol for all the different request configurations it generates. With this purpose in mind, we propose Dynamic Transport Selection (DTS), a framework built on top of the existing Cronet library, that probabilistically chooses the best-fit transport protocol for a specific request configuration. We implemented and deployed DTS as a Java library and evaluated it in both the Android and Wear OS devices. We found that opportunistically choosing the transport-protocol improves the overall RCT by as much as 41.76% when compared to either using only QUIC or TCP for all requests. Key Observations: To the best of our knowledge, this is the first comprehensive evaluation of QUIC’s performance in smart-phones and the first measurement study about the performance of QUIC in smart-watches. We enumerate the following key findings from our measurement study:

• We found that performance improvements from using QUIC are not uniform across all request

89 sizes, but are dependent on factors like the traffic size, traffic direction, connection type, device hardware specifications (like smart-phone or smart-watch) etc. In general, when all other variables are kept constant, we found that smaller request sizes over QUIC have lower RCTs when compared to larger requests

• QUIC is more resilient to changes in link RTTs than TCP.This behavior becomes more promi- nent for up-link traffic, where the RCTs become increasingly divergent when using TCP and increase only linearly when using QUIC.

• Both QUIC and TCP have shown improved performance when storing the connection state for smaller request sizes compared to when terminating and reestablishing connections, particularly in high latency networks, such as LTE.

• We identified a bug affecting the Cronet library to automatically choose the un-metered link after a connection migration happens because of the choice of the transport protocol. We found that communications using QUIC do not revert back to using the WiFi link even when it is available, whereas TCP connections did not have this issue.

Chapter Organization: The chapter is organized as follows. In Sec. 6.2, we compare and contrast our work with prior art. Sec. 6.3 gives a brief overview about some of the QUIC and Cronet features that we frequently refer back to in this work. Sec. 6.4 presents the measurement methodology and the testbed setup, which is followed by the measurement study itself in Sec. 6.5 and 6.6. Sec. 6.7 introduces and evaluates our DTS framework. We conclude the chapter in Sec. 6.8

6.2 Related Work

Performance characterization of QUIC: Several measurement studies [73, 119, 29, 20] have char- acterized the performance of QUIC in terms of both low level metrics such as throughput, latency etc. and high level metrics like page load time (PLT). [73] proposes a methodology for analysis of QUIC in desktop environments and compares PLTs of web pages requested using QUIC and TCP.They also uncovered a bug in the Chromium’s implementation of QUIC server that led to its poor performance because of improper configuration of initial congestion window size and slow start threshold. Another study [119] measures the adoption of QUIC from analysing the entire IPv4 address space. The authors in this study found that Google pushes about half of their traffic using QUIC. Other studies [29, 20] compared QUIC with other application level protocols like HTTP and SPDY. While most of the of prior art focuses on using QUIC in web servers and desktop environments, our work is the first to comprehensively investigate QUIC’s performance on smart-phones and wearable devices. Design of QUIC based frameworks: There is a growing body of work [89, 131, 32, 81, 34] that is trying to maximize the performance of web applications by improving several functional aspects of

90 QUIC. [32] and [34] augment current QUIC implementation by extending multi-path support and the ability to add custom plugins (such as forward error correction, connection monitoring etc.) to it. [89] constructs a decision tree from their analysis of traffic traces, to simplify the process of choosing among HTTP,SPDY, and QUIC protocols depending on network conditions. While our work does not make any modification to the current QUIC implementation, we propose a dynamic framework for the Android platform that decides whether to use QUIC or not depending upon the application’s request configuration. Performance analysis and improvements in wearable and smart-phone devices: Numerous stud- ies in the past have characterized and proposed solutions to improve performance of various aspects of the smart-phone and smart-watch ecosystem, such as energy usage [24], background data transfer [142], network usage [77, 43, 56, 87, 33], CPU usage and software architecture [86]. In [88], the authors have proposed a framework to add additional network capacity to the smart-phones by leveraging multi-path communication from an accompanying smart-watch. Here, the smart-watch acts like an external agent that downloads data on behalf of the smart-phone and relays it back over BT or WiFi. Another study [147] has done a thorough analysis of the networking stack in Wear OS devices, which resulted in several interesting observations like delayed connection hand-off between phone and wearable device, inflated end-to-end latency due to phone side buffer-bloat, to list a few. Although prior art has extensively studied network performance as a whole in smart-phone and wearable devices, our work supplements the existing literature by exclusively focusing on QUIC.

6.3 Background

In this section, we provide an overview of some of the important features of QUIC protocol along with some implementation details of Android’s Cronet library.

6.3.1 QUIC Protocol

QUIC is an application-layer transport protocol built on top of UDP,to improve several important performance issues which plague TCP.We list some of the important improvements below. 0-RTT Connection: One of the most important feature of QUIC is the 0-RTT connection establish- ment. QUIC achieves this by storing the cryptographic credentials received from the server during the initial connection, and uses these cached credentials in subsequent requests to encrypt payload data in the first packet itself. Improved Congestion Control: Unlike TCP,QUIC uses unique sequence numbers for original and retransmission frames. This eliminates the TCP’s retransmission ambiguity problem, where both the original and retransmission frames share same sequence numbers. Unique sequence numbers improves the round-trip-time estimation and provide richer information about network congestion to the congestion control module. Rapid Deployment: Unlike TCP,any changes to the QUIC protocol can be packaged with the appli- cation code and be rapidly distributed, deployed, and experimented across the globe. This allows

91 faster update cycles, better security, and innovative use of transport protocols to improve existing services [97, 104].

6.3.2 Cronet Library

Cronet is an adaptation of the open-source Chromium networking stack for the Android platform and is implemented as a library for its mobile applications. It imports several important features from the Chromium networking stack like native support for HTTP,HTTP2, and QUIC protocols, asynchronous requests for non-blocking network operations, and resource caching for faster request completions. A simple HTTP request using Cronet first involves instantiating a "CronetEngine" object which contains required configuration that all requests have to follow (such as caching policy, compression scheme etc.). Individual HTTP requests are then built using this "CronetEngine" object with the "newURLRequestBuilder" method. Each request is associated with a mandatory list of callback functions that implement the logic to handle request response, failure, completion and redirection.

6.4 Data Collection

In this section, we present details about the testbed, the network environment, and the experimental scenarios that we have evaluated in this work.

6.4.1 Testbed

The testbed consists of an Android smart-phone, a Wear OS smart-watch and a desktop computer serving as a control. All of them are connected to the same enterprise WiFi network. Additionally, both the smart-phone and smart-watch are also connected to an LTE network as shown in Fig. 6.1. At the other end of the testbed is a server machine, instantiated in the Google Cloud Platform (GCP). The server machine is a dual core virtual machine with 8GB memory and running Ubuntu 16.04LTS operating system. We have installed a QUIC and a simple HTTP server in the GCP VM to respond to GET and POST requests. The QUIC server is the chromium implementation [112] of the QUIC version 46. For TCP,we used CUBIC, the default variant that comes with the OSes. The control machine is a quad core Ubuntu desktop with 32GB memory. The main functionality of the control machine is to orchestrate the data collection process and also provide an additional view into the network conditions by logging the ping times to the server.

6.4.2 Network Environment

We experimented with the two common connectivity types found in smart-phones and smart- watches, namely, WiFi and LTE. The WiFi network is an IEEE802.11ac capable enterprise scale network serving tens of devices. For the LTE connection, we used Verizon, a major LTE connectivity provider in the US, in both the devices. By conducting our experiments over the public internet and

92

Internet Server

Figure 6.1 Testbed

not in an isolated testbed, we ensured that we observed real-world connection quality and ping times. The WiFi network quality and the ping time distribution to the remote server are presented in Fig. 6.2 and Fig. 6.3. Note that the WiFi network has an average ping time of 22ms to the GCP server, while the LTE network has 49.5ms. Moreover, despite being co-located and connected to the same WiFi network, the smart-phone reported better RSSI compared to the smart-watch. To study the impact of link RTTs, we also emulated additional network delay on the WiFi network using the TC tool [126]. The additional delay values that we experimented with are recorded in table 6.1, along with other important parameters used during this study.

0.20 0.1 Phone WiFi 0.08 Watch LTE 0.06

0.04

Prob. Density Prob. Density 0.02

0 0 -80 -70 -60 -50 0 20 40 60 80 100 RSSI (dbm) Time (ms)

Figure 6.2 p.d.f. of WiFi RSSI on the two devices Figure 6.3 p.d.f. of ping values to GCP server

6.4.3 Metrics and Experimental Scenarios

The primary metric for performance evaluation in this work is the Request Completion Time (RCT). As the name suggests, it is the total amount of time elapsed between calling the GET/POST method and the arrival of the complete response/ acknowledgement to the application. We developed a

93 custom suite of Android and Wear OS apps to conduct our experiments. We collected logcat (RCT, WiFi RSSI), dmtrace (function tracing) and ping data from the control machine to the GCP server, while conducting our experiments. We tested two scenarios in this work. First, we evaluate the RCTs for requests of different sizes in various network and link connectivity environments for both the Android and Wear OS devices. Next, we evaluate the link connection migration functionality in Android and Wear OS devices for both QUIC and TCP protocols. To ease the burden of referring back and forth, we describe the specific methodology of the experiments in their respective sections. Unless otherwise stated, we repeated all our experiments at least 30 times. To eliminate differ- ences due to external factors, we carried out the experiments with QUIC and TCP back-to-back. Additionally, when conducting experiments with WiFi, the LTE radio was switched off and vice versa.

Table 6.1 Testbed and experiment parameters

Parameter Value(s) Object Sizes 1KB, 10KB, 500KB, 2MB, 25MB Extra Delay 50ms, 100ms, 150ms Devices Pixel 2 [138], TicWatch Pro [129]

6.5 Request Completion Time

In this section, we first introduce our methodology of data collection to assess RCTs in various scenarios. After that, we present our observations and analyses for each of those scenarios such as, requesting a single object with or without network emulation, requesting data for the first time from the server etc.

6.5.1 Methodology

The entire data collection process is automated and controlled by the control machine. Each ex- perimental scenario starts with the control machine installing a set of custom Android or Wear OS applications in the respective devices. The custom Java applications implement the logic to generate requests of different sizes and store the corresponding RCTs and link quality in log files. Next, the end-to-end delay is controlled by the TC tool installed on the GCP server machine. The control machine then executes the application to collect data for a particular scenario. The sequence of steps performed by the Android or Wear OS application include, turning on/off the appropriate wireless radio, requesting an object using QUIC and waiting until the response is received, request- ing another object of the same size using TCP while recording the RCTs and function calls in both cases. The server machine in GCP responds to GET requests from Android and Wear OS devices by sending an HTML object of the requested size. The responses are static HTML pages that are

94 20x10 4 20x10 4 20x10 3 20x10 3 20x10 2 20x10 2

1 1 RCT (ms) 20x10 RCT (ms) 20x10 20x10 0 20x10 0 1KB 2MB 1KB 2MB 10KB500KB 25MB 10KB500KB 25MB (a) GET, WiFi (b) GET, LTE 20x10 4 20x10 4 20x10 3 20x10 3 20x10 2 20x10 2

1 1 RCT (ms) 20x10 RCT (ms) 20x10 20x10 0 20x10 0 1KB 2MB 1KB 2MB 10KB500KB 25MB 10KB500KB 25MB (c) POST, WiFi (d) POST, LTE

Figure 6.4 RCTs: Red (Blue) lines represent QUIC (TCP) and solid (dotted) lines represent smart-phone (smart-watch). Each figure plots observations for a given traffic direction and link type pair. cached in the server machine in order to minimize the differences in RCTs caused due to processing and computation times in the server. Similarly, the POST body is pre-computed and cached in the application memory to minimize the computation overhead in RCTs. Note that in this work, we evaluate QUIC and TCP as a whole and as an integral constituent of the hardware/software combination in the Android smart-phone and smart-watch ecosystem. De- tangling the impact and performance of individual components of the protocols and the respective OSes for root cause analysis involves substantial changes to the code and is an interesting topic for future work. The importance of this work is to rather emphasize that the benefits of QUIC do not vary uniformly and monotonically (either increase or decrease) w.r.t. request size, link type or choice of hardware/software, such that a simple hard-coded decision tree could help in deciding the transport protocol for realizing smaller RCTs. Rather, the priority of this work is to establish that non-uniform differences in RCTs exist due to contextual parameters such as the device hardware and software, traffic size, traffic direction etc. We use our empirical observations as an evidence and intuition to develop a framework later in the chapter that intelligently chooses the transport service at run time to decrease RCTs.

6.5.2 Evaluations

We first present the absolute RCT values in various scenarios and for both the smart-phone and smart-watch in Figs. 6.4(a) to 6.4(d). Following that, we analyze the RCT values w.r.t. how they react

95 to changes in traffic size, direction, link type, minimum RTT, and the effect of cold starting. The RCT values are averaged over 30 iterations for each scenario. Figs. 6.4(a) to 6.4(d) plot these average values for the smart-phone and smart-watch for GET and POST requests over WiFi and LTE connections. An immediate and interesting observation from these figures is about the impact that the choice of hardware/software has on the RCT values. We observe that the RCTs of smart-watch are always or at-least greater that the corresponding RCT values of smart-phone, when every other parameter is kept constant. RCT is an application level metric and encapsulates various software and hardware delays encountered while serving a request, such as buffering delays during copying data between software layers, lower CPU speed etc. Since smart-watches are relatively low on resources (in terms of processing speed, memory, battery etc. ) compared to smart-phones, they tend to have higher RCTs.

6.5.2.1 Effect of Traffic Size and Direction

Figs. 6.5(a) and 6.5(b) plot the heatmaps of the ratio of RCTs using TCP to RCTs using QUIC in the smart-phone and the smart-watch, respectively. The Y axis represents the traffic direction and link type combination, and the X axis represents the request sizes. Lighter shades in the heatmap imply that the requests over QUIC had lower RCTs than TCP. Observations: In smart-phones, we see a uniform gradient w.r.t. performance improvements when QUIC was used to serve both GET and POST requests. Smaller objects, irrespective of the traffic direction and link type, have seen reduced RCTs from using QUIC than TCP.But these performance gains diminish as the request size increases and for larger objects, requests using TCP have shown smaller RCTs compared to QUIC. We do not see this uniform transition of RCT performance for smart-watch. Here, GET requests have a similar gradient profile as observed in the smart-phone, meaning smaller downlink traffic over WiFi and LTE benefit from using QUIC. Whereas, uplink traffic displayed an inverse trend w.r.t. request sizes, i.e. smaller requests over TCP and larger requests over QUIC had lower RCTs. Discussion: There are two perspectives from which we need to look at the above observations to comprehend them. First is the effect of the request sizes. Data transmission for serving smaller request sizes can be completed both within less time and smaller number of packets when compared to larger requests. When RCTs of smaller requests are comparable to the link RTT, a saving of even one round trip for connection establishment significantly improves the overall performance. QUIC

GET, WIFI 3 GET, WIFI 1.5 GET, LTE 2 GET, LTE 1 POST, WIFI POST, WIFI 1 0.5 POST, LTE POST, LTE

1KB 2MB 1KB 2MB 10KB500KB 25MB 10KB500KB 25MB (a) Smart-phone (b) Smart-watch Figure 6.5 Ratio of RCTTCP to RCTQUIC to study the effect of request size and direction. Lighter shade implies QUIC had lower RCT.

96 has faster connection establishment times compared to TCP because, when making a request to an already visited QUIC server, the QUIC protocol allows payload data to be sent in the very first packet using cryptographic details from the earlier connection. The second perspective to look from is the dissimilarities arising due to the traffic direction. Hardware specifications play an important role when evaluating the performance of a transport protocol, especially QUIC because of its user-space implementation. An important difference to highlight between the downlink and uplink scenario is the hardware and software specifications of the traffic generator. For POST requests, the traffic is generated by resource constrained hardware and shaped by network protocols abstracted by the Android or Wear OS networking APIs. Contrast this with GET requests where the traffic is generated using a commodity GCP server machine, which has high processing power and uses a general purpose operating system. In the case of QUIC, an additional software layer that now each byte of data has to traverse increases the RCTs compared to TCP (note that this computation cost builds up for larger requests because of the more packets to process). A decrease in the processor speed (comparing the smart- phone to smart-watch) further inflates the RCTs due to this additional traversal. But optimizations in the software design of Wear OS that are transparent to the application developers could also affect the RCTs, because TCP suffered more than QUIC in the smart-watch for larger uplink traffic in-spite of a slower processor. Further root cause analysis to pin-point the differences between Android and Wear OS networking logic requires substantial modifications to the software stack because of tight coupling between the hardware type (smart-phone or smart-watch) and the corresponding software (Android or Wear OS). But these observations further emphasize our earlier argument that the performance variations of QUIC in smart-phone and smart-watch platforms are not intuitive, which motivates the need for a trained model to choose the right transport protocol.

6.5.2.2 Effect of Link Type

Smart-phones and watches usually alternate between WiFi and LTE for their data communication. Therefore, observing the impact of switching from WiFi interface to an LTE interface on the RCTs is important from the application designer’s perspective. A protocol which shows larger RCT changes due to changing link types is generally undesirable compared to the one that shows smaller changes. The is to minimize the disruption to the quality of experience felt by the users. For example, a video application would need to downgrade its playback resolution when a user migrates from WiFi to LTE

QUIC, GET 6 QUIC, GET 5 TCP, GET TCP, GET 4 4 3 QUIC, POST QUIC, POST 2 TCP, POST 2 TCP, POST 1

1KB 2MB 1KB 2MB 10KB500KB 25MB 10KB500KB 25MB (a) Smart-phone (b) Smart-watch Figure 6.6 Ratio of RCTLTE to RCTW i F i for different request sizes. Lighter shades imply LTE network had larger RCT.

97 network, if the protocol over which the application operates deteriorates its performance excessively because of this switch. In this section, we evaluate the impact that the link type has on the RCTs when using QUIC and

TCP.Figs. 6.6(a) and 6.6(b) plot the ratios of RCTLTE to RCTW i F i for each of the scenarios described by the X and Y axis labels. Lighter shades imply larger ratios. Note that changing the link type from WiFi to LTE has always shown to increase the RCT value in our experiments because of the higher minimum RTTs observed over the LTE network. Observations: The observations from Figs. 6.6(a) and 6.6(b) can be divided into downlink and uplink categories. For downlink traffic, both the smart-phone and smart-watch show similar change in RCT behavior to changing link types. The downlink TCP traffic is hardly affected by link type, whereas the

RCTLTE to RCTW i F i ratio for downlink QUIC traffic increases linearly with the traffic size. In case of uplink traffic, there is a contrasting trend in the RCT behavior between the smart-phone and the smart-watch. In the case of the smart-phone, the observations follow our intuition, where the RCT ratios increase with the request sizes. This effect is observed in both the QUIC and TCP connections. But we see the opposite trend between RCT ratios and request sizes in the smart-watch.

Discussion: As per the common intuition, RCTLTE to RCTW i F i ratio should increase with the request size as the per-packet rise in RCT from using LTE network would add with increase in the number of packets. But the effect of request size was only minimal for downlink traffic because the smart-phone or smart-watch were only sending acknowledgments and not the data packets. In contrast, when the devices are responsible for transmitting the data packets, the link type becomes relevant in the first-hop for uplink scenario. In this case, the overall RCT can be broken down between the time it takes for the packet to leave the device after being queued by the application and the round-trip propagation time for the packet (assuming the processing delay is minimal in an over-provisioned server). Since the the round-trip propagation time would be the same for both the smart-phone and smart-watch, the differentiating factor would be the buffering delay. Considering the above break down of RCTs, our observations show that a Wear OS running smart-watch has higher buffering delay when compared to an Android smart-phone, although both operating systems are derived from the same design platform. The differences in packet scheduling could be to conserve battery, in which case, packets would be buffered to amortize the costs of using an expensive radio.

30 30 0 0 50ms 50ms 20 20 100ms 100ms 150ms 150ms 10 GET POST 10 GET POST Ratio of RCTs Ratio of RCTs

0 0

1KB 2MB 1KB 2MB 1KB 2MB 1KB 2MB 10KB500KB 25MB 10KB500KB 25MB 10KB500KB 25MB 10KB500KB 25MB Figure 6.7 RCT ratios when extra network delay Figure 6.8 RCT ratios when extra network delay is introduced for QUIC requests. is introduced for TCP requests.

98 6.5.2.3 Effect of Network Delay

The observations in the previous section are evaluated from the experiments conducted in a high throughput and low latency enterprise WiFi network. But real world residential and commercial wireless networks usually have much higher latencies due to several reasons, such as throttling from service provider or crowded public access points. Therefore, to evaluate RCTs in high RTT links, we extended our experiments in Sec. 6.5.2 for the WiFi network by emulating extra delay along the communication path. Similar to the experiments in Sec. 6.5.2, all responses to GET and POST requests are cached in their respective end-hosts to minimize the cost of computation overhead in the RCTs. Fig. 6.7 plots the ratio of RCT of QUIC in the network with additional delay (emulated using the TC tool) with RCT recorded without any network emulation. Fig. 6.8 plots the ratio for TCP.To avoid duplicity of evaluations and keep the results concise, we omit the observations for smart-watch and present the RCT ratios for the smart-phone only. Observations: First, we observed that extra delay in the communication path has a proportional effect on the RCT values. An increase in the link RTT increased the observed RCT in both uplink and downlink traffic. This observation holds true for both QUIC and TCP protocols. Second, among the transport protocols, QUIC is less penalized by the increase in RTTs when compared to TCP.This observation emphasizes the improvements in the congestion control module of QUIC that are noted in previous performance studies as well [73, 81]. Third, the improvements in the congestion control module of QUIC becomes even more noticeable for uplink traffic. Note that TCP traffic exhibits diverging RCT ratios for larger uploads, whereas QUIC traffic displayed a stable increase of RCTs with the uploaded object size. An important takeaway from the above observations is that QUIC improves RCT of both downlink and uplink traffic when compared to TCP in high latency networks, and this improvement is significant in uplink traffic than downlink.

6.5.2.4 Effect of Cold-Starts

The evaluations so far have considered RCTs on warmed up connections only, i.e. , requests which are sent to a previously visited server. In these cases, the cached connection details are utilized for immediately starting the data transfer. Both TCP and QUIC enjoy these benefits because of the faster handshake and other improvements introduced in TLS 1.3 (TCP still has to incur 1-RTT delay to

6 phone, GET, wifi phone, GET, wifi 6 watch, GET, wifi 5 watch, GET, wifi phone, POST, wifi phone, POST, wifi 5 4 watch, POST, wifi watch, POST, wifi 4 phone, GET, lte 3 phone, GET, lte 3 watch, GET, lte watch, GET, lte 2 phone, POST, lte phone, POST, lte 2 watch, POST, lte 1 watch, POST, lte 1 1KB 10KB 500KB 2MB 25MB 1KB 10KB 500KB 2MB 25MB (a) QUIC (b) TCP Figure 6.9 Ratio of RCTs observed for cold connections with warmed-up connections. Light colors imply larger improvement in RCT when using warmed-up connections.

99 complete TCP handshake though). However, smart-phones and smart-watches have memory and energy constraints, and operating systems routinely employ mechanisms to close, replace or evict dormant applications from memory to free up its resources. Application developers themselves can also do this, which begs to answer an important question about the trade-off between holding up system resources and improving the connection RCT. We answer these questions by studying the improvements in RCT observed between cold and warmed-up connections. Observations: Fig. 6.9(a) plots the average ratios of RCTs of cold-connections using QUIC to warmed- up connections using QUIC. Fig. 6.9(b) plots the same for TCP.We present the following two take- aways from these figures. First, both QUIC and TCP benefit from using warmed-up connections in high delay links. Since the RCTs for smaller requests are comparable with the link RTTs, even a difference of one RTT in connection establishment time would considerably improve the connection performance. For example, warmed-up requests over LTE showed more improvement than the same requests over WiFi because the LTE link has higher RTTs. Second, in contrast to the common practice of freeing up resources in compute and energy constrained devices like smart-watches, retaining the connection details actually improved the RCT for small request sizes. We suspect that the cost of spawning a new Cronet engine when the application is brought to foreground in these low powered devices inflates the overall RCT when compared to storing and retrieving it from the memory.

6.6 Connection Migration

This section explains the connection migration scenario that we studied in this work along with the data collection process and our observations.

6.6.1 Background

One of the most desirable qualities of smartphones and smartwatches apart from their small form- factor is their mobility. During the course of a typical day, the smartphone or a smartwatch of an average user connects to many different networks and connectivity link types [43]. But these network migrations and connection discontinuities are rarely felt by the user applications because of the transparent handling of these migrations by the networking APIs of Android OS and Wear OS. In this section, we study the efficiency of the Cronet API in handling such connection discontinuities and transparently migrate from WiFi to LTE and vice versa for QUIC and TCP connections.

6.6.2 Data Collection

Application Logic: Developers usually make use of custom logic in their applications to handle link layer connection migrations. The actual logic to handle such scenarios vary with actual implemen- tation, such as periodic polling or interrupts and therefore is different for every application. The goal of our application is to successfully upload a 5MB object to the remote server during a link

100 level connection migration. To achieve this, we make two design choices to our application logic. First, we break down the upload object into 10KB chunks, which makes a total of 500 chunks, and upload each of them to the remote server while keeping track of which chunk is currently being uploaded. Second, in the event of a connection migration (notified by a broadcast intent from the OS), we pause and resume the upload as soon as the link is ready. At no point do we directly select the choice of connectivity link but the Cronet module handles the selection transparently from us. The application only controls the amount of data to the Cronet’s request handler API, when to send that data to the server, and which transport protocol to use for the upload.

Methodology: We reuse the same automation framework used in Sec. 6.5 to carry out the exper- iments for connection migration scenarios. Each experimental run has two steps. The first step involves starting a 5MB data upload to the GCP server using either QUIC or TCP.The second step is the connection migration step, where we programmatically turn off the WiFi interface. The LTE interface is left turned on for the entire duration of the experiment. Since the default choice of wireless interface in both Android and Wear OS is the WiFi link (when available), each data upload starts off by using the WiFi interface first, and then the LTE interface when WiFi is not available. The routing of data between these two interfaces is handled by the Cronet module. There are two parameters that we control during the connection migration scenario. The first parameter is the amount of time since the beginning of the data upload at which the WiFi interface will be turned off, denoted as Disconnection Time. The second parameter is the duration of the time window for which the WiFi interface stays turned off, before it is programmatically switched on again. This window duration is denoted as LTE Window. An example link state change is shown in Fig. 6.10 where the link changes from WiFi to LTE at t=5 sec for a duration of one second.

LTE Window LTE Connected to WiFi

Link State WiFi Disconnection Time 1 2 3 4 5 6 7 8 Time(sec)

Figure 6.10 Link state change at t=5 for one second.

At the end of every iteration, we record the total time it took to complete the 5MB upload along with the RCTs of the individual chunks. Similar to the data collection process in Sec. 6.5, we alternated between QUIC and TCP while collecting data to reduce environmental and network noise. All the observations in this section are averaged over ten iterations.

101 6.6.3 Evaluations

Before we analyze the aggregate RCTs of the upload data with a connection migration in between, let us first get an intuition of how the aggregate RCTs should vary with increasing Disconnection Time and LTE Window. Recall from Sec. 6.5 that the RCTs of upload data over the LTE network were more than those over WiFi. Considering this takeaway, the RCTs of upload chunks which are being sent over the LTE network should be more than that of the RCTs of the chunks transmitted over WiFi. In other words, the aggregate RCT values should increase with increasing LTE Window size. Note that the Disconnection Time should not influence the aggregate RCT as long as there is enough data left to send after the LTE Window duration has elapsed.

Observations: Figs. 6.11 and 6.12 plot the heatmap of the aggregate RCTs (in seconds) for TCP and QUIC, respectively. We present two observations from these plots. First, the RCT gradient in the heatmap for TCP exhibits the behavior discussed earlier in this section. The aggregate RCTs only increased with increasing LTE Window size but are fairly independent of when the link state changes (when Cronet was configured to use TCP). Second, we see an opposite behavior in the RCT gradients when we used QUIC. Here, the aggregate RCTs only depended on when the link state change had happened and not on the duration of the LTE Window. With QUIC, the relation between aggregate RCTs and the Disconnection Time is inverse, i.e. , longer durations after which the disconnect from WiFi happened yielded lower aggregate RCTs.

38 38 1s 36 1s 36 34 34 5s 32 5s 32 30 30 10s 28 10s 28

Disconection Time 26 Disconection Time 26 1s 5s 10s 15s 1s 5s 10s 15s LTE Window LTE Window

Figure 6.11 Agg. RCTs during connection migra- Figure 6.12 Agg. RCTs during connection migra- tion while using TCP. tion while using QUIC.

To investigate why we observe this unexpected behavior with QUIC connections, we look at the RCT values for individual chunks for both TCP and QUIC uploads and plot them in Fig. 6.13 for t h Disconnection Time=1s and LTE Window=5s. The RCT vs N chunk plots for other configurations are omitted due to space constraints and similar observations. We can immediately see from this figure that the RCT values for the TCP upload closely follow the pattern of link state plot in Fig. 6.10. This implies that the Cronet module is able to transparently route data from the application to the WiFi interface when it becomes online again. Whereas, judging from the RCT values of QUIC, we can infer that the application continues to use the LTE interface even when the WiFi interface is available, because Cronet did not route those chunks to the WiFi interface, even when it was turned

102 back on. We further confirm this observation from an over-the-air wireless signal capture using a WiFi sniffer, where we found that traffic from the device does not resume on the WiFi interface after switching on the WiFi interface. This discussion reveals that the current implementation of the cronet module in Android and Wear OS has a serious functional limitation for uplink QUIC traffic. We found that when the link state changes from LTE to WiFi, the Cronet module is unable to re-route the data to use the WiFi interface. This leads to both performance issues, where uploads take more time because they would use a high latency network instead of WiFi, and unnecessary data charges on the user. However, we did not observe the same limitation when connections used TCP with Cronet.

6.7 Transport Protocol Selection

Measurements in Sec. 6.5 reveal that improvements in RCTs due to QUIC are not felt for all request sizes and in all network scenarios. Therefore, hard-coding the choice of transport protocol in the application code will not result in the best performance in all scenarios. The obvious question to ask at this point is whether we can construct a framework that can opportunistically choose the transport protocol based on request type and network configuration without any user intervention? In this section, we present a lightweight framework, called Dynamic Transport Selection (DTS), that is built on top of the Cronet library and whose function is to choose the best-fit transport protocol for a given request context. We first define the desired characteristics of the framework and then describe the overall architecture of the framework and its position in the network stack. Next, we present details about the individual components of the framework. Finally, we implement a light-weight Java Android library and evaluate the improvements in RCT when using this framework. Before proceeding, we emphasize here that our objective here is only to demonstrate that the correct choice of transport protocol can significantly improve the RCTs for resource constrained devices. We acknowledge that solutions better than what we present next can be developed to improve RCTs even further.

150 QUIC 100 TCP

RCT 50 Switch to LTE 0 0 100 200 300 400 500 N th Chunk

Figure 6.13 RCTs of individual chunks for Disconnection Time=1s and LTE Window=5s.

103 6.7.1 Design Goals

1. Minimize RCTs: The primary function of the framework will be to minimize the RCTs by choosing the appropriate transport protocol (QUIC/TCP). The framework should maintain a model that trans- parently does this selection based on the request characteristics (discussed later in this section). 2. Adapt to changing network conditions: The framework should be flexible enough to adapt its model of selecting the appropriate transport protocol based on changing network environment. 3. Lightweight: The framework should be lightweight in terms of memory, compute, and network usage so that it can be implemented in devices with even resource constrained hardware and/or software specifications like low-end Android smartphones, or Wear OS smartwatches. 4. Minimal effort to integrate into applications: The framework should not introduce additional declarations or modifications other than the already used Cronet APIs. All the state maintenance logic should be handled within the framework itself and the application developers should be able to integrate the framework functionality with as few code changes as possible. The application developer also should not have to implement any additional logic to evaluate the accuracy of the predictions or manually trigger retraining of the model.

6.7.2 DTS Architecture

DTS is located between the application code-base and the Cronet library, and leverages many of the Cronet’s APIs to implement its request processing functionality. Applications submit network requests to DTS, which in turn extracts various features from the request and predicts the appro- priate transport protocol. DTS then utilizes Cronet’s builder methods to construct and maintain a CronetEngine for each transport choice (QUIC/TCP) to service the request. Since the requests are ultimately handled by the Cronet library, the corresponding responses are asynchronously read by the callback methods implemented in the application logic to handle those responses. In other words, the framework introduces no additional delay due to response buffering and does not impede the responsiveness of the applications. The DTS framework is comprised of five modules, shown in Fig. 6.14 and described next. 1. Feature Extraction: The feature extraction module takes the HTTP request as its input and out- puts a set of features, Fi , which are used by the DTS framework to select the appropriate transport protocol. 2. Transport Selection: The transport prediction is the main decision making body of the DTS frame- work and utilizes the extracted features from the Feature Extraction module to feed its prediction model. The output of this module is the best-fit transport protocol to use for the current HTTP request. 3. Accuracy Predictor: After servicing the HTTP request using the predicted best-fit transport pro- tocol, the accuracy predictor decides if the prediction was accurate or not. The input to this module is the RCT of the current HTTP request and it outputs whether the corresponding transport protocol

104 prediction is valid or not. If the accuracy predictor deems the current prediction to be valid, it updates the model with this new observation. Otherwise, it drops the current observation. 4. Model Database: We store all the necessary state and model descriptions in this module. It con- tains a set of data structures for easy and fast data retrieval and acts as the centralized data storage for all the other modules in the DTS framework. 5. Request Processor: This module implements the networking functionality of the DTS framework using Cronet’s APIs. The original request from the application along with the best-fit transport prediction from the Transport Selection module are the inputs to this module. Since the received data is directly read by the application-defined callback functions, this module does not need to maintain any receive-side buffers. However, it does measure all the RCTs and communicates them to the Accuracy Predictor.

Feature Extraction Application Transport Selection

Dynamic Transport Accuracy Predictor Selection Model Data

Cronet Engine Request Handler

Figure 6.14 DTS Architecture and individual modules.

6.7.3 Implementation

In this section, we describe and provide rationale about the implementation details of all the DTS modules. Fig. 6.15 shows how the information flows across different modules of DTS in our implementation.

6.7.3.1 Feature Extraction

The first step in the process of predicting the appropriate transport protocol for a request is to determine the relevant features. The features can be extracted from the request itself, such as from the headers of the HTTP request (direction of connection, estimated size of data transfer), from the application state-space (number of connections, time since last request, etc.) or the current configuration of the device (phone/watch, WiFi/LTE, etc.). We use the following five features in our DTS framework to predict the best-fit transport protocol. Note that real-world applications can use many more and altogether different features, but for this work, we use the following: estimated request size for GET and upload size for POST,wireless connectivity type, device type, traffic direction, first connection (or not). Designing and implementing a system to estimate the response data size for any given request is beyond the scope of this work. Rather, we use a simple lookup table that

105 transforms the URL request to its corresponding response size. We acknowledge that there exist better and dynamic solutions that can predict the response data size based on the applications network and data history, but since that is not the motivation of this work, we make use of a simple solution for demonstrating the effectiveness of the DTS framework.

6.7.3.2 Transport Selection

Selecting the appropriate transport protocol from prior knowledge of RCTs is a classification problem. Given a plethora of machine learning algorithms available in the literature, we choose to use Naive Bayes Classification on the request features to choose the best-fit transport protocol. There are two reasons for selecting the Naive Bayes Classification approach when compared to other off-the-shelf approaches like Neural Networks etc. First, the Naive Bayes Classifier is a simple probabilistic model with minimal computation overhead. Recall from Sec. 6.7.1 that one of our design goals is to have a light-weight framework with very little compute, network, and memory footprint. This ensures that DTS can be deployed even in devices with low resource availability like smart-watches. Second, the model should be flexible and adapt to changing network environments, which is our second design goal. Since Bayes classifier is only a function of the conditional probabilities of the individual events, the output of the model can be controlled by adjusting the probabilities of the individual events (Accuracy Predictor module is responsible for keeping the model up-to-date). This eliminates the need to retrain the model for minor changes in the network.

Given the output of Feature Extraction module as a set of feature Fi , and αQ (αT ) be the scenarios where QUIC (TCP) was the best-fit transport protocol in the training data, then the conditional

REQ

Feature Extraction

Data Protocol Transport Request Store Selection Processor

RCT (REQ, Transport)

Accuracy Cronet Predictor Engine

Figure 6.15 Information flow across different modules of DTS.

106 probability that QUIC (TCP) is the best-fit transport protocol for the given feature set is given by

P (αQ ) P (Fi αQ ) P (QUIC Fi ) = × | | P (Fi )

P (αT ) P (Fi αT ) P (TCP Fi ) = × | | P (Fi ) where P (Fi αQ ) and P (Fi αT ) are the conditional probabilities that the feature set Fi yielded lower | | RCTs for all cases where QUIC (TCP) was the better protocol to use. Since P (Fi ) is a common denominator in both the predictions, taking the ratio of P (QUIC Fi ) with P (TCP Fi ) (which is | | denoted as β) yields P (Fi αQ ) β = θ | × P (Fi αT ) | where θ = P (αQ )/P (αT ) is the constant multiplicative factor (constant because αQ and αT are constants for a given training data). We chose QUIC when β > 1 and TCP otherwise.

6.7.3.3 Model Database

To store and retrieve data required to compute β and the relevant conditional probabilities, the storage module hosts a set of hash tables and other global variables to keep track of different quantities. The DTS framework stores the total number of requests it serviced using QUIC and TCP as the global variables numObsQuic and numObsTcp. These values are used to calculate θ used in the Transport Selection module. There are two hash tables called Prob_Table and RCT_-

Table. The feature set Fi is used as the key in both the hash tables. Since the constituents of Fi are nominal values, there exists a finite set of keys in both the tables. The Prob_Table keeps a count of the number of times that a particular Fi used QUIC or TCP.These values are used to calculate the conditional probabilities P (Fi αQ ) and P (Fi αT ) for the predicted Fi . The RCT_Table stores the | | smoothed RCT and the RCT variance values for every Fi . The Accuracy Predictor determines if a particular prediction resulted in the expected RCT behavior or not, and consequently, the smoothed RCT and variance are computed on only those observations deemed valid by the Accuracy Predictor.

6.7.3.4 Request Handler

This module uses Cronet’s APIs to complete network requests raised by the application. Specifi- cally, it maintains two CronetEngine’s, for QUIC and TCP,and transparently multiplexes the data communication using either of the engines depending on the prediction of the Transport Selection module. This module exposes all the publicly available APIs of the Cronet library using the prefix "DTS". For example, the following syntax would be used to build a CronetEngine object using the Cronet library.

myBuilder = new CronetEngine.Builder(context);

107 With the DTS framework, the developers need not change any of their application logic, but just rename Cronet’s APIs with the "DTS" prefix. Therefore, the new code becomes

myBuilder = new DTSCronetEngine.Builder(context);

6.7.3.5 Accuracy Predictor

This module has two responsibilities. First, it needs to classify if the prediction made by the Transport Selection module is a valid prediction or not. Second, if the RCTs start to change due to changes in network conditions, then this module has to detect these changes and retrain the model accordingly. We explain both the processes below. Classify Predictions: To classify the predictions as valid or not, we keep track of two quantities, the smoothed RCT and smoothed variance in RCTs. Smoothed RCT is the moving average of RCT for all the observations which the Accuracy Predictor classifies as valid observations. If λ and µ are the smoothing factors for smoothed RCT and variance, respectively, and r c t is the RCT of the current valid observation, then the smoothed RCT, s RC T , and variance v RC T are computed as:

s RC T = λ s RC T + (1 λ) r c t × − ×

v RC T = µ v RC T + (1 µ) s RC T r c t × − × − As the RCT values are a direct function of the choice of the transport protocol, they form a close cluster when the underlying network conditions and request features stay constant. Therefore, we use this intuition from empirical findings in Sec. 6.5 to classify any r c t value that lies in between a bounded interval around the s RC T as a valid observation. Specifically, if we define K as the bounding constant, then the expression

s RC T K v RC T < r c t < s RC T + K v RC T − × × states the classification condition for the current observation. If the current observation does not satisfy this condition, then we categorize it as a non-conforming observation Non-Conforming Observations: Transient changes in the network conditions or changes in the server load capacity can sometimes affect the observed RCT values and cause the boundary condi- tion to fail. This does not necessarily mean that the model is making inaccurate predictions, but is just a short-lived phenomenon. In this case, we discard the current observation and increment the nonConfObs count for the feature set Fi . Note that this counter is reset to zero immediately upon encountering a valid r c t observation. But if continuous out-of-bound r c t values are observed by the Accuracy Predictor, then we determine that the current model for the feature Fi is no longer valid and void all the observations made for Fi . In other words, we revert back to the base-case scenario for Fi when there is no training data available and restart the training process. Base-Case Scenario: The DTS framework will not have any training data when the application is

108 accessed for the very first time after the installation. In this case, the framework starts generating its own training data until a desired number of samples are gathered to train the Naive Bayes Classifier. The DTS framework starts the training process by alternating between using QUIC and TCP for every request categorized as Fi , until it has collected S samples for both the protocols (for the Fi category). We then apply the Bayesian technique to calculate θ and the conditional probabilities for

Fi and update the Prob_Table with these values.

6.7.4 Evaluations

We implemented the DTS framework as an Android Java module and integrated it with the custom measurement application described in Sec. 6.5. We generated different evaluation configurations using all the permutations of features described in Sec. 6.7.3.1, i.e. , request size, traffic direction, link type, device type, and first connection. Next, we uniformly sampled multiple request configurations from the above permutation set and generated HTTP requests from both the phone and smart- watch using only QUIC, TCP,and our DTS framework. We evaluate the observations from these experiments below.

10 6 10 6 3 10 8 2 6 4 1 2 Cumulative RCT 0 Cumulative RCT 0 0 200 400 600 800 1000 0 200 400 600 800 N th Observation N th Observation (a) Phone (b) Watch

Figure 6.16 Cumulative RCTs when only QUIC (yellow), TCP (blue), or DTS (red) are used for HTTP re- quests.

Cumulative RCT: Figs. 6.16(a) and 6.16(b) plot the cumulative RCTs of all the requests from smart- phone and smart-watch respectively. The DTS framework has improved the RCT performance in both the device types over the long run. In smart-phone, DTS has improved the cumulative RCT by 20.68% when compared to using only QUIC and 34.26% when compared to using only TCP for all requests. The corresponding gains for the smart-watch are 41.76% for QUIC and 14.45% for TCP.The necessity for adaptively selecting the transport protocol is further strengthened by the difference in the order of cumulative RCTs between the smart-phone and smart-watch. Migrating the application to only use QUIC has marginally improved the overall RCT performance in the smart-phone, while the same action drastically deteriorated the RCTs in smart-watch. This would significantly deteriorate the networking performance of the application if the choice of the transport protocol was hard-coded.

109 1KB 10KB 500KB 2MB 25KB 1KB 10KB 500KB 2MB 25KB 1 1

0.5 0.5 Probability Probability

0 0 (a) Phone (b) Watch

Figure 6.17 Conditional probabilities of predicting QUIC. Red (light red) bar belongs to GET request over WiFi (LTE) and blue (light blue) bar belongs to POST request over WiFi (LTE).

Characteristics of the model: To further understand how these gains are achieved by DTS, we peek into the final state of the DTS Transport Selection model at the end of our experiments. The final conditional probabilities of choosing QUIC for all the different request configurations are shown in Figs. 6.17(a) and 6.17(b). The conditional probabilities for TCP are computed by subtracting these values from 1. Notice that the classifier in the smart-phone tends to exclusively select QUIC for smaller request sizes, which is in agreement with the RCT observations in Sec. 6.5. In contrast, the conditional probabilities observed in the smart-watch are more diverse, with no particular protocol being exclusively favored.

6.7.5 Limitations

Despite demonstrating an overall improvement in the RCT performance, the current implementa- tion of the DTS framework has the following three limitations. First, the Feature Extraction module in the present framework is a simple lookup table. Real-world applications deal with a continuous scale of data sizes and require an adaptive mechanism to classify the requests with further finer granularity. Designing such a mechanism is a separate research work by itself and beyond the extent of this work. Second, DTS is limited by the scope of the application in which it is integrated into. Incorporating the DTS framework system-wide would require instrumenting the Android source code, which is considerably difficult than instrumenting the application layer. Third, the existing framework has empirically derived model parameters such as λ, µ, etc. As a result, devices connected to extremely dynamic networks can result in frequent retrainings if appropriate model parameters are not selected. We plan to address this limitation in our future work where our primary objective will be the selection of optimal parameters. We will design the mechanism to adapt the model parameters with changing network conditions and request configurations.

6.8 Conclusion

In this work, we have conducted a performance comparison of QUIC with TCP on smart-phone and smart-watch devices in real-world network environments. Although there are some similarities

110 between the performance of QUIC in smart-phone and wearable devices with desktop machines, there are many differences too because of the hardware and OS constraints. A key observation from our study is that integration of networking stack with the smart-phone and smart-watch OS plays a key role in determining QUIC’s overall performance. We also uncovered a bug that restricts the migration between LTE and Wifi when using Cronet library to send QUIC requests. We leveraged our observations from this study to develop and deploy a lightweight framework that learns and chooses the best-fit transport protocol among TCP and QUIC to minimize the long-term RCT distribution.

111 CHAPTER

7

CONCLUSION

In this thesis, I presented a network performance analysis of IoT, Cloud and Mobile systems. For IoT networks, I first presented a light-weight framework called IoTmto measure fine-grained per- formance metrics in resource constrained IoT nodes. The key novelty of IoTmis that it deliberately introduces noise in its measurements to reuse existing memory for multiple measurement in- stances. IoTmuses probability modelling of stored measurement data to reconstruct the original measurement values, while incurring minor loss in retrieval accuracy. Our evaluations show that IoTmachieves an accuracy between 5% and 7% while only using 2 bytes per each measurement counter. Next, I characterize the performance of WiFi in dense IoT deployments by emulating multi- ple IoT traffic patterns in a real-world WiFi deployment. One important observation from this study is that the performance degradation to rate limiting by TCP impacts the overall performance more than the frame collisions at higher IoT traffic rates. For cloud and datacenter networks, I conducted two studies primarily studying the impact of TCP variants in these networks. First, I developed a methodology to choose the best-fit TCP variant in a public cloud environment based on simple traffic measurements between cloud virtual machine instances. I verified this methodology by deploying typical distributed workloads in real public cloud networks and evaluate the application-level performance as a result of the choice of TCP variant in the cluster. Next, I characterized the problem of TCP coexistence in datacenter networks. For this, I deployed multiple datacenter topologies and emulated real-world workload traffic. A key take-away of this study was that multiple coexisting TCP traffic variants can translate to actual application-level performance degradation. Based on these findings, I implemented a machine learning model to learn the TCP variant of a remote flow from only local measurements and change the local TCP variant accordingly.

112 For mobile systems, I evaluated the performance of QUIC on Android and Wear OS systems. A key observation from this measurement study is that the performance benefits of QUIC are not felt in all network scenarios for mobile systems (unlike previous measurement studies for QUIC deployments in general purpose OS). Moreover, current implementation of the QUIC library in Android has functional limitations during mobile roaming scenarios which severely affects the request completion times after a network migration. We used the observations from this study to develop a probabilistic framework that can proactively choose the appropriate transport protocol according to the request characteristics and minimize the long-term request completion times. We found that our framework can reduce the long-term request completion times by 34.26% in smart-phones and 41.76% in smart-watches compared to either using only TCP or QUIC for all requests.

113 BIBLIOGRAPHY

[1] Abdeljaouad, I. et al. “Performance analysis of modern TCP variants: A comparison of Cubic, Compound and New Reno”. 2010 25th Biennial Symposium on Communications. 2010, pp. 80–83.

[2] Aguayo, D. et al. “Link-level Measurements from an 802.11B Mesh Network”. Proceedings of the 2004 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. SIGCOMM ’04. Portland, Oregon, USA: ACM, 2004, pp. 121– 132.

[3] Al-Fares, M. et al. “A Scalable, Commodity Data Center Network Architecture”. Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication. SIGCOMM ’08. Seattle, WA, USA: ACM, 2008, pp. 63–74.

[4] Al-Fares, M. et al. “A scalable, commodity data center network architecture”. ACM SIGCOMM Computer Communication Review. Vol. 38. 4. ACM. 2008, pp. 63–74.

[5] Alizadeh, M. & Edsall, T. “On the Data Path Performance of Leaf-Spine Datacenter Fabrics”. 2013 IEEE 21st Annual Symposium on High-Performance Interconnects. 2013, pp. 71–74.

[6] Alizadeh, M. et al. “Data center tcp (dctcp)”. ACM SIGCOMM computer communication review. Vol. 40. 4. ACM. 2010, pp. 63–74.

[7] Allman, M. et al. “TCP Congestion Control” (2009).

[8] Alyssa Wilk, R. H. & Swett, I. A QUIC update on Google’s experimental transport. ": //blog.chromium.org/2015/04/a-quic-update-on-googles-experimental.html". 2015.

[9] Amazon. Content Delivery Network (CDN) - Amazon CloudFront. "https://aws.amazon. com/cloudfront/". [Online; accessed 24-January-2019]. 2018.

[10] Amazon. Enhanced Networking on Linux. "https://docs.aws.amazon.com/AWSEC2/ latest/UserGuide/enhanced-networking.html". [Online; accessed 24-January-2019]. 2018.

[11] Android Developer Guide. "https://developer.android.com/".

[12] Baccelli, F. et al. “On optimal probing for delay and loss measurement”. Proc. ACM IMC. ACM. 2007, pp. 291–302.

[13] Balachandran, A. et al. “Characterizing User Behavior and Network Performance in a Public Wireless LAN”. Proceedings of the 2002 ACM SIGMETRICS International Conference on Mea- surement and Modeling of Computer Systems. SIGMETRICS ’02. Marina Del Rey, California: ACM, 2002, pp. 195–205.

114 [14] Benson, T. et al. “Understanding Data Center Traffic Characteristics”. Proceedings of the 1st ACM Workshop on Research on Enterprise Networking. WREN ’09. Barcelona, Spain: ACM, 2009, pp. 65–72.

[15] Bianchi, G. “Performance analysis of the IEEE 802.11 distributed coordination function”. IEEE Journal on selected areas in communications 18.3 (2000), pp. 535–547.

[16] Braden, R. Requirements for Internet Hosts – Communication Layers. 1989.

[17] Brakmo, L. S. & Peterson, L. L. “TCP Vegas: end to end congestion avoidance on a global Internet”. IEEE Journal on Selected Areas in Communications 13.8 (1995), pp. 1465–1480.

[18] Brik, V. et al. “A Measurement Study of a Commercial-grade Urban Wifi Mesh”. Proceedings of the 8th ACM SIGCOMM Conference on Internet Measurement. IMC ’08. Vouliagmeni, Greece: ACM, 2008, pp. 111–124.

[19] Cardwell, N. et al. “BBR: Congestion-Based Congestion Control”. Queue 14.5 (2016), 50:20– 50:53.

[20] Carlucci, G. et al. “HTTP over UDP: An experimental investigation of QUIC”. Proceedings of the ACM Symposium on Applied Computing. 2015.

[21] Charikar, M. et al. “Finding frequent items in data streams”. Automata, Languages and Programming. Springer, 2002.

[22] Chen, A. et al. “Tracking long duration flows in network traffic”. Proc. IEEE INFOCOM. 2010.

[23] Chen, X. et al. “Passive TCP Identification for Wired and Wireless Networks: A Long-Short Term Memory Approach”. 2019 15th International Wireless Communications Mobile Com- puting Conference (IWCMC). 2019, pp. 717–722.

[24] Chen, X. et al. “Smartphone Energy Drain in the Wild”. Proceedings of the 2015 ACM SIG- METRICS International Conference on Measurement and Modeling of Computer Systems - SIGMETRICS ’15. New York, New York, USA: ACM Press, 2015, pp. 151–164.

[25] Conserve water with the Internet of Things. http://www.ibm.com/developerworks/cloud/ library/cl-poseidon1-app/.

[26] contributors, W. CUBIC TCP — Wikipedia, The Free Encyclopedia. [Online; accessed 10- January-2018]. 2017.

[27] contributors, W. IEEE 802.11n-2009 — Wikipedia, The Free Encyclopedia. [Online; accessed 10-January-2018]. 2017.

[28] contributors, W. Point coordination function — Wikipedia, The Free Encyclopedia. [Online; accessed 10-January-2018]. 2017.

[29] Cook, S. et al. “QUIC: Better for what and for whom?” IEEE International Conference on Communications. Institute of Electrical and Electronics Engineers Inc., 2017.

115 [30] Cormode, G. & Muthukrishnan, S. “An improved data stream summary: the count-min sketch and its applications”. Journal of Algorithms 55.1 (2005), pp. 58–75.

[31] Cronkite-Ratcliff, B. et al. “Virtualized Congestion Control”. Proceedings of the 2016 ACM SIGCOMM Conference. SIGCOMM ’16. Florianopolis, Brazil: ACM, 2016, pp. 230–243.

[32] De Coninck, Q. & Bonaventure, O. “Multipath QUIC: Design and evaluation”. CoNEXT 2017 - Proceedings of the 2017 13th International Conference on emerging Networking EXperiments and Technologies. Association for Computing Machinery, Inc, 2017, pp. 160–166.

[33] De Coninck, Q. et al. “A First Analysis of Multipath TCP on Smartphones”. Passive and Active Measurements, 2016. 2016, pp. 57–69.

[34] De Coninck, Q. et al. “Pluginizing QUIC”. Proceedings of the ACM Special Interest Group on Data Communication. SIGCOMM ’19. Beijing, China: Association for Computing Machinery, 2019, 59–74.

[35] Dean, J. & Ghemawat, S. “MapReduce: a flexible data processing tool”. Communications of the ACM 53.1 (2010), pp. 72–77.

[36] Duffield, N. G. et al. “Explicit loss inference in multicast tomography”. IEEE Transactions on Information Theory 52.8 (2006), pp. 3852–3855.

[37] Eclipse. 2018.

[38] EDIMAX Technology Co., L. 2018.

[39] Ersoz, D. et al. “Characterizing Network Traffic in a Cluster-based, Multi-tier Data Center”. 27th International Conference on Distributed Computing Systems (ICDCS ’07). 2007, pp. 59– 59.

[40] Estan, C. & Varghese, G. “New directions in traffic measurement and accounting”. ACM SIGMCOMM CCR 32.4 (2002).

[41] Floyd, S. & Henderson, T. The NewReno Modification to TCP’s Fast Recovery Algorithm. United States, 1999.

[42] Friedman, E. & Tzoumas, K. Introduction to Apache Flink: Stream Processing for Real Time and Beyond. 1st. O’Reilly Media, Inc., 2016.

[43] Fukuda, K. et al. “Tracking the Evolution and Diversity in Network Usage of Smartphones”. Proceedings of the 2015 ACM Conference on Internet Measurement Conference - IMC ’15. New York, New York, USA: ACM Press, 2015, pp. 253–266.

[44] Gamage, S. et al. “Opportunistic Flooding to Improve TCP Transmit Performance in Virtu- alized Clouds”. Proceedings of the 2Nd ACM Symposium on Cloud Computing. SOCC ’11. Cascais, Portugal: ACM, 2011, 24:1–24:14.

[45] Garg, N. Apache Kafka. Packt Publishing, 2013.

[46] Ghemawat, S. et al. “The Google File System”. SOSP. 2003.

116 [47] Google. Google Cloud Load Balancing - High performance, scalable load balancing on Google Cloud Platform. "https://cloud.google.com/load-balancing/". [Online; accessed 24- January-2019]. 2018.

[48] Greenberg, A. et al. “VL2: a scalable and flexible data center network”. ACM SIGCOMM computer communication review. Vol. 39. 4. ACM. 2009, pp. 51–62.

[49] Grieco, L. A. & Mascolo, S. “Performance Evaluation and Comparison of Westwood+, New Reno, and Vegas TCP Congestion Control”. SIGCOMM Comput. Commun. Rev. 34.2 (2004), pp. 25–38.

[50] Grieco, L. A. & Mascolo, S. “Performance Evaluation and Comparison of Westwood+, New Reno, and Vegas TCP Congestion Control”. SIGCOMM Comput. Commun. Rev. 34.2 (2004), pp. 25–38.

[51] Gubbi, J. et al. “Internet of Things (IoT): A vision, architectural elements, and future di- rections”. Future Generation Computer Systems 29.7 (2013). Including Special sections: Cyber-enabled Distributed Computing for Ubiquitous Cloud and Network Services and Cloud Computing and Scientific Applications and Big Data, Scalable Analytics, and Beyond, pp. 1645 –1660.

[52] Guo, C. et al. “Dcell: A Scalable and Fault-tolerant Network Structure for Data Centers”. Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication. SIGCOMM ’08. Seattle, WA, USA: ACM, 2008, pp. 75–86.

[53] Guo, C. et al. “BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers”. Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication. SIGCOMM ’09. Barcelona, Spain: ACM, 2009, pp. 63–74.

[54] Ha, S. et al. “CUBIC: a new TCP-friendly high-speed TCP variant”. ACM SIGOPS operating systems review 42.5 (2008), pp. 64–74.

[55] Ha, S. et al. “CUBIC: A New TCP-friendly High-speed TCP Variant”. SIGOPS Oper. Syst. Rev. 42.5 (2008), pp. 64–74.

[56] Han, B. et al. “An anatomy of mobile web performance over multipath TCP”. Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies - CoNEXT ’15. New York, New York, USA: ACM Press, 2015, pp. 1–7.

[57] Hasegawa, G. et al. “Analysis and improvement of fairness between TCP Reno and Vegas for deployment of TCP Vegas to the Internet”. Proceedings 2000 International Conference on Network Protocols. 2000, pp. 177–186.

[58] Hasib, M. et al. “Accuracy of packet loss monitoring over networked CPE”. Communications, IET 1.3 (2007), pp. 507–513.

[59] He, K. & al., et. “AC/DC TCP: Virtual Congestion Control Enforcement for Datacenter Net- works”. SIGCOMM. 2016.

[60] HiBench. "https://github.com/intel-hadoop/HiBench". 2018.

117 [61] Hills, R. 2018.

[62] Hock, M. et al. “Experimental evaluation of BBR congestion control”. 2017 IEEE 25th Inter- national Conference on Network Protocols (ICNP). 2017, pp. 1–10.

[63] Hunkeler, U. et al. “MQTT-S—A publish/subscribe protocol for Wireless Sensor Networks”. Communication systems software and middleware and workshops, 2008. comsware 2008. 3rd international conference on. IEEE. 2008, pp. 791–798.

[64] Hwang, J. et al. “NetVM: High Performance and Flexible Networking Using Virtualization on Commodity Platforms”. IEEE Transactions on Network and Service Management 12.1 (2015), pp. 34–47.

[65] “IEEE Std 802.11n Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: Enhancements for Higher Throughput” (2009).

[66] Iperf. "https://iperf.fr/". 2018.

[67] Irteza, S. M. et al. “On the coexistence of transport protocols in data centers”. 2014 IEEE International Conference on Communications (ICC). 2014, pp. 3203–3208.

[68] Iya, N. et al. “Analyzing the impact of bufferbloat on latency-sensitive applications”. 2015 IEEE International Conference on Communications (ICC). 2015, pp. 6098–6103.

[69] Jain, S. et al. “B4: Experience with a globally-deployed software defined WAN”. ACM SIG- COMM Computer Communication Review. Vol. 43. 4. ACM. 2013, pp. 3–14.

[70] Jardosh, A. P.et al. “Understanding Congestion in IEEE 802.11B Wireless Networks”. Proceed- ings of the 5th ACM SIGCOMM Conference on Internet Measurement. IMC ’05. Berkeley, CA: USENIX Association, 2005, pp. 25–25.

[71] Jardosh, A. P.et al. “Understanding Link-layer Behavior in Highly Congested IEEE 802.11B Wireless Networks”. Proceedings of the 2005 ACM SIGCOMM Workshop on Experimental Ap- proaches to Wireless Network Design and Analysis. E-WIND ’05. Philadelphia, Pennsylvania, USA: ACM, 2005, pp. 11–16.

[72] Judd, G. “Attaining the Promise and Avoiding the Pitfalls of TCP in the Datacenter”. Proceed- ings of the 12th USENIX Conference on Networked Systems Design and Implementation. NSDI’15. Oakland, CA: USENIX Association, 2015, pp. 145–157.

[73] Kakhki, A. M. et al. “Taking a long look at QUIC”. Proceedings of the 2017 Internet Mea- surement Conference on - IMC ’17. New York, New York, USA: ACM Press, 2017, pp. 290– 303.

[74] Kandula, S. et al. “The Nature of Data Center Traffic: Measurements & Analysis”. Proceedings of the 9th ACM SIGCOMM Conference on Internet Measurement. IMC ’09. Chicago, Illinois, USA: ACM, 2009, pp. 202–208.

[75] Kay, S. M. “Fundamentals of statistical signal processing, volume I: estimation theory” (1993).

118 [76] Kim, B. S. et al. “Effect of Frame Aggregation on the Throughput Performance of IEEE 802.11n”. 2008 IEEE Wireless Communications and Networking Conference. 2008, pp. 1740–1744.

[77] Kolamunna, H. et al. “A First Look at SIM-Enabled Wearables in the Wild”. Proceedings of the 2018 Internet Measurement Conference on - IMC ’18. ACM, 2018.

[78] Kompella, R. R. et al. “Every microsecond counts: tracking fine-grain latencies with a lossy difference aggregator”. Proc. ACM SIGCOMM. 2009, pp. 255–266.

[79] Kumar, A. et al. “BwE: Flexible, hierarchical bandwidth allocation for WAN distributed com- puting”. ACM SIGCOMM Computer Communication Review. Vol. 45. 4. ACM. 2015, pp. 1– 14.

[80] Kühlewind, M. et al. “Using data center TCP (DCTCP) in the Internet”. 2014 IEEE Globecom Workshops (GC Wkshps). 2014, pp. 583–588.

[81] Langley, A. et al. “The quic transport protocol: Design and internet-scale deployment”. SIGCOMM 2017 - Proceedings of the 2017 Conference of the ACM Special Interest Group on Data Communication. Association for Computing Machinery, Inc, 2017, pp. 183–196.

[82] Lee, M. et al. “Not all microseconds are equal: fine-grained per-flow measurements with reference latency interpolation”. Proc. ACM SIGCOMM. 2010, pp. 27–38.

[83] Lee, M. et al. “A scalable architecture for maintaining packet latency measurements”. Proc. IMC. 2012, pp. 101–114.

[84] Li, A. et al. “CloudCmp: Comparing Public Cloud Providers”. Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement. IMC ’10. Melbourne, Australia: ACM, 2010, pp. 1–14.

[85] Light, R. A. “Mosquitto: server and client implementation of the MQTT protocol”. The Journal of Open Source Software 2.13 (2017), p. 265.

[86] Liu, R. & Lin, F.X. “Understanding the characteristics of android wear OS”. MobiSys 2016 - Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services. 2016.

[87] Liu, X. et al. “Characterizing Smartwatch Usage in the Wild”. Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services - MobiSys ’17. New York, New York, USA: ACM Press, 2017, pp. 385–398.

[88] Liu, X. et al. “Rethink Phone-Wearable Collaboration From the Networking Perspective”. Proceedings of the 2017 Workshop on Wearable Systems and Applications - WearSys ’17. New York, New York, USA: ACM Press, 2017, pp. 47–52.

[89] Megyesi, P.et al. “How quick is QUIC?” 2016 IEEE International Conference on Communica- tions, ICC 2016. Institute of Electrical and Electronics Engineers Inc., 2016.

119 [90] Microsoft. Optimize network throughput for Azure virtual machines. "https://docs. microsoft.com/en-us/azure/virtual-network/virtual-network-optimize-network- bandwidth". [Online; accessed 24-January-2019]. 2018.

[91] Mills, D. “Network Time Protocol (Version 3) Specification, Implementation” (1992).

[92] Mirza, M. et al. “A Machine Learning Approach to TCP Throughput Prediction”. IEEE/ACM Transactions on Networking 18.4 (2010), pp. 1026–1039.

[93] Mishra, A. et al. “The Great Internet TCP Congestion Control Census”. Proc. ACM Meas. Anal. Comput. Syst. 3.3 (2019).

[94] Mittal, R. et al. “TIMELY: RTT-based Congestion Control for the Datacenter”. Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication. SIGCOMM ’15. London, United Kingdom: ACM, 2015, pp. 537–550.

[95] Mo, J. et al. “Analysis and comparison of TCP Reno and Vegas”. INFOCOM ’99. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. Proceedings. IEEE. Vol. 3. 1999, 1556–1563 vol.3.

[96] Moshref, M. et al. “Trumpet: Timely and precise triggers in data centers”. Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference. ACM. 2016, pp. 129–143.

[97] Nathan, V. et al. “End-to-End Transport for Video QoE Fairness”. Proceedings of the ACM Special Interest Group on Data Communication. SIGCOMM ’19. Beijing, China: Association for Computing Machinery, 2019, 408–423.

[98] NETGEAR. 2018.

[99] Network Load Balancers. "https://docs.microsoft.com/en- us/windows- server/ networking/technologies/network-load-balancing". 2018.

[100] Network Load Balancers - Elastic Load Balancing.

[101] Nguyen, H. X. & Thiran, P.“Network loss inference with second order statistics of end-to-end flows”. Proc. ACM IMC. ACM. 2007, pp. 227–240.

[102] Niranjan Mysore, R. et al. “Portland: a scalable fault-tolerant layer 2 data center network fabric”. ACM SIGCOMM Computer Communication Review. Vol. 39. 4. ACM. 2009, pp. 39–50.

[103] Nogiwa, T. & Hirata, K. “Identification of TCP congestion control algorithms with convolution neural networks”. 2018 International Conference on Information Networking (ICOIN). 2018, pp. 663–665.

[104] Palmer, M. et al. “The quic fix for optimal video streaming”. EPIQ 2018 - Proceedings of the 2018 Workshop on the Evolution, Performance, and Interoperability of QUIC, Part of CoNEXT 2018. 2018. arXiv: 1809.10270.

[105] Parker, B. M. et al. “Measurement of packet loss probability by optimal design of packet probing experiments”. IET communications 3.6 (2009), pp. 979–991.

120 [106] Parker, T. S. & Chua, L. O. Poincaré Maps. en. Ed. by Parker, T. S. & Chua, L. O. New York, NY, 1989.

[107] Patro, A. et al. “Observing Home Wireless Experience Through WiFi APs”. Proceedings of the 19th Annual International Conference on Mobile Computing & Networking. MobiCom ’13. Miami, Florida, USA: ACM, 2013, pp. 339–350.

[108] Pitel, G. & Fouquier, G. “Count-Min-Log sketch: Approximately counting with approximate counters”. arXiv preprint arXiv:1502.04885 (2015).

[109] Prakash, P.et al. “The TCP Outcast Problem: Exposing Unfairness in Data Center Networks”. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementa- tion. NSDI’12. San Jose, CA: USENIX Association, 2012, pp. 30–30.

[110] Qin, X. et al. “Reflection on the Popularity of MapReduce and Observation of Its Position in a Unified Big Data Platform”. Web-Age Information Management. 2013.

[111] Qiu, L. et al. “On individual and aggregate TCP performance”. ICNP. 1999.

[112] QUIC Server. "https://www.chromium.org/quic/playing-with-quic".

[113] Radhakrishnan, M. et al. “Smartphones and BLE Services: Empirical Insights”. 2015 IEEE 12th International Conference on Mobile Ad Hoc and Sensor Systems. 2015, pp. 226–234.

[114] Raghavendra, R. et al. “Unwanted Link Layer Traffic in Large IEEE 802.11 Wireless Networks”. IEEE Transactions on Mobile Computing 9.9 (2010), pp. 1212–1225.

[115] Raiciu, C. et al. “Improving datacenter performance and robustness with multipath TCP”. ACM SIGCOMM Computer Communication Review. Vol. 41. 4. ACM. 2011, pp. 266–277.

[116] Rao, N. S. et al. “TCP Throughput Profiles Using Measurements Over Dedicated Connec- tions”. Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. ACM. 2017, pp. 193–204.

[117] raspberrypi.org. 2018.

[118] Rhee, I. et al. CUBIC for Fast Long-Distance Networks. 2017.

[119] Rüth, J. et al. “A First Look at QUIC in the Wild”. Lecture Notes in Computer Science (includ- ing subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 10771 LNCS. Springer Verlag, 2018, pp. 255–268. arXiv: 1801.05168.

[120] Sander, C. et al. “DeePCCI: Deep Learning-Based Passive Congestion Control Identification”. NetAI’19. Beijing, China: Association for Computing Machinery, 2019, 37–43.

[121] Schulz, M. et al. Nexmon: The C-based Firmware Patching Framework. 2017.

[122] Singh, A. et al. “Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network”. Commun. ACM 59.9 (2016), pp. 88–97.

121 [123] Smart Buildings: Intel IoT Platforms Open Doors to Innovation at IBCon. https://blogs. intel.com/iot/2015/06/17/smart-buildings- intel-iot-platforms-open-doors-to-intelligent-buildings-at-ibcon/.

[124] Smart Structures EDC - Embedded Data Collector. http://smart- structures.com/ technology/EDC-embedded-data-collector/.

[125] Sommers, J. et al. “Improving accuracy in end-to-end packet loss measurement”. 35.4 (2005), pp. 157–168.

[126] TC Tool. "https://linux.die.net/man/8/tc".

[127] TCP Probe. "https://wiki.linuxfoundation.org/networking/tcpprobe".

[128] Tian, F.& Chen, K. “Towards optimal resource provisioning for running mapreduce programs in public clouds”. Cloud Computing (CLOUD), 2011 IEEE International Conference on. IEEE. 2011, pp. 155–162.

[129] TicWatch Pro. "https://www.mobvoi.com/us/pages/ticwatchpro".

[130] Tobagi, F.& Kleinrock, L. “Packet Switching in Radio Channels: Part II - The Hidden Terminal Problem in Carrier Sense Multiple-Access and the Busy-Tone Solution”. IEEE Transactions on Communications 23.12 (1975), pp. 1417–1433.

[131] Tong, V. et al. “A Novel QUIC Traffic Classifier Based on Convolutional Neural Networks”. 2018 IEEE Global Communications Conference, GLOBECOM 2018 - Proceedings. 2018.

[132] Triumph. "https://www.broadcom.com/products/ethernet-connectivity/switching/ strataxgs/bcm56820-series".

[133] Vamanan, B. et al. “Deadline-aware Datacenter TCP (D2TCP)”. SIGCOMM Comput. Com- mun. Rev. 42.4 (2012), pp. 115–126.

[134] Varadarajan, V. et al. “Resource-freeing Attacks: Improve Your Cloud Performance (at Your Neighbor’s Expense)”. Proceedings of the 2012 ACM Conference on Computer and Commu- nications Security. CCS ’12. Raleigh, North Carolina, USA: ACM, 2012, pp. 281–292.

[135] Wang, G. & Ng, T. S. E. “The Impact of Virtualization on Network Performance of Amazon EC2 Data Center”. 2010 Proceedings IEEE INFOCOM. 2010, pp. 1–9.

[136] Wear OS Developer Guide. "https://developer.android.com/training/wearables".

[137] White, T. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012.

[138] Wikipedia contributors. Pixel 2. https://en.wikipedia.org/w/index.php?title=Pixel_ 2&oldid=963460451. [Online; accessed 26-June-2020]. 2020.

[139] Wu, H. et al. “ICTCP: Incast Congestion Control for TCP in Data-center Networks”. IEEE/ACM Trans. Netw. 21.2 (2013), pp. 345–358.

122 [140] Yan, F. et al. “On the cost, latency, and bandwidth of LIGHTNESS data center network ar- chitecture”. 2015 International Conference on Photonics in Switching (PS). 2015, pp. 130– 132.

[141] Yang, P.et al. “TCP Congestion Avoidance Algorithm Identification”. IEEE/ACM Transactions on Networking 22.4 (2014), pp. 1311–1324.

[142] Yang, Y. & Cao, G. “Characterizing and optimizing background data transfers on smart- watches”. 2017 IEEE 25th International Conference on Network Protocols (ICNP). IEEE, 2017, pp. 1–10.

[143] Zaharia, M. et al. “Spark: Cluster computing with working sets.” HotCloud 10.10-10 (2010), p. 95.

[144] Zaharia, M. et al. “Apache spark: a unified engine for big data processing”. Communications of the ACM 59.11 (2016), pp. 56–65.

[145] Zhang, Y. et al. “Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications”. Proc. ACM IMC. 2004.

[146] Zhao, Y. & Chen, Y. “FAD and SPA: End-to-end link-level loss rate inference without infras- tructure”. Computer Networks 53.9 (2009), pp. 1303–1318.

[147] Zhu, X. et al. “Understanding the Networking Performance of Wear OS”. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3.1 (2019), p. 3.

123