<<

1

Comparing ISP Performance using Big Data from M-Lab Xiaohong Deng, Yun Feng, Thanchanok Sutjarittham, Hassan Habibi Gharakheili, Blanca Gallego, and Vijay Sivaraman

Abstract—Comparing ISPs on broadband speed is challeng- remote areas served by low-capacity (wired or ) infras- ing, since measurements can vary due to subscriber attributes tructure will compare poorly to an ISP-B whose subscribers such as operation system and test conditions such as access are predominantly city dwellers connected by fiber; yet, it capacity, server distance, TCP window size, time-of-day, and network segment size. In this paper, we draw inspiration from could well be that ISP-A can provide higher speeds than ISP- observational studies in medicine, which face a similar challenge B to every subscriber covered by ISP-B! The comparison bias in comparing the effect of treatments on patients with diverse illustrated above arising from disparity in access capacity is characteristics, and have successfully tackled this using “causal but one example of many potential confounding factors, such inference” techniques for post facto analysis of medical records. as latency to content servers, host and server TCP window size Our first contribution is to develop a tool to pre-process and visualize the millions of data points in M-Lab at various time- settings, maximum segment size in the network, and time-of- and space-granularities to get preliminary insights on factors day, that directly bias measurement test results. Observational affecting broadband performance. Next, we analyze 24 months studies therefore need to understand and correct for such biases of data pertaining to twelve ISPs across three countries, and to ensure that the comparisons are fair. demonstrate that there is observational bias in the data due to In this study we draw inspiration from the field of medicine, disparities amongst ISPs in their attribute distributions. For our third contribution, we apply a multi-variate matching method which has grappled for decades with appropriate methods to identify suitable cohorts that can be compared without bias, to compare new drugs/treatments. “Patients” (in our case which reveals that ISPs are closer in performance than thought broadband subscribers) with “attributes” such as gender, age, before. Our final contribution is to refine our model by developing medical conditions, and prior medications (in our case access- a method for estimating speed-tier and re-apply matching for capacity, server-latency, host-settings, etc.) are given “treat- comparison of ISP performance. Our results challenge conven- tional rankings of ISPs, and pave the way towards data-driven ments” (in our case ISPs), and their efficacy needs to be approaches for unbiased comparisons of ISPs world-wide. compared. The dilemma is that any given patient can only be measured taking treatment-A or treatment-B, but not both Index Terms—Broadband performance, Big data, Data analyt- ics, Measurement lab at the same time; similarly, a subscriber in our case can only be observed when connected to one ISP, so the “ground truth” of that customer’s broadband performance served by I.INTRODUCTION other ISPs is never observed. To overcome this issue, the gold This paper asks the question: how should we compare standard for medical treatment comparisons is a “randomized Service Providers (ISPs) in terms of the broadband control trial” (RCT), wherein each patient in the cohort is speeds they provide to consumers? (aspects such as pricing randomly assigned to one of the multiple treatments (one plans, quotas, and reliability are not considered in this paper). of which could be a placebo). The randomization is crucial On the face of it, determining the answer may seem simple: a here, in the expectation that known as well as unknown subscriber’s speed can be measured directly (say via a speed- attributes that could confound the experiment outcome get arXiv:2101.09795v1 [cs.PF] 24 Jan 2021 test tool or an adaptive bit-rate video stream), allowing ISPs to evenly distributed across the groups being compared so that be compared based on the average (or median) measured speed statistically meaningful inferences can be drawn. across their subscriber base. However, this approach has deep Alas, “randomized” assignment of ISPs to subscribers is conceptual problems: an ISP-A who has many subscribers in not a viable option in the real world, so we have to instead rely on “observational” studies that analyze performance data X. Deng was with the School of Electrical Engineering and Telecommu- nications, University of , , NSW 2052, given a priori assignment of ISPs to subscribers. Fortunately (e-mail: [email protected]). for us, techniques for observational studies are maturing T. Sutjarittham, H. Habibi Gharakheili, and V. Sivaraman are with rapidly, particularly in medicine where analyzing big data the School of Electrical Engineering and , Univer- sity of New South Wales, Sydney, NSW 2052, Australia (e-mails: from electronic health records is much cheaper than running [email protected], [email protected], [email protected]). controlled clinical trials, and can yield valuable insights on the Y. Feng is with Shanghai Technologies, Pudong, China, ZIP 201206 causal relationship between patient attributes and treatment (e-mail: [email protected]). B. Gallego is with the Centre for Big Data Research in Health, Uni- outcomes. In this work we have collaborated closely with versity of New South Wales, Sydney, NSW 2052, Australia (e-mail: a medical informatics specialist to apply “causal inference” [email protected]). techniques to analyzing ISP performance data – unlike a This submission is an extended and improved version of our paper presented at the ITNAC 2015 conference [1] classic supervised learning problem, causal inference works by estimating how things might look in different conditions, 2 thereby differentiating the influence of A versus B, instead a 30-day period [3]. While these large content providers of trying to predicting the outcome. We apply this method to undoubtedly have a wealth of measurement data, these are the wealth of broadband data available from the open M-Lab specific to their services, and neither their data nor their precise platform, which holds over 40 million measurement results comparison methods are available in the public domain (to be world-wide for the year 2016. Though no data-driven approach fair Google does outline a methodology on its video quality can guarantee that causal relationships are deduced correctly, report page, but it fails to mention important elements such as as there could be unknown attributes that affect outcome whether it only considers video streams of a certain minimum (the “unknown unknowns”, to use the Rumsfeld phrase), we duration, whether a house that watches more video streams believe that the M-Lab data set captures most, if not all, contributes more to the aggregate rating, and how it accounts of the important attributes that are likely to affect the speed for various factors such as browser type, server latency, etc. measurements. that vary across subscribers and can affect the measurement Our objective in this paper is to apply emerging data- outcome). Governments are also under increasing pressure to analysis techniques to the big-data from M-Lab to get new compile consumer reports on broadband performance – for insights into ISP broadband performance comparison. Our first example the FCC in the US [4] directs consumers to various contribution is somewhat incidental - we develop a tool that speed test tools to make their own assessment, and the ACCC allows researchers to easily and quickly process and depict in Australia [5] is running a pilot program to instrument M-Lab data to visualize performance metrics (speed, latency, volunteers’ homes with hardware probes to measure their loss, congestion) at various spatial (per-house, per-ISP, per- connection speeds. Additionally, various national regulators in country) and temporal (hourly, monthly, yearly) granularities. Europe employ their own methods of measuring broadband Our second contribution applies our tool to over 17 million speed and publish white papers as surveyed in [6] – for data samples taken in 2015 and 2016 spanning 12 ISPs in 3 example, the ofcom in the UK uses a hardware measurement countries, to identify the relative impact of various attributes unit (developed by SamKnows), while several other national (access speed-tier, host settings, server distance, etc.) on broad- regulators such as in Italy, Austria, Germany, Portugal, Slove- band performance. We reveal, both visually and analytically, nia use specialized software solutions (developed in-house) that dominant attributes can vary significantly across ISPs, while the regulator in Greece adopted M-Lab’s NDT tool. corroborating our earlier assertion that subscriber cohorts have While there is a commendable amount of effort being disparate characteristics and ISP comparisons are therefore expended on collecting data, via either passive measurement riddled with bias. Our third contribution is to apply a causal of video traffic or active probing using hardware devices (we inference technique, called multi-variate matching, to filter refer the reader to a recent survey [7] that gives an overview of the data sets by identifying cohorts with similar attributes measurement platforms and standardization efforts), less effort across ISPs. Our final contribution is to refine our method to has been expended on a systematic analysis of the collected make M-Lab data more useful by mapping measurements with data. This matters, because early works such as [8] have households and estimate their speed-tier, so that meaningful demonstrated that broadband speed measurements can exhibit performance comparisons across households can be conducted. high variability, and these differences arise from a complex Our results indicate that the ISPs are closer in speed perfor- set of factors including test methodology and test conditions, mance than previously thought, and their relative ranking can including home networks and end users’ computers, that make be quite different to what the raw aggregates indicate. it very challenging to attribute performance bottlenecks to the The rest of this paper is organized as follows: §II recaps constituent parts, specifically the ISP network. While their prior work in broadband performance, and gives relevant work acknowledges that broadband benchmarking needs to background on causal inference techniques. In §III we describe look beyond statistical averages to identify factors exogenous our measurement data set, the attributes it contains, and to the ISP, they do not offer any specific approaches for doing preliminary insights gleaned from our visualization tool. The so. NANO [9] developed a system that infers the effect of attribute distributions and underlying biases are discussed in ISPs policy on a service performance – it also compares the §IV, while in §V we apply multi-variate matching to reduce service performance across multiple ISPs. NANO establishes bias and compare ISPs in a fair manner. §VI presents our a causal relationship between an ISP and its observed per- systematic approach to estimate household access capacity formance by adjusting confounding factors such as client- from M-Lab data. The paper is concluded in §VII with pointers based confounder, network-based confounder and time-based to future work. confounder. But, it does not consider the TCP throughput performance comparison across ISPs. NetDiff [10] designed II.BACKGROUNDAND PRIOR WORK a system that offers a fair performance comparison of ISP networks by taking into account the size and geographic spread A. Broadband Measurement and Reporting of each ISP. It helps customers determine which ISP offers the Measuring and ranking broadband ISPs has been ongoing best performance according to their specific workload. But (and contentious) for several years – Netflix publishes a their work considers only one confounding factor. A separate monthly ISP speed index [2] that ranks ISPs based on their body of work [11]–[13] explores model-driven and data-driven measured prime time Netflix performance, and Youtube graphs methods to estimate or predict end-to-end available ; for each ISP the break-down of video streaming quality (low however, they operate at short time-scale, their data-sets are vs standard vs high definition) by time-of-day averaged over small, and their focus is not specific to broadband networks. 3

We believe our work is among the first to combine causal Overall test counts by country inference techniques for observational studies with the big complementary CDF data openly available from the M-Lab measurement platform 100% to attempt a fair comparison of ISP broadband performance. 90% AU GB 80% US B. Causal Inference Analysis 70% As mentioned earlier, the gold standard for comparisons is a randomized control trial, which is not feasible in our 60% case. We therefore have to use observational data with a priori 50% assignments of ISPs to subscribers, and use causal inference

% of tests 40% methods [14]–[16] that can control for differences in the co- variate distributions between the groups being compared so as 30% to minimize confounding. One of the most popular methods is 20% “matching” [17], which selects subsets of observations in one group (the treatment group) for comparison with observations 10% having similar covariate distributions in the comparator group 0% 1 2 3 4 5 10 20 50 100 500 1000 (the control group) – balancing the distribution of covariates Test Frequency in the two groups gives the effect of a randomized experiment. Matching has been used extensively in epidemiological, social, Fig. 1. CCDF of household test counts in AU, UK, and US. and economic research studies, and has been proven to reduce confounding bias very effectively. The most common ap- and any interested party can design, implement, and deploy proaches to perform matching are propensity score matching, new Internet measurement tools under an open license. This multivariate matching based on Mahalanobis distance, and gives a significant advantage to M-Lab over other platforms more recently, genetic matching algorithms. In this paper such as PerfSONAR [20] – M-Lab provides a much larger we chose multivariate matching, which is easier to tune by collection data generated by tens of millions of tests from interpreting results when the number of attributes is not too clients connected to hundreds of ISPs across the globe every large, and is well supported in R [18]. year. All data collected in the M-Lab platform are open Once the covariates of the observations have been matched access (as opposed to commercial platforms such as Ookla between the groups, the difference in outcome is averaged to [21]), and available either in raw format as file archives estimate the average treatment effect (ATE) – in medicine, over Google cloud storage, or in SQL-friendly parsed format this could quantify the average effect of a drug compared to a accessible using BigQuery. In terms of diversity, the M-Lab placebo, while in our case it estimates the average difference covers a more diverse range of users compared to hardware- in download speed between the two ISPs being compared. based platforms such as SamKnows [22] and BISMark [23] – Certain pre-conditions are needed for our approach: we assume deploying a hardware-based measurement at users’ premises that the group assignment (i.e., choice of ISP for subscriber) is constrained by distribution of devices, and thus limited to has been made independent of the outcome, conditional on selected populations. the observed covariates; that the baseline covariates, although measured post facto, are not affected by the treatment (ISP); A. Data Set Selection and Pre-Processing and that there are sufficient observations for which the prob- In this paper we use the data collected by the Network ability of assignment is bounded away from zero and one. Diagnostic Test (NDT) tool, since it has by far the largest In simple words, this states that households (patients) did not number of speed test samples (over 40 million for the year make ISP (treatment) choice based on known outcomes or 2016), and captures a rich set of attributes for each test attributes, and there are a reasonable number of samples from (discussed later in this section). In order to evaluate the the two groups being compared that have similar covariate generality of our methods, we apply them to data from three distributions (our results in §V will capture this via p-values). countries: Australia (AU), the United Kingdom (UK), and the (US). We select four of the largest ISPs from III.DATA-SET SELECTION,ATTRIBUTES,VISUALIZATION each country for comparison: , , iiNet, and TPG In this section we briefly introduce M-Lab, its measurement from AU; BT, Virgin, and TalkTalk from the UK; and tools and data repositories. We then describe the data we have , Verizon, AT&T, and Cox from the US. For these selected and pre-processed, the attributes we have extracted, ISPs, we analyze the NDT speed test measurements taken from and the visualization tool we have built. two years (2015 and 2016), comprising 1.3m samples for AU, M-Lab [19] was founded in 2008 as a consortium of 1.4m for UK, and 14.5m for US – the latter is an order of research, industry, and public-interest partners, backed by magnitude larger since Google searches in the US got linked Google, to create a global open measurement platform and with NDT as of July 2016. data repository that researchers can use for deeper studies of Determining household speed: Our first objective is to Internet performance. M-Lab has built a platform on which set a baseline for ISP speed comparison by computing their test servers are well distributed across continents and ISPs, mean/median values. However, we found in the NDT data-set 4

10.0 30 50

● ● ● 40 ● 7.5 ● ● ● ● 20 ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ● 5.0 ● ● 20 10 2.5 ● ● ● ● ●

Download speed (Mbps) Download speed (Mbps) Download speed (Mbps) Download 10 ● ● ● ● ● ● ●

0.0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 month month month

● Optus TPG Telstra iiNet ● BT Sky TalkTalk Virgin ● AT&T Comcast Cox Verizon

(a) AU. (b) UK. (c) US. Fig. 2. Monthly median download speeds: (a) Australia, (b) UK, (c) US. that some IP addresses were conducting many more tests than TABLE I others (for convenience of exposition we will henceforth refer NUMBEROF SAMPLED NDT MEASUREMENTSFROMEACH ISP to each unique IP address as a “household”). Our data set year ISP raw test count annual test count > 20 used was found to have around 565k, 464k, and 2.81m households 2015 Telstra 117,019 11,180 9.6% for AU, UK, and US respectively, indicating that the average 2015 Optus 46,138 8,232 17.8% 2015 iiNet 42,917 5,144 12.0% household contributes only 2-4 samples each month. There is 2015 TPG 52,186 14,928 28.6% however a significant skew in monthly test frequency amongst 2015 BT 238,134 13,844 5.8% households, as shown in the complementary CDF (CCDF) 2015 Virgin 205,149 54,371 26.5% of Fig. 1 - in AU and UK for example, the bottom 50% of 2015 Sky 235,271 27,623 11.7% 2015 TalkTalk 7,450 652 8.8% households contribute 5 or fewer samples, while the top 10% 2015 AT&T 460,482 148,512 32.3% contribute 50 or more samples each. We eliminate this bias 2015 Cox 215,499 56,282 26.1% by aggregating (averaging) the results to obtain a single value 2015 Verizon 291,421 88,010 30.2% per household per month, and plot the resulting month-by- 2015 Comcast 769,728 216,252 28.1% 2016 Telstra 478,469 65,839 13.8% month median download speed across households in Fig. 2. 2016 Optus 161,303 26,421 16.4% The rankings shown in the figure are broadly consistent with 2016 iiNet 199,764 20,525 10.3% the ones published by Netflix – like us, the Netflix ISP speed 2016 TPG 219,547 53,879 24.5% 2016 BT 126,013 10,890 8.6% index ranks Optus, Virgin, and Verizon as highest respectively 2016 Virgin 71,947 20,405 28.4% in AU, UK, and US for most months of 2016. There is no 2016 Sky 95,964 10,292 10.7% reference point to how Netflix computes its ISP indexing. 2016 TalkTalk 17,543 3,666 20.9% But because this way it generates similar ranking as Netflix 2016 AT&T 2,841,976 1,280,814 45.1% 2016 Cox 1,391,983 580,976 41.7% index, so we believe it’s a fair baseline and will use it as 2016 Verizon 1,333,403 537,319 40.3% a naive baseline for our discussions and comparisons against 2016 comcast 5,442,720 2,164,122 39.8% other methods developed later in this paper. Therefore, later on, when we refer to naive arithmetic average, we mean ISP performance aggregated by household – each house only has by higher test-count threshold will estimate access speed- one vote to the ISP average. tier more accurately, but reduces the data-set by eliminating households that conduct fewer tests (see CCDF of test-counts Determining access speed-tier: The download speed for a in Fig. 1). We chose threshold of 20 for AU and UK, and 50 household will be limited by the capacity of its access link, for US, so as to get reasonable confidence in our estimates which in turn is dictated by physical attributes such as medium of access speed-tier. As we can see in Table I, the test-count (fiber, copper, wireless) and distance from the local exchange. thresholds we choose are able to retain 10-30%, 6-27% and 26- It may further be constrained if the subscriber has chosen 33% samples for AU, UK and US respectively in 2015’s data, a plan with lower advertised speed. We term this maximum and 10-24%, 9-30% and 40-45% samples for AU, UK and possible speed available to the household as its “access speed- the US respectively in 2016’s data. As the magnitude of the tier”. As we will see in the next section, this attribute is number of samples increased in the US since July of 2016, we important when comparing ISPs, but is not explicitly present believe, having a test-count threshold to select measurement in the data since M-Lab is not privy to advertised speeds and samples will not be a constraint for applying our method on subscriber plans. We, therefore, have to infer a household’s future’s data. access speed-tier from the measured data. We take an ap- proximate approach of using the largest value of measured speed as the access speed-tier for that household, provided: B. Attribute Selection (a) the household has conducted a minimum threshold number The NDT speed-test client connects to its nearest NDT of tests, and (b) at least one test was conducted during off- server for the speed test. The server records the client infor- peak hours (i.e., outside of 7pm-11pm local time). Filtering mation (IP address, geographic location, OS, client version), 5

C. Visualization Tool Since visualization is key to human comprehension and interpretation of results, we built a tool to ease the generation of plots of various performance measures (speed, RTT, con- gestion signals) filtered by country, ISP, or specific household, at time-scales of hours, days, and months. A data-extraction script in Python queries the M-Lab NDT store to extract data, and an R script filters fields of interest, and annotates it with extra attributes (such as speed-tier and local time) into country-specific local files. A set of analytics scripts in R then perform the various algorithmic operations, ranging from simple aggregation and normalization to more complex causal inference models discussed later. A JavaScript front-end provides user interaction to input plotting options and display the resulting graphs. Our UI is openly accessible at https://mlab-vis.sdn.unsw.edu.au/ and we encourage the reader to try the various plot options, such as: (a) aggregated plots that show monthly/hourly ISP (raw or normalized) speeds, (b) scatter plots that show download Fig. 3. Comcast speed-tests faceted by day-of-month for December 2016. speeds by time-of-day or day-of-month, (c) distribution plot of speed-tier for a specific ISP, (d) correlation plot showing network attributes (RTT, MSS), server side stats (TCP buffers, how the download speed relates to the various attributes, maximum congestion window sizes, and other web100 vari- and (e) household plots that show speed, RTT, etc., specific ables [24] for TCP tuning), and detailed run-time measure- to a client IP address. Though we will see several plots ments (speed, loss, congestion signal counts). While nearly generated by our tool throughout this paper, here we would 50 attributes are included, we found that many were sparsely like to illustrate some visual insights from a facet plot, in recorded, due to different NDT client versions embedded in this case speed measurements taken at different times of day. different applications on different operating systems choosing Fig. 3 shows a panel of 31 plots, each of which depicts all to record a different subset of attributes. speed-tests done during that day of the month, over the month The TCP window size of the client is expected to affect of December 2016 for Comcast. The top-10 contributing measured speeds. There are a number of Web100 performance households are each given their unique color, with their statistics related to the TCP window [24]: the TCP receiver’s IP addresses (and in parenthesis the number of test points announced maximum window size (Rwnd) which is deter- contributed for that month) shown in the legend. One can mined by the available head room in the RecieveBuf, and immediately see the temporal skew in testing patterns: the Average Receiver Window is generally close to Rwnd. We use dark green household (IP address 67.180.193.135 with 1281 the maximum window size (Rwnd) as it is solely dependent on tests) does its speed testing exclusively on three days (22-Dec, client conditions, whereas RecieveBuf can also reflect network 29-Dec, and 30-Dec), the light green coded household (IP conditions. The distance between client and server is recorded address 24.147.127.89 with 672 tests) is concentrated on 18th in the form of minimum round-trip-time (min-RTT) over the December, while other households such as red (IP address duration of each test, as is the maximum segment size (MSS). 50.248.236.185 with 1809 tests) and purple (IP address The client region is recorded, which allows conversion to local 76.114.35.144 with 1055 tests) are spread across every day time-of-day to determine if the test is done during peak or off- of the month. The plot also gives a visual representation to peak times. The Operating System and version attribute can the variability in number of tests conducted across houses, tell us about the default host settings (e.g., TCP auto-tuning, as well as the variability in speed experienced by the same Nagle option) that affect speed performance. The broadband household. It is often useful to corroborate the numerical ISP for the client is deduced based on a Whois lookup of results presented in this paper with their visual depictions the client IP address, and the speed-tier for the household made possible by our tool. computed as explained earlier. The Client Limited Time is the percentage of time a test was constrained by the client itself. This attribute is inherently correlated with speed-tier – IV. TEST DATA ATTRIBUTE DISTRIBUTIONSAND BIASES we found that higher speed-tier clients are more likely to be In this section we study how test condition attributes (af- constrained by client-side limitations. Since this attribute is not fecting speed) can vary in measurements across different ISPs, directly associated with the speed, we do not process it when and how this can bias the comparison results. We begin by we quantify the correlation between attributes and speeds but feeding the measurement test results, along with the associated only use it for applying the Causal Inference model later on. test conditions attributes, to the Random Forest method in R The impact of these attributes, reflective of test conditions, to compute the “importance rank” (other machine learning will be discussed in the next section. methods for variable selection, such as Bayesian Additive 6

Speed tier Speed tier Speed tier

Receiver Window Size Receiver Window Size Receiver Window Size

Distance Distance Distance

OS OS ISP

ISP ISP OS

MSS MSS MSS

Time of day Time of day Time of day

0 100 200 0 200 400 0 250 500 750 1000 vimp vimp vimp (a) AU. (b) UK. (c) US. Fig. 4. Attribute importance computed from Random Forest for (a) AU, (b) UK, and (c) US.

1.00 35% 1.00 21% 12%

14% 0.75 0.75 16% 19%

9% 0.50 0.50 10% Density Density

13% 0.25 0.25 11% 17% 19%

0.00 0.00 8 12 20 30 50 100 8 12 2030 50 100 Capacity−tier (Mbps) Capacity−tier (Mbps) (a) Optus. (b) Telstra. Fig. 5. Comparing access speed-tier for (a) Optus, and (b) Telstra, in AU.

test. Even with a shared/nationalized broadband infrastructure, 1.5 some ISPs may be serving customers with lower access speed- tiers, which can drag their averages down. In fact, in AU, 1.0 Telstra claims that it serves more rural/regional customers than Density other ISPs such as Optus, which is used as a reason why it 0.5 ranks lower on the Netflix ISP speed index (and in our monthly median plot shown in Fig. 2(a)). To check the plausibility of 0.0 this claim, we use our tool to plot in Fig. 5 the distribution of 1 10 100 Speed Tier (Mbps) speed-tiers of households served by Tesltra and Optus in 2016.

Linux 3.4+ Windows Vista + Windows XP − The disparity is evident – Optus has only 10% of subscribers at speed-tier below 8 Mbps and 63% above 20 Mbps, while Fig. 6. Comparing download speeds by OS in AU. Telstra has 21% of subscribers below 8 Mbps and only 46% Regression Trees [25], provided similar results). Fig. 4 shows above 20 Mbps. Since much of the access infrastructure in that the access speed-tier, host buffers, and distance attributes Australia is open and can be shared by all ISPs, the disparity have the highest impact on measured download speed, across in access speed-tier is attributable to different proportions of all the countries studied. This by itself is not very surprising, metropolitan versus regional customers served by the two ISPs. since these factors directly determine TCP dynamics and hence Another illustration of the impact of an attribute, in this case measured speeds. What is surprising to observe is that the the client OS, is shown visually in Fig. 6. This plot is from M- ISP per-se has a relatively lower weight, typically ranking Lab test data from the AU for 2016, and shows the distribution fourth or fifth in importance. Of course, in countries such as density of measured download speed (x-axis on log scale) the US, each ISP often runs their own physical broadband separated by the OS type. The bias is again evident here – infrastructure, and therefore wields a much larger influence clients using flavors of Linux version 3.4 or higher (solid via the access speed-tier attribute, whereas in countries such as red curve) are clustered in the range of 10-100 Mbps, while AU and UK the ISPs typically share (nationalized) broadband Windows clients running XP or older (dashed blue curve) are infrastructure, and hence do not dictate the access speed-tier concentrated in 5-20 Mbps – this could be attributed to the attribute. In either case, the ISP attribute is found to be no more lack of TCP auto-tuning in older versions of the Windows OS significant than the operating system on the client running the [26]. 7

20 15 6 15 2 4 10 10

Density 1 Density Density Density 5 2 5

0 0 0 0 30 100 300 1000 3000 1 10 100 30 100 300 1000 3000 1 10 100 1000 Rwnd (KB) Speed Tier (Mbps) Rwnd (KB) Speed Tier (Mbps) 5 6 4 1.5 0.2 4 3 1.0 2

Density Density 0.1 Density Density 2 0.5 1

0 0.0 0 0.0 3 10 30 100 300 1250 1300 1350 1400 1450 1500 1 10 100 1000 1250 1300 1350 1400 1450 1500 Distance (ms) MSS () Distance (ms) MSS (Bytes)

100 10000

Count 50 Count 5000

0 0 Linux 3.4+ Windows Vista + Windows XP − Linux 3.4+ Windows Vista + Windows XP − Operating System Operating System ISP optus telstra ISP at&t cox comcast verizon

(a) AU. (b) US. Fig. 7. Distribution of various attributes for: (a) AU, and (b) US.

100 100

75 75

nr−0.1 nr−0.1 nr−0.2 nr−0.2 50 50 r−0.1 r−0.1 r−0.2 r−0.2

Discarded samples (%) 25 Discarded samples (%) 25

0 0 Telstra Optus Telstra Optus Optus iiNet Comcast Verizon Cox Verizon Verizon Cox vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. iiNet iiNet TPG TPG Telstra TPG AT&T AT&T AT&T Cox Comcast Comcast 2016−ISP pairs in speedtier bin [0, 8] 2015−ISP pairs in speedtier bin [30, 50] (a) AU. (b) US. Fig. 8. Impact of matching with/without replacement and caliper for: (a) AU and (b) US. For completeness, we show the distributions of all the ate distributions to reduce bias. For attributes that take con- major attributes (host TCP window size, speed-tier, distance, tinuous values (speed-tier, host-buffers, client-server distance, MSS, and OS) in Fig. 7 for AU and US (UK has similar MSS), we use the Mahalanobis distance [18] to compute the characteristics to AU, and is omitted here for space reasons). “closeness” between measurement samples pertaining to ISP- It can be seen that in almost every attribute, the ISPs differ: 1 (the “treatment group”) and ISP-2 (the “control group”). for example in AU, TPG subscribers are more skewed towards Samples that are within a “caliper” (specified in units of lower Rwnd (TCP window), larger client-server distance, and the standard deviation of each attribute) distance from each more Windows XP (or older) OS, compared to subscribers other are deemed to be matched, and constitute the “common of other ISPs, while in the US AT&T subscribers have larger support” between the two groups, while all other samples distance and more widely spread Rwnd than others. As dis- are dropped. A larger caliper allows more samples to be cussed earlier, such differences in attributes (that are reflective matched for greater common support, reducing error (variance) of test conditions) can bias the test outcomes in multiple ways; in the comparison of the average treatment effect (ATE), the next section develops a method to eliminate this bias and whereas a smaller caliper makes the matching more exact undertake a fair comparison of ISPs. for improved unit homogeneity, yielding lower bias [15]. The caliper therefore has to be tuned to achieve the desired balance V. DEBIASINGUSING MULTI-VARIATE MATCHING between error and bias. We use the causal inference technique called multi-variate Another key factor is to check whether matching should matching, as briefly introduced in §II-B, to balance the covari- be done “with replacement” (i.e., r) or “without replacement” 8

Before Matching After matching Before Matching After matching 1.5 0.6 0.4 1.0 0.2 0.5 Distribution 0.0 Distribution 0.0 0.1 1 10 100 0.1 1 10 100 0.1 1 10 100 0.1 1 10 100 Download Speed (Mbps) Download Speed (Mbps) Before Matching After matching Before Matching After matching 2.5 2.0 3 1.5 2 1.0 0.5 1 Distribution 0.0 Distribution 0 1 10 100 1000 1 10 100 1000 1 10 100 1 10 100 Distance (ms) Distance (ms) Before Matching After matching Before Matching After matching 3 2 2 1 1 Distribution 0 Distribution 0 100 10000 100 10000 100 1000 10000 100 1000 10000 Rwnd (KB) Rwnd (KB)

0.15 Before Matching After matching Before Matching After matching 0.10 0.2 0.05 0.1

Distribution 0.00 Distribution 0.0 1400 1450 1500 1400 1450 1500 1400 1450 1500 1400 1450 1500 MSS () MSS (Byte) Before Matching After matching Before Matching After matching 3 1.5 2 1.0 1 0.5 Distribution 0 Distribution 0.0 0.01 0.1 1 0.01 0.1 1 0.01 0.1 1 0.01 0.1 1 Receiver limited time Receiver limited time

Optus Telstra Comcast AT&T

Before Matching After matching Before Matching After matching 20000 10000 15000 5000 10000 5000

No. of samples No. 0 of samples No. 0 Non−peak Peak Non−peak Peak Non−peak Peak Non−peak Peak Non−peak Peak Non−peak Peak Non−peak Peak Non−peak Peak WeekdaysWeekdaysWeekendsWeekends WeekdaysWeekdaysWeekendsWeekends Weekdays Weekdays Weekends Weekends Weekdays Weekdays Weekends Weekends Time of week Time of week Before Matching After matching Before Matching After matching 40000 20000 30000 10000 20000 10000

No. of samples No. 0 of samples No. 0 window scale window scale window scale window scale disabled disabled disabled disabled OS OS

Optus Telstra Comcast AT&T Fig. 9. Covariate distribution before and after matching with replacement Fig. 10. Covariate distribution before and after matching with replacement Optus vs. Telstra (in AU) measurements with caliper 0.2. Comcast vs. AT&T (in US) measurements with caliper 0.2.

(i.e., nr). Matching with replacement can often decrease bias (i.e., typically between 50% to 75%) fall out of the common because controls that look similar to many treated individuals support, and thus are discarded – except for “Verizon vs. Cox” can be used multiple times [27], but this makes the inference pair in US with the caliper value 0.1. Instead, matching with more complex since matched controls become dependent. We replacement discards less fraction of samples (or keeps a larger employ both of these methods (i.e., r and nr). Also for caliper, portion of samples in common support area). Focusing on AU we use two values 0.1 and 0.2 to evaluate the impact of both in Fig. 8(a), the discard rate is less than 25% for four pairs tight and relax matching respectively. We show our results in of ISPs when matching with replacement is employed and the Fig. 8 for six pairs of ISPs within a specific speed-tier in AU caliper value is set to 0.2 (purple bars) – two pairs experience a and US. relatively higher discard rate about 45%. For US, on the other It is seen that when matching without replacement is em- hand, same settings give a fairly consistent discard rate of less ployed, for both AU and US, a fairly large fraction of samples than 25% across four pairs of ISPs, as shown by purple bars 9

B. Matching on Larger Set of ISP Pairs and Various Speed- Tiers

● We now extend our evaluation to a larger set of ISP pairs 5 ● ● with diversity of speed-tiers for dataset from two years 2015 ● ● ● ● and 2016. Table II shows 83 ISP pairs (in AU and US) along ● ●● ●

● ● ●● with their corresponding speed-tier (in Mbps) and the year of ●● ● ● ●●●●● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ●●●●● data. ISP pairs are sorted in ascending order of their speed 0 ● ●●●●●●●●● ● ●●●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● difference (computed from naive aggregation) – each pair is ● ● ●● nr ● ●●●●● ●● ●● ● ● ● ● ● ● ● r assigned a unique ID (order) for ease of reference in Table II. ●● ● ● ● ● ● ●●● ● ● ● We show the results of matching with caliper 0.2 in Fig. 11. ● ● ● −5 ● The speed difference of each ISP pair resulted from naive ● ● ● ● ● ● ● arithmetic means, matching without replacement (nr), and ● ● ●● matching with replacement (r) are shown by black dots, Diff. of download speed (Mbps) of download Diff. ● ● ● ●● red segments, and green segments respectively. Matching −10 estimated differences are represented by error bars of 95% ● confidence interval. 1 10 20 30 40 50 60 70 80 We make the following observations: (a) when raw dif- ISP pairs ID ference is close to zero for pair IDs between 40 to 70, the Fig. 11. Comparing ISP speeds for 83 pairs in AU and US. estimated differences after matching are also close to zero – matching method does not change the inference, (b) when raw difference has a large negative value for pair IDs between 1 in Fig 8(b). As a results, we use matching with replacement to 20 (or a large positive value for pair IDs between 75 to and caliper 0.2 having sufficient number of samples in the 83), matching estimates the difference much lower (i.e., 0 to common support to keep error low. -3 Mbps) than what naive arithmetic means indicate (i.e., - 3 to -8 Mbps), and (c) in some cases, the raw difference is positive but the estimated value becomes negative with statistical significance – this indicates that a simple use of A. Matching on Selected ISP Pairs raw average for ranking ISPs could be misleading, for example Optus vs. Telstra in AU (as discussed earlier). Let us begin by looking at Optus vs Telstra in AU for speed- VI.REFINING SPEED-TIER ESTIMATION tier [0, 8] Mbps. In Fig. 9, we show how covariate measures and download speeds change when matching with replacement We developed a causal inference model for a fair compari- of caliper 0.2 is applied. It is seen that attributes differ substan- son of ISP performance in the previous section. In this section tially before and after matching – matching discards 20.8% of we refine our model by undertaking a more detailed analysis to samples and makes the covariate distributions more similar estimate household speed-tier (also referred to as “broadband (thereby reducing bias). Note that, before matching, Telstra is access capacity”) from the MLab data, allowing us to further disadvantaged in terms of longer distance and larger number improve the fidelity of broadband speed comparisons. of users with old OS (not supporting TCP auto-tuning), and hence sees a smaller download speed on average, suggested A. Isolating Households by naive aggregation (e.g., Fig. 2(a)). When confounding M-Lab data-points are indexed by IP address of home factors (attributes) are balanced using the matching method, gateways. ISPs allocate IP addresses based on their resource interestingly Telstra gives better performance than Optus on pool, subscriber base, or business policy. In some cases, an average – a result contrary to the one seen in Fig. 2(a). ISP (often a large one) may have a fairly large pool of public We now compare Comcast and AT&T in the US for speed- IP addresses and can assign every subscriber a unique public tier bin [30, 50] Mbps. We can observe in Fig. 10 that IP, but one-to-one address lease may change dynamically over attributes are quite disparate between Comcast and AT&T time. In other cases, the ISP (often a small one) will instead when all samples are considered (before matching). Comcast is assign a public IP address to a group of subscribers, and disadvantaged against AT&T by several factors including: a) a then employ NAT to multiplex their traffic. Consequently, it higher distance between clients and server; b) a larger number becomes challenging to extract the broadband capacity from of measurements from old OS (i.e., window scale disabled); M-Lab data, as an IP address does not necessarily represent and c) a larger number of measurements in the state with a single household. Thus, we need a method to isolate data- high value (i.e., 0.1 to 1) of Receiver-Limited-Time. Therefore, points corresponding to single households. a naive average difference suggests that Comcast is about 8 The congestion signal of each NDT data-point indicates Mpbs slower than AT&T on average. But matching balances how the TCP congestion window (cwnd) is being affected the attributes (confounding factors) as shown by plots on the by congestion and is incremented by any form of congestion right in Fig. 10 and thus reduces the speed difference to 3-4 notifications including Fast Retransmit, ECN, and timeouts. Mbps. Theoretically, a large value of congestion signal (congestion 10

TABLE II ISP PAIRS WITH COMMON SUPPORT IN AU AND US.

ISP pair ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Virgin vs. Sky

ISP1 vs ISP2 Comcast vs. AT&T Comcast vs. AT&T Virgin vs. BT Comcast vs. Verizon Comcast vs. Verizon Comcast vs. Cox Comcast vs. AT&T Virgin vs. BT Comcast vs. AT&T Virgin vs. BT Virgin vs. BT Comcast vs. Cox Comcast vs. Cox Comcast vs. Cox Virgin vs. Sky Comcast vs. AT&T Virgin vs. Sky Virgin vs. BT Virgin vs. Sky Virgin vs. BT Speed-tier [30, 50] [30, 50] [50, 75] [75, 100] [20, 25] [50, 75] [75, 100] [25, 30] [20, 25] [30, 50] [30, 50] [30, 50] [20, 25] [12, 20] [25, 30] [50, 75] [30, 50] [12, 20] [20, 25] [25, 30] [20, 25] Year 2016 2015 2015 2015 2015 2015 2016 2016 2016 2015 2016 2016 2016 2016 2016 2016 2016 2015 2015 2015 2015 ISP pair ID 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

ISP1 vs ISP2 Virgin vs. TalkTalk Telstra vs. Optus Comcast vs. AT&T Comcast vs. AT&T Comcast vs. Cox Virgin vs. Sky Comcast vs. AT&T Comcast vs. Verizon Virgin vs. Sky Comcast vs. Verizon Virgin vs. Sky Comcast vs. Verizon Comcast vs. AT&T Comcast vs. AT&T Comcast vs. Verizon TPG vs. Optus Comcast vs. AT&T TPG vs. Telstra Comcast vs. Cox Comcast vs. Verizon Comcast vs. Cox Speed-tier [12, 20] [25, 30] [12, 20] [25, 30] [30, 50] [12, 20] [0, 8] [50, 57] [30, 50] [30, 50] [12, 20] [20, 25] [8, 12] [20, 25] [25, 30] [12, 20] [0, 8] [12, 20] [25, 30] [50, 57] [20, 25] Year 2016 2016 2016 2015 2015 2016 2016 2016 2015 2016 2015 2016 2016 2015 2015 2015 2015 2015 2015 2015 2015 ISP pair ID 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

ISP1 vs ISP2 Comcast vs. AT&T Comcast vs. Cox Comcast vs. AT&T Comcast vs. Verizon Comcast vs. Cox Comcast vs. Verizon Comcast vs. Verizon Comcast vs. Cox Comcast vs. AT&T Comcast vs. Verizon Comcast vs. Verizon Telstra vs. iiNet TPG vs. Telstra Telstra vs. iiNet Telstra vs. Optus Comcast vs. Cox TPG vs. Optus Telstra vs. TPG Comcast vs. Cox Telstra vs. iiNet Telstra vs. TPG Speed-tier [8, 12] [8, 12] [12, 20] [30, 50] [12, 20] [12, 20] [0, 8] [0, 8] [25, 30] [0, 8] [8, 12] [0, 8] [8, 12] [8, 12] [0, 8] [8, 12] [0, 8] [0, 8] [75, 100] [12, 20] [8, 12] Year 2015 2015 2015 2015 2015 2015 2016 2015 2016 2015 2015 2016 2015 2016 2016 2016 2015 2016 2015 2016 2016 ISP pair ID 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 Comcast vs. Verizon

ISP1 vs ISP2 Comcast vs. Verizon Comcast vs. Cox Telstra vs. iiNet Virgin vs. Sky Telstra vs. Optus Telstra vs. Optus Telstra vs. Optus Virgin vs. Sky Telstra vs. TPG Virgin vs. BT TPG vs. Telstra TPG vs. iiNet Comcast vs. Verizon Comcast vs. Cox Telstra vs. iiNet Telstra vs. Optus Telstra vs. Optus Comcast vs. Cox Comcast vs. Cox Speed-tier [8, 12] [0, 8] [20, 25] [0, 8] [8, 12] [12, 20] [20, 25] [0, 8] [12, 20] [0, 8] [0, 8] [0, 8] [12, 20] [100, 1000] [30, 50] [30, 50] [50, 75] [75, 100] [25, 30] [25, 30] Year 2016 2016 2016 2016 2016 2016 2016 2015 2016 2015 2015 2015 2016 2015 2016 2016 2016 2016 2016 2016

● ● ● ● ● ● ●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●

(a) Negative correlation (Cox, 458 tests from 98.174.39.22) – high speed (b) Positive correlation (City of Thomasville Utilities), 896 tests from during un-congested period, and low speed during fairly congested period. 64.39.155.194) – high speed even during highly congested period, and low speed even during uncongested period. Fig. 12. Two samples of correlation between download-speed and congestion-count.

(a) AU. (b) US. Fig. 13. Negative/Positive correlation across large/small ISPs in (a) AU, and (b) US. 11

●● ● ● ● ●●●●●●●●●●●●●● ● ● ●●●●● ● ●●●●●●●●●● ● ● ●●● ● ● ● ● ●●●●●●●●●●●●●●●●●● ●

● ●●● ● ●●●●●●●●●●● ●●●●●●●●●● ●●● ●● ● ●● ● ● ● ● ●● ● ●● ●●● ●

● ●

(a) Negative correlation (Cox, 458 tests from 98.174.39.22). (b) Positive correlation (City, 896 tests from 64.39.155.194). Fig. 14. Consistency of correlation between download-speed and congestion-count across four months: (a) negative, and (b) positive correlation. count) should correspond to a low TCP throughput (download towards the bottom right of the plot (low congestion and high speed), and vice versa. speed values), and larger red circles are grouped at the top left We denote the Pearson’s correlation coefficient between region of the plot (i.e., high congestion and low speed values). the measured download speed and recorded congestion count On the other hand, Fig. 12(b) shows a positive correlation by ρ. This parameter is computed across all tests correspond- (ρ = 0.69) for 896 test-points from City ISP in the US – ing to a given client IP address. smaller green squares are mainly spread from left to middle We expect ρ to be negative for any given household, as bottom of the plot (low congestion and low/medium speed higher broadband speed should correlate with lower conges- values), and larger red circles are clustered at top middle of tion, and this is indeed the case for a majority of client IP the plot (high congestion and medium/high speed values). addresses contained in the M-Lab data. However, for some IP addresses, we observe strong positive correlations (i.e., ρ > 0). B. Large-Scale Consistency Validation Our hypothesis for this unexpected phenomenon is that when multiple houses of an ISP network are sharing an IP address; We now go back to our M-Lab data to analyze the ρ pa- the speed measurements can vary in a wide range depending rameter across various ISPs of different size as well as across on broadband capacity of individual households, whereas months checking whether a consistent pattern of correlation is congestion counts would have smaller variations reflecting the observed. condition of the network. Thus, having mixed measurements 1) Across ISPs: Large ISPs such as AT&T and Comcsat in (speed and congestion-count) from multiple households will the US with a wealth of public IP addresses (i.e., 91 million likely result in imbalanced data pairs causing an unexpected and 51 million) Smaller ISPs who own smaller pool of IPv4 positive correlation between speed and congestion-count. addresses (e.g., class C blocks) are more likely forced to To better visualize our discussion and hypothesis, we employ NAT (or dynamic lease in the best case) for better present in Fig. 12 samples of the correlation between the management of their limited address resources. On the other download-speed and congestion-count observed over a 12- hand, larger ISPs who were assigned class A address blocks month period from two IP addresses. In each plot, the would have discretion to statically allocate one public IP normalized density distribution of download-speed measure- address to each of their clients. ments is depicted by solid black lines. We overlay it by We, therefore, start examining the aggregate ρ parameter scatter plot of download-speed (x-axis) and its corresponding for each ISP in AU and US. We select two of large and two congestion-counts (y-axis), shown by square/circle markers. of small ISPs from each country for comparison: in Australia, Note that for a given IP address, we unit-scale (normalize) Telstra and Optus as large providers, and Harbour and CEnet measured download-speed and congestion-count separately by as small providers; in the US, Comcast and AT&T as large dividing each data point by corresponding maximum value providers, and Hurricane and Lightower as small providers. (i.e., Xi/Xmax and Yi/Ymax; where [Xi, Yi] is the pair of We present in Fig. 13 the normalized density distribution of ρ download-speed and congestion-count for a client IP). In our value across unique IP addresses of each ISP. We find ∼12K, plots, the scaled value of congestion count for each test-point ∼4K, 79, and 24 unique addresses from network of Aus- is proportional to the size of the corresponding marker, tiered tralian ISPs Telstra, Optus, Harbour and CEnet respectively in two colors – low/medium (i.e., < 0.5) congestion counts are conducting total of ∼453K, ∼182K, 5273, and 4638 NDT tests in green, and high/very-high (i.e., ≥ 0.5) congestion counts are over 12-month period (Aug’16 - Jul’17). Fig. 13(a) shows in red. the ρ distribution for our selected operators in Australia. It Fig. 12(a) shows a negative correlation (i.e., ρ = −0.83) for is seen that the ρ parameter is predominately negative in large 458 test-points obtained from an IP address served by Cox ISPs (shown by solid red lines for Telstra and dashed green ISP in the US – smaller green squares are mainly skewed lines for Optus in Fig. 13(a)) suggesting that majority of IP 12

TABLE III ISP PAIRS WITH COMMON SUPPORT IN AU AND US(REFINED DATASET).

ISP pair ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

ISP1 vs ISP2 AT&T vs. Cox Virgin vs. BT Comcast vs. AT&T Comcast vs. Verizon Virgin vs. BT Comcast vs. Verizon Comcast vs. Cox Virgin vs. BT Comcast vs. Verizon Virgin vs. Sky Telstra vs. Optus Virgin vs. Sky Virgin vs. Sky Virgin vs. Sky Telstra vs. Optus Virgin vs. BT Virgin vs. Sky Virgin vs. Sky AT&T vs. Verizon Comcast vs. AT&T Comcast vs. AT&T Speed-tier [50, 75] [50, 75] [50, 75] [75, 100] [30, 50] [20, 25] [50, 75] [30, 50] [25, 30] [30, 50] [25, 30] [20, 25] [25, 30] [25, 30] [50, 75] [12, 20] [30, 50] [12, 20] [75, 100] [30, 50] [20, 25] Year 2016 2015 2015 2015 2015 2015 2015 2016 2015 2016 2016 2015 2015 2016 2016 2015 2015 2016 2016 2015 2015 ISP pair ID 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

ISP1 vs ISP2 Virgin vs. TalkTalk Comcast vs. Cox AT&T vs. Cox Virgin vs. Sky Comcast vs. AT&T Comcast vs. AT&T TPG vs. Optus Comcast vs. AT&T TPG vs. Telstra Comcast vs. Verizon Comcast vs. Cox Comcast vs. Cox AT&T vs. Cox Comcast vs. Cox Comcast vs. Verizon Comcast vs. Cox Comcast vs. Cox Telstra vs. iiNet Comcast vs. AT&T TPG vs. Telstra Telstra vs. iiNet Speed-tier [30, 50] [30, 50] [8, 12] [12, 20] [25, 30] [8, 12] [12, 20] [0, 8] [12, 20] [50, 75] [75, 100] [12, 20] [12, 20] [8, 12] [0, 8] [0, 8] [25, 30] [12, 20] [12, 20] [8, 12] [0, 8] Year 2016 2015 2016 2015 2015 2015 2015 2015 2015 2015 2015 2015 2016 2015 2015 2015 2015 2016 2015 2015 2016 ISP pair ID 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

ISP1 vs ISP2 Telstra vs. iiNet Comcast vs. Verizon Comcast vs. Verizon AT&T vs. Verizon Comcast vs. Verizon Telstra vs. TPG Telstra vs. Optus Telstra vs. TPG Comcast vs. Cox Telstra vs. Optus Telstra vs. Optus Telstra vs. TPG Telstra vs. iiNet Virgin vs. Sky Virgin vs. Sky Virgin vs. BT TPG vs. Telstra TPG vs. iiNet AT&T vs. Comcast Telstra vs. iiNet Telstra vs. Optus Speed-tier [8, 12] [8, 12] [30, 50] [50, 75] [12, 20] [8, 12] [0, 8] [0, 8] [20, 25] [8, 12] [12, 20] [12, 20] [20, 25] [0, 8] [0, 8] [0, 8] [0, 8] [0, 8] [25, 30] [30, 50] [30, 50] Year 2016 2015 2015 2016 2015 2016 2016 2016 2015 2016 2016 2016 2016 2015 2016 2015 2015 2015 2016 2016 2016 ISP pair ID 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

ISP1 vs ISP2 AT&T vs. Cox Telstra vs. Optus AT&T vs. Verizon AT&T vs. Cox AT&T vs. Comcast AT&T vs. Comcast AT&T vs. Comcast AT&T vs. Verizon AT&T vs. Verizon AT&T vs. Comcast AT&T vs. Verizon Comcast vs. Cox AT&T vs. Verizon AT&T vs. Cox AT&T vs. Cox AT&T vs. Comcast AT&T vs. Comcast AT&T vs. Comcast Speed-tier [75, 100] [20, 25] [0, 8] [0, 8] [8, 12] [0, 8] [75, 100] [8, 12] [25, 30] [12, 20] [20, 25] [100, 1000] [12, 20] [25, 30] [30, 50] [20, 25] [50, 75] [30, 50] Year 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 2015 2016 2016 2016 2016 2016 2016

10 ● ● ● ● ● ●

● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ●●● ● ●●● ● ● ●● ●● ● ●● ● ● ●●●●● ● ● ● ●●● ● ● ● ●● ● ●● ●●●●● ● ● ● ●●●●●●● ●● ● 0 ●●●●●●●● ● ● ●●●●●●● ● ● ● nr ●●●● ● ● ● ●●●● ● ●●● ● ● ● ● ●●● r ● ●●● ●● ● ● ● ● ● ● ● ● ● ● ●●●● ●●●●●●●● ●● ●●● ● ●● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ●● ● ● ● ● ● ● ●● ●● ● −10 ●

Diff. of download speed (Mbps) of download Diff. ● Fig. 15. Outliers in speed measurements.

● addresses present in M-Lab data (from these two large ISPs) 1 10 20 30 40 50 60 70 80 are consistently assigned to single households. Moreover, the ρ ISP pairs ID distribution is fairly biased towards positive values for smaller ISPs – average ρ = 0.31 for Harbour (its distribution is Fig. 16. Comparing ISP speeds for 81 pairs in AU and US (refined dataset). shown by dotted blue lines) and average ρ = 0.58 for CEnet correlation of 0.21 and 0.10 respectively. (its distribution is shown by dashed-dotted purple lines) in We see on average a negative correlation between measured Fig. 13(a), meaning that IP addresses are mainly shared by download-speed and congestion-count across large network multiple households of varied broadband capacity. operators (with large pool of IP addresses), and positive Similarly, we observe an aggregate negative correlation correlation values across small network operators (with small values for large ISPs in the US along with neutral/positive pool of IP addresses) in two countries Australia and US. correlation for smaller ISPs, as shown in Fig. 13(b). For our 2) Across Months: We now track the correlation value US selected ISPs Comcast, AT&T, Hurricane, and Lightower, within a network operator across various months to check we have ∼4.3m, ∼2.8m, ∼46K, and ∼14K NDT test-points whether change of network conditions would affect the ρ respectively indexed by ∼98K, ∼53K, 424, and 176 unique value. This verifies the validity of our hypothesis over time. addresses. The average ρ for large operators Comcast and We, therefore, compute the ρ value for a given IP address on AT&T is −0.39 and −0.43 respectively, whereas smaller a monthly basis using data points observed within a month, operators Hurricane and Lightower exhibit positive average e.g., April 2017. 13

In Fig. 12, we saw speed, congestion and the ρ value D. Multi-Variate Matching on Refined Dataset computed on aggregate data of 12-month period for one Lastly, we apply the multi-variate matching technique on the sample of IP address in each network (large and small sepa- refined dataset (i.e., removing data-points of households with rately). We visualize in Fig. 14 the monthly data along with positive ρ followed by eliminating outliers) to re-evaluate our corresponding ρ values for the same IP addresses and their comparison of ISP performance. Similar to previous section, respective networks. We observe an strong negative correlation we begin by sorting ISP pairs in ascending order of their for data of address 98.174.39.22 from Cox (one of the top average difference of download speed. Note that we now have ten large ISPs in the US) consistent across four months in 81 ISP pairs (i.e., 2 pairs less than the original dataset in 2017, as shown in Fig. 14(a). Individual monthly speed density the previous section) with sufficient common support – new curves (narrow single hump) and congestion clusters are fairly pairs are listed in Table III. Our matching results are shown in similar to the plot in Fig. 12(a), and the ρ value is −0.90, Fig. 16. We observe that for majority of ISP pairs, matching −0.88, −0.82, and −0.76 for successive months April, May, estimated differences are closer to zero compared to the naive June, and July respectively. average – though we see a few exceptions (e.g., pair IDs 22, Considering the IP address from a smaller operator in 64, 72). This reiterates our view that when ISPs are compared Fig. 14(b), a strong positive correlation is observed consent- fairly by adjusting for test conditions, they are not so different. ingly across four successive months in 2016-2017. In each plot, download-speed density curve depicts two humps and VII.CONCLUSION congestion markers are aligned (green squares on the left and This paper is a first step towards a fair comparison of speed red circles on the right) in opposite direction to that which is performance across broadband ISPs, by applying emerging expected, just similar to aggregate performance measurements causal inference techniques widely used in medicine to the in Fig. 12(b). We again see a strong positive ρ values of 0.61, large volume of measurement data from M-Lab. We first built 0.62, 0.69, and 0.62 respectively for November and December a tool to pre-process and visualize M-Lab data, giving prelim- in 2016, and January and February in 2017. inary insights into the factors affecting speed performance. We Our analysis of M-Lab data across various network opera- then demonstrated that test attributes such as access speed-tier, tors and across various months validates that our hypothesis host TCP window size, and server distance vary in distribution holds true. across ISPs, and further that these attributes affect measure- ment outcomes. We then applied multi-variate matching to reduce the confounding bias, and our fair comparison between C. Estimating Household Speed-Tier pairs revealed that the difference between ISPs is much lower than what naive aggregates may suggest. Our future work will We now filter measurements corresponding to those IP expand this study by estimating the comparative performance addresses that exhibit positive correlation between their of ISPs for individual customers rather than just aggregates. download-speed and congestion-count (i.e., ρ > 0). We note This will be achieved using more sophisticated methods, such that a large fraction of IP addresses from small ISPs are filtered as machine learning based Targeted Maximum Likelihood due to positive ρ value. For example, no data from CEnet (in (TML) algorithms, which can deal with both confounding as AU) is considered as single house [28]. well as differential causal effects. After removing data of multiple households, we estimate the speed-tier as a proxy for broadband capacity of each REFERENCES house. We term this maximum possible speed available to the [1] X. Deng, J. Hamilton, J. Thorne, and V. Sivaraman, “Measuring Broad- household as its “speed-tier”. As far as maximum download band Performance using M-Lab: Why Averages Tell a Poor Tale,” speed is concerned, in some cases we observe very large in PRoc. International Networks and Applications Conference (ITNAC), Sydney, Australia, Nov 2015. values in measurements which are more likely to be outliers. [2] Netflix. Netflix ISP Speed Index. http://ispspeedindex.netflix.com/. Fig. 15 exemplifies measured download speed from a sample [3] Youtube. Youtube Video Quality Report. https://www.google.com/get/ household. We use green solid lines to show the density videoqualityreport/. [4] FCC. Broadband Speed. https://www.fcc.gov/general/broadband-speed. distribution of speed overlayed by black circles stacked along [5] Australian Competition and Consumer Commission. Australia’s broad- the x-axis representing actual data points. We can see that band speeds. there are several outliers observed around 60 Mbps in Fig. 15 [6] European Commission DG Communications Networks. Eu analysis of broadband speed. while the rest of measurements fall under 30 Mbps – the [7] V. Bajpai and J. Schonwalder, “A Survey on Internet Performance maximum speed value seems to be about half of outliers value. Measurement Platforms and Related Standardization Efforts,” IEEE The dashed vertical red line depicts the cut-off point to filter Communication Surveys and Tutorials, vol. 17, no. 3, pp. 1313–1341, 2015. outliers. [8] S. Bauer, D. Clark, and W. Lehr, “Understanding Broadband Speed In order detect and exclude outliers data in our study, we Measurements,” in Proc. 38th Research Conference on Communication, employ the standard modified Thompson Tau technique [29] to Information and Internet Policy, Sep 2010. [9] M. B. Tariq, M. Motiwala, N. Feamster, and M. Ammar, “Detecting statistically determine rejection zone. This method eliminates Network Neutrality Violations with Causal Inference,” in Proc. ACM outliers more than two standard deviations from the mean CoNEXT, Rome, Italy, Dec 2009. value. After filtering outliers from our dataset, we pick the [10] R. Mahajan, M. Zhang, L. Poole, and V. Pai, “Uncovering Performance Differences Among Backbone ISPs with Netdiff,” in Proc. USENIX maximum value of remaining data points as the estimated NSDI. San Francisco, California, USA: USENIX Association, Apr speed-tier of corresponding house (i.e., IP address) [28]. 2008. 14

[11] M. Jain and C. Dovrolis, “End-to-End Available Bandwidth: Measure- Thanchanok Sutjarittham is currently pursuing her ment Methodology, Dynamics, and Relation with TCP Throughput,” in Ph.D. in Electrical Engineering and Telecommunica- Proc. ACM SIGCOMM, Pittsburgh, PA, USA, Aug 2002. tions at the University of New South Wales (UNSW [12] ——, “Ten Fallacies and Pitfalls on End-to-End Available Bandwidth Sydney), where she has also received her B.Eng. in Estimation,” in Proc. ACM Internet Measurement Conference, Taormina, Electrical Engineering and Telecommunications in Sicily, Italy, Oct 2004. 2016. Her primary research interests include Inter- [13] M. Mirza, J. SOmmers, P. Barford, and X. Zhu, “A Machine Learning net of Things, sensors data analytics, and applied Approach to TCP Throughput Prediction,” in Proc. ACM SIGMETRICS, machine learning. San Diego, CA, USA, Jun 2007. [14] J. Pearl, “Causal inference in statistics: An overview,” Statistics Surveys, vol. 3, pp. 96–146, 2009. [15] D. B. Rubin, “Causal Inference Using Potential Outcomes: Design, Modeling, Decisions,” Journal of the American Statistical Association, vol. 100, no. 468, pp. 322–331, 2005. [16] J. Sekhon, “The Neyman-Rubin Model of Causal Inference and Esti- mation via Matching Methods,” in The Oxford Handbook of Political Methodology, 2006. [17] E. A. Stuart, “Matching Methods for Causal Inference:A Review and a Look Forward,” in Institute of Mathematical Statistics, 2010. [18] J. Sekhon, “Multivariate and Propensity Score Matching Software with Hassan Habibi Gharakheili received his B.Sc. and Automated Balance Optimization: The Matching Package for R,” Jour- M.Sc. degrees of Electrical Engineering from the nal of Statistical Software, vol. 42, no. 7, 2011. Sharif University of Technology in Tehran, Iran in [19] M-Lab. Measurement Lab. http://www.measurementlab.net/. 2001 and 2004 respectively, and his Ph.D. in Elec- [20] Various Authors. perfSONAR: PERFormance Service Oriented Network trical Engineering and Telecommunications from the monitoring ARchitecture. https://www.perfsonar.net/. University of New South Wales in Sydney, Australia [21] Ookla. Speed-Test. http://www.speedtest.net/. in 2015. He is currently a Senior Lecturer at the [22] Samknows. Samknows internet performance platform. University of New South Wales in Sydney, Australia. [23] M-Lab. BISmark. http://www.measurementlab.net/tools/bismark/. His current research interests include programmable [24] M. Mathis and J. Heffner and R. Raghunarayan. TCP Extended Statistics networks, learning-based networked systems, and MIB. https://www.ietf.org/rfc/rfc4898.txt. data analytics in computer systems. [25] J. Bleich, A. Kapelner, E. George, and S. Jensen, “Variable Selection for BART: An Application to Gene Regulation,” The Annals of Applied Statistics, vol. 8, no. 3, pp. 1750–1781, 2014. [26] J. Semke, J. Mahdavi, and M. Mathis, “Automatic TCP Buffer Tuning,” in Proc. ACM SIGCOMM, Vancouver, British Columbia, Canada, Aug 1998. [27] E. A. Stuart, “Matching methods for causal inference: A review and a look forward.” Statistical Science, vol. 25, pp. 1–21, 2010. [28] X. Deng, Y. Feng, H. Habibi Gharakheili, and V. Sivaraman, “Estimating Residential Broadband Capacity using Big Data from M-Lab,” arXiv Blanca Gallego Luxan received the B.S. degree preprint, arXiv:1901.07059v1, Jan 2019. from the Universidad Autonoma´ Metropolitana, and [29] Tau. Modified Thompson Tau Outliers detection. http: the Ph.D. degree from the University of Califor- //www.statisticshowto.com/modified-thompson-tau-test/. nia, Los Angeles. She is currently an Associate Professor at the Centre for Big Data Research in Health, UNSW. She has extensive international re- search experience in data analysis and computational modeling and has made significant and innovative contributions to the design, analysis, and develop- ment of models derived from complex empirical data Xiaohong Deng received her B.Sc. and M.Sc. de- for a wide range of applications, such as patient grees of Computer Science from Chongqing Uni- safety, biosurveillance, corporate sustainability reporting, ecological footprint versity of Posts and Telecommunications in 2004 analysis, and climate variability. and Beijing University of Posts and Communications in 2007 respectively, and her Ph.D. in Electrical Engineering and Telecommunications from the Uni- versity of New South Wales in Sydney, Australia in 2019. She served France Telecom from 2008 to 2013 as a Network Architect and Project Lead. Her research interests include broadband networks and big data analytics of network performance data.

Vijay Sivaraman received his B. Tech. from the Indian Institute of Technology in Delhi, India, in 1994, his M.S. from North Carolina State University in 1996, and his Ph.D. from the University of Yun Feng received his B.Sc. and M.Sc. degrees of California at Los Angeles in 2000. He has worked Telecommunication from Xidian University in China at Bell-Labs as a student Fellow, in a silicon valley and the University of New South Wales in Sydney start-up manufacturing optical switch-routers, and in 2014 and 2018 respectively. He was a research as a Senior Research Engineer at the CSIRO in assistant in the School of Electrical Engineering and Australia. He is now a Professor at the University of Telecommunications at the University of New South New South Wales in Sydney, Australia. His research Wales. He is currently working at Shanghai Huawei interests include Software Defined Networking, net- Technologies. His research interests include big data, work architectures, and cyber-security particularly for IoT networks. machine learning and application of embedded sys- tem.