Learning in situ: a randomized experiment in video streaming Francis Y. Yan and Hudson Ayers, Stanford University; Chenzhi Zhu, Tsinghua University; Sadjad Fouladi, James Hong, Keyi Zhang, Philip Levis, and Keith Winstein, Stanford University https://www.usenix.org/conference/nsdi20/presentation/yan

This paper is included in the Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20) February 25–27, 2020 • Santa Clara, CA, USA 978-1-939133-13-7

Open access to the Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’20) is sponsored by Learning in situ: a randomized experiment in video streaming

Francis Y. Yan Hudson Ayers Chenzhi Zhu† Sadjad Fouladi James Hong Keyi Zhang Philip Levis Keith Winstein

Stanford University, †Tsinghua University

Abstract In the academic literature, many recent ABR algorithms use statistical and machine-learning methods [4, 25, 38–40, 46], We describe the results of a randomized controlled trial of which allow algorithms to consider many input signals and video-streaming algorithms for bitrate selection and network try to perform well for a wide variety of clients. An ABR prediction. Over the last year, we have streamed 38.6 years decision can depend on recent throughput, client-side buffer of video to 63,508 users across the Internet. Sessions are occupancy, delay, the experience of clients on similar ISPs or randomized in blinded fashion among algorithms. types of connectivity, etc. Machine learning can find patterns We found that in this real-world setting, it is difficult for so- in seas of data and is a natural fit for this problem domain. phisticated or machine-learned control schemes to outperform However, it is a perennial lesson that the performance of a “simple” scheme (buffer-based control), notwithstanding learned algorithms depends on the data or environments used good performance in network emulators or simulators. We to train them. ML approaches to video streaming and other performed a statistical analysis and found that the heavy-tailed wide-area networking challenges are often hampered in their nature of network and user behavior, as well as the challenges access to good and representative training data. The Inter- of emulating diverse Internet paths during training, present net is complex and diverse, individual nodes only observe a obstacles for learned algorithms in this setting. noisy sliver of the system dynamics, and behavior is often We then developed an ABR algorithm that robustly outper- heavy-tailed and changes with time. Even with representative formed other schemes, by leveraging data from its deployment throughput traces, accurately simulating or emulating the di- and limiting the scope of machine learning only to making versity of Internet paths requires more than replaying such predictions that can be checked soon after. The system uses traces and is beyond current capabilities [15, 16, 31, 45]. supervised learning in situ, with data from the real deployment environment, to train a probabilistic predictor of upcoming As a result, the performance of algorithms in emulated envi- chunk transmission times. This module then informs a classi- ronments may not generalize to the Internet [7]. For example, cal control policy (model predictive control). CS2P’s gains were more modest over real networks than in To support further investigation, we are publishing an simulation [40]. Measurements of Pensieve [25] saw narrower archive of data and results each week, and will open our ongo- benefits on similar paths [11] and a large-scale streaming ing study to the community. We welcome other researchers to service [24]. Other learned algorithms, such as the Remy use this platform to develop and validate new algorithms for congestion-control schemes, have also seen inconsistent re- bitrate selection, network prediction, and congestion control. sults on real networks, despite good results in simulation [45]. This paper seeks to answer: what does it take to create a learned ABR algorithm that robustly performs well over the 1 Introduction wild Internet? We report the design and findings of Puffer1, an ongoing research study that operates a video-streaming Video streaming is the predominant Internet application, mak- website open to the public. Over the past year, Puffer has ing up almost three quarters of all traffic [41]. One key al- streamed 38.6 years of video to 63,508 distinct users, while gorithmic question in video streaming is adaptive bitrate recording client telemetry for analysis (current load is about selection, or ABR, which decides the compression level se- 60 stream-days of data per day). Puffer randomly assigns each lected for each “chunk,” or segment, of the video. ABR al- session to one of a set of ABR algorithms; users are blinded gorithms optimize the user’s quality of experience (QoE): to the assignment. We find: more-compressed chunks reduce quality, but larger chunks may stall playback if the client cannot download them in time. 1https://puffer.stanford.edu

USENIX Association 17th USENIX Symposium on Networked Systems Design and Implementation 495 In our real-world setting, sophisticated algorithms based Results of primary experiment (Jan. 26–Aug. 7 & Aug. 30–Oct. 16, 2019) on control theory [46] or reinforcement learning [25] Algorithm Time stalled Mean SSIM SSIM variation Mean duration (lower is better) (higher is better) (lower is better) (time on site) did not outperform simple buffer-based control [18]. We Fugu 0.13% 16.64 dB 0.74 dB 33.6 min found that more-sophisticated algorithms do not necessarily MPC-HM [46] 0.22% 16.61 dB 0.79 dB 30.8 min beat a simpler, older algorithm. The newer algorithms were BBA [18] 0.19% 16.56 dB 1.11 dB 32.1 min developed and evaluated using throughput traces that may not Pensieve [25] 0.17% 16.26 dB 1.05 dB 31.6 min have captured enough of the Internet’s heavy tails and other RobustMPC-HM 0.12% 16.01 dB 0.98 dB 31.0 min dynamics when replayed in simulation or emulation. Training them on more-representative traces doesn’t necessarily re- Figure 1: In an eight-month randomized controlled trial with verse this: we retrained one algorithm using throughput traces blinded assignment, the Fugu scheme outperformed other drawn from Puffer (instead of its original set of traces) and ABR algorithms. The primary analysis includes 637,189 evaluated it also on Puffer, but the results were similar (§5.3). streams played by 54,612 client IP addresses (13.1 client- years in total). Uncertainties are shown in Figures 9 and 11. Statistical margins of error in quantifying algorithm per- formance are considerable. Prior work on ABR algorithms has claimed benefits of 10–15% [46], 3.2–14% [40], or 12– work [25, 37, 44]. One way to achieve representative training 25% [25], based on throughput traces or real-world experi- is to learn in place (in situ) on the actual deployment envi- ments lasting hours or days. However, we found that the em- ronment, assuming the scheme can be feasibly trained this pirical variability and heavy tails of throughput evolution and way and the deployment is widely enough used to exercise a rebuffering create statistical margins of uncertainty that make broad range of scenarios.3 The approach we describe here is it challenging to detect real effects of this magnitude. Even only a step in this direction, but we believe Puffer’s results with a year of experience per scheme, a 20% improvement in suggest that learned systems will benefit by addressing the rebuffering ratio would be statistically indistinguishable, i.e., challenge of “how will we get enough representative scenar- below the threshold of detection with 95% confidence. These ios for training—what is enough, and how do we keep them uncertainties affect the design space of machine-learning ap- representative over time?” as a first-class consideration. proaches that can practically be deployed [13, 26]. We intend to operate Puffer as an “open research” project It is possible to robustly outperform existing schemes by as long as feasible. We invite the research community to train combining classical control with an ML predictor trained and test new algorithms on randomized subsets of its traf- in situ on real data. We describe Fugu, a data-driven ABR fic, gaining feedback on real-world performance with quanti- algorithm that combines several techniques. Fugu is based fied uncertainty. Along with this paper, we are publishing an on MPC (model predictive control) [46], a classical control archive of data and results back to the beginning of 2019 on policy, but replaces its throughput predictor with a deep neural the Puffer website, with new data and results posted weekly. network trained using supervised learning on data recorded in In the next few sections, we discuss the background and situ (in place), meaning from Fugu’s actual deployment envi- related work on this problem (§2), the design of our blinded ronment, Puffer. The predictor has some uncommon features: randomized experiment (§3) and the Fugu algorithm (§4), it predicts transmission time given a chunk’s file size (vs. esti- with experimental results in Section 5, and a discussion of mating throughput), it outputs a probability distribution (vs. a results and limitations in Section 6. In the appendices, we point estimate), and it considers low-level congestion-control provide a standardized diagram of the experimental flow for statistics among its input signals. Ablation studies (§4.2) find the primary analysis and describe the data we are releasing. each of these features to be necessary to Fugu’s performance. In a controlled experiment during most of 2019, Fugu 2 Background and related work outperformed existing techniques—including the simple algorithm—in stall ratio (with one exception), video qual- The basic problem of adaptive video streaming has been the ity, and the variability of video quality (Figure 1). The im- subject of much academic work; for a good overview, we refer provements were significant both statistically and, perhaps, the reader to Yin et al. [46]. We briefly outline the problem practically: users who were randomly assigned to Fugu (in here. A service wishes to serve a pre-recorded or live video blinded fashion) chose to continue streaming for 5–9% longer, stream to a broad array of clients over the Internet. Each on average, than users assigned to the other ABR algorithms.2 client’s connection has a different and unpredictable time- Our results suggest that, as in other domains, good and varying performance. Because there are many clients, it is not representative training is the key challenge for robust perfor- feasible for the service to adjust the encoder configuration in mance of learned networking algorithms, a somewhat differ- real time to accommodate any one client. ent point of view from the generalizability arguments in prior 3Even collecting traces from a deployment environment and replaying 2This effect was driven solely by users streaming more than 3 hours of them in a simulator or emulator to train a control policy—as is typically video; we do not fully understand it. necessary in reinforcement learning—is not what we mean by “in situ.”

496 17th USENIX Symposium on Networked Systems Design and Implementation USENIX Association

Algorithm Control Predictor Optimization goal How trained BBA classical (linear control) n/a +SSIM s.t. bitrate < limit n/a MPC-HM classical (MPC) classical (HM) +SSIM, –stalls, –∆SSIM n/a RobustMPC-HM classical (robust MPC) classical (HM) +SSIM, –stalls, –∆SSIM n/a Pensieve learned (DNN) n/a +bitrate, –stalls, –∆bitrate reinforcement learning in simulation Fugu classical (MPC) learned (DNN) +SSIM, –stalls, –∆SSIM supervised learning in situ

Figure 5: Distinguishing features of algorithms used in the primary experiment. HM = harmonic mean of last five throughput samples. MPC = model predictive control. DNN = deep neural network.

Sessions are randomly assigned to serving daemons. Users training data, we used the authors’ provided script to generate can switch channels without breaking their TCP connection 1000 simulated videos as training videos, and a combination and may have many “streams” within each session. of the FCC and Norway traces linked to in the Pensieve code- Puffer is not a client-side DASH [28] (Dynamic Adaptive base as training traces. Streaming over HTTP) system. Like DASH, though, Puffer is an ABR system streaming chunked video over a TCP connec- 3.4 The Puffer experiment tion, and runs the same ABR algorithms that DASH systems can run. We don’t expect this architecture to replace client- To recruit participants, we purchased Google and Reddit ads side ABR (which can be served by CDN edge nodes), but we for keywords such as “live tv” and “tv streaming” and paid expect its conclusions to translate to ABR schemes broadly. people on Amazon Mechanical Turk to use Puffer. We were The Puffer website works in the Chrome, , Edge, and also featured in press articles. Popular programs (e.g. the 2019 browsers, including on Android phones, but does not and 2020 Super Bowls, the Oscars, World Cup, and “Bachelor play in the Safari browser or on iOS (which lack support for in Paradise”) brought large spikes (> 20×) over baseline load. the Media Source Extensions or Opus audio). Our current average load is about 60 concurrent streams. Between Jan. 26, 2019 and Feb. 2, 2020, we have streamed 3.3 Hosting arbitrary ABR schemes 38.6 years of video to 63,508 registered study participants using 111,231 unique IP addresses. About eight months of We implemented buffer-based control (BBA), MPC, Ro- that period was spent on the “primary experiment,” a ran- bustMPC, and Fugu in back-end daemons that serve video domized trial comparing Fugu with other algorithms: MPC, chunks over the WebSocket. We use SSIM in the objective RobustMPC, Pensieve, and BBA (a summary of features is functions for each of these schemes. For BBA, we use the in Figure 5). This period saw a total of 314,577 streaming formula in the original paper [18] to decide the maximum sessions, and 1,904,316 individual streams. An experimental- chunk size, and subject to this constraint, the chunk with the flow diagram in the standardized CONSORT format [35] is highest SSIM is selected to stream. We also choose reservoir in the appendix (Figure A1). values consistent with our 15-second maximum buffer. We record client telemetry as time-series data, detailing the size and SSIM of every video chunk, the time to deliver each Deploying Pensieve for live streaming. We use the released chunk to the client, the buffer size and rebuffering events at Pensieve code (written in Python with TensorFlow) directly. the client, the TCP statistics on the server, and the identity of When a client is assigned to Pensieve, Puffer spawns a Python the ABR and congestion-control schemes. A full description subprocess running Pensieve’s multi-video model. of the data is in Appendix B. We contacted the Pensieve authors to request advice on deploying the algorithm in a live, multi-video, real-world set- Metrics and statistical uncertainty. We group the time se- ting. The authors recommended that we use a longer-running ries by user stream to calculate a set of summary figures: the training and that we tune the entropy parameter when training total time between the first and last recorded events of the the multi-video neural network. We wrote an automated tool stream, the startup time, the total watch time between the first to train 6 different models with various entropy reduction and last successfully played portion of the stream, the total schemes. We tested these manually over a few real networks, time the video is stalled for rebuffering, the average SSIM, then selected the model with the best performance. We mod- and the chunk-by-chunk variation in SSIM. The ratio between ified the Pensieve code (and confirmed with the authors) so “total time stalled” and “total watch time” is known as the re- that it does not expect the video to end before a user’s session buffering ratio or stall ratio, and is widely used to summarize completes. We were not able to modify Pensieve to optimize the performance of streaming video systems [22]. SSIM; it considers the average bitrate of each Puffer stream. We observe considerable heavy-tailed behavior in most of We adjusted the video chunk length to 2.002 seconds and the these statistics. Watch times are skewed (Figure 11), and while buffer threshold to 15 seconds to reflect our parameters. For the risk of rebuffering is important to any ABR algorithm,

USENIX Association 17th USENIX Symposium on Networked Systems Design and Implementation 499 actual rebuffering is a rare phenomenon. Of the 637,189 eli- The controller, described in Section 4.4, makes decisions gible streams considered for the primary analysis across all by following a classical control algorithm to optimize an five ABR schemes, only 24,328 (4%) of those streams had objective QoE function (§4.1) based on predictions for how any stalls, mirroring commercial services [22]. long each chunk would take to transmit. These predictions are These skewed distributions create more room for the play provided by the Transmission Time Predictor (TTP) (§4.2), of chance to corrupt the bottom-line statistics summarizing a a neural network that estimates a probability distribution for scheme’s performance—even two identical schemes will see the transmission time of a proposed chunk with given size. considerable variation in average performance until a substan- tial amount of data is assembled. In this study, we worked to 4.1 Objective function quantify the statistical uncertainty that can be attributed to the play of chance in assigning sessions to ABR algorithms. We For each video chunk Ki, Fugu has a selection of versions of s calculate confidence intervals on rebuffering ratio with the this chunk to choose from, Ki , each with a different size s. bootstrap method [14], simulating streams drawn empirically As with prior approaches, Fugu quantifies the QoE of each from each scheme’s observed distribution of rebuffering ratio chunk as a linear combination of video quality, video quality as a function of stream duration. We calculate confidence variation, and stall time [46]. Unlike some prior approaches, intervals on average SSIM using the formula for weighted which use the average compressed bitrate of each encoding standard error, weighting each stream by its duration. setting as a proxy for image quality, Fugu optimizes a percep- These practices result in substantial confidence intervals: tual measure of picture quality—in our case, SSIM. This has with at least 2.5 years of data for each scheme, the width of the been shown to correlate with human opinions of QoE [12]. 95% confidence interval on a scheme’s stall ratio is between We emphasize that we use the exact same objective function ±13% and ±21% of the mean value. This is comparable to in our version of MPC and RobustMPC as well. the magnitude of the total benefit reported by some academic Let Q(K) be the video quality of a chunk K, T(K) be the un- work that used much shorter real-world experiments. Even certain transmission time of K, and Bi be the current playback a recent study of a Pensieve-like scheme on Facebook [24], buffer size. Following [46], Fugu defines the QoE obtained s encompassing 30 million streams, did not detect a change in by sending Ki (given the previously sent chunk Ki−1) as rebuffering ratio outside the level of statistical noise. QoE Ks,K Q Ks λ Q Ks Q K We conclude that considerations of uncertainty in real- ( i i−1)= ( i ) − | ( i ) − ( i−1)| s (1) world learning and experimentation, especially given uncon- − µ · max{T(Ki ) − Bi,0}, trolled data from the Internet with real users, deserve further where max{T(Ks) − B ,0} describes the stall time experi- study. Strategies to import real-world data into repeatable i i enced by sending Ks, and λ and µ are configuration constants emulators [45] or reduce their variance [26] will likely be i for how much to weight video quality variation and rebuffer- helpful in producing robust learned networking algorithms. ing. Fugu plans a trajectory of sizes s of the future H chunks to maximize their expected total QoE. 4 Fugu: design and implementation 4.2 Transmission Time Predictor (TTP) Fugu is a control algorithm for bitrate selection, designed to s be feasibly trained in place (in situ) on a real deployment envi- Once Fugu decides which chunk from Ki to send, two por- ronment. It consists of a classical controller (model predictive tions of the QoE become known: the video quality and video control, the same as in MPC-HM), informed by a nonlinear quality variation. The remaining uncertainty is the stall time. predictor that can be trained with supervised learning. The server knows the current playback buffer size, so what it Figure 6 shows Fugu’s high-level design. Fugu runs on the needs to know is the transmission time: how long will it take server, making it easy to update its model and aggregate per- for the client to receive the chunk? Given an oracle that re- formance data across clients over time. Clients send necessary ports the transmission time of any chunk, the MPC controller telemetry, such as buffer levels, to the server. can compute the optimal plan to maximize QoE. Fugu uses a trained neural-network transmission-time pre- dictor to approximate the oracle. For each chunk in the fixed Puffer H Data Aggregation model-based control -step horizon, we train a separate predictor. E.g., if opti- Video Server mizing for the total QoE of the next five chunks, five neural update state bitrate networks are trained. This lets us parallelize training. model update selection

daily training Each TTP network for the future step h ∈ {0,...,H − 1} Transmission Time MPC Controller takes as input a vector of: Predictor

1. sizes of past t chunks Ki−t ,...,Ki−1, Figure 6: Overview of Fugu 2. actual transmission times of past t chunks: Ti−t ,...,Ti−1,

500 17th USENIX Symposium on Networked Systems Design and Implementation USENIX Association ˆ s 3. internal TCP statistics ( tcp_info structure), where Pr[T(Ki )= ti] is the probability predicted by TTP for s s the transmission time of Ki to be ti, and Bi+1 can be derived 4. size s of a proposed chunk Ki+h. by system dynamics [46] given an enumerated (discretized) ti. The TCP statistics include the current congestion window The controller computes the optimal trajectory by solving the size, the number of unacknowledged packets in flight, the above value iteration with dynamic programming (DP). To smoothed RTT estimate, the minimum RTT, and the TCP make the DP computational feasible, it also discretizes Bi into estimated throughput (tcpi_delivery_rate). bins and uses forward recursion with memoization to only Prior approaches have used Harmonic Mean (HM) [46] compute for relevant states. or a Hidden Markov Model (HMM) [40] to predict a single throughput for the entire lookahead horizon irrespective of the size of chunk to send. In contrast, the TTP acknowledges the 4.5 Implementation fact that observed throughput varies with chunk size [7,32,47] s TTP takes as input the past t = 8 chunks, and outputs a by taking the size of proposed chunk Ki+h as an explicit input. In addition, it outputs a discretized probability distribution of probability distribution over 21 bins of transmission time: , . , . , . , . , . ,..., . ,∞ predicted transmission time Tˆ(Ks ). [0 0 25) [0 25 0 75) [0 75 1 25) [9 75 ), with 0.5 sec- i+h onds as the bin size except for the first and the last bins. TTP is a fully connected neural network, with two hidden layers 4.3 Training the TTP with 64 neurons each. We tested different TTPs with vari- We sample from the real usage data collected by any scheme ous numbers of hidden layers and neurons, and found similar running on Puffer and feed individual user streams to the training losses across a range of conditions for each. We im- TTP as training input. For the TTP network in the future plemented TTP and the training in PyTorch, but we load the step h, each user stream contains a chunk-by-chunk series trained model in C++ when running on the production server of (a) the input 4-vector with the last element to be size of for performance. A forward pass of TTP’s neural network in C++ imposes minimal overhead per chunk (less than 0.3 ms the actually sent chunk Ki+h, and, (b) the actual transmission on average on a recent x86-64 core). The MPC controller time Ti+h of chunk Ki+h as desired output; the sequence is shuffled to remove correlation. It is worth noting that unlike optimizes over H = 5 future steps (about 10 seconds). We set prior work [25, 40] that learned from throughput traces, TTP λ = 1 and µ = 100 to balance the conflicting goals in QoE. is trained directly on real chunk-by-chunk data. Each retraining takes about 6 hours on a 48-core server. We train the TTP with standard supervised learning: the training minimizes the cross-entropy loss between the output 4.6 Ablation study of TTP features probability distribution and the discretized actual transmission time using stochastic gradient descent. We performed an ablation study to assess the impact of the We retrain the TTP every day, using training data collected TTP’s features (Figure 7). The prediction accuracy is mea- over the prior 14 days, to avoid the effects of dataset shift and sured using mean squared error (MSE) between the predicted catastrophic forgetting [33,34]. Within the 14-day window, we transmission time and the actual (absolute, unbinned) value. weight more recent days more heavily. The weights from the For the TTP that outputs a probability distribution, we com- previous day’s model are loaded to warm-start the retraining. pute the expected transmission time by weighting the median value of each bin with the corresponding probability. Here 4.4 Model-based controller are the more notable results: Use of low-level congestion-control statistics. The TTP’s Our MPC controller (used for MPC-HM, RobustMPC-HM, nature as a DNN lets it consider a variety of noisy inputs, and Fugu) is a stochastic optimal controller that maximizes including low-level congestion-control statistics. We feed the the expected cumulative QoE in Equation 1 with replanning. It kernel’s tcp_info structure to the TTP, and find that several queries TTP for predictions of transmission time and outputs a of these fields contribute positively to the TTP’s accuracy, plan Ks,Ks ,...,Ks by value iteration [8]. After sending i i+1 i+H−1 especially the RTT, CWND, and number of packets in flight Ks, the controller observes and updates the input vector passed i (Figure 7). Although client-side ABR systems cannot typi- into TTP, and replans again for the next chunk. cally access this structure directory because the statistics live Given the current playback buffer level B and the last sent i on the sender, these results should motivate the communica- chunk K , let v∗(B ,K ) denote the maximum expected i−1 i i i−1 tion of richer data to ABR algorithms wherever they live. sum of QoE that can be achieved in the H-step lookahead horizon. We have value iteration as follows: Transmission-time prediction. The TTP explicitly consid- ∗ s ers the size of a proposed chunk, rather than predicting v (Bi,Ki−1)= max Pr[Tˆ(K )= ti]· i s n∑ i throughput and then modeling transmission time as scaling Ki t i linearly with chunk size [7, 32, 47]. We compared the TTP s ∗ s (QoE(Ki ,Ki−1)+ vi+1(Bi+1,Ki ))o, with an equivalent throughput predictor that is agnostic to the

USENIX Association 17th USENIX Symposium on Networked Systems Design and Implementation 501

about systematic uncertainties and the difficulties of running the wild Internet, with its variability and heavy-tailed distri- experiments and accumulating so much data. butions. It remains a challenging problem to gather the ap- Some of Fugu’s performance (and that of MPC, Ro- propriate training data (or in the case of RL systems, training bustMPC, and BBA) relative to Pensieve may be due to the environments) to properly learn and validate such systems. fact that these four schemes received more information as In this paper, we asked: what does it take to create a learned they ran—namely, the SSIM of each possible version of each ABR algorithm that robustly performs well over the wild Inter- future chunk—than did Pensieve. It is possible that an “SSIM- net? In effect, our best answer is to cheat: train the algorithm aware” Pensieve might perform better. The load of calculating in situ on data from the real deployment environment, and use SSIM for each encoded chunk is not insignificant—about an an algorithm whose structure is sophisticated enough (a neural extra 40% on top of encoding the video. network) and yet also simple enough (a predictor amenable to supervised learning on data, informing a classical controller) to benefit from that kind of training. 6.2 Limitations of Fugu Over the last year, we have streamed 38.6 years of video to 63,508 users across the Internet. Sessions are randomized There is a sense that data-driven algorithms that more “heav- in blinded fashion among algorithms, and client telemetry ily” squeeze out performance gains may also put themselves is recorded for analysis. The Fugu algorithm robustly out- at risk to brittleness when a deployment environment drifts performed other schemes, both simple and sophisticated, on from one where the algorithm was trained. In that sense, it is objective measures (SSIM, stall time, SSIM variability) and hard to say whether Fugu’s performance might decay catas- increased the duration that users chose to continue streaming. trophically some day. We tried and failed to demonstrate a We have found the Puffer approach a powerful tool for net- quantitative benefit from daily retraining over “out-of-date” working research—it is fulfilling to be able to “measure, then vintages, but at the same time, we cannot be sure that some build” [5] to iterate rapidly on new ideas and gain feedback. surprising detail tomorrow—e.g., a new user from an unfa- Accordingly, we are opening Puffer as an “open research” plat- miliar network—won’t send Fugu into a tailspin before it can form. Along with this paper, we are publishing our full archive be retrained. A year of data on a growing userbase suggests, of data and results on the Puffer website. The system posts but doesn’t guarantee, robustness to a changing environment. new data each week, along with a summary of results from Fugu does not consider several issues that other research the ongoing experiments, with confidence intervals similar to has concerned itself with—e.g., being able to “replace” those in this paper. (The format is described in Appendix B.) already-downloaded chunks in the buffer with higher quality We redacted some fields from the public archive to protect versions [38], or optimizing the joint QoE of multiple clients participants’ privacy (e.g., IP address) but are willing to work who share a congestion bottleneck [29]. with researchers on access to these fields in an aggregated Fugu is not tied as tightly to the TCP or congestion control fashion. Puffer and Fugu are also open-source software, as as it might be—for example, Fugu could wait to send a chunk are the analysis tools used to prepare the results in this paper. until the TCP sender tells it that there is a sufficient congestion We plan to operate Puffer as long as feasible and invite window for most of the chunk (or the whole chunk) to be sent researchers to train and validate new algorithms for ABR immediately. Otherwise, it might choose to wait and make control, network and throughput prediction, and congestion a better-informed decision later. Fugu does not schedule the control on its traffic. We are eager to collaborate with and transmission of chunks—it will always send the next chunk learn from the community’s ideas on how to design and deploy as long as the client has room in its playback buffer. robust learned systems for the Internet.

7 Conclusion Acknowledgments

Machine-learned systems in computer networking sometimes We are greatly indebted to Emily Marx, who joined this describe themselves as achieving near-“optimal” performance, project after the original submission of this paper, found and based on results in a contained or modeled version of the corrected bugs in our analysis tools, and performed the final problem [25,37,39]. Such approaches are not limited to the data analysis. We thank our shepherd, Vyas Sekar, and the academic community: in early 2020, a major video-streaming ACM SIGCOMM and USENIX NSDI reviewers for their company announced a $5,000 prize for the best low-delay helpful feedback. We are grateful for conversations with and ABR scheme, in which candidates will be evaluated in a net- feedback from Danfei Xu, T.Y. Huang, Hongzi Mao, Michael work simulator that follows a trace of varying throughput [2]. Schapira, and Nils Krahnstoever, and we thank the participants In this paper, we suggest that these efforts can benefit from in the Puffer research study, without whom these experiments considering a broader notion of performance and optimality. could not have been conducted. This work was supported by Good, or even near-optimal, performance in a simulator or NSF grant CNS-1909212 and by Google, Huawei, VMware, emulator does not necessarily predict good performance over Dropbox, Facebook, and the Stanford Platform Lab.

506 17th USENIX Symposium on Networked Systems Design and Implementation USENIX Association References [13] Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement learning. [1] : Non-profit retransmission of broadcast televi- In ICML 2019 Workshop RL4RealLife, 2019. sion, June 2018. https://news.locast.org/app/uploads/ 2018/11/Locast-White-Paper.pdf. [14] Bradley Efron and Robert Tibshirani. Bootstrap methods for standard errors, confidence intervals, and other mea- [2] MMSys’20/ Grand Challenge on Adaptation Al- sures of statistical accuracy. Statistical science, pages gorithms for Near-Second Latency, January 2020. https: 54–75, 1986. //2020.acmmmsys.org/lll_challenge.php. [15] Sally Floyd and Eddie Kohler. Internet research needs [3] Alekh Agarwal, Nan Jiang, and Sham M. Kakade. Lec- better models. ACM SIGCOMM Computer Communi- ture notes on the theory of reinforcement learning. 2019. cation Review, 33(1):29–34, 2003. [4] Zahaib Akhtar, Yun Seong Nam, Ramesh Govindan, [16] Sally Floyd and Vern Paxson. Difficulties in simulating Sanjay Rao, Jessica Chen, Ethan Katz-Bassett, Bruno the internet. IEEE/ACM Transactions on Networking, Ribeiro, Jibin Zhan, and Hui Zhang. Oboe: Auto- 9(4):392–403, 2001. tuning video ABR algorithms to network conditions. In Proceedings of the 2018 Conference of the ACM SIG- [17] Sadjad Fouladi, John Emmons, Emre Orbay, Catherine COMM, pages 44–58, 2018. Wu, Riad S. Wahby, and Keith Winstein. Salsify: Low- latency network video through tighter integration be- [5] Remzi Arpaci-Dusseau. Measure, then build (USENIX tween a video codec and a transport protocol. In 15th ATC 2019 keynote). Renton, WA, July 2019. USENIX USENIX Symposium on Networked Systems Design and Association. Implementation (NSDI 18), pages 267–282, 2018.

[6] Athula Balachandran, Vyas Sekar, Aditya Akella, Srini- [18] Te-Yuan Huang, Ramesh Johari, Nick McKeown, vasan Seshan, Ion Stoica, and Hui Zhang. Developing Matthew Trunnell, and Mark Watson. A buffer-based a predictive model of quality of experience for Inter- approach to rate adaptation: Evidence from a large video net video. ACM SIGCOMM Computer Communication streaming service. In Proceedings of the 2014 Confer- Review, 43(4):339–350, 2013. ence of the ACM SIGCOMM, pages 187–198, 2014.

[7] Mihovil Bartulovic, Junchen Jiang, Sivaraman Balakr- [19] Junchen Jiang, Vyas Sekar, Henry Milner, Davis Shep- ishnan, Vyas Sekar, and Bruno Sinopoli. Biases in data- herd, Ion Stoica, and Hui Zhang. CFA: A practical driven networking, and what to do about them. In Pro- prediction system for video QoE optimization. In 13th ceedings of the 16th ACM Workshop on Hot Topics in USENIX Symposium on Networked Systems Design and Networks, pages 192–198, 2017. Implementation (NSDI 16), pages 137–150, 2016.

[8] Richard Bellman. A Markovian decision process. Jour- [20] Junchen Jiang, Vyas Sekar, and Hui Zhang. Improving nal of mathematics and mechanics, pages 679–684, fairness, efficiency, and stability in HTTP-based adap- 1957. tive video streaming with FESTIVE. In Proceedings of the 8th International Conference on emerging Net- [9] Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, working EXperiments and Technologies, pages 97–108, Soheil Hassas Yeganeh, and Van Jacobson. BBR: 2012. Congestion-based congestion control. ACM Queue, 14(5):20–53, 2016. [21] S. Shunmuga Krishnan and Ramesh K. Sitaraman. Video stream quality impacts viewer behavior: Inferring [10] Federal Communications Commission. Measuring causality using quasi-experimental designs. IEEE/ACM Broadband America. https://www.fcc.gov/general/ Transactions on Networking, 21(6):2001–2014, 2013. measuring-broadband-america. [22] Adam Langley, Alistair Riddoch, Alyssa Wilk, Antonio [11] Paul Crews and Hudson Ayers. CS 244 ’18: Vicente, Charles Krasic, Dan Zhang, Fan Yang, Fedor Recreating and extending Pensieve, 2018. https: Kouranov, Ian Swett, Janardhan Iyengar, et al. The //reproducingnetworkresearch.wordpress.com/2018/ QUIC transport protocol: Design and Internet-scale de- 07/16/cs-244-18-recreating-and-extending-pensieve/. ployment. In Proceedings of the 2017 Conference of the ACM SIGCOMM, pages 183–196, 2017. [12] Zhengfang Duanmu, Kai Zeng, Kede Ma, Abdul Rehman, and Zhou Wang. A quality-of-experience index [23] Zhi Li, Xiaoqing Zhu, Joshua Gahm, Rong Pan, Hao for streaming video. IEEE Journal of Selected Topics in Hu, Ali C. Begen, and David Oran. Probe and adapt: Signal Processing, 11(1):154–166, 2016. Rate adaptation for HTTP video streaming at scale.

USENIX Association 17th USENIX Symposium on Networked Systems Design and Implementation 507 IEEE Journal on Selected Areas in Communications, [33] Anthony Robins. Catastrophic forgetting, rehearsal and 32(4):719–733, 2014. pseudorehearsal. Connection Science, 7(2):123–146, 1995. [24] Hongzi Mao, Shannon Chen, Drew Dimmery, Shaun Singh, Drew Blaisdell, Yuandong Tian, Mohammad Al- [34] Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A izadeh, and Eytan Bakshy. Real-world video adaptation reduction of imitation learning and structured predic- with reinforcement learning. In ICML 2019 Workshop tion to no-regret online learning. In Proceedings of RL4RealLife, 2019. the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 627–635, 2011. [25] Hongzi Mao, Ravi Netravali, and Mohammad Alizadeh. Neural adaptive video streaming with Pensieve. In [35] Kenneth F. Schulz, Douglas G. Altman, and David Mo- Proceedings of the 2017 Conference of the ACM SIG- her. CONSORT 2010 statement: updated guidelines COMM, pages 197–210. ACM, 2017. for reporting parallel group randomised trials. BMC medicine, 8(1):18, 2010. [26] Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, and Mohammad Alizadeh. Variance re- [36] Alexander T. Schwarm and Michael Nikolaou. Chance- duction for reinforcement learning in input-driven en- constrained model predictive control. AIChE Journal, vironments. In International Conference on Learning 45(8):1743–1752, 1999. Representations, 2019. [37] Anirudh Sivaraman, Keith Winstein, Pratiksha Thaker, [27] Ricky K.P. Mok, Xiapu Luo, Edmond W.W. Chan, and and Hari Balakrishnan. An experimental study of the Rocky K.C. Chang. QDASH: a QoE-aware DASH learnability of congestion control. In Proceedings of system. In Proceedings of the 3rd Multimedia Systems the 2014 Conference of the ACM SIGCOMM, pages Conference, pages 11–22, 2012. 479–490, 2014. [28] Dynamic adaptive streaming over HTTP (DASH) — Part [38] Kevin Spiteri, Ramesh Sitaraman, and Daniel Sparacio. 1: Media presentation description and segment formats, From theory to practice: Improving bitrate adaptation April 2012. ISO/IEC 23009-1 (http://standards.iso.org/ in the DASH reference player. In Proceedings of the ittf/PubliclyAvailableStandards). 9th ACM Multimedia Systems Conference, MMSys ’18, [29] Vikram Nathan, Vibhaalakshmi Sivaraman, Ravichan- pages 123–137, New York, NY, USA, 2018. ACM. dra Addanki, Mehrdad Khani, Prateesh Goyal, and Mo- hammad Alizadeh. End-to-end transport for video qoe [39] Kevin Spiteri, Rahul Urgaonkar, and Ramesh K. Sitara- fairness. In Proceedings of the ACM Special Interest man. BOLA: Near-optimal bitrate adaptation for online Group on Data Communication, SIGCOMM ’19, page videos. In INFOCOM 2016-The 35th Annual IEEE In- 408–423, New York, NY, USA, 2019. Association for ternational Conference on Computer Communications, Computing Machinery. IEEE, pages 1–9. IEEE, 2016.

[30] Ravi Netravali, Anirudh Sivaraman, Somak Das, [40] Yi Sun, Xiaoqi Yin, Junchen Jiang, Vyas Sekar, Fuyuan Ameesh Goyal, Keith Winstein, James Mickens, and Lin, Nanshu Wang, Tao Liu, and Bruno Sinopoli. CS2P: Hari Balakrishnan. Mahimahi: Accurate record-and- Improving video bitrate selection and adaptation with replay for HTTP. In 2015 USENIX Annual Technical data-driven throughput prediction. In Proceedings of Conference (USENIX ATC 15), pages 417–429, 2015. the 2016 Conference of the ACM SIGCOMM, pages 272–285, 2016. [31] Vern Paxson and Sally Floyd. Why we don’t know how to simulate the Internet. In Proceedings of the [41] Cisco Systems. Cisco Visual Networking Index: 29th conference on Winter simulation, pages 1037–1044, Forecast and trends, 2017–2022, November 2018. 1997. https://www.cisco.com/c/en/us/solutions/collateral/ service-provider/visual-networking-index-vni/ [32] Yanyuan Qin, Shuai Hao, Krishna R. Pattipati, Feng white-paper-c11-741490.pdf. Qian, Subhabrata Sen, Bing Wang, and Chaoqun Yue. ABR streaming of VBR-encoded videos: characteriza- [42] Guibin Tian and Yong Liu. Towards agile and smooth tion, challenges, and solutions. In Proceedings of the video adaptation in dynamic HTTP streaming. In Pro- 14th International Conference on emerging Networking ceedings of the 8th International Conference on emerg- EXperiments and Technologies, pages 366–378. ACM, ing Networking EXperiments and Technologies, pages 2018. 109–120, 2012.

508 17th USENIX Symposium on Networked Systems Design and Implementation USENIX Association [43] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transac- tions on Image Processing, 13(4):600–612, 2004. [44] Keith Winstein and Hari Balakrishnan. TCP ex Machina: Computer-generated congestion control. Proceedings of the 2013 Conference of the ACM SIGCOMM, 43(4):123– 134, 2013. [45] Francis Y. Yan, Jestin Ma, Greg D. Hill, Deepti Ragha- van, Riad S. Wahby, Philip Levis, and Keith Winstein. Pantheon: the training ground for Internet congestion- control research. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 731–743, Boston, MA, 2018. USENIX Association. [46] Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. A control-theoretic approach for dynamic adaptive video streaming over HTTP. In Proceedings of the 2015 Conference of the ACM SIGCOMM, pages 325–338, 2015. [47] Tong Zhang, Fengyuan Ren, Wenxue Cheng, Xiaohui Luo, Ran Shu, and Xiaolan Liu. Modeling and ana- lyzing the influence of chunk size variation on bitrate adaptation in DASH. In IEEE INFOCOM 2017-IEEE Conference on Computer Communications, pages 1–9. IEEE, 2017.

USENIX Association 17th USENIX Symposium on Networked Systems Design and Implementation 509 1 1t SNXSmoimo ewre ytm einadIpeetto USENIX Association 510 17th USENIXSymposium onNetworked Systems Designand Implementation A Randomized trial flow diagram

314,577 sessions underwent randomization 1,904,316 streams 69,017 unique IPs 17.2 client-years of data

69,941 sessions were excluded 437,266 streams 4.0 client-years of data

◦ 102,994 streams were assigned CUBIC ◦ 334,272 streams were assigned experimental algorithms for ◦ portions of the study duration

49,960 sessions were assigned 49,084 sessions were assigned 48,519 sessions were assigned 47,819 sessions were assigned 49,254 sessions were assigned Fugu MPC-HM RobustMPC-HM Pensieve BBA 303,250 streams 294,541 streams 293,323 streams 283,683 streams 292,253 streams

170,629 streams were excluded 166,186 streams were excluded 166,792 streams were excluded 158,879 streams were excluded 167,375 streams were excluded

◦ 385 did not begin playing ◦ 527 did not begin playing ◦ 213 did not begin playing ◦ 380 did not begin playing ◦ 330 did not begin playing ◦ 170,180 had watch time less than 4s ◦ 165,603 had watch time less than 4s ◦ 166,487 had watch time less than 4s ◦ 158,474 had watch time less than 4s ◦ 167,009 had watch time less than 4s ◦ 64 stalled from a slow video decoder ◦ 56 stalled from a slow video decoder ◦ 92 stalled from a slow video decoder ◦ 25 stalled from a slow video decoder ◦ 35 stalled from a slow video decoder ◦ 1 sent contradictory data

3,810 streams were truncated 3,580 streams were truncated 3,327 streams were truncated 3,557 streams were truncated 3,585 streams were truncated because of a loss of contact because of a loss of contact because of a loss of contact because of a loss of contact because of a loss of contact

132,621 streams were considered 128,355 streams were considered 126,531 streams were considered 124,804 streams were considered 124,878 streams were considered 2.8 client-years of data 2.6 client-years of data 2.5 client-years of data 2.5 client-years of data 2.7 client-years of data

637,189 streams were considered 13.1 client-years of data

◦ 1.2 client-days spent in startup ◦ 7.9 client-days spent stalled ◦ 13.1 client-years spent playing

Figure A1: CONSORT-style diagram [ 35] of experimental flow for the primary results (Figures 1 and 9), obtained during the period Jan. 26–Aug. 7, 2019, and Aug. 30–Oct. 16, 2019. A “session” represents one visit to the Puffer video player and may contain many “streams.” Reloading starts a new session, but changing channels only starts a new stream and does not change TCP connections or ABR algorithms. B Description of open data video_acked collects a data point every time a Puffer server receives a video chunk acknowledgement from a client. The open data we are releasing comprise different Each data point can be matched to a data point in video_sent “measurements”—each measurement contains a different set using video_ts (if the chunk is ever acknowledged) and used of time-series data collected on Puffer servers. Below we high- to calculate the transmission time of the chunk—difference light the format of interesting fields in three measurements between the timestamps in the two data points. Specifically, that are essential for analysis: video_sent, video_acked, each data point in video_acked contains: and client_buffer. • time: timestamp when the chunk is acknowledged video_sent collects a data point every time a Puffer server • session_id sends a video chunk to a client. Each data point contains: • expt_id time: timestamp when the chunk is sent • • channel session_id: unique ID for the video session • • video_ts • expt_id: unique ID to identify the experimental group; expt_id can be used as a key to retrieve the experimen- client_buffer collects client-side information reported tal setting (e.g., ABR, congestion control) when sending to Puffer servers on a regular interval and when certain events the chunk, in another file we are providing. occur. Each data point contains: • channel: TV channel name • time: timestamp when the client message is received • video_ts: unique presentation timestamp of the chunk • session_id • format: encoding settings of the chunk, including reso- • expt_id lution and constant rate factor (CRF) • channel • size: size of the chunk • event: event type, e.g., was this triggered by a regular • ssim_index: SSIM of the chunk report every quarter second, or because the client stalled • cwnd: congestion window size (tcpi_snd_cwnd) or began playing. • in_flight: number of unacknowledged packets in • buffer: playback buffer size flight (tcpi_unacked - tcpi_sacked - tcpi_lost + • cum_rebuf: cumulative rebuffer time in the current tcpi_retrans) stream • min_rtt: minimum RTT (tcpi_min_rtt) Between Jan. 26, 2019 and Feb. 2, 2020, we collected • rtt: smoothed RTT estimate (tcpi_rtt) 675,839,652 data points in video_sent, 677,956,279 data • delivery_rate: estimate of TCP throughput points in video_acked, and 4,622,575,336 data points in (tcpi_delivery_rate) client_buffer.

USENIX Association 17th USENIX Symposium on Networked Systems Design and Implementation 511