Under anonymous submission (Sept. 19, 2019) , 1 what does it take to create a Sadjad Fouladi † We report the design and findings of Puffer https://puffer.stanford.edu 1 Unfortunately, ML approaches for video streaming and As a result, algorithms trained in emulated environments This paper seeks to answer: However, it is a perennial lesson that the performance an ongoing research study that operates a video-streaming has streamed 14.2 years of video tois 56,000 about distinct 50 users, stream-daysassigns of each data session per to day). one of Puffer a randomly set of ABR algorithms; users ronments used to trainized machine them. learning Internet in services partdatasets. because revolution- Generally they speaking, have the huge, machine-learningnity commu- rich has gravitated towards simplerhuge algorithms amounts trained of with representative data—these algorithmsto tend be more robust in novelsophisticated scenarios algorithms [16,37]—and trained away on from small datasets. other networking areas are oftengood hampered in and their representative access training to data.plex and The diverse, Internet individual is nodes com- onlyof observe the a system noisy dynamics, sliver and behavior is often heavy-tailed. paths remains beyond current capabilities [13, 14 , 28, 42]. may not generalize to thegains Internet [5]. were For more example, CS2P’s modestlation over [38]. real Measurements networks of thanbenefits Pensieve on in similar [23] simu- network saw pathsservices narrower [9] [22]. or large-scale Other streaming learnedcongestion-control algorithms, schemes, such have as also the seensults inconsistent Remy on re- real networks, despite good results in simulation [42]. learned ABR algorithm that robustly performswild well Internet? over the decision can depend on recentoccupancy, throughput, delay, client-side the buffer experience of clientstypes on of similar connectivity, etc. ISPs Machine or learningin can seas find of patterns data and is a natural fit for thisof problem learned domain. algorithms depends greatly on the data or envi- website open to the public. Over the past eight months, Puffer while recording client telemetry for analysis (current load Accurately simulating or emulating the diversity of Internet Tsinghua University † 1 adaptive bitrate : a randomized experiment in video streaming Stanford University, in situ Abstract James Hong Keyi Zhang Philip Levis Keith Winstein Francis Y. Yan Hudson Ayers Chenzhi Zhu Learning , or ABR, which decides the compression level se- on data from the real deployment environment.

In the academic literature, many recent ABR algorithms use To support further investigation, we are publishing an We developed an ABR algorithm that robustly outperforms We found that in this real-world setting, it is difficult for so-

statistical and machine-learning methods [2,23,35,36,38,43], try to perform well for a wide variety of clients. An ABR more-compressed chunks reduce quality, butmay larger stall chunks playback if the client cannot download them in time. selection lected for each “chunk,” orgorithms segment, optimize of the the video. user’s ABR quality al- of experience (QoE): ing up almost threegorithmic quarters of question all in traffic video [39]. streaming One key is al- 1 Introduction ing study to the community. Weuse welcome this other platform researchers to to developbitrate and selection, validate network new prediction, algorithms and for congestion control. in situ archive of traces and results each day, and will open our ongo- other schemes in practice, by combininga classical learned control with network predictor, trained with supervised learning good performance in network emulatorsperformed or a statistical simulators. analysis We andand found heavy-tailed that nature the of variability networkcreate and hurdles algorithm for behavior robust learned algorithms in this area. phisticated or machine-learned control schemes toa outperform “simple” scheme (buffer-based control), notwithstanding sions are randomized in blindedand fashion client among telemetry algorithms, is recorded for analysis. prediction. Over the last eight months, we have streamed video-streaming algorithms for bitrate selection and network which allow algorithms to consider many input signals and Video streaming is the predominant Internet application, mak- We describe the results of a randomized controlled trial of 14.2 years of video to 56,000 users across the Internet. Ses- arXiv:1906.01113v2 [cs.NI] 23 Sep 2019 Sep 23 [cs.NI] arXiv:1906.01113v2 Under anonymous submission (Sept. 19, 2019) 32.6 min 0.68 dB ” as a first-class consideration. 16.9 dB how will we get enough representative 0.12% 0.10% 16.2 dB 0.90 dB 27.4 min (lower is better) (higher is better) (lower is better) (time on site) In a seven-month randomized controlled trial with , on real data from the actual deployment environment, Instead, the service encodes the video into a handful of We intend to operate Puffer as an “open research” project In the next few sections, we discuss the background and Results of primary experiment (Jan. 19–Aug. 7 &Algorithm Aug. 30–Sept. 12, 2019) TimeFugu stalled Mean SSIMMPC-HM SSIM] [43 variation MeanBBA duration [17] 0.25%Pensieve [23]RobustMPC-HM 16.8 dB 0.17% 0.19% 16.5 0.72 dB 16.8 dB dB 0.97 dB 1.03 27.9 dB min 28.5 min 29.6 min client’s connection has a different and unpredictable time- feasible for the service toreal adjust time the to encoder accommodate configuration any in one client. alternative compressed versions. Each represents the original ing the challenge of “ scenarios for training—what is enough,them and representative how over do time? we keep for the next several years.to We invite train the research and community testits new traffic, algorithms gaining on feedback on randomizedquantified real-world subsets uncertainty. performance of Along with with this paper,ing we an are archive of publish- traces and2019, results with back new traces to and the results beginning of posted daily. related work on this problemrandomized (§2), the experiment design (§3) of and our blinded the Fugu algorithm (§4), results and limitations inprovide Section6. a In standardized diagram the of appendices,the the we primary experimental analysis flow for and describedata the we format are and releasing quantity alongside of this paper. 2 Background and Related Work subject of much academic work; forthe a reader good overview, to we Yin refer ethere. al. A [43]. service We briefly wishes outline tostream serve the a problem to pre-recorded or a live broad video array of clients over the Internet. Each Figure 1: blinded assignment, the Fugu scheme outperformed other streams played by 44,907 clientin IP addresses total). (8.5 Uncertainties client-years are shown in Figures8 and 10. situ assuming the scheme can bedeployment trained is on widely observed enough data used andof to the scenarios. exercise a The broad approach range wein describe this direction, here but is we onlymachine-learned believe a Puffer’s networking results step systems suggest will that benefit by address- varying performance. Because there are many clients, it is not video but at a different quality, target bitrate, or resolution. The basic problem of adaptive video streaming has been the with experimental results in5, Section and a discussion of ABR algorithms. The primary analysis includes 458,801 2 . in 2 We given a proposed chunk’s file- Prior work on ABR algorithms We describe Fugu, a data-driven ABR transmission time on real data. of accumulated experience (or representative traces) This effect was statistically significant but driven solely by users stream- 2 Our results suggest that, as in other domains, good and rep- In a rigorous controlled experiment during most of 2019, year ing more than 2.5 hours of video; we do not fully understand it. resentative training is the key distinguishingfor feature robust performance required of learnedplest ABR way algorithms. to The obtain sim- representative training data is to learn fashion) chose to continue streamingaverage, for than 10–20% users longer, assigned on to the other ABR algorithms and the variability ofments video were quality significant both (Fig.1). statistically Thecally: and, improve- users perhaps, who practi- were randomly assigned to Fugu (in blinded Fugu outperformed existing techniques—including the simple algorithm—in stall ratio (with one exception), video quality, congestion-control statistics among its inputthese signals. techniques Each has of been explored before, butthem Fugu in combines a new way. Ablationof studies these (Section techniques4.2 ) to find be each necessary to Fugu’s performance. it predicts size (vs. estimating throughput), ittribution outputs a (vs. probability a dis- point estimate), and it considers low-level network trained using supervised learning on data recorded ment, Puffer. The predictor has some uncommon features: algorithm that combines several techniques.on MPC Fugu (model-predictive is control) based [43], apolicy, but classical replaces control its throughput predictor with a deep neural detection with 95% confidence.the These design uncertainties space affect ofpractically machine-learning be approaches deployed in that this can setting [11, 24]. It is possible to robustlycombining outperform classical control existing with schemes an by ML predictorin trained situ lenging to detect reala effects of this magnitude.per Even scheme, with a 20% improvementbe in statistically rebuffering ratio indistinguishable, would i.e., below the threshold of hours or days. However, we foundity that and the heavy empirical tails variabil- ofcreate throughput statistical evolution and margins rebuffering of uncertainty that make it chal- has claimed benefits of 10–15%25% [43], [23], 3.2–14% based [38], on or traces 12– or real-world experiments lasting enough of the variability or heavy tails weStatistical see margins in of error practice. informance quantifying are algorithm considerable. per- found that more-sophisticated algorithms dobeat not simpler, necessarily older algorithms. The newer algorithms were de- are blinded to the assignment. We find: In our real-world setting, sophisticatedon algorithms based control theorydid [43] not outperform or simple buffer-based reinforcement control [17]. learning [23] veloped using collections of traces that may not have captured “in situ,” meaning from Fugu’s actual deployment environ- Under anonymous submission (Sept. 19, 2019) 200 about 150 100 Epoch decisions ; it predicts the 50 per se 0 Typical Puffer session with

2.8 2.4 2.0 1.6 (to predict chunk transmis- Throughput (Mbps) Throughput similar mean throughput (b) —it outputs not a single pre- 200 160 that respond to a series of decisions and probabilistic 120 Epoch 80 supervised learning Puffer has not observed CS2P’s discrete throughput environments 40 0

CS2P example session (Fig- 3

Pensieve [23] is an ABR scheme also based on a deep neu- Fugu fits in this same category of algorithms. It also uses 2.4 2.8 2.6 Throughput (Mbps) Throughput to train a scheme thattraining makes decisions; these schemes need judge their consequences. This islearning.” known Generally as speaking, “reinforcement reinforcement learning tech- niques need to be ableperformance to by observe slightly a varying detectable a difference controllarge in action; amounts this of requires training, and systems that are challenging to tor trained on real data. This component, which we callof the features, none of which can claim novelty on itstransmission own. time The of a proposed chunk with a given filesize. effect [5,29,44], although to our knowledge Fuguuse is this the fact first operationally to aspredictor part of is a also control policy. Fugu’s dicted transmission time, butpossible a outcomes. probability The use distribution of on certainty probabilistic in or model stochastic predictive un- controlbut has to a our long knowledge history Fugu [33], in is this the context. first Finally, to Fugu’s use predictor stochastic is MPC a neural network, transmission time, including raw congestion-controlfrom statistics the sender-side TCP implementationthat [15, 40]. several We of found these signals (RTT,benefit delivery_rate, ABR FlightSize) decisions (§5). ral network. Unlike Fugu, Pensieve usesnot the simply neural to network make predictions but to make to train the algorithm. Whiletrained CS2P with and Fugu’s TTP cansion be times recorded from past data), it takes more than “data” ure 4a from [38]) Figure 2: states. (Epochs are 6 seconds in both plots.) similarity and models their evolving throughput as a Marko- a similar model to detectstate. when In the our network dataset, path we hasobservation have changed of not discrete observed throughput CS2P states and (Figure2). Oboe’s MPC as the control strategy, informed by a network predic- vian process with a small number of discrete states; Oboe uses (a) Transmission Time Predictor (TTP), incorporates a number TTP is not a “throughput” predictor That observed throughput varies with filesize is a well-known which lets it consider an array of diverse signals that relate to which chunks to send. This affects the type of learning used 3 model adaptive Algorithms that select that informs a predictive Researchers have produced a lit- (ABR) schemes. These schemes fetch chunks, accu- throughput predictor , typically 2–6 seconds each, and encodes each version CS2P [38] and Oboe-tuned RobustMPC [2] are related to Model Predictive Control (MPC), FastMPC, and Robust- the same control strategy (MPC). These throughput predictors throughput over time within a session; CS2P clusters users by the harmonic mean of the last five throughput samples. MPC; they constitute better throughput predictors that inform of chunks over a limitedto horizon maximize (e.g., the the expected next QoE. 5–8RobustMPC We chunks) implemented for MPC Puffer, and using the same predictor as the paper: of what will happen tonear the future, buffer occupancy depending and on QoE whichquality in chunks the and it fetches, sizes. with MPC what uses the model to plan a sequence MPC [43] fall into theules: last a category. They comprise two mod- the duration of the playbacktheoretic buffer [17,35,36], schemes and that control- try toa maximize receding expected horizon, QoE given over theprediction of upcoming the chunk future sizes throughput. and a erature of ABR schemes, includingthat “rate-based” approaches focus onthroughput matching [18,21,25], “buffer-based” the algorithms that video steer bitrate to the network against human behavior or opinion [4, 10 , 19]. maximize the quality of chunks played back, and minimize quality). The importance tradeoff forin these a factors QoE is metric; captured several studies have calibrated QoE metrics framed as choosing the optimalor sequence replace of chunks [35], to given fetch recentthe experience future, and to guesses minimize about startup time and presence of stalls, size of each chunk. Ifstall the while buffer the underflows, client playback “rebuffers”: must fetchingresuming more playback. chunks The before goal of an ABR algorithm is typically the same time. The playhead advancesa and steady drains rate, the 1 buffer s/s, at tated but by chunks arrive the at varying irregular network intervals throughput dic- and the compressed given uncertain future throughput,bitrate are known as mulating them in a playback buffer, while playing the video at alternative stream, the chunks vary in compressed sizeChoosing [44]. which chunks to fetch. chunks of each chunk independently, soaccess it to can any be other decoded chunks.to without This switch gives between clients different the versions opportunity at each chunk boundary. Client sessions choose from thisencodes limited the menu. The different service versionsto in switch a midstream way as that necessary: allows it clients divides the video into variation in quality over time (especially abrupt changes in The different alternatives are generally referred to as different were trained on real datasets that recorded the evolution of Adaptive bitrate selection. which alternative version of each chunk to fetch and play, “bitrates,” although streaming services today“variable generally bitrate” use (VBR) encoding [29], where within each Under anonymous submission (Sept. 19, 2019) tag BBA to produce 4

16.5 Average SSIM (dB) SSIM Average Puffer encodes each video chunk in ten different H.264 Puffer then uses Encoding six channels in ten versions each (60 streams and the JavaScript MediaSource extensions) onand the places client side, the ABR schemeserver at with the 10 server. We Gbps have Ethernet a in 48-core a well-connected datacenter. a daemon on the server.different Each TCP daemon congestion is control configured (for with the a primary analysis, seconds long. Video is de-interlaced with a “canonical” 1080p60 or 720p60 source for compression. range from 240p60 video with26 constant (about rate 200 factor kbps) (CRF) to of 5,500 1080p60 kbps). video Audio with chunks CRF are of encoded 20 in (about the Opus format. SSIM [41], a measure ofcal video source. quality, This relative to information the isof canoni- used BBA, by MPC, the RobustMPC, objective function andtion. Fugu, In and practice, for the relationship our between evalua- bitrate and quality compressed chunk sizes directly—only whatscreen. is Our shown on results the indicate thattrate schemes directly that do maximize not reap bi- quality a (Figure4). commensurate benefit in picture total) with 64 2.7 GHz CPUeach in encoded steady chunk state. consumes Calculating an additional the 18 SSIM cores. of 3.2 Serving chunks to the browser Puffer uses a “dumb” player (using the HTML5 Figure 4: quality video per byte sent,directly vs. (Pensieve) or those the that SSIM maximize of bitrate each chunk (BBA). versions, using varies chunk-by-chunk (Figure3), and users cannot perceive The browser opens a WebSocket (TLS/TCP) connection to To make it feasible to deploy and test arbitrary ABR schemes, (MPC-HM, RobustMPC-HM, and Fugu) delivered higher we used BBR [7]) and ABR scheme. Some schemes are more 4 3 200 kbps 5500 kbps 2 Chunk number 1 Picture quality also varies

8 6

18 14 12 10 16 Average SSIM (dB) SSIM Average (b) with VBR encoding]. [29 3 2 Chunk number Variations in picture quality and chunk size within 5500 kbps 200 kbps 1 VBR encoding lets chunk

2 0 4 6 Our reasoning for streaming live television was to collect Size (MB) Size in the event ofstream. lost Video transport-stream chunks packets are 2.002 on seconds either long, sub- reflecting the 2 transport streams ina UDP. stream We wrote to chunks softwaretaining of to synchronization raw decode (by decoded inserting video black and fields audio, or main- silence) 3.1 Back-end: decoding, encoding,Puffer SSIM receives six televisionantenna channels and using an a ATSC demodulator, VHF/UHF which outputs MPEG- describe details of the system, experiment, and analysis. available for free on the Internet.from Our a study law that benefits, allows in nonprofitover-the-air part, television organizations signals to without retransmit charge [1]. Here, we robust conclusions about the performance of algorithms for evergreen source of popular content that had not been broadly constitute human subjects research. data from enough participants and network paths to draw participants include any member ofparticipate. the Users public are who blinded wishes to to algorithm assignment, and Institutional Review Board determined that Puffer does not commercial television channels. Puffer operatesized as controlled a trial; random- sessions areof randomly a assigned to set one of ABR or congestion-control schemes. The study sure the behavior of ABRpublicly accessible schemes, website we that built live-streams Puffer, six a over-the-air free, 3 Puffer: an ongoing live study of ABR difficulties, [11 24]. The authors ofsimilar Pensieve scheme recently on tested video a traffic at Facebook], [22 observing a simulate faithfully or that have too much variability present each stream suggest a benefitSSIM from and choosing size, chunks rather based than on average bitrate (legend). size vary within a stream [44]. Figure 3: (a) To understand the challenges of video streaming and mea- ABR control and network prediction. Live television is an we record client telemetry on video quality and playback. Our 1/1001 factor for NTSC frame rates. Audio chunks are 4.8 1.6% increase in bitrate in a large real-world test. Under anonymous submission (Sept. 19, 2019) in situ We record through- ) over baseline load. × 20 > n/a n/a n/a limit SSIM SSIM SSIMSSIM supervised learning in emulation supervised learning bitrate reinforcement learning in simulation < ∆ ∆ ∆ ∆ ∆ – – – – – bitrate stalls, stalls, stalls, stalls, stalls, – – – – – s.t. , We observe considerable heavy-tailed behavior in most Starting from the beginning of 2019, we have streamed SSIM SSIM, SSIM, bitrate SSIM, SSIM, SSIM, and the chunk-by-chunk variationbetween in “total SSIM. time The stalled” ratio andas “total watch the time” rebuffering is ratio known orsummarize the stall performance ratio, of and streaming is video widely systems used [20]. to of these statistics. Watch times arebuffering, skewed while (Fig. important10 ), to and QoE, re- is a rare phenomenon. Of for keywords such as “live tv”people and on “tv Amazon streaming,” and Mechanical paid Puffer. Turk We to were stream featured video in press from popular articles live events as (including a the way Super toand Bowl, watch other the World sporting Cup, events, “Bachelor incurrent Paradise,” average etc.). load Our is aboutevents 50 bring stream-days large per spikes day. ( Popular using 61,682 unique IP addresses. Aboutperiod seven was months spent of that on aother randomized algorithms trial comparing (MPC, Fugu RobustMPC, with Pensieve, and BBA); 337,170 streaming sessions, and aual total streams. of A 1,595,356 full individ- experimental-flow diagramized in CONSORT the format standard- [32] is in the appendix (Figure A1). Metrics and statistical uncertainty. put traces and clientappendix) telemetry and (a calculate full a description setstream: of the is total figures in time to between the the summarize firstof each and the last stream, recorded events the startupthe time, the first total and watch last timethe successfully between total played time portion the of video the is stalled stream, for rebuffering, the average the individual filesizes of each chunk;bitrate it of considers each the Puffer average stream.length We to adjusted 2.002 the seconds video andonds chunk to the reflect buffer our threshold parameters. toauthors’ For 15 provided training sec- script data, to we generate usedtraining 1000 the videos, simulated and videos a as combinationtraces of linked the to FCC in and the Norway Pensieve codebase as training traces. 3.4 The Puffer experiment To recruit participants, we purchased Google and Reddit ads we refer to this as the primary experiment. This period saw 14.2 years of video to 55,897 registered study participants + + + + + + 5 to 43200, n/a n/a We use the re- video_num_chunks classical (MPC) learned (DNN) Distinguishing features of algorithms used in the experiments. HM = harmonic mean of last five throughput samples. We contacted the Pensieve authors to request advice on Deploying Pensieve for live streaming. Puffer is not a client-side DASH [26] (Dynamic Adaptive Emulation-trained Fugu MPC-HM classical (MPC) classical (HM) AlgorithmBBA Control classical (prop. control) Predictor Optimization goal How trained RobustMPC-HMPensieve classical (robustFugu MPC) classical (HM) learned (DNN) classical (MPC) learned (DNN) indicating 24 hours of video,the so video that to Pensieve does end not beforenot expect a able user’s session to completes. modify We Pensieve were to optimize SSIM or to consider schemes. We tested these manually overthen a selected few the real model networks, withified the best the performance. Pensieve We mod- code to set training and that we tune thethe entropy multi-video parameter neural when network. training We wroteto an train automated 6 tool different models with various entropy reduction deploying the algorithm in ating. live, The multi-video, authors real-world set- recommended that we use a longer-running leased Pensieve code (written indirectly. When Python a with client TensorFlow) is assigneda to Python Pensieve, subprocess Puffer running spawns Pensieve’s multi-video model. formula in the original paperconsistent [17] with to a choose 15-second reservoir maximum values buffer. bustMPC, and Fugu in back-endchunks daemons over that the WebSocket. serve We video usefunctions SSIM for each in of the these objective schemes. For BBA, we used the 3.3 Hosting arbitrary ABR schemes browsers, including on Androidin phones, the but Safari does browser not orMediaSource on play extensions iOS or (which Opus lack audio). support for the side ABR (which can beexpect served its by conclusions CDN to edge translate nodes), to but ABR we schemes broadly. an ABR system streaming chunkedtion, video and over runs a the TCP same connec- can ABR run. algorithms We that don’t DASH expect systems this architecture to replace client- can switch channels without breakingand their may TCP have connection many “streams” within each session. Streaming over HTTP) system. Like DASH, though, Puffer is from serving client traffic (includingis TLS, about TCP, and 5% ABR) ofSessions are an randomly Intel assigned x86-64 to serving 2.7 daemons. GHz Users core per stream. efficiently implemented than others; on average the CPU load Figure 5: MPC = model-predictive control. DNN = deep neural network. The Puffer website works in the Chrome, , and Edge We implemented buffer-based control (BBA), MPC, Ro- Under anonymous submission (Sept. 19, 2019) . s H (1) | ) 1 , − i } K 0 be the uncer- ( , i ) of the future Q B K s ( − − T ) ) s , i (following [43]) as s i s K i K K ( ( K be the current playback Q T | i { λ B − , each with a different size s ) max i s i · K , and K µ ( K − Q , Fugu has a selection of versions of i K describes the stall time experienced by ) = 1 } − 0 i , i K are configuration constants for how much to , B s i µ be the quality of a chunk K − ( ) . Fugu plans a trajectory of sizes ) s s i i K and K ( K ( λ QoE Q T { Let Fugu uses a trained neural-network transmission-time pre- The controller, described in Section 4.4, makes decisions dictor to approximate the oracle.horizon, For we each train chunk a in separate thethe predictor. fixed total E.g., QoE if of optimizing the for nexttrained. five (Multiple chunks, networks five in neural parallel networksalent are are to functionally one equiv- that takeshave the observed future equivalent time performance step with as bothbut a approaches, training variable. multiple We DNNs lets us parallelize training.) been shown to correlate with human opinions of QoEin [10]. our version of MPC and RobustMPC as well. tain transmission time of buffer size. Fugu defines the QoE of max sending Once Fugu decides whichthe chunk QoE become to known: send, the two video quality portions and of video quality server knows the current playback buffer size,to so know what it is needs the transmissionthe client time: to how receive long the will chunk?the it Given transmission an time take oracle of for that any reports compute chunk, the the optimal MPC plan controller to can maximize QoE. server, making it easy to updateformance data its across model clients and over aggregate time.telemetry, per- Clients such send as necessary buffer levels, to the server. by following a classicalobjective QoE control function algorithm (§4.1) based to onlong optimize each predictions chunk for an would how take toprovided transmit. by These the predictions Transmission are Time Predictora (TTP) neural4.2), (§ network that estimatesthe a transmission probability time distribution of for a proposed chunk with given filesize. 4.1 Objective function For each video chunk this chunk to choose from, chunk as a linear combination of video quality, video quality setting as a proxy fortual image measure quality, Fugu of optimizes picture a quality—in percep- our case, SSIM. This has chunks to maximize their expected total QoE. 4.2 Transmission Time Predictor (TTP) variation. The remaining uncertainty is the stall time. The variation, and stall time [43]. Unlike some prior approaches, where weight video quality variation and rebuffering. The last term As with prior approaches, Fugu quantifies the QoE of each which use the average compressed bitrate of each encoding We emphasize that we use the exact same objective function 6 bitrate selection er ff Pu Video Server MPC Controller state update

model-based control daily training daily Overview of Fugu of the mean value. This is comparable to stalls, mirroring commercial services [20]. Figure 6: 17% any ± and Predictor model update Data Aggregation Figure6 shows Fugu’s high-level design. Fugu runs on the We conclude that considerations of uncertainty in real- These practices result in substantial confidence intervals: These skewed distributions create more room for the play Transmission Time 10% predictor that can be trained with supervised learning. be feasibly trained in place (inronment. situ) It on consists a of real a deployment classicalcontrol, envi- controller the (model same predictive as in MPC-HM), informed by a nonlinear 4 Fugu: design and implementation Fugu is a control algorithm for bitrate selection, designed to emulators [42] or reducehelpful their in variance producing [24] robust learned will networking likely algorithms. be trolled data from the Internetstudy. with real Strategies users, to deserve import further real-world data into repeatable collected data on 30 millionin streams, rebuffering did ratio not outside detect the a level change of statistical noise. ± the magnitude of the total benefit reported by some academic recent study of a Pensieve-like scheme on Facebook, which 95% confidence interval on a scheme’s stall ratio is between as a function ofintervals stream on duration. average SSIM We calculate usingstandard error, the confidence weighting formula each for stream weighted by its duration. calculate confidence intervals on rebufferingbootstrap ratio method with [12], simulating the streamsfrom drawn each empirically scheme’s observed distribution of rebuffering ratio considerable variation in average performance untiltial a amount substan- of data isquantify assembled. the In statistical this uncertainty study, that we canplay worked be of to attributed chance to in the assigning sessions to ABR algorithms. We of chance to corrupt thescheme’s bottom-line performance—even statistics two summarizing identical a schemes will see the 458,801 eligible streams considered for the primary anal- streams had ysis across all five ABR schemes, only 15,788 (3%) of those world learning and experimentation, especially given uncon- work that used much shorter real-world experiments. Even a with 1.75 years of data for each scheme, the width of the Under anonymous submission (Sept. 19, 2019) probability The TTP’s , with 0.5 sec- to balance the ) ∞ , 100 75 . = into bins and uses for- 9 [ i µ chunks, and outputs a B 8 ,..., and The TTP explicitly consid- ) = 1 The TTP outputs a t 25 . = 1 λ , is estimated. The controller com- i and outputs a predicted duration, a 75 T . i 0 K [ future steps (about 10 seconds) by solv- , 5 ) structure to the TTP, and find that several = 75 . 0 H , 25 . . ) 0 1 [ tcp_info − , i ) K , 25 i . B 0 ( , ∗ i 0 Use of low-level congestion-control statistics. control that transmission timefilesize does [5]. not We scale compared linearly thean with accuracy equivalent throughput of predictor the (keeping TTP everythingunchanged) else against and found the TTP’s predictionsaccurate were (Figure7). much more Prediction with uncertainty. 4.5 Implementation on average on aoptimizes recent over x86-64 core). Theing MPC value iteration. controller We set conflicting goals in QoE. Eachon retraining a takes 48-core about server. 6 hours 4.6 Ablation study of TTP features nature as a DNNincluding lets low-level congestion-control it statistics. consider We a feedkernel’s the variety of noisy inputs, of these fields contributeespecially positively the to RTT, CWND, the and TTP’s number accuracy, of packets in flight cally access this structure directoryon because the the sender, statistics these live results shouldtion motivate of the richer communica- data to ABR algorithms wherever they live. ers the chunk size of more powerful approach than a generatingestimate. a single It throughput is well known in ABR streaming and congestion system dynamics once putes the optimal trajectoryeration by with solving dynamic the programming above (DP).computational feasible, To value make it it- discretizes the DP v probability distribution over 21[ bins of transmission time: onds as the bin sizeis except a for fully-connected the neural first network, and with the two last hidden bins. layers TTP ous numbers of hidden layerstraining and losses neurons, across and a found range similar plemented of TTP conditions and for the each. training Wetrained im- in model PyTorch, but in we C++ load when the for running performance. on A the forward production pass server C++ of imposes TTP’s neural minimal network overhead per in chunk (less than 0.3 ms TTP’s features (Fig.7). Here are the more notable results: TTP takes as input the past (Figure7). Although client-side ABR systems cannot typi- Transmission-time prediction. ward recursion with memoization to only compute for relevant with 64 neurons each. We tested different TTPs with vari- We performed an ablation study to assess the impact of the 7 , s i , de- K o ) 1 )) , s i − i 1 K K − , i , 1 i T + B i ( B ∗ i ( v structure), ,..., 1 t + can be derived by − ∗ i i , v 1 T · 1 ] + i . i − s i i T ) + B K 1 K − tcp_info i ) = s K i chunks ). , ,..., , and t K s i t i ( T by saving client telemetry from − K ˆ i T ( [ K D Pr i QoE T ∑ ( is the probability for TTP to output a by value iteration [6]. After sending n ] i 1 chunks: T s i − t K H max + ) = s i s i K ) = tcpi_delivery_rate K 1 ( ˆ − T i [ -step lookahead horizon given the last sent chunk ,..., K 1 , over the transmission time of Pr H i with standard supervised learning: the training mini- + s ) i B . We have value iteration as follows: s i ( K 1 D ∗ i K , Given the current playback buffer level, let We retrain the TTP every day, using training data collected Prior approaches have used Harmonic Mean (HM) [43] Each TTP network takes as input a vector of: − 4. size of the chunk to be transmitted. 3. internal TCP statistics ( 2. transmission times of past 1. sizes of past s ( v i i ˆ discretized transmission time in the K into TTP, and replans again for the next chunk. note the maximum expected sum of QoE that can be achieved for predictions ofK transmission time andthe controller outputs observes and a updates the plan input vector passed 4.4 Model-based controller Our MPC controller (used forand MPC-HM, Fugu) RobustMPC-HM, is a stochasticthe optimal expected controller cumulative that QoE maximizes in Equation1. It queries TTP 4.3 Training the TTP loaded to warm-start the retraining. shuffle the sampled data to removeof correlation in inputs. the The sequence weights from the previous day’s model are on Puffer over the prior 14shift days, and to catastrophic avoid the forgetting effects [30,31]. of Within dataset the 14-day mizes the cross-entropy loss betweendistribution the and output the probability discretized actual transmissionstochastic time gradient using descent. Puffer collects training data real usage, aggregating pairs ofthe (a) true the transmission input 4-vector time and, foron (b) the chunk. We train the TTP throughput for the entire lookaheadtransmission-time horizon. predictor In outputs contrast, a the probabilityT distribution throughput ( or a Hidden Markov Model (HMM) [38] to predict a single size, the number ofsmoothed unacknowledged RTT estimate, packets the in minimum RTT, and flight, the the estimated The TCP statistics include the current congestion window where window, we weight more recent days more heavily, and we Under anonymous submission (Sept. 19, 2019) features of the all (Figure 11 shows catastrophic the “depth” of the neural network, in situ or the prediction of transmission time as a function of chunk We conclude that robustly beating “simple” algorithms We found that old-fashioned “buffer-based” control per- The only scheme to consistently outperform BBA in both allowed Fugu to begin at a higher quality (Figure9). standing promising results in containedsimulators environments and such emulators. as The gainshave that in learned optimization algorithms or smarter decisiona making tradeoff may in come brittleness at or sensitivity to heavy-tailed behavior. algorithms, counting all streams that playedof at video. least A 4 standardized seconds diagramavailable of in the the experimental appendix flow (Figure is A1). forms surprisingly well, despite itsperformed status research as baseline. a In frequently the out- left-hand overall dataset side), (Figure8, BBA outperformsis MPC-HM statistically on indistinguishable stalls on and videoforms Pensieve quality. on It quality outper- and isRobustMPC-HM indistinguishable has on an stalls. even lowerconsiderable stall rate, loss at of the quality. cost Among of “slow” network paths ure8, right-hand side), Pensieve showsducing an the advantage stall in rate, re- again at a considerable cost in quality. stalls and quality was Fugu, but only when of Fugu’s predictions, or size (and not simply throughput), Fugu forfeits its advantage SSIM variability (Figure1). The TTP’s usestatistics of was low-level helpful TCP on a cold start to a new session; this of-date” versions of the TTP on the Puffer website between between 106 and 131 stream-daysschemes. of Somewhat data to our for surprise, each we were ofa not these significant able difference to in detect performance between any of these one that was retrained eacha day benefit in from August. learning We certainly see behavior from a TTP learned inpractice a of network daily emulation), retraining but in our situ appears to be overkill. 5 Experimental Results study, including the evaluationare of shown Fugu. in Figure8. Our In maingroup, summary, blinded-assignment, we results randomized conducted controlled a trial parallel- of five The data include 8.5 stream-years of data split across five TTP were used. If we remove the probabilistic “fuzzy” nature TTP trained in February, March, April, and May, compared (those with average throughput less than 6 Mbit/s; see Fig- (§§4.6). Fugu also outperformed other schemes in terms of with machine learning may be surprisingly difficult, notwith- ABR schemes between Jan.Aug. 30 19 and and Sept. 12, Aug. 2019 7, (the cutoff and for between this submission). we conducted a randomized controlled trial of several “out- Aug. 7 and Aug. 30, 2019. Wewith compared the versions “live” TTP of that the is retrained each day. We collected ABR schemes—even comparing the “February” TTP against We now present findings from our experiments with the Puffer 8 Although it is often said that “data > To validate our practice of daily retraining, worse, without significant improvement in × of transmission times. This additional informa- worse (data not shown). Ablation study of Fugu’s Transmission Time Pre- × We also deployed this scheme on the Puffer website and col- In addition, to confirm the relationship between prediction Daily retraining. lected 107 stream-days of dataits in end-to-end September ABR 2019 performance. to Again, measure accuracy the was lower harmful prediction to the bottom line; its rebuffering ratio neural network), trained the same,prediction performs accuracy much (Figure7). worse on Use of neural network. algorithms” in machine learning [16], webenefit did find to a the significant usetion. of A a linear-regression deep model neural (equivalent to network a in single-layer this applica- SSIM (data not shown). ployed a point-estimate version of2019 Fugu and on collected Puffer 39 in stream-days August It of performed data much with worse this than scheme. ratio normal Fugu: was the 3–9 rebuffering in prediction accuracy with the former (Figure7). accuracy and performance of the entire ABR system, we de- point estimate without uncertainty. We evaluated theaccuracy expected of a probabilistic TTPlikelihood” vs. version, an and equivalent found “maximum a considerable improvement distribution tion allows for better decision making, compared with a single and one that predicts throughput without regard to chunk size dictor (TTP). Removing each offeatures the reduced TTP’s inputs, its outputs, ability or toa predict video the chunk. transmission A time non-probabilistic of TTP (“Point Estimate”) Figure 7: TCP-layer statistics (RTT, CWND) were also helpful. (“Throughput Predictor”) both performed markedly worse. was 2–5 Under anonymous submission (Sept. 19, 2019) 0.5 0.75 Fugu in situ RobustMPC-HM Pensieve 1 BBA

Time spent stalled (%) Better QoE Better MPC-HM 1.25 Slow network paths (N=100,500, duration 1.3y) 15 14

15.5 14.5 13.5 Average SSIM (dB) SSIM Average Time-on-site is a figure of merit in the video-streaming To investigate this further, we constructed an emulation Chrome clients inside mahimahithe [27] server. Each shells mahimahi to shell imposeddelay connect a on to 40 traffic originating ms inside end-to-end capacity it over and time limited to the match downlink FCC the broadband capacity network recorded traces in [8]. auation, As set in uplink of the speeds Pensieve in eval- all shells were capped at 12 Mbps. details of the underlying schememost (MPC of and their Fugu codebase). even share Thesolely average by difference the was upper driven 5%lasting tail more of than viewership 2.5 duration hours)—viewers (sessions much assigned more to likely Fugu to are keepas streaming the beyond distributions this are point, nearly even identical until then. industry and might be increased by delivering better-quality ing this phenomenon. 5.2 The benefits ofEach learning of the ABR algorithmsemulation in we prior evaluate works was [23,43]. evaluated Notably, the in results in those results we have seen here—foroutperforming example, MPC-HM buffer-based control and Pensieve. environment similar to that usedning in the [23]. Puffer This media involved server run- locally, and launching headless video, but we simply do not know enough about what is driv- we believe the experiment was carefully executed not to “leak” works are qualitatively different from some of the real world Within this test setup, we automated 12 clients to repeatedly 9 0.1 0.48 BBA Fugu RobustMPC-HM 0.5 RobustMPC-HM 0.2 Fugu Pensieve BBA 0.52 MPC-HM Startup delayStartup (s) In a blinded randomized controlled trial that included 8.5 years of video streamed to 44,907 client IP Time spent stalled (%) Pensieve

0.54 Better QoE Better Better QoE Better less than 6 Mbit/s; following prior work [23, 43 ], these paths are more likely to require nontrivial bitrate-adaptation MPC-HM On a cold start, Fugu’s ability to bootstrap ABR 0.3 0.56 Primary experiment (N=458,801, duration 8.5y) 10

9.9 16.5

10.1 10.2 10.3 16.25 16.75 rst-chunk SSIM (dB) SSIM rst-chunk fi Average Average SSIM (dB) SSIM Average player about 10–20% longer, onto average, than other those schemes. assigned Users were blinded to the assignment, and of users across algorithms (Figure 10 ). Users whose sessions 5.1 Fugu users streamed for longer Figure 9: decisions from congestion-control statistics (e.g., RTT) boosts initial quality. logic. Such streams accounted for 16% of overall viewing time and 82% of stalls. Error bars show 95% confidence intervals. addresses over a seven-month period, Fuguincreased SSIM, reduced and the reduced fraction SSIM of variationdelivery_rate time within spent each stalled stream (except (tabular with data in respect Figure1). to “Slow” RobustMPC-HM), network paths have mean TCP Figure 8: Main results. were assigned to Fugu chose to remain on the Puffer video We observed significant differences in the session durations Under anonymous submission (Sept. 19, 2019) environ- systematic conditions, eliminating same to respond to a sequence of control decisions and decide Finally, Pensieve optimizes a QoE metric centered around Second, most of the evaluation of Pensieve in the original First, we have found that emulation-based training and Reinforcement learning (RL) schemes such as Pensieve default ABR scheme, and no significant benefit in rebuffering. bitrate as a proxy forrequired video significant quality. Altering surgery to this would providesieve new have neural values network to over the time. Pen- Figure4 shows that Pensieve mary analysis. We emphasize that our findings do not indicate may be at a particularUnlike disadvantage supervised from learning this schemes phenomenon. thating can “data,” learn reinforcement from learning train- requiresment a training on the appropriate consequences andment reward. could That be environ- real lifestatistical instead noise of we a observe simulator, would butextremely make the slow this level or type of require of an learning algorithms extremely in broad training. deployment Reinforcement of learning reliesable on to being slightly vary athe control resulting action reward. and By detect a ourinputs change calculations, is in the such variability that of reliably it distinguish takes two about ABR 2 schemesperformance stream-years whose differs of innate by data “true” 15%. to To make RL practical, future ity [24] or construct morethat faithful model simulators tail and emulators behaviors [42]. paper focused on trainingsingle and test evaluating video. Pensieve As using ato a result, the explore state was space that inherentlyPensieve model more “multi-video had model”—which limited. we Evaluation have to ofexperimental use setting—was the more for limited. our Our resultsconsistent are with more a recent large-scale study of a Pensieve-multi- this study found a small benefit in bitrate over Facebook’s 5.3 Remarks on Pensieve and RL for ABR outperformed MPC-HM, RobustMPC-HM, and BBAsimulation-based in tests both and in videoand streaming high tests on speed a real-world low networks.believe Our the mismatch results may differ; have we occurred for several reasons. testing (or, at least,do mahimahi not tests capture with the thethe vagaries Puffer FCC of study. the dataset) Unlike real-world real-world randomized pathsbased trials, emulators seen trace- and in simulators allowtwo experimenters different to algorithms run on the the effect of the playa of chance different in distribution giving of differentetc. algorithms watch times, However, it network is behaviors, uncertainty difficult that comes to from characterize selectingomit a the the set variability or of heavy-tailed traces nature that ofexperience a may (both real deployment network behaviors as wellsuch as as user watch duration). behaviors, video-like scheme on 30 million streams at Facebook [22]; The original Pensieve paper [23] demonstrated that Pensieve was the #2 scheme in terms of bitrate (below BBA) in the pri- work may need to explore techniques to reduce this variabil- 10 1000 in situ Fugu—or in situ ± 1.0) ± 0.9) ± 0.9) ± 0.9) ± 1.1 min [95% CI]) 29.6 27.9 28.5 27.4 32.6 100 (mean (mean (mean (mean (mean BBA Fugu Pensieve MPC-HM MPC-HM RobustMPC-HM Total time on video player (minutes) Users randomly assigned to Fugu chose to remain 10 1

0.1

0.01 CCDF 0.001 We trained a version of Fugu in this emulation environ- 0.0001 in both quality and stallopportunities time. in These constructing results network suggest emulators research additional that dynamics capture of the real Internet. other algorithms lying somewhereexperiment in (middle between. of figure), In weture, the see with real a Fugu more apparently muddled outperforming pic- other algorithms side of figure), almost everyalong algorithm the tested SSIM/stall lies frontier, somewhere with Pensieveleast rebuffering and the MPC delivering the highest quality video, and the panel). Looking at theing other to ABR comparing schemes, Figure itresults is 11 differ illuminat- with markedly from Figure8—the the emulation real world. In emulation (left ment to compare its performance with a version trained of emulation-trained Fugu was horrible (Figure 11, middle Each client was assigned toand a CC different combination algorithms, of and ABR playededly the over 10 more minute than video 15 repeat- this hours of experiment are FCC depicted traces. in The results Figure from 11. connect to the media server,clip which recorded would on play NBC a over 10 each minute network trace in the dataset. the upper 5% tail (sessions lastingon-site more is than a 2.5 hours). figure Time- of merit in the industry and may correlate on the Puffer video player about 10%–20%than longer, those on assigned average, to otherthe schemes. assignment. Users This were average blinded difference to was driven solely by Figure 10: (on data from Puffer). Compared with the with every other ABR scheme—the real-world performance with QoE, but we do not fully understand this effect. Under anonymous submission (Sept. 19, 2019) in needs to situs 15% (95% CI). concurrent streams at any given time. We ± Fugu 0.1 million BBA works, but we don’t know how specific the 0.2 RobustMPC-HM Although we believe that past research papers may have By contrast, we understand YouTube has an average load Some of Fugu’s performance (and that of MPC, Ro- Pensieve bustMPC, and BBA) relative to Pensievefact may that be these four due schemes to received the moreran—namely, the information SSIM as of they each possible versionchunk—than of did each Pensieve. future It is possiblePensieve that an might “SSIM-aware” perform better. We triedsort to of approximate this scheme (anemulation) SSIM-aware with neural the network “Emulation-trained trained Fugu” in benchmark training a new Fugu based on data fromsitu that server would be. underestimated the uncertainties in real-world measurements of underestimating our own uncertaintiesemphasizing or—also uncertainties possible— that are onlymedium-sized relevant academic to studies, small such as or ours, andthe irrelevant industry. to The current load onstreams Puffer is on about 50 average, concurrent meaningdays we of data collect per about day. Ourstream-years 50 primary of analysis data stream- per covers scheme about collected 1.7 period, over and a was seven-month sufficient toto measure within its about performance metrics over 50 imagine the considerations of conductingments data-driven at experi- this level mayabout be statistical completely uncertainty, different—perhaps and less more abouttainties systematic and uncer- the difficulties ofmulating running so experiments and much accu- data. (We alsoonly able understand to that measure YouTube its is ownbecause performance once of every the 48 vastness hours of data that needs to be aggregated.) paths between a user onCDN an edge access network node. and We their don’tFugu nearest know model for would sure work if in the a pre-trained different location, or whether yield comparable results. Our results show that learning with realistic Internet paths and users, we also may be guilty MPC-HM 0.3 11 comparison of throughput distribution on FCC traces and Puffer, 0.4 Time spent stalled (%) Right:

0.5 Better QoE Better replicate the Emulation-trained Fugu 0.6

15 17 16 Average SSIM (dB) SSIM Average cannot 0 Fugu Pensieve 0.25 BBA RobustMPC-HM 0.5 0.75 performance in emulation, run in mahimahi [27] using the FCC traces8], [ following the method of Pensieve [ 23]. 1

Time spent stalled (%) Better QoE Better 1.25 experimental results from Puffer. During Jan.–April 2019, we randomized sessions to a set of algorithms including MPC-HM 1.5 13 15 14

It is also unknown to what degree Puffer’s results—which We have supplemented this black-box work with ablation accuracy of its predictor, and have studied various ablated 12.5 13.5 14.5 Average SSIM (dB) SSIM Average 2 provisioned datacenter, sending to clientscountry across over our the wide-area entire Internet—generalizeserver to at a different a different institution, much less the more typical strategies without extensive real-world experiments. are about a single server with 10 Gbps connectivity in a well- doubt it will be thebecome final possible word to on model this enoughInternet topic. of “in the Perhaps silico” vagaries to it of enable will the the development real of robust control experimental findings outside the real world—a real world precisely. That may be an unsatisfying conclusion, and we analyses to relate the real-worldl performance of Fugu to the of the reason for this paper is that we by assigned algorithm; and wethese don’t behaviors know accurately how in to a emulate controlled environment. streaming. We don’t know thepaths true and distribution throughput-generating of processes; network wethe don’t participants know or why the distribution in watch times differs 6.1 Limitations of theOur experiments randomized controlled trialnecessarily represents “black a box,” study rigorous, of but ABR algorithms for video subject to important limitations thatmance may and affect their generalizability. perfor- 6 Limitations in a scenario where similar, pre-recordeda video familiar is set played of over known networks. that Pensieve cannot be a useful ABR algorithm, especially traces did not generalize toestimated using the the real-world experimental setting. results from the left two figures. Figure 11: Left: Middle: versions of Fugu in deployment. However, ultimately part The design of the Puffer experiment and the Fugu system are whose behavior is noisy and takes lots of time to measure “emulation-trained Fugu,” and show results of 45,695 streams, 328 stream-days of data from this time period. Training on these Under anonymous submission (Sept. 19, 2019) 3 Pro- , pages 44–58. 4 Remzi Arpaci-Dusseau. Measure, then build (USENIX : Non-profit retransmission of broadcastsion, televi- June 2018. https://www.locast.org/app/uploads/ 2018/11/Locast-White-Paper.pdf. Zahaib Akhtar, Yun SeongSanjay Nam, Rao, Ramesh Jessica Chen, Govindan, EthanRibeiro, Katz-Bassett, Jibin Bruno Zhan, and Hui Zhang. Oboe: auto-tuning ceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication video ABR algorithms to network conditions. In ACM, 2018. ATC 2019 keynote). Renton, WA, JulyAssociation. 2019. USENIX We do not claim this ashttps://github.com/StanfordSNR/puffer an accomplishment per se. 3 4 We plan to operate Puffer for several years and invite re- Over the last eight months, we have streamed 14.2 years of We have found the Puffer approach an enormously pow- [3] [1] [2] access to this data asopen-source appropriate. software. Puffer and Fugu are also searchers to train and validatetrol, new algorithms network for and ABR throughput con- prediction,trol and congestion on con- its traffic.helpful “medium-scale” We stepping-stone believe for that newpartway Puffer algorithms, between could the serve flexibility of as network a emulation and the algorithms—of commercial services. We are eagerorate to with collab- and learn fromdesign the and community’s ideas deploy on robust learned how to systems for the Internet. References amenable to supervised learning oncontroller) data, to feeding benefit a from classical that kind of training. ized in blinded fashion among algorithms, andis client recorded telemetry for analysis.performed The other Fugu schemes, both algorithm simple robustlyobjective and out- measures sophisticated, (SSIM, on stall time, SSIMincreased the variability) duration and that users chose to continue streaming. erful tool for networking research—itable is very to fulfilling “measure, to then be ideas build” and [3] gain to feedback. iterate Accordingly, weas rapidly are an on opening “open new Puffer research”are platform. Along publishing with our this fullPuffer paper, archive we website. of The traces system and posts results new on data the each day, along format is described infields the from appendix.) the We public will archivebut redact (e.g., are IP some willing address to and work user with ID) researchers in the community on vastness of data—but also conservatism about deploying new video to 56,000 users across the Internet. Sessions are random- with a summary ofwith results confidence intervals from similar the to those ongoing in experiments, this paper. (The 12 choose to wait and make retraining over “every six daily might what does it take to create a learned on data from the real deployment environment, and In effect, our best answer is to cheat: train the algorithm In this paper, we asked: In this paper, we suggest that these efforts can benefit from Fugu is not tied as tightly to the TCP or congestion control Fugu does not consider several issues that other research in situ use an algorithm whose structure is sophisticated enough net? distributions. It remains a challengingappropriate problem training to data gather (or the ining the environment) case to of properly RL learn systems, and train- validate such systems. ity. Good, or even near-optimal, performanceor in emulator a does simulator not necessarilyover predict the good wild performance Internet, with its variability and heavy-tailed problem [23, 34, ]. 36 considering a broader notion of performance and optimal- Machine-learned systems in computer networkingdescribe sometimes themselves as achieving near-“optimal” performance, based on results in a contained or modeled version of the 7 Conclusion a better-informed decision later. Fugu doestransmission not of schedule chunks—it the will alwaysas send long the as next chunk the client has room in its playback buffer. as it might be—for example, Fuguuntil the could TCP wait sender to tells send it a that there chunk is a sufficient congestion immediately. Otherwise, it has concerned itselfalready-downloaded chunks with—e.g., in being the buffer with able higher quality to “replace” fore it can be retrained.ing Our userbase eight suggests, months but of doesn’t datachanging guarantee, on environment. robustness a to grow- a months” retraining, but at thethat same some time, surprising we detail cannot be tomorrow—e.g.,an a sure unfamiliar new network—won’t user send from Fugu into a tailspin be- from one where theit is algorithm hard was to trained.catastrophically say some In whether day. We Fugu’s that tried performance sense, anda might failed quantitative to decay benefit demonstrate from ily” squeeze out performance gainsat may risk also to put brittleness themselves when a deployment environment drifts 6.2 Limitations of Fugu chunk is not insignificant—about ancost extra of 40% encoding on the top video. of the versions [35], or optimizing the joint QoE of multiple clients There is a sense that data-driven algorithms that more “heav- (a neural network) and yet also simple enough (a predictor (Figure ).11 The load of calculating SSIM for each encoded window for most of the chunk (or the whole chunk) to be sent who share a congestion bottleneck. ABR algorithm that robustly performs well over the wild Inter- Under anonymous submission (Sept. 19, 2019) , Pro- , IMC , pages Proceed- IEEE Intelligent IEEE/ACM Trans- ICML 2019 Workshop , pages 197–210. ACM, , 22(1):326–340, 2014. ACM SIGCOMM Computer Commu- , abs/1807.02264, 2018. , 44(4):187–198, 2015. Proceedings of the Conference of the ACM , 2019. , MMSys ’12, pages 11–22, New York, NY, Proceedings of the 3rd Multimedia Systems CoRR , 24(2):8–12, March 2009. duction for reinforcement learning inronments. input-driven envi- Ricky K. P. Mok, Xiapu Luo,Rocky Edmond K. W. W. C. Chan, Chang. and system. In QDASH: A QoE-aware DASH USA, 2012. ACM. Kouranov, Ian Swett, JanardhanQUIC Iyengar, transport et protocol: Design al. andployment. Internet-scale de- In The Zhi Li, Xiaoqing Zhu,Hu, Joshua Ali Gahm, C Rong Pan, Begen, Hao Rate and David adaptation Oran. for HTTP Probe and video adapt: streaming32(4):719–733, at 2014. scale. Hongzi Mao, Shannon Chen,Singh, Drew Drew Dimmery, Blaisdell, Shaun Yuandong Tian, Mohammadizadeh, Al- and Eytan Bakshy. Real-world video adaptation Hongzi Mao, Ravi Netravali, andNeural Mohammad adaptive Alizadeh. video streaming withceedings Pensieve. of In the Conference of the ACM Special Interest 2017. Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, and Mohammad Alizadeh. Variance re- Alon Halevy, Peter Norvig, and Fernandounreasonable Pereira. effectiveness The of data. Te-Yuan Huang, RameshMatthew Trunnell, and Johari, Mark Watson. Nickapproach A to rate buffer-based McKeown, adaptation: Evidence fromstreaming a service. large video nication Review Junchen Jiang, Vyas Sekar, andfairness, Hui efficiency, Zhang. and stability Improving in HTTP-basedtive adap- video streaming with FESTIVE. actions on Networking (TON) S. Shunmuga Krishnan and Rameshcausality K. using quasi-experimental Sitaraman. designs. In ings of the 2012 Internet Measurement Conference Adam Langley, Alistair Riddoch, Alyssa Wilk, Antonio IEEE Journal on Selected Areas in Communications Special Interest Group on Data Communication Systems RL4RealLife Conference Vicente, Charles Krasic, Dan Zhang, Fan Yang, Fedor with reinforcement learning. In Group on Data Communication Video stream quality impacts viewer behavior: Inferring 183–196. ACM, 2017. ’12, pages 211–224, New York, NY, USA, 2012. ACM. [25] [20] [21] [22] [24] [16] [17] [18] [19] [23] 13 , , Pro- 15th Jour- , 14(5):50, , 43(4):339– , 9(4):392–403, Queue , pages 679–684, , 1(1):54–75, 02 1986. , pages 267–282, Renton, WA, Statist. Sci. SIGCOMM Comput. Commun. Rev. IEEE/ACM Trans. Netw. , HotNets-XVI, pages 192–198, New York, NY, SIGCOMM Comput. Commun. Rev. , abs/1904.12901, 2019. tween a video codec and a transport protocol. In 2018. USENIX Association. Sally Floyd and Vern Paxson.the Difficulties in internet. simulating latency network video through tighter integration be- dard errors, confidence intervals, andstatistical other accuracy. measures of Sally Floyd and Eddie Kohler.better Internet research models. needs 33(1):29–34, January 2003. Gabriel Dulac-Arnold, Daniel J. Mankowitz,Hester. and Challenges Todd of real-world reinforcement learning. B. Efron and R. Tibshirani. Bootstrap methods for stan- Paul CrewsRecreating and and Hudson extending Ayers. Pensieve, 2018. 07/16/cs-244-18-recreating-and-extending-pensieve/. CShttps: 244Z. Duanmu, ’18: K. Zeng, K. Ma, A. Rehman, and Z. Wang. Congestion-based congestion control. 2016. Federal Communications Commission.broadband america. Measuring measuring-broadband-america. https://www.fcc.gov/general/ Richard Bellman. A Markovian decisionnal process. of Mathematics and Mechanics Neal Cardwell, YuchungSoheil Cheng, Hassas C Yeganeh, Stephen and Van Gunn, Jacobson. BBR: Mihovil Bartulovic, Junchen Jiang, Sivaraman Balakr- ishnan, Vyas Sekar, and Brunodriven Sinopoli. networking, Biases and in what data- to doceedings about of them. the 16th In ACM Workshop on Hot Topics in USA, 2017. ACM. Athula Balachandran, Vyas Sekar, Aditya Akella, Srini- a predictive model of quality of experience for Internet 350, August 2013. Implementation (NSDI 18) IEEE Journal of Selected Topics in Signal Processing vasan Seshan, Ion Stoica, and Hui Zhang. Developing video. //reproducingnetworkresearch.wordpress.com/2018/ CoRR Networks USENIX Symposium on Networked Systems Design and August 2001. A quality-of-experience index for streaming video. Wu, Riad S. Wahby, and Keith Winstein. Salsify: Low- 11(1):154–166, Feb 2017. 1957. [8] [9] [7] [6] [4] [5] [15] Sadjad Fouladi, John Emmons, Emre Orbay, Catherine [14] [11] [12] [13] [10] Under anonymous submission (Sept. 19, 2019) Pro- , pages ACM SIG- , pages 1–9. , volume 45, Proceedings of , pages 272–285. , abs/1707.02968, IEEE transactions on CoRR , pages 731–743, Boston, IEEE INFOCOM 2017-IEEE 2018 USENIX Annual Technical , 13(4):600–612, 2004. INFOCOM 2016-The 35th Annual IEEE In- , pages 1–9. IEEE, 2016. lyzing the influence of chunkadaptation size in variation DASH. on In bitrate IEEE, 2017. Guibin Tian and Yong Liu. Towards agile and smooth ceedings of the 8th internationaling conference on networking Emerg- experiments and technologies Zhou Wang, Alan C Bovik, Hamid RSimoncelli. Sheikh, and Image Eero quality P assessment: fromibility error to vis- structural similarity. image processing Francis Y. Yan, Jestin Ma, Greg D. Hill, Deepti Ragha- Pantheon: the training ground forcontrol Internet research. congestion- In MA, 2018. USENIX Association. Xiaoqi Yin, Abhishek Jindal, VyasSinopoli. Sekar, and A Bruno control-theoreticadaptive video approach for streaming dynamic over HTTP. In pages 325–338. ACM, 2015. Tong Zhang, Fengyuan Ren, Wenxue Cheng,Luo, Xiaohui Ran Shu, and Xiaolan Liu. Modeling and ana- Kevin Spiteri, Rahul Urgaonkar, and Rameshman. K. BOLA: Sitara- Near-optimal bitrate adaptation for online ternational Conference on Computer Communications, Chen Sun, Abhinav Shrivastava, Saurabh Singh, and of data in deep2017. learning era. Yi Sun, Xiaoqi Yin, Junchen Jiang,Lin, Vyas Nanshu Sekar, Wang, Tao Fuyuan Liu, andImproving Bruno video Sinopoli. bitrate CS2P: selection anddata-driven adaptation throughput with prediction. In the 2016 ACM SIGCOMM Conference Cisco Systems.Forecast and Ciscohttps://www.cisco.com/c/en/us/solutions/collateral/ trends, Visual Networking 2017–2022, November Index: service-provider/visual-networking-index-vni/ 2018. IEEE video adaptation in dynamic HTTP streaming. In van, Riad S. Wahby, Philip Levis, and Keith Winstein. videos. In Conference on Computer Communications Conference (USENIX ATC 18) COMM Computer Communication Review white-paper-c11-741490.pdf. Abhinav Gupta. Revisiting unreasonable effectiveness ACM, 2016. 109–120. ACM, 2012. [40] [37] [39] [41] [42] [43] [44] [36] [38] 14 , , , , , MMSys ’18, BMC medicine , 7(2):123–146, AIChE Journal , SIGCOMM ’14, Proceedings of the Proceedings of the Proceedings of the , WSC ’97, pages 1037– Proceedings of the 29th Proceedings of the four- , pages 366–378. ACM, Proceedings of the 2015 USENIX Connection Science , pages 627–635, 2011. From theory to practice: Improvingin bitrate the adaptation DASH reference player.9th ACM In Multimedia Systems Conference pages 123–137, New York, NY, USA, 2018. ACM. Anirudh Sivaraman, Keith Winstein, Pratiksha Thaker, and Hari Balakrishnan. Anlearnability experimental of study congestion of control. the In pages 479–490, New York, NY, USA, 2014. ACM. Kevin Spiteri, Ramesh Sitaraman, and Daniel Sparacio. porting parallel group randomised trials. 8(1):18, 2010. Alexander T. Schwarm and Michael Nikolaou.constrained model Chance- predictive control. reduction of imitation learning and structuredto prediction no-regret online learning. In teenth international conference on artificialand intelligence statistics Kenneth F Schulz, Douglas G Altman,CONSORT and 2010 David Moher. statement: updated guidelines for re- 2018. Anthony Robins. Catastrophic forgetting, rehearsalpseudorehearsal. and Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A Yanyuan Qin, Shuai Hao,Qian, Krishna Subhabrata Sen, R Bing Pattipati, Wang, and Feng Chaoqun Yue. tion, challenges, and solutions. In Vern Paxson and Sallyhow Floyd. to simulate the Why Internet. we In don’t know Society. Ravi Netravali, Anirudh Sivaraman,Hari Somak Balakrishnan. Das, Mahimahi:replay Accurate for HTTP. record-and- In USENIX ATC ’15, pages 417–429, Berkeley,2015. CA, USENIX USA, Association. Dynamic adaptive streaming over HTTP (DASH) — Part ). ittf/PubliclyAvailableStandards 2014 ACM Conference on SIGCOMM EXperiments and Technologies Conference on Winter Simulation Conference on Usenix Annual Technical Conference 45(8):1743–1752, 1999. ABR streaming of VBR-encoded videos: characteriza- Ameesh Goyal, Keith Winstein, James Mickens, and April 2012. ISO/IEC 23009-1 (http://standards.iso.org/ 1995. 14th International Conference on emerging Networking 1044, Washington, DC, USA, 1997. IEEE Computer 1: Media presentation description and segment formats [35] [34] [33] [32] [30] [31] [29] [28] [26] [27]

Under anonymous submission (Sept. 19, 2019)

ail hnigcanl ruiga noptbebrowser. incompatible an using or channels changing rapidly

hnesol trsanwsra n osntcag C oncin rARagrtm.Alrenme fsrasnvrbgnpaig hs eeotnusers often were these playing; began never streams of number large A algorithms. ABR or connections TCP change not does and stream new a starts only channels

u.3–et 2 09 ssin ersnsoevstt h ufrvdopae n a oti ay“tem. eodn trsanwssin u changing but session, new a starts Reloading “streams.” many contain may and player video Puffer the to visit one represents “session” A 2019. 12, 30–Sept. Aug.

,otie uigtepro a.1Ag ,21,and 2019, 7, 1–Aug. Jan. period the during obtained ), 8 and 1 (Figures results primary the for flow experimental of ] 32 [ diagram CONSORT-style iueA1: Figure

playing spent client-years 8.5 ◦

stalled spent client-days 5.1 ◦

startup in spent client-days 2.7 ◦

8.5 client-years of data of client-years 8.5

458,801 streams were considered were streams 458,801

1.7 client-years of data of client-years 1.7 data of client-years 1.9 1.7 client-years of data of client-years 1.7 data of client-years 1.6 data of client-years 1.7

93,819 streams were considered were streams 93,819 considered were streams 93,209 89,287 streams were considered were streams 89,287 considered were streams 90,952 considered were streams 91,534

because of a loss of contact of loss a of because contact of loss a of because because of a loss of contact of loss a of because contact of loss a of because contact of loss a of because

2,655 streams were truncated truncated were streams 2,655 truncated were streams 2,683 2,520 streams were truncated truncated were streams 2,520 truncated were streams 2,599 truncated were streams 2,391

1 sent contradictory data contradictory sent 1 ◦

29 stalled from a slow video decoder video slow a from stalled 29 decoder video slow a from stalled 40 41 stalled from a slow video decoder video slow a from stalled 41 24 stalled from a slow video decoder video slow a from stalled 24 decoder video slow a from stalled 14 ◦ ◦ ◦ ◦ ◦

15 84,640 had watch time less than 4s than less time watch had 84,640 79,435 had watch time less than 4s than less time watch had 79,435 4s than less time watch had 87,426 4s than less time watch had 87,958 87,200 had watch time less than 4s than less time watch had 87,200 ◦ ◦ ◦ ◦ ◦

55,301 did not begin playing begin not did 55,301 56,845 did not begin playing begin not did 56,845 55,182 did not begin playing begin not did 55,182 playing begin not did 59,450 playing begin not did 57,119 ◦ ◦ ◦ ◦ ◦

139,981 streams were excluded were streams 139,981 144,832 streams were excluded were streams 144,832 142,407 streams were excluded were streams 142,407 excluded were streams 138,899 excluded were streams 144,586

238,651 streams 238,651 streams 233,190 236,120 streams 236,120 229,851 streams 229,851 231,694 streams 231,694

MPC-HM Fugu RobustMPC-HM Pensieve BBA

47,958 sessions were assigned were sessions 47,958 48,703 sessions were assigned were sessions 48,703 48,082 sessions were assigned were sessions 48,082 47,584 sessions were assigned were sessions 47,584 47,775 sessions were assigned were sessions 47,775

portions of the study duration study the of portions ◦

103,446 streams were assigned experimental algorithms for for algorithms experimental assigned were streams 103,446 ◦

53,631 streams were assigned CUBIC assigned were streams 53,631 ◦

3.2 client-years of data of client-years 3.2

158,077 streams 158,077

97,068 sessions were excluded were sessions 97,068

26.8 client-years of data of client-years 26.8

56,262 unique IPs unique 56,262

1,595,356 streams 1,595,356

337,170 sessions underwent randomization underwent sessions 337,170 admzdtilflwdiagram flow trial Randomized A Under anonymous submission (Sept. 19, 2019) video_sent ). ). , 264,374,831 data tcpi_rtt tcpi_min_rtt ). video_sent , and 1,729,618,482 data points in : estimate of TCP throughput collects client-side buffer information on collects a data point every time an Puffer . : cumulative rebuffer time in the current : minimum RTT ( : playback buffer size. : event type, e.g., was this triggered by a regular video_acked : smoothed RTT estimate ( tcpi_delivery_rate stream. event report every quarter second, oror because began the playing. client stalled buffer cum_rebuf min_rtt rtt delivery_rate ( Between January 1 and September 12, 2019, we collected client_buffer • • • • • • video_acked 261,280,238 data points in points in client_buffer server receives a video chunk acknowledgementEach from data a point can client. be matched to a data point in transmission time of the chunk. a regular interval and whenpoint certain contains: events occur. Each data (if the chunk is ever acknowledged) and used to calculate the 16 , tcpi_lost − video_acked , . video_sent tcpi_sacked − tcp_info ). ) in . : SSIM of the chunk. : number of unacknowledged packets in : a unique ID to identify a video stream. collects a data point every time an Puffer can be used to retrieve the configuration (e.g., : a unique ID for each experimental group. tcpi_unacked : size of the chunk. : current congestion window size : epoch time when the chunk is sent. tcpi_retrans tcpi_snd_cwnd in_flight cwnd flight ( experimental group. size ssim_index time stream_id expt_id expt_id ( ABR, congestion control) tested in the corresponding + client_buffer • • • • • video_sent • • tains: and server sends a video chunk to a client. Each data point con- of information from a video serverlight or the a format client. of Below interesting we fieldssential high- in for three analysis measurements of es- this data: B Description of open data The open data we release comprise different sets of data – “measurements” – with each measurement containing a piece