Engineering for a Science-Centric Experimentation Platform

Engineering for a Science-Centric Experimentation Platform Nikos Diamantopoulos Jeffrey Wong David Issa Mattos [email protected] [email protected] [email protected] Netflix, Inc. Netflix, Inc. Chalmers University of Technology Los Gatos, California, USA Los Gatos, California, USA Gothenburg, Sweden Ilias Gerostathopoulos Matthew Wardrop Tobias Mao [email protected] [email protected] [email protected] Technical University of Munich Netflix, Inc. Netflix, Inc. Munich, Germany Los Gatos, California, USA Los Gatos, California, USA Colin McFarland [email protected] Netflix, Inc. Los Gatos, California, USA ABSTRACT symbiosis" of engineers and data scientists, each complementing Netflix is an internet entertainment service that routinely employs the skill sets of the other, in order to create a platform that is robust experimentation to guide strategy around product innovations. As and scalable, while also being readily extensible by data scientists. Netflix grew, it had the opportunity to explore increasingly spe- Netflix routinely uses online A/B experiments to inform strategy cialized improvements to its service, which generated demand for and operation discussions (e.g. [4, 18, 26, 28]), as well as whether deeper analyses supported by richer metrics and powered by more certain product changes should be launched. Over time these dis- diverse statistical methodologies. To facilitate this, and more fully cussions grew to be increasingly specialized, generating demand for harness the skill sets of both engineering and data science, Netflix more and richer metrics powered by extensible statistical method- engineers created a science-centric experimentation platform that ologies that are capable of answering diverse causal effects ques- leverages the expertise of data scientists from a wide range of back- tions. For example, it was becoming more common for teams to grounds by allowing them to make direct code contributions in the require bespoke metrics to assist in the analysis of specific ex- languages used by scientists (Python and R). Moreover, the same periments, such as when changes to Netflix’s UI architecture and code that runs in production is able to be run locally, making it video player design caused extra hard-to-isolate latency in play- straightforward to explore and graduate both metrics and causal back startup [18]; or to require bespoke statistical methodologies, inference methodologies directly into production services. such as when interleaving was used to garner additional statisti- In this paper, we utilize a case-study research method to provide cal power when trying to compare two already highly-optimised two main contributions. Firstly, we report on the architecture of this personalization algorithms [4]. platform, with a special emphasis on its novel aspects: how it sup- To support these ever-growing use-cases, Netflix made a strate- ports science-centric end-to-end workflows without compromising gic bet to make their experimentation science-centric; that is, to engineering requirements. Secondly, we describe its approach to place a heavy emphasis on supporting arbitrary scientific analyses. causal inference, which leverages the potential outcomes concep- To implement this science-centric vision, Netflix’s experimentation tual framework to provide a unified abstraction layer for arbitrary platform, Netflix XP, was reimagined around three key tenets: statistical models and methodologies. trustworthiness, scalability, and inclusivity. Trustworthiness is es- sential since results that are untrustworthy are not actionable. Scal- arXiv:1910.03878v1 [cs.SE] 9 Oct 2019 KEYWORDS ability is required to accommodate for Netflix’s growth. Inclusivity is a key tenet because it allows scientists from diverse backgrounds experimentation, A/B testing, software architecture, causal infer- such as biology, psychology, economics, mathematics, physics, com- ence, science-centric puter science and other disciplines to contribute to and enrich the 1 INTRODUCTION experimentation platform. The implications of these tenets on Netflix XP are wide-ranging, Understanding the causal effects of product and business decisions but perhaps chief among them are the resulting choices of language via experimentation is a key enabler for innovation and improve- and computing paradigm. Python was chosen as the primary lan- ment [13, 25, 29], and the gold-standard of experimentation is the guage of the platform; with some components in C++ and R as randomized controlled trial design (also known as A/B testing) needed to support performance and/or statistical models. This was [5, 39, 41]. a natural choice because it is familiar to many data scientists, and In this paper, we will be presenting a case-study of the A/B has a comprehensive collection of standard libraries supporting testing experimentation platform built by Netflix, a leading internet both engineering and data science use-cases. The platform also entertainment service. The innovations of this experimentation platform are interesting because they have resulted in a "technical adopted a non-distributed architecture in order to reduce the bar- Mann-Whitney test, or CUPED [7] are used to identify statistically rier of entry into the platform for new statistical methodologies. significant changes in the metrics and generate scorecards [12]. Since non-distributed architectures are not as trivially scaled, the These scorecards help product managers, engineers, and data scien- techniques employed by the platform in order to ensure scalability, tists to make informed decisions and identify a causal relationship i.e. compression and numerical performance optimizations, are a between the product change and the observed effect. RCT in web significant contribution of this work. systems is extensively discussed by Kohavi et al. [25]. The paper The reimagined Netflix XP has also had implications for its presents an in-depth guide on how to run controlled experiments stakeholders. Firstly, data science productivity has increased. It is on web systems, discussing types of experimental designs, statis- now straightforward for data scientists to reproduce and extend tical analysis, ramp-up, the effect of robots and some architecture the standard analyses performed by the experimentation platform considerations, such as assignment methods and randomization because they can run the production code in a local environment. algorithms. The code also permits ad hoc extensions, allowing scientists to Although most research in online experiments has focused on leverage their background and domain knowledge to easily deliver RCT, companies have been using other types of experimental de- customized scorecards [12]; for example, by including explorations signs to infer causal relations. For instance, Xu and Chen [40] of heterogenity or temporal effects. Secondly, data science work- describe the usage of quasi A/B tests to evaluate the mobile app flows have been enriched with a more extensive toolkit. Since the of LinkedIn. The paper details the characteristics of the mobile in- platform was reimagined, new statistical methodologies (such as frastructure that contribute to the need for designing and running quantile bootstrapping and regression) have been contributed to different experiment designs than RCT. the platform, which can then be used in combination with arbitrary metrics of the data scientists’ choice. Thirdly, engineers have been freed up to focus on the platform itself. Since data scientists are now 2.2 Experimentation Processes and Platforms responsible for contributing and maintaining their own metrics and To support and democratize experimentation across multiple depart- methodologies, engineers are now able to focus on aspects of the ments, products and use cases, Kaufman et al. [23] have identified platform in which they specialize, leading to greater scalability and the need for an experimentation platform to be generic and ex- trustworthiness. The effect of these implications has compounded tensible enough to allow the design, implementation, and analysis in rapid innovation cycles around ongoing strategy discussions, of experiments with minimal ad hoc work. They describe, in the which has changed the face of experimentation at Netflix. context of Booking.com, the usage of an extensible metric frame- In this paper, we utilize a case-study research method to provide work to provide flexibility for experiment owners to create new two main contributions. Firstly, we report on the architecture of this metrics. However, they do not describe the extensibility aspect in platform, with a special emphasis on its novel aspects: how it sup- the context of different experimental designs and analyses aswe ports science-centric end-to-end workflows without compromising do. the engineering requirements laid out in subsequent sections. Sec- Twitter discusses its experimentation platform and how it is ondly, we describe its approach to causal inference, which leverages capable of measuring and analyzing a large number of flexible the potential outcomes framework to provide a unified abstraction metrics [9]. The platform supports three types of metrics: built-in layer for arbitrary statistical models and methodologies. metrics that are tracked for all experiments, event-based metrics, The rest of this paper is organized as follows: Section 2 presents and metrics that are owned and generated by engineers. One of background information in online experiments and related works. the challenges

Engineering for a Science-Centric Experimentation Platform

Presto Customers to Enjoy No Data Download Limits on Foxtel

Control Over Ammonium Nitrate Aerosol

Provider Directory

Innovation for Age-Friendly Buildings, Cities and Environments

Presto: the Definitive Guide

Improving Access for All

Accessing Subscription Video on Demand

TIVO EXPERIENCE ©2019 Tivo Inc

Subscription Video on Demand: Viewing Preferences Among New Zealand Audiences

Presto Sets the Perfect Mood for February

Nomination Press Release

Apc Contractor Members