Peeking at A/B Tests Why It MaErs, and What to Do About It
Total Page:16
File Type:pdf, Size:1020Kb
KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada Peeking at A/B Tests Why it maers, and what to do about it Ramesh Johari∗ Pete Koomen† Stanford University Optimizely, Inc. [email protected] Leonid Pekelis‡ David Walsh§ Optimizely, Inc. Stanford University [email protected] [email protected] ABSTRACT obtain a very simple “user interface”, because these measures isolate is paper reports on novel statistical methodology, which has been the task of analyzing experiments from the details of their design deployed by the commercial A/B testing platform Optimizely to and implementation. communicate experimental results to their customers. Our method- Crucially, the inferential validity of these p-values and con- ology addresses the issue that traditional p-values and condence dence intervals requires the separation between the design and intervals give unreliable inference. is is because users of A/B analysis of experiments to be strictly maintained. In particular, the testing soware are known to continuously monitor these measures sample size must be xed in advance. Compare this to A/B testing as the experiment is running. We provide always valid p-values practice, where users oen continuously monitor the p-values and and condence intervals that are provably robust to this eect. condence intervals reported in order to re-adjust the sample size Not only does this make it safe for a user to continuously monitor, of an experiment dynamically [14]. Figure 1 shows a typical A/B but it empowers her to detect true eects more eciently. is testing dashboard that enables such behavior. paper provides simulations and numerical studies on Optimizely’s is “peeking” behavior results because the opportunity cost of data, demonstrating an improvement in detection performance over longer experiments is large, so there is value to detecting true eects traditional methods. as quickly as possible, or giving up if it appears that no eect will be detected soon so that the user may test something else. Further, KEYWORDS most users lack good prior understanding of both their tolerance for longer experiments as well as the eect size they seek, frustrating A/B testing, sequential hypothesis testing, p-values, condence aempts to optimize the sample size in advance. Peeking early intervals at results to trade o maximum detection with minimum samples dynamically seems like a substantial benet of the real-time data 1 INTRODUCTION that modern A/B testing environments can provide. Web applications typically optimize their product oerings using Unfortunately, stopping experiments in an adaptive manner randomized controlled trials (RCTs); in industry parlance this is through continuous monitoring of the dashboard will severely fa- known as A/B testing. e rapid rise of A/B testing has led to the vorably bias the selection of experiments deemed signicant. Indeed, emergence of a number of widely used platforms that handle the very high false positive probabilities can be obtained—well in excess implementation of these experiments [10, 20]. e typical A/B test of the nominal desired false positive probability (typically set at 5%). compares the values of a parameter across two variations (con- As an example, even with 10,000 samples (quite common in online trol and treatment) to see if one variation oers an opportunity to A/B testing), we nd that the false positive probability can easily be improve their service, while the A/B testing platform communi- inated by 5-10x. at means that, throughout the industry, users cates results to the user via standard frequentist parameter testing have been drawing inferences that are not supported by their data. measures, i.e., p-values and condence intervals. In doing so, they Our paper presents the approach taken to address this challenge within the large-scale commercial A/B testing platform Optimizely. ∗ RJ is a technical advisor to Optimizely, Inc.; this work was completed as part of his We develop novel methodology to compute p-values and condence work with Optimizely. †PK is co-founder and Chief Technology Ocer of Optimizely, Inc. intervals; our measures, which we call always valid, allow users to ‡LP is a technical advisor to Optimizely, Inc.; this work was completed when he was continuously monitor the experiment and stop at a data-dependent an employee at Optimizely. time of their choosing, while maintaining control over false positive §is work was completed while DW was employed by Optimizely, Inc. probability at a desired pre-set level. is protects statistically naive Permission to make digital or hard copies of part or all of this work for personal or users, and lets all users leverage real-time data to trade o the classroom use is granted without fee provided that copies are not made or distributed detection power and sample size dynamically. As described in the for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for third-party components of this work must be honored. paper, our methods build on classical results in the sequential testing For all other uses, contact the owner/author(s). literature in statistics. e methods we describe were implemented KDD’17, August 13–17, 2017, Halifax, NS, Canada. in the Optimizely platform in January 2015 as Optimizely Stats © 2017 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-4887-4/17/08. DOI: hp://dx.doi.org/10.1145/3097983.3097992 Engine, and have been in use across all products including mobile, 1517 KDD 2017 Applied Data Science Paper KDD’17, August 13–17, 2017, Halifax, NS, Canada Figure 1: A typical results page in Optimizely’s A/B testing dashboard. e dashboard encourages users to continuously monitor their experiments, by providing updated results on experiments in real-time. web, and server-side testing; hundreds of thousands of experiments One-variation experiment. In a one-variation experiment, we have been run by thousands of customers since its launch. test a single variation (or treatment) against a known baseline. In In Section 2, we outline the basic A/B testing problem, as well particular, we suppose independent observations from an exponen- as the typical approach used today. In Section 3, we discuss why ¹ º1 iid∼ tial family X = Xn n=1 Fθ , where the parameter θ takes values continuous monitoring “breaks” the existing paradigm and leads in Θ ⊂ Rp . In this seing, we consider the problem of testing a sim- to invalid inference, and we propose a denition of always valid ple null hypothesis H0 : θ = θ0 against the composite alternative p-values and condence intervals that admit valid inference despite H1 : θ , θ0. Here θ0 is the known baseline of comparison. continuous monitoring of tests by users. In Section 4, we give the roughout the paper, we index probability distributions by the approach taken in the Optimizely platform to compute these mea- parameters; e.g., Pθ denotes the probability distribution on the data sures; in particular, they are derived from a novel generalization of induced by parameter θ. the mixture sequential probability ratio test (mSPRT) [16]. In Section Two-variation experiment. In a two-variation experiment or 5, we empirically demonstrate that our approach both allows users A/B test, we test two variations (e.g., treatment and control, or to control false positive probability, and improves the user’s ability A and B) against each other. Here we observe two independent to trade o between detection of real eects and the length of the i.i.d. sequences X and Y, corresponding to the observations on visi- experiment (in an appropriate sense). tors receiving experiences A and B respectively. In studying A/B e core of our solution is formulated for the basic A/B testing tests, we restrict the data model to the two most common cases en- problem with two variations (treatment and control). We conclude countered in practice: Bernoulli data with success probabilities µA in Section 6 by addressing challenges that arise for multivariate and µB (used to model binary outcomes such as clicks, conversions, testing, where users have many variations and metrics of interest etc.); and normal data with means µA and µB and known variance that they compare simultaneously. Multivariate testing immediately σ 2 (used to model continuous-valued outcomes such as time on gives rise to a severe multiple comparisons problem, where users site). In this seing, we consider the problem of testing the null can overinterpret signicant results unless appropriate corrections B A hypothesis H0 : θ := µ − µ = 0 against H1 : θ , 0. are applied [21]. In our deployment, always valid p-values are Decision rules. e experimenter needs to decide how long to combined with multiple hypothesis testing correction procedures to run the test, and whether to reject the null hypothesis when the test provide a robust inference platform for experimenters, supporting is done. We formalize this process through the notion of a decision both continuous monitoring and multivariate testing. rule. Formally, a decision rule is a pair ¹T; δº, where T is a stopping time that denotes the sample size at which the test is ended, and δ 2 PRELIMINARIES is a binary-valued decision dependent only on the observations up 1 In this section, we describe the typical approach for analyzing A/B to time T , where δ = 1 indicates that H0 is rejected. A stopping tests based on the frequentist theory of hypothesis testing, which time is any time that is dependent only on the data observed up to we refer to as xed-horizon testing. that time; therefore, this denition captures the crucial feature of decision-making in A/B tests that the terminal sample size may be 2.1 Experiments and decision rules data-dependent.