arXiv:1803.06336v4 [stat.AP] 12 Sep 2018 / etn,bgdt,dsrbtdagrtm ag sample large algorithm, distributed data, big testing, A/B ihu xr oe supin n ov i aaproblems data big solve and assumptions model extra without etro umr ttsis hc a fe ecmue in computed be often can which , summary of vector h r fbgdt rnsbt lsig n uss[1.O o On [21]. curses and blessings both brings data big of era The ihr[4,i osmaieteoiia aastb low-di a by back set data dates original which the summarize them, to behind is [24], idea Fisher pr key data The big ones. smaller downsize into can therefore and parallelizable ially htrsml h iieadcnuragrtm[1 34]. [31, algorithm divide-and-conquer the 52] resemble [6, that algorithms distributed into parallelize t to algorithms difficult extend res setting can of data which amount meta-algorithms, big increasing in an a interest been in has well we there scale Recently, work 47]. not to 25, do known scenarios are data that small machines) in and vector algorithms support existing (e.g., Many ods p system. in distributed run a can in which allel algorithms efficient for s strive data to the community forcing computations, intensive more deman and it storage hand, problems other the tackle efficiently, On to signal-to-noise-ratio. and and smaller effects, accurately subtle more more and measure smaller to study us allows it hand, . Background 1.1 INTRODUCTION 1 longi inference, quantile randomization, metrics, online KEYWORDS classic its both highlighting by applications. novel analytics D metric the of in role central method the impo emphasize most addre We to the challenges. literature, mentioned of statistics classic one the method, from Delta tools the tant a provide applying we to testing, dev guide A/B metric tical large-scale in for analytics examples and real-life ment four by Motivated infere and cedures. estimation trustworthy constructing numerou for pose lenges levels, correspond aggregation their and as processes such generating metrics, Ho the bootstrap. of like attributes procedures certain simulation-based of fra cost a at the algorithms distributed using formulas dir closed-form them model to opportunities potential up opening di d normal tions, follow big approximately of metrics such setting of the majority the Under performance. business monitor to and metrics industr online on technology relying culture, information data-driven a the adopted decade, last the During ABSTRACT ∗ le n itiue a.Freape oetmt h mean the estimate to example, For way. distributed and allel h uhr r itdi lhbtclorder. alphabetical in listed are authors The ttesm ie hr sacaso loihswihaetriv are which algorithms of class a is there time, same the At [email protected] irsf Corporation Microsoft emn,WA Redmond, lxDeng Alex pligteDlaMto nMti Analytics: Metric in Method Delta the Applying rcia ud ihNvlIdeas Novel with Guide Practical A steafore- the ss uia study tudinal n ideas and , irsf Corporation Microsoft slarger ds measure n data ing [email protected] c pro- nce mensional to of ction oblems a are hat theory, cience stribu- wever, chal- s meth- earch has y par- a l Knoblich Ulf ectly prac- elop- with emn,WA Redmond, and and elta [12, ata, via ar- ne to to ll r- - eavct ietymdln h erc ahrta h or the than rather metrics the modeling directly advocate We omnapiaino h eta ii hoe st const to is theorem limit central the of application common A elkonfr.Let form. well-known sa large Given play. into comes data big of blessing the where z / etn) nern h hne fmtiscnestablis can metrics of changes the inferring testing), A/B aineo omldsrbto rmidpnetadiden and independent from distribution normal a of variable 1 h 100 the emr pcfi,b sn h atthat fact the using by specific, more be lutainw nyrve h n-ait ae o n ra any For case. uni-variate broadly. the theorem review nor only limit the we central extends illustration which the method, from Delta approximations the mal employ we ble data, the enjoy big To complex. of variables, more random often i.i.d. are metrics of our average practice the to applies only form cetfi ed hnvra siaewt ro asi pres is bars error with al estimate in an used whenever statistics fields scientific asymptotic of results influential most islvl eas edt nwterjitdsrbto.Th distribution. joint their know to need also we prob a level, looking underlying When rics the observations. i.i.d. of as assumptions such model, basic bilistic level some granular need most the we at user), data analyzing When data-set. nal s.I h otx frnoie otoldexperimentati controlled randomized of context greater of the t are In rather that level, est. metrics, granular between most the differences at or measurements changes the prac In often 18]. is [15, indicators performance key as monitoring and to referred often are su these statistics technology, information as such businesses driven . eta ii hoe n et method Delta and theorem limit Central ca 1.2 design interface user clicks. new more a and times whether view e.g., longer 50], 44, [36, ity iemean nite n u fsurs hc r h orsodn umr stat summary corresponding the are tics t which obtain squares, to of sum need and only we observations, (i.i.d.) distributed vrg,te stesml size sample the as then average, itiuinas distribution l vrg)adconstant and average) ple efcl ihu oigayinformation. any losing without perfectly smttcnraiyt n otnostransformation continuous any to normality asymptotic otie,w rtrve h eta ii hoe nismo its in theorem limit central the review first we contained, the to prop asymptotic desirable possess often metrics the sizes, nfc,te are they fact, In α / hl nifleta eut h eta ii hoe nits in theorem limit central the result, influential an While 2 1 n a etiilycmue nadsrbtdfsin nda In fashion. distributed a in computed trivially be can and sthe is eta ii theorem limit central ( 1 T n − ( µ α tesbcitidctsisdpnec on dependency its indicates subscript (the α / n variance and ) ucetstatistics sufficient ofiec nevlof interval confidence % 2 n ) hqatl for quantile th ∞ → √ n X , ( X h et ehdalw st xedits extend to us allows method Delta the 1 ¯ , . . . , − θ σ 1] .. hycnrpeetteoiia data-set original the represent can they i.e., [10], 4] oesr htteppri self- is paper the that ensure To [45]. 2 µ irsf Corporation Microsoft uhthat such [email protected] )/ X N > σ n n ( metrics emn,WA Redmond, 0 ina Lu Jiannan 0 → ∞ → be , . 1 elet We ) N n . µ T hsi rubyoeo the of one arguably is This ( √ , ...osrain ihfi- with observations i.i.d. 0 n sdfrmeasuring for used and , n ndistribution in as n , 1 − ( T ) X X ¯ . n θ ¯ ± = − eoetesample the denote z ∗ O θ α → ) / ( 1 2 σ / √ / n √ n , N risdue erties ϕ h met- the t ) .. sam- e.g., ersum heir n causal- h otall most ( ( n the and , T mmary 0 n in and where n(or on tically ie it tice, n , ndom ented. inter- ssing basic mple 1 ) (e.g., uses sis is ruct . han ) For igi- To ta- is- in st a- - first order Taylor expansion [43] to alternative methods that need to put up a model for in- 2 dividual subjects, our method requires less model assump- ϕ Tn ϕ θ = ϕ θ Tn θ + O Tn θ , (1) ( )− ( ) ′( )( − ) {( − ) } tions. we have in distribution The main purpose of this paper is to promote the Delta method 2 √n ϕ Tn ϕ θ N 0,ϕ′ θ . as a general, memorable and efficient tool for metric analytics. In { ( )− ( )} → ( ) particular, our contributions to the existing data mining literature This is the Delta method. Without relying on any assumptions  include: other than “big data,” the Delta method is general. It is also mem- Practical guidance for scenarios such as inferring ratio met- orable – anyone with basic knowledge of calculus can derive it. • Moreover, the calculation is trivially parallelizable and can be eas- rics, where the Delta method offers scalable and easier-to- ily implemented in a distributed system. Nevertheless, although implement solutions; A novel and computationally efficient solution to estimate conceptually and theoretically straightforward, the practical diffi- • culty is to find the right “link” function ϕ that transforms the sim- the sampling variance of quantile metrics; A novel data augmentation technique that employs the Delta ple average to our desired metric. Because of different attributes • of the metrics, such as the underlying data generating process and method to model metrics resulting from within-subject or aggregation levels, the process of discovering the corresponding longitudinal studies with unspecified missing data patterns. transformation can be challenging. However, unfortunately, although For reproducibility and knowledge transfer, we provide all relevant various applications of the Delta method have previously appeared computer programs at https://aka.ms/exp/deltamethod. in the data mining literature [16, 36, 39, 44], the method itself and the discovery of ϕ were often deemed technical details and only 2 INFERRING PERCENT CHANGES briefly mentioned or relegated to appendices. Motivated by this gap, we aim to provide a practical guide that highlights the ef- 2.1 Percent change and Fieller interval fectiveness and importance of the Delta method, hoping to help Measuring change is a common theme in applied data science. In fellow data scientists, developers and applied researchers conduct online A/B testing [15, 17, 36, 37, 44], we estimate the average trustworthy metric analytics. treatment effect (ATE) by the difference of the same metric mea- sured from treatment and control groups, respectively. In time se- 1.3 Scope and contributions ries analyses and longitudinal studies, we often track a metric over As a practical guide, this paper presents four applications of the time and monitor changes between different time points. For il- ,..., Delta method in real-life scenarios, all of which have been deployed lustration, let X1 Xn be i.i.d. observations from the control 2, ,..., in Microsoft’s online A/B testing platform ExP [35, 36] and em- group with mean µx and variance σx Y1 Yn i.i.d. observations 2, ployed to analyze experimental results on a daily basis: from the treatment group with mean µy and variance σy and σxy 4 ¯ = n the covariance between Xi ’s and Yj ’s. Let X i=1 Xi n and In Section 2, we derive the asymptotic distributions of ra- n / • 2 Y¯ = = Yi n be two measurements of the same metric from tio metrics . Compared with standard approaches by Fieller i 1 / Í [22, 23], the Delta method provides a much simpler and yet the treatment and control groups, respectively, and their difference ∆ˆ = ¯Í ¯ ∆ = . almost equally accurate and effective solution; Y X is an unbiased estimate of the ATE µy µx Be- − ¯ ¯ − In Section 3, we analyze cluster randomized experiments, cause both X and Y are approximately normally distributed, their • ∆ˆ where the Delta method offers an efficient alternative algo- difference also follows an approximate normal distribution with ∆ rithm to standard statistical machinery known as the mixed mean and variance effect model [4, 26], and provides unbiased estimates; 2 2 Var ∆ˆ = σ + σ 2σxy n. In Section 4, by extending the Delta method to outer con- ( ) ( y x − )/ • fidence intervals [42], we propose a novel hybrid method Consequently, the well-known 100(1-α)% confidence interval of ∆ to construct confidence intervals for quantile metrics with is ∆ˆ zα Var ∆ˆ , where Var ∆ˆ is the finite-sample analogue 3 ± × ( ) ( ) almost no extra computational cost . Unlike most existing of Var ∆ˆ , and can be computed using the sample and ( ) methods, our proposal does not require repeated simulations covariance ofc the treatment andc control observations, denoted as 2 2 as in bootstrap, nor does it require estimating an unknown sy , sx and sxy respectively. density function, which itself is often a theoretically chal- In practice, however, absolute differences as defined above are of- lenging and computationally intensive task, and has been a ten hard to interpret because they are not scale-invariant. Instead, center of criticism [7]; we focus on the relative difference or percent change ∆% = µy In Section 5, we handle missing data in within-subject stud- ( − µx µx , estimated by ∆ˆ % = Y¯ X¯ X¯. The key problem of this sec- • )/ ( − )/ ies by combining the Delta method with data augmentation. tionis constructing a 100(1-α)% confidence interval for ∆ˆ %. For this We demonstrate the effectiveness of “big data small prob- classic problem, Fieller [22, 23] seems to be the first to provide a so- lem” approach when directly modeling metrics. Comparing lution. To be specific, let tr α denote the 1 α th quantile for the ( ) ( − ) t distribution with degree of freedom r, and g = ns2t2 r X¯2, 2Pedantically, ratio of two measurements of the same metric, from the treatment and − x α 2( )/ control groups. / 3Our procedure computes two more quantiles for confidence interval. Since the main cost of quantile computing is sorting the data, computing three and one quantiles cost 4For A/B testing where the treatment and control groups are independently sampled almost the same. from a super population, σxy = 0. 2 then Fieller’s interval of ∆% is 2.3 Numerical examples To illustrate the performance of Fieller’s interval in (2), the Delta 1 Y¯ gsxy 1 method interval in (4), and the Edgeworth interval with and with- 1 g X¯ − − s2 − ( x out bias correction under different scenarios, we let the sample (2) = , , , . , 2 size n 20 50 200 2000 For each fixed n we assume that the tα 2 r Y¯ Y¯2 sxy / ( ) s2 2 s + s2 g s2 treatment and control groups are independent, and consider three ¯ y ¯ xy ¯2 x y 2 = ± √nX tuv − X X − − sx !) simulation models for i.i.d. experimental units i 1,...,n : (1) Normal: Xi N µ = 1, σ = 0.1 , Yi N µ = 1.1, σ = 0.1 ; ∼ ( ) ∼ ( ) Although widely considered as the standard approach for estimat- (2) Poisson: Xi Pois λ = 1 , Yi Pois λ = 1.1 ; ∼ ( ) ∼ ( ) ing variances of percent changes [38], deriving (2) is cumbersome (3) Bernoulli: Xi Bern p = 0.5 , Yi Bern p = 0.6 . ∼ ( ) ∼ ( ) (see [46] for a modern revisit). The greater problem is that, even The above models aim to mimic the behaviors of our most com- when given this formula, applying it often requires quite some ef- mon metrics, discrete or continuous. For each case, we repeatedly fort. According to (2) we need to not only estimate the sample vari- sample M = 10, 000 data sets, and for each data set we construct ances and covariance, but also the parameter g. Fieller’s and the Delta method intervals, respectively, and then add correction to the Delta method result. We also construct the Edge- 2.2 Delta method and Edgeworth correction worth expansion interval without and with the bias correction. The Delta method provides a more intuitive alternative solution. Table 1: Simulated examples: The first two columns contain simulation mod- els and sample sizes. The next five columns present the coverage rates of various Although they can be found in classic textbooks such as [10], this 95% confidence intervals – Fieller’s, Delta method based (w/o and w/ bias correc- paper (as a practical guide) still provides all the relevant technical tion) and Edgeworth expansion based (w/o and w/ bias correction). details. We letTn = Y¯, X¯ , θ = µy , µx and ϕ x, y = y x. A multi- Method n Fieller Delta Delta(BC) Edgeworth Edgeworth(BC) ( ) ( ) ( ) / variate analogue of (1) suggests that ϕ Tn ϕ θ ϕ θ Tn θ , ( )− ( )≈∇ ( )·( − ) Normal 20 0.9563 0.9421 0.9422 0.9426 0.9426 which implies that Normal 50 0.9529 0.9477 0.9477 0.9478 0.9477 Normal 200 0.9505 0.9490 0.9491 0.9490 0.9490 Y µy 1 µy Normal 2000 0.9504 0.9503 0.9503 0.9503 0.9503 Y¯ µ X¯ µ (3) y 2 x Poisson 20 0.9400 0.9322 0.9370 0.9341 0.9396 X − µx ≈ µx ( − )− µx ( − ) Poisson 50 0.9481 0.9448 0.9464 0.9464 0.9478 = = 2 Poisson 200 0.9500 0.9491 0.9493 0.9496 0.9498 For i 1,...,n, let Wi Yi µx µyXi µx , which are also i.i.d. Poisson 2000 0.9494 0.9494 0.9495 0.9494 0.9495 / − / ¯ ¯ observations. Consequently, we can re-write (3) as Y X µy µx Bernoulli 20 0.9539 0.9403 0.9490 0.9476 0.9521 n / − / ≈ = Wi n, leading to the Delta method based confidence interval i 1 / Bernoulli 50 0.9547 0.9507 0.9484 0.9513 0.9539 Bernoulli 200 0.9525 0.9513 0.9509 0.9517 0.9513 Í Bernoulli 2000 0.9502 0.9500 0.9499 0.9501 0.9500 ¯ z ¯ ¯2 Y α 2 2 Y + Y 2 1 / sy 2 sxy sx (4) We report the corresponding coverage rates in Table 1, from X¯ − ± √nX¯ s − X¯ X¯2 which we can draw several conclusions. For big data (n 200), ≥ all methods achieve nominal (i.e., 95%) coverage rates for all point estimate uncertainty quantification ≈ simulation models. For small data (n 50), although Fieller’s in- |{z} | {z } ≤ (4) is easier to implement than (2), and is in fact the limit of (2) in terval seems more accurate for some simulation models (e.g., nor- the large sample case, because as n , we have t r z , mal), other methods perform comparably, especially after the bias →∞ α 2( )→ α 2 g 0, and Var X¯ O 1 n . Although Fieller’s confidence/ inter-/ correction. For simplicity in implementation and transparency in → ( ) ∼ ( / ) val can be more accurate for small samples [30], this benefit ap- applications, we recommend Algorithm 1, which uses the Delta pears rather limited for big data. Moreover, the Delta method can method based interval (4) with the bias correction. also be easily extended for a better approximation by using Edge- Algorithm 1 Confidence interval for ratio: Delta method + bias correc- worth expansion [5, 29]. To be more specific, (3) suggests that in tion distribution √n Y¯ X¯ µy µx σw ν, whose cumulative dis- 1: function deltaci(X = X1,..., Xn ; Y1,..., Yn ; α = 0.05) / − / / → tribution function 2: X¯ = mean X ; Y¯ = mean Y ;  2 = ( ) 2 = ( ) = 3: sx var X ; sy var Y ; sxy cov X, Y ; 1 2 2 ( 3 ) 2 ( )2 ( ) F t = Φ t 6n κ t 1 ϕ t 4: bc = Y¯ X¯ s n 1 X¯ sxy n ⊲ bias correction term − / w / × x / − / × / ( ) ( )− ( − ) ( ) 5: pest = Y¯ X¯ 1+bc ⊲ point estimate 2/ ¯−2 ¯ ¯ 3 + ¯ 2 ¯ 4 2 contains a correction term in addition to Φ t , the cumulative dis- 6: vest = sy X 2 Y X sxt Y X sx ( ) / − × / ∗ / ∗ tribution function of the standard normal, and κw , the skewness 7: return: pest z1 α 2 vest n ⊲ 100 1 α confidence interval ± − / × / ( − ) of the Wi ’s. By simply replacing the terms “ z ” in (4) with ν ± α 2 α 2 p and ν , the α 2 th and 1 α 2 th quantiles/ of ν, respectively,/ 1 α 2 ( / ) ( − / ) 3 DECODING CLUSTER RANDOMIZATION we obtain− / the corresponding Edgeworth expansion based confi- dence interval. Finally, we can add a second-order bias correction 3.1 The variance estimation problem 2 2 term Ys¯ X¯ sxy nX¯ to the point estimate in (4); the same Two key concepts in a typical A/B test are the randomization unit ( x / − )/( ) correct term can be applied to the Edgeworth expansion based in- – the granularity level where sampling or randomization is per- terval. formed, and the analysis unit – the aggregation level of metric 3 computation. Analysis is straightforward when the randomization the “fixed” effect for the treatment indicator term6. Stan [9] offers and analysis units agree [14], e.g., when randomizing by user while a Markov Chain Monte Carlo (MCMC) implementation aiming to also computing the average revenue per user. However, often the infer the posterior distribution of those parameters, but this needs randomization unit is a cluster of analysis units (it cannot be more significant computational effort for big data. Moreover, the esti- granular than the analysis unit, otherwise the analysis unit would mated ATE, i.e., the coefficient for the treatment assignment indica- contain observations under both treatment and control, nullifying tor, is for the randomization unit (i.e., cluster) but not the analysis the purpose of differentiating the two groups). Such cases, some- unit level, because it treats all clusters with equal weights and can times referred to as cluster randomized experiments in the econo- be viewed as the effect on the double average Yij Nj K, i ( j / )/ metrics and statistics literature [1, 33], are quite common in prac- which is usually different than the population average Y¯ [15]. This Í Í tice, e.g., enterprise policy prohibiting users within the same orga- distinction doesn’t make a systematic difference when effects across nization from different experiences, or the need to reduce bias in clusters are homogeneous. However, in practice the treatment ef- the presence of network interference [2, 20, 27]. Perhaps more ubiq- fects are often heterogeneous, and using mixed effect model esti- uitously, for the same experiment we usually have metrics with dif- mates without further adjustment steps could lead to severe biases. ferent analysis units. For example, to meet different business needs, On the contrary, the Delta method solves the problem directly ¯ most user-randomized experiments run by ExP contain both user- from the metric definition. Re-write Y into i j Yij i Ni . Let = ( )/ level and page-level metrics. Si j Yij , and divide both the numerator and denominator by We consider two average metrics5 of the treatment and con- K, Í Í Í trol groups, assumed to be independent. Without loss of gener- Í Si K Y¯ = i / = S¯ N¯ . ality, we only focus on the treatment group with K clusters. For i Ni K / = Í / i 1,..., K, the ith cluster contains Ni analysis unit level obser- Albeit not an average of i.i.d. random variables, Y¯ is a ratio of two = Í vations Yij (j 1,..., Ni ). Then the corresponding average metric averages of i.i.d. randomization unit level quantities [14]. There- ¯ = is Y i, j Yij i Ni . We assume that within each cluster the ob- fore, by following (3) in Section 2.2, 2 servations Yij ’s are i.i.d. with mean µi and variance σ , and across Í  Í i 1 µ µ2 clusters µi , σi , Ni are also i.i.d. Var Y¯ σ 2 2 S σ + S σ 2 . (6) ( ) ( )≈ 2 S − µ S N 2 N KµN N µN ! 3.2 Existing and Delta method based solutions Therefore, the variance of Y¯ depends on the variance of a centered Y¯ is not an average of i.i.d. random variables, and the crux of our version of Si (by a multiple of Ni , not the constant E Si as typi- ( ) analysis is to estimate its variance. Under strict assumptions, closed- cally done). Intuitively, both the variance of the cluster size Ni , and form solutions for this problem exist [19, 33]. For example, when = of the within-cluster sum of observations Si j Yij , contribute = 2 = 2 , Ni m and σi σ for all i to (6). In particular, σ 2 = VarN is an important contributor of the N i Í σ 2 + τ 2 variance of Y¯, leading to a practical recommendation for the exper- Var Y¯ = 1 + m 1 ρ , (5) ( ) Km { ( − ) } imenters – it is desirable to make the cluster sizes homogeneous; 2 2 2 otherwise it can be difficult to detect small treatment effects due where τ = Var µi is the between-cluster variance and ρ = τ σ + ( ) /( to low statistical power. τ 2 is the coefficient of intra-cluster correlation, which quantifies the ) contribution of between-cluster variance to the total variance. To 3.3 Numerical examples facilitate understanding of the variance formula (5), two extreme cases are worth mentioning: Because the coverage property of (6) has been extensively covered in Section 2.3, we only focus on comparing it with the mixed effect (1) If σ = 0, then for each i = 1,..., K and all j = 1,..., Ni , 2 model here. We first describe the data generating process, which Yij = µi . In this case, ρ = 1 and Var Y¯ = τ K; ( ) / consists of a two-level hierarchy as described in the previous sec- (2) If τ = 0, then µ = µ for all i = 1,..., K, and therefore the i tion. First, at the randomization unit level, let the total number of observations Y ’s are in fact i.i.d. In this case, ρ = 0 and (5) ij clusters K = 1000. To mimic cluster heterogeneity, we divide clus- reduces to Var Y¯ = σ 2 Km . ( ) /( ) ters into three categories: small, medium and large. We generate However, unfortunately, although theoretically sound, (5) has only the numbers of clusters in the three categories by the following limited practical value because the assumptions it makes are unre- multinomial distribution: alistic. In reality, the cluster sizes Ni and distributions of Yij for = = 2 M1, M2, M3 ′ Multi-nomial n K;p 1 3, 1 2, 1 6 . each cluster i are different, which means that µi and σi are differ- ( ) ∼ { ( / / / )} ent. For the ith cluster, depending on which category it belongs to, we 7 Another common approach is the mixed effect model, also known generate Ni , µi and σi in the following way : as multi-level/hierarchical regression [26], where Yij depends on Small: Ni Poisson 2 , µi N µ = 0.3, σi = 0.05 ; 2 • ∼ ( ) ∼ ( ) µi and σ , while the parameters themselves follow a “higher order” Medium: Ni Poisson 5 , µi N µ = 0.5, σi = 0.1 ; i • ∼ ( ) ∼ ( ) distribution. Under this setting, we can infer the treatment effect as High: Ni Poisson 30 , µi N µ = 0.8, σi = 0.05 ; • ∼ ( ) ∼ ( ) 5Pedantically, they are two measurements of the same metric. We often use metric to 6Y Treatment + (1|User) in lme4 notation. A detailed discussion is beyond the scope refer to both the metric itself (e.g. revenue-per-user) and measurements of the metric of this∼ paper; see Bates et al. [3], Gelman and Hill [26]. 7 (e.g. revenue-per-user of the treatment group) and this distinction would be clear in The positive correlation between µi and Ni is not important, and reader can try out the context. code with different configuration. 4 Second, for each fixed i, let Yij Bernoulli p = µi for all j = Delta method can be seen as the ultimate simplification of GEE’s ∼ ( ) 1,..., Ni . This choice is motivated by binary metrics such as page- sandwich variance after summarizing data points into click-rate, and because of it we can determine the ground truth sufficient statistics. But the derivation of GEE is much more in- E Y¯ = 0.667 by computing the weighted average of µi weighted volved than the , while we can explain the ( ) by the cluster sizes and the mixture of small, medium and large Delta method in a few lines and it is not only more memorable but clusters. also provides more insights in (6). Our goal is to infer E Y¯ and we compare the following three ( ) methods: 4 EFFICIENT VARIANCE ESTIMATION FOR (1) A naive estimator Y¯, pretending all observations Yij are i.i.d; QUANTILE METRICS (2) Fitting a mixed effect model with a cluster random effect µi ; 4.1 Sample quantiles and their asymptotics (3) Using the metric Y¯ as in the first method, but using the Delta method for variance estimation. Although the vast majority of metrics are averages of user teleme- try data, quantile metrics form another category that is widely Based on the aforementioned data generating mechanism, we re- used to focus on the tails of distributions. In particular, this is of- peatedly and independently generate 1000 data sets. For each data ten the case for performance measurements, where we not only set, we compute the point and variance estimates of E Y¯ using ( ) care about an average user’s experience, but even more so about the naive, mixed effect, and delta methods. We then compute em- those that suffer from the slowest responses. Within the web per- pirical variances for the three , and compare them to the formance community, quantiles (of, for example, page loading time) average of estimated variances. We report the results in Table 2. at 75%, 95% or 99% often take the spotlight. In addition, the 50% Table 2: Simulated examples: The first three columns contain the chosen quantile (median) is sometimes used to replace the mean, because E method, the true value of Y¯ and the true standard deviation of the corre- it is more robust to outlier observations (e.g., errors in instrumenta- sponding methods. The last two( ) columns contain the point estimates and aver- age estimated standard errors. tion). This section focuses on estimating the variances of quantile Method Ground Truth SD(True) Estimate Avg. SE(Model) estimates. Suppose we have n i.i.d. observations X1,..., Xn, generated by Naive 0.667 0.00895 0.667 0.00522 a cumulative distribution function F x = P X x and a density Mixedeffect 0.667 0.00977 0.547 0.00956 8 ( ) ( ≤ ) Deltamethod 0.667 0.00895 0.667 0.00908 function f x . The theoretical pth quantile for the distribution F ( ) 1 is defined as F − p . Let X 1 ,..., X n be the ascending ordering of ( ) ( ) ( ) Not surprisingly, the naive method under-estimates the true vari- the original observations. The sample quantile at p is X np if np ( ) ance – we had treated the correlated observations as independent. is an integer. Otherwise, let np be the floor of np, then the sam- ⌊ ⌋ Both the Delta method and the mixed effect model produced satis- ple quantile can be defined as any number between X np and 9 (⌊ ⌋) factory variance estimates. However, although both the naive and X np +1 or a linear interpolation of the two . For simplicity here E (⌊ ⌋ ) the Delta method correctly estimated Y , the mixed effect estima- we use X np , which will not affect any asymptotic results. It is a ( ) (⌊ ⌋) tor is severely biased. This shouldn’t be a big surprise if we look well-known fact that, if X1,..., Xn are i.i.d. observations, follow- deeper into the model Yij = α + βi + ϵij and E ϵij = 0, where ing the central limit theorem and a rather straightforward appli- ( ) the random effects βi are centered so E βi = 0. The sum of the cation of the Delta method, the sample quantile is approximately ( ) intercept terms α and βi stands for the per-cluster mean µi , and normal [10, 45]: α represents the average of per-cluster mean, where we compute σ 2 the mean within each cluster first, and then average over clusters. √ 1 , , n X np F − p N 0 2 (7) This is different from the metrics defined as simple average of Y ⌊ ⌋ − ( ) → " f F 1 p # ij n o − ( ) in the way that in the former all clusters are equally weighted and where σ 2 = p 1 p . However, unfortunately, in practice we rarely in the latter case bigger clusters have more weight. The two defini- ( − ) have i.i.d. observations. A common scenario is in search engine per- tions will be the same if and only if either there is no heterogeneity, formance telemetry, where we receive an observation (e.g., page i.e. per-cluster means µ are all the same, or all clusters have the i loading time) for each server request or page-view, while random- same size. We can still use the mixed effect model to get a unbiased ization is done at a higher level such as device or user. This is the estimate. This requires us to first estimate every βi (thus µi ), and + same situation we have seen in Section 3, where Xi are clustered. then compute α βi Ni i Ni by applying the correct weight Ni . 1 ( ) / To simplify future notations, we let Yi = I Xi F p , where I The mixed effect model with the above formula gave a new esti- { ≤ − ( )} is the indicator function. Then (6) can be used to compute Var Y¯ , mate 0.662, much closer toÍ the ground truth. Unfortunately, it is ( ) and (7) holds in the clustered case with σ 2 = nVar Y¯ . This gener- still hard to get the variance of this new estimator. ( ) alizes the i.i.d. case where nVar Y¯ = p 1 p . Note that the Delta In this study we didn’t consider the treatment effect. In ATE es- ( ) ( − ) method is instrumental in proving (7) itself, but a rigorous proof timation, the mixed effect model will similarly result in a biased involves a rather technical last step that is beyond our scope. A estimate for the ATE for the same reason, as long as per-cluster formal proof can be found in [45]. treatment effects vary and cluster sizes are different. The fact that the mixed effect model provides a double average type estimate 8We do not consider cases when X has discrete mass, and F will have jumps. In this and the Delta method estimates the “population” mean is analo- case the quantile can take many values and is not well defined. In practice this case can be seen as continuous case with some discrete correction. gous to the comparison of the mixed effect model with GEE (gener- 9When p is 0.5, the 50% quantile, or median, is often defined as the average of the alized estimating equations) [41]. In fact, in the Gaussian case, the middle two numbers if we have even number of observations. 5 4.2 A Delta method solution to a practical issue However, in this algorithm the ranks depends on σ, whose compu- Although theoretically sound, the difficulty of applying (7) in prac- tation depends on the quantile estimates (more specifically the Yi 1 requires a pass through the data after quantile is estimated). This tice lies in the denominator f F − p , whose computation requires { ( )} 1 means that this algorithm requires a first ‘ntile’ retrieval, and then the unknown density function f at the unknown quantile F − p . ( ) a pass through the data for σ computation, and then another two A common approach is to estimate f from the observed Xi using non-parametric methods such as kernel density estimation [48]. ‘ntile’ retrievals. Turns out, computing all three ‘ntiles’ in one stage However, any non-parametric density estimation method is trad- is much more efficient than splitting into two stages. This is be- ing off between bias and variance. To reduce variance, more ag- cause retrieving ‘ntiles’ can be optimized in the following way: if gressive smoothing and hence larger bias need to be introduced to we only need to fetch tail ranks, it is pointless to sort data that are the procedure. This issue is less critical for quantiles at the body not at the tail; we can use sketching algorithm to narrow down the of the distribution, e.g. median, where density is high and more possible range where our ranks reside and only sort in that range, data exists around the quantile to make the variance smaller. As making it even more efficient to retrieve multiple ‘ntiles’ at once. we move to the tail, e.g. 90%, 95% or 99%, however, the noise of Along this line of thoughts, to make the algorithm more efficient, the density estimation gets bigger, so we have to introduce more we noticed that (7) also implies that the change from Xi being i.i.d. smoothing and more bias. Because the density shows up in the de- to clustered only requires an adjustment to the numerator σ, which nominator and density in the tail often decays to 0, a small bias is a simple re-scaling step, and the correction factor does not de- in estimated density can lead to a big bias for the estimated vari- pend on the unknown density function f in the denominator. If the ance (Brown and Wolfe [7] raised similar criticisms with their sim- outer CI were to provide a good confidence interval in i.i.d. case, a ulation study). A second approach is to bootstrap, re-sampling the re-scaled outer CI with the same correction term should also work whole dataset many times and computing quantiles repeatedly. Un- for the clustered case, at least when n is large. This leads to the like an average, computing quantiles requires sorting, and sorting outer CI with post-adjustment algorithm: in distributed systems (data is distributed in the network) requires = (1) compute L,U n p zα 2 p 1 p n data shuffling between nodes, which incurs costly network I/O. ( ± / ( − )/ ) (2) fetch X np , X L and X U Thus, bootstrap works well for small scale data but tends to be (⌊ ⌋)= ( ) ( p) too expensive in large scale in its original form (efficient bootstrap (3) compute Yi I Xi X np } { 2≤ (⌊ ⌋)2 on massive data is a research area of its own [34]). (4) compute µS , µN , σS , σS N , σN An alternative method without the necessity for density estima- (5) compute σ by setting σ 2 n equal to the result of equation (6) / tion is more desirable, especially from a more practical perspec- (6) compute the correction factor σ p 1 p and apply it to / ( − ) tive. One such method is called outer confidence interval (outer X and X L U p CI) [40, 42], which produces a closed-form formula for quantile ( ) ( ) confidence intervals using combinatorial arguments. Recall that We implemented this latter method in ExP using Apache Spark 1 [51] and SCOPE [11]. Yi = I Xi F p and Yi is the count of observations no { ≤ − ( )} greater than the quantile. In the aforementioned i.i.d. case, Yi fol- lows a .Í Consequently, when n is large √n Y¯ 4.3 Numerical examples 2 2 Í1 ( − p N 0,σ where σ = p 1 p . If the quantile value F − p ) ≈ ( ) ¯ = ( − ) ( ) ∈ To test the validity and performance of the adjusted outer CI method, X r , X r +1 , then Y r n. The above equation can be inverted [ ( ) ( )) / we compare its coverage to a standard non-parametric bootstrap into a 100 1 α % confidence interval for r n : p zα 2σ √n. = = ( − ) / ± / / (NB 1000 replicates). The simulation setup consists of Nu This means with 95% probability the true percentile is between 100,..., 10000 users (clusters) with N u = 1,..., 10 observations = = + i the lower rank L n p zα 2σ √n and upper rank U n p u ( − / / ) ( each (Ni are uniformly distributed). Each observation is the sum + u zα 2σ √n 1! of two i.i.d. random variables X = Xi + Xu , where Xu is con- / / ) i The traditional outer CI depends on Xi being i.i.d. But when stant for each user. We consider two cases, one symmetric and one there are clusters, σ 2 n instead of being p 1 p n simply takes a / ( − )/ heavy-tailed distribution: different formula (6) by the Delta method and the result above still iid holds. Hence the confidence interval for a quantile can be com- Normal: Xi , Xu µ = 0, σ = 1 ; puted in the following steps: • ∼ N(iid ) Log-Normal: Xi , Xu Log-normal µ = 0, σ = 1 . • ∼ ( ) (1) fetch the quantile X np (⌊ ⌋) First, we find the "true" 95th percentile value of these distribution (2) compute Yi = I Xi X } = 7 { ≤ np by computing its value for a very large sample (Nu 10 ). Second, 2 (⌊ ⌋)2 = (3) compute µS , µN , σS , σS N , σN we compute the confidence intervals for M 10000 simulation (4) compute σ by setting σ 2 n equal to the result of equation (6) runs using bootstrap and outer CI with pre- and post-adjustment = / (5) compute L,U n p zα 2σ and compare their coverage estimates ( 0.002 standard error), shown ( ± / ) ≈ (6) fetch the two ranks X L and X U in Table 3. We found that when the sample contains 1000 or more ( ) ( ) clusters, all methods provide good coverage. Pre- and post-adjustment We call this outer CI with pre-adjustment. This method reduces outer CI results are both very close to the much more computa- the complexity of computing a quantile and its confidence interval tionally expensive bootstrap (in our un-optimized simulations, the into a Delta method step and subsequently fetching three “ntiles”. outer CI method was 20 times faster than bootstrap). When the ≈ 6 sample size was smaller than 1000 clusters, bootstrap was notice- In this section we show how the Delta method can be very use- ably inferior to outer CI. For all sample sizes, pre-adjustment pro- ful for estimating Cov X¯t , X¯ for any two time points t and t . ( t ′ ) ′ vided slightly larger coverage than post-adjustment, and this dif- We then use this to show how we can analyze within-subject stud- ference increased for smaller samples. In addition, because adjust- ies, also known as pre-post, repeated measurement, or longitudinal ment tends to result in increased confidence intervals, unadjusted studies. Our method starts with metrics directly, highly contrast- ranks are more likely to have the same value as the quantile value, ing with traditional methods such as mixed effect models which and thus post-adjustment is more likely to underestimate the vari- build models from individual user’s data. Because we study metrics ance in that case. In conclusion, post-adjustment outer CI works directly, our model is small and easy to solve with its complexity 10 very well for large sample sizes Nu 1000 and reasonably well constant to the scale of data. ≥ for smaller samples, but has slightly inferior coverage compared to pre-adjustment. For big data, outer CI with post-adjustment is 5.2 Methodology recommended due its efficiency, while for median to small sample How do we estimate covariance with a completely unknown miss- sizes, outer CI with pre-adjustment is preferred if accuracy is para- ing data mechanism? There are many existing works on handling mount. missing data. One approach is to model the propensity of a data- Table 3: Simulated examples: The first two columns contain the simulated point being missing using other observed predictors [13]. This re- models and sample sizes. The last three columns contain the coverage rates for quires additional covariates/predictors to be observed, plusa rather the Bootstrap, pre-adjusted and post-adjusted outer confidence intervals. strong assumption that conditioned on these predictors, data are Distribution Nu Bootstrap Outer CI Outer CI missing completely at random. We present a novel idea using the (pre-adj.) (post-adj.) Delta method after data augmentation without the need of model- Normal 100 0.9039 0.9465 0.9369 ing the missing data mechanism. Specifically, we use an additional Normal 1000 0.9500 0.9549 0.9506 indicator for the presence/absence status of a user in each period t. Normal 10000 0.9500 0.9500 0.9482 For user i in period t, let Iit = 1 if user i appears in period t, and 0 Log-normal 100 0.8551 0.9198 0.9049 otherwise. For each user i in period t, instead of one scalar metric Log-normal 1000 0.9403 0.9474 0.9421 value Xit , we augment it to a vector Iit , Xit . When Iit = 0, i.e. ( ) ( ) Log-normal 10000 0.9458 0.9482 0.9479 user is missing, we define Xit = 0. Under this simple augmentation, the metric value X¯t , taking the average over those non-missing measurements in period t, is the same as Xit Iit ! In this i / i 5 MISSING DATA AND WITHIN-SUBJECT connection, Í Í ANALYSES X X X¯ X¯ ¯ , ¯ = i it , i it ′ = t , t ′ Cov Xt Xt ′ Cov Cov 5.1 Background ( ) Iit Iit I¯ I¯  Í i Í i ′   t t ′  Consider the case that we are tracking a metric over time. For ex- where the last equality is byÍ dividingÍ both numerator and denom- ample, businesses need to track key performance indicators like inator by the same total number of users who have appeared in user engagement and revenue daily or weekly for a given audi- any of the two periods.11 Thanks to the central limit theorem, the ence. To simplify, let us say we are tracking weekly visits-per-user. vector I¯t , X¯t , I¯t , X¯t is also asymptotically (multivariate) normal = ( ′ ′ ) Visits-per-user is defined as a simple average X¯t , t 1,... for each with covariance matrix Σ, which can be estimated using sample week. If there is a drop between X¯t and X¯t 1, how do we set up an variance and covariance because there is no missing data after aug- − alert so that we can avoid triggering it due to random noise and mentation. In Section 2 we already applied the Delta method to control the false alarm rate? Looking at the variance, we see compute the variance of a ratio of metrics by taking the first order Taylor expansion. Here, we can expand the two ratios to their first Var X¯t X¯t 1 = Var X¯t + Var X¯t 1 2Cov X¯t , X¯t 1 ( − − ) ( ) ( − )− ( − ) ¯ ¯ 2 , order linear form Xt µXt µIt µXt It µIt µI where µX and we need to have a good estimate of the covariance term be- ( − )/ − ( − )/ t and µI are the means of X and I and the expansion for t ′ is similar. cause we know it is non-zero. Cov X¯t , X¯ can then be computed using Σ. ( t ′ ) Normally, estimating the covariance of two sample averages is We can apply this to the general case of a within-subject study. trivial and the procedure is very similar to the estimation of the Without loss of any generality and with the benefit of a concrete — sample variance. But there is something special about this case solution, we only discuss a cross-over design here. Other designs missing data. Not everyone uses the product every week. For any including more periods are the same in principle. In a cross-over online and offline metric, we are only able to define the metric us- design, we have two groups I and II. For group I, we assign them ing observed data. If for every user we observe its value in week to the treatment for the first period and control for the second. For t 1 and t, the covariance can be estimated using the sample co- − variance formula. But if there are many users who appear in week 10Part of the work in this section has previously appeared in a technical report [28]. t 1 but not in t, it is unclear how to proceed. The naive approach is Since authoring the technical report, the authors developed a better understanding − of the differences between mixed effect models and the proposed method, and we to estimate the covariance using complete observations, i.e. users present our new findings here. The technical report has more details about other types who appear in both weeks. However, this only works if the data is of repeated measurement models. missing completely at random. In reality, active users will show up 11Actually, if there are more than two period, we can either use only users appeared in any of the two periods, or users appeared in any of all the periods. It is mathematically more often and are more likely to appear in both weeks, meaning the same thing if we added more users and then treat them as not appeared in I and the missing data are obviously not random. X , i.e. X¯t remains the same. 7 group II, the assignment order is reversed. We often pick the two We simulate the following 1000 times: each time we run both the groups using random selection, so each period is an A/B test by mixed effect model Xit IsTreatment + Time + (1|User) as well as ∼ itself. the additive cross-over model (8) and record their estimates of the Let X¯it ,i = 1, 2,t = 1, 2 be the metric for group i in period t. Let ATE and the corresponding estimated variance. We then estimate X = X¯11, X¯12, X¯21, X¯22 . We know X N µ, ΣX for a mean vector the true estimator variance using the sample variance of those es- ( ) ∼ ( ) µ and covariance matrix ΣX. ΣX has the form diag Σ, Σ since the timates among 1000 trials and compare that to the mean of the ( ) two groups are independent with the same distribution. With the estimated variance to evaluate the quality of variance estimations. help of the Delta method, we can estimate Σ from the data and treat We also compute the average of ATE estimates and compare to the it as a constant covariance matrix (hence ΣX). Our model concerns true ATE to assess the bias. the mean vector µ θ using other parameters which represent our Table 4: Simulated examples: The first three columns contain the method, ( ) true ATE and standard deviations of the corresponding methods. The last two interest. In a cross-over design, θ = θ1,θ2, ∆ where the first two ( ) columns contain the point estimates and average estimated standard errors. are baseline means for the two periods and our main interest is the treatment effect ∆. (We assume there is no treatment effect Ground Truth SD(True) Estimate Avg. SE(Model) for group I carried over to the 2nd period. To study carry over mixedeffect 6.592 0.1295 7.129 0.1261 effect, a more complicated design needs to be employed.) In this Deltamethod 6.592 0.1573 6.593 0.1568 parameterization, Table 4 summarizes the results. We found both methods pro- vide good variance estimation and the mixed effect model shows µ θ = θ1 + ∆,θ2, θ1,θ2 + ∆ . (8) ( ) ( ) smaller variance. However, mixed effect also displays an upward bias in the ATE estimate while the Delta method closely tracks the The maximum likelihood estimator and Fisher information the- true effect. To further understand the difference between the two, ory [45] paved a general way for us to estimate θ as well as the we separate the data into two groups: users with complete data, variance of the estimators for various mean vector models. For ex- and users who only appear in one period (incomplete group). We ample, if we want to model the relative change directly, we just run the mixed effect model for the the two groups separately. Note need to change the addition of ∆ into multiplication of 1 + ∆ . ( ) that in the second group each user only appears once in the data, Notice the model we are fitting here is very small, almost a toy so the model is essentially a linear model. Our working hypothesis example from a text book. All the heavy lifting is in computing is the following: because the mixed effect model assumes a fixed ΣX which is dealt with by the Delta method where all computa- treatment effect, the effect for the complete and incomplete groups tions are trivially distributed. When µ θ = Mθ is linear as in the ( ) must be the same. The mixed effect model can take advantage of additive cross-over model above, θˆ = MT Σ 1M 1MT Σ 1X and ( −X )− −X this assumption and construct an estimator by weighted average of Var θ = I 1 where the Fisher Information I = MT Σ 1M. ( ) − −X the two estimates from the two groups, with the optimal weight in- versely proportional to their variances. Table 5 shows the weighted 5.3b Numerical examples average estimator is indeed very close to mixed effect model for We simulate two groups with 1000 users each. The first group re- both estimate and variance. The weighted average is closer to the ceives treatment in period 1, then control in period 2, while the complete group because the variance there is much smaller than second group receives the inverse. For each user, we impose an that of the incomplete group since within-subject comparison sig- nificantly reduces noise. This explains why the mixed effect model independent user effect ui that follows a normal distribution with can produce misleading results in within-subject analysis when- mean 10 and standard deviation 3, and independent noises ϵit with mean 0 and standard deviation 2. Each user’s base observations (be- ever missing data patterns can be correlated with the treatment effect. The Delta method, on the other hand, offers a flexible and fore treatment effect) for the two periods are ui +ϵi1,ui +ϵi2 . We ( ) then model the missing data and treatment effect such that they robust solution. Table 5: Simulated examples: Point and variance estimates using the mixed are correlated. We define a user’s engagement level li by its user effect vs. weighted average estimators. effect ui through P U < ui , i.e. the probability that a user’s u is ( ) bigger than another random user. We model the treatment effect Estimate Var ∆i as an additive normal with mean 10 and standard deviation 0.3, mixedeffectmodel 7.1290 0.00161 multiplied by li . For each user and each of the two periods, there mixed effect modeloncompletegroup 7.3876 0.00180 is a 1 - max(0.1, li ) probability of this user being missing. We can linear model on incomplete group 5.1174 0.01133 interpret the missing data as a user not showing up, or as a failure weighted avg estimator 7.0766 0.00155 to observe. In this model, without missing data, every user has two observations and the average treatment effect should be E ∆i = 5 ( ) because E li = 0.5. Since we have missing data and it is more 6 CONCLUDING REMARKS ( ) likely for lower engagement levels to be missing, we expect the av- Measure everything is not only an inspiring slogan, but also a cru- erage treatment effect for the population of all observed userstobe cial step towards the holy grail of data-driven decision making. In between 5 to 10. In the current model we didn’t add a time period the big data era, innovative technologies have tremendously ad- effect such that the second period could have a different mean θ2 vanced user telemetry and feedback collection, and distilling in- from the first period’s mean θ1, but in analysis we always assume sights and knowledge from them is an imperative task for business that this effect could exist. success. To do so, we typically apply certain statistical models at 8 the level of individual observations, fit the model using numerical [4] Douglas Bates, Martin Maechler, Ben Bolker, Steven Walker, et al. 2014. lme4: procedures such as solving optimization problems, and draw con- Linear mixed-effects models using Eigen and S4. R package version 1, 7 (2014), 1–23. clusions and assess uncertainties from the fitted models. However, [5] Dennis D Boos and Jacqueline M Hughes-Oliver. 2000. How large does n have for big data such an analytical procedure can be very challenging. to be for Z and t intervals? The American Statistician 54, 2 (2000), 121–128. [6] Léon Bottou. 2010. Large-scale machine learning with stochastic gradient de- Based on the key observation that metrics are approximately nor- scent. In Proceedings of COMPSTAT’2010. Springer, 177–186. mal due to the central limit theorem, this paper offers an alterna- [7] Morton B Brown and Robert A Wolfe. 1983. Estimation of the variance of per- tive perspective by advocating modeling metrics, i.e., summaries or centile estimates. Computational Statistics & Data Analysis 1 (1983), 167–174. [8] Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent aggregations of the original data sets, in a direct manner. By doing Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In so, we can decompose big data into small problems. However, al- Proceedings of the Eleventh ACM International Conference on Web Search and Data though conceptually sound, in practice these metric level models Mining. ACM, 55–63. [9] Bob Carpenter, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, often involve nonlinear transformations of data or complex data Michael Betancourt, Michael A Brubaker, Jiqiang Guo, Peter Li, and Allen Rid- generating mechanisms, posing several challenges for trustworthy dell. 2016. Stan: A probabilistic programming language. Journal of Statistical and efficient metric analytics. Software 20 (2016), 1–37. [10] George Casella and Roger L Berger. 2002. Statistical Inference, Second Edition. To address these issues, we promoted the Delta method’s cen- Duxbury Press: Pacific Grove, CA. tral role in making it possible to extend the central limit theorem to [11] Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Si- mon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel pro- new territories. We demonstrated how to apply the Delta method cessing of massive data sets. Proceedings of the VLDB Endowment 1 (2008), 1265– in four real-life applications, and illustrated how this approach 1276. naturally leads to trivially parallelizable and highly efficient im- [12] Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20 (1995), 273–297. plementations. Among these applications, ratio metrics, clustered [13] M. Davidian, A.A. Tsiatis, and S. Leon. 2005. Semiparametric Estimation of Treat- randomizations and quantile metrics are all common and impor- ment Effect in a Pretest-Posttest Study with Missing Data. Statist. Sci. 20 (2005), tant scenarios for A/B testing, and business analytics in general. 295–301. Issue 3. [14] A. Deng, J. Lu, and J. Litz. 2017. Trustworthy analysis of online A/B tests: Pitfalls, Within-subject studies are becoming more popular for superior challenges and solutions. In Proceedings of the Tenth ACM International Confer- sensitivities, and missing data with an unknown mechanism is ence on Web Search and Data Mining. 641–649. [15] A. Deng and X. Shi. 2016. Data-driven metric development for online controlled ubiquitous in both the online and offline worlds. Our contribution experiments: Seven lessons learned. In Proceedings of the 22nd ACM SIGKDD to technical improvements, novel ideas and new understandings in- International Conference on Knowledge Discovery and Data Mining. cludes bias correction for ratio metric estimation, combination of [16] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Proceedings the Delta method and outer confidence intervals for quantile met- of the 6th ACM WSDM Conference. 123–132. ric variance estimation, and the idea of data augmentation for gen- [17] Pavel Dmitriev, Somit Gupta, Dong Woo Kim, and Garnet Vaz. 2017. A Dirty eral missing data problems in within-subject studies. We also re- Dozen: Twelve Common Metric Interpretation Pitfalls in Online Controlled Ex- periments. In Proceedings of the 23rd ACM SIGKDD International Conference on vealed the connection between the Delta method and mixed effect Knowledge Discovery and Data Mining (KDD ’17). ACM, New York, NY, USA, models, and explained their differences. In addition, we pointed 1427–1436. https://doi.org/10.1145/3097983.3098024 [18] Pavel Dmitrievand Xian Wu. 2016. MeasuringMetrics.In Proceedings of the 25th out the advantage of the Delta method in the presence of an un- ACM International on Conference on Information and Knowledge Management. known missing data mechanism. Overall speaking, we hope this ACM, 429–437. paper can serve as a practical guide of applying the Delta method [19] Allan Donner. 1987. Statistical methodology for paired cluster designs. American Journal of Epidemiology 126, 5 (1987), 972–979. in large-scale metric analyses and A/B tests, so that it is no longer [20] Dean Eckles, Brian Karrer, and Johan Ugander. 2017. Design and analysis of just a technical detail, but a starting point and a central piece of experiments in networks: Reducing bias from interference. Journal of Causal analytical thinking. Inference 5, 1 (2017). [21] Jianqing Fan, Fang Han, and Han Liu. 2014. Challenges of big data analysis. Although the Delta method can help tackle big data problems, National Science Review 1 (2014), 293–314. it does not replace the need for rigorous experimental designs and [22] Edgar C Fieller. 1940. The biological standardization of insulin. Supplement to the Journal of the Royal Statistical Society 7, 1 (1940), 1–64. probabilistic modeling. For example, “optimal” choices for cluster [23] Edgar C Fieller. 1954. Some problems in interval estimation. Journal of the Royal configurations, randomization mechanisms and data transforma- Statistical Society. Series B (Methodological) (1954), 175–185. tions are known to increase the sensitivities of metrics [8, 32, 49]. [24] Ronald Aylmer Fisher. 1922. On the mathematical foundations of theoretical sta- tistics. Philosophical Transactions of the Royal Society of London. Series A, Con- We leave them as future research directions. taining Papers of a Mathematical or Physical Character 222 (1922), 309–368. [25] Pedro A Forero, Alfonso Cano, and Georgios B Giannakis. 2010. Consensus- ACKNOWLEDGMENTS based distributed support vector machines. Journal of Machine Learning Re- search 11, May (2010), 1663–1707. We benefited from insightful discussions with Ya Xu and Daniel [26] Andrew Gelman and Jennifer Hill. 2006. Data analysis using regression and mul- Ting on quantile metrics and outer confidence intervals. We thank tilevel/hierarchical models. Cambridge University Press. [27] Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network A/B testing: Jong Ho Lee for implementing the efficient quantile metric esti- From sampling to estimation. In Proceedings of the 24th International Conference mation and confidence interval construction in ExP using Apache on World Wide Web. International World Wide Web Conferences Steering Com- mittee, 399–409. Spark, and Yu Guo for previous work on repeated measurements. [28] Yu Guo and Alex Deng. 2015. Flexible Online Repeated Measures Experiment. arXiv preprint arXiv:1501.00450 (2015). REFERENCES [29] Peter Hall. 2013. The bootstrap and Edgeworth expansion. Springer Science & Business Media. [1] Susan Athey and Guido W Imbens. 2017. The econometrics of randomized ex- [30] Joe Hirschberg and Jenny Lye. 2010. A geometric comparison of the delta and periments. Handbook of Economic Field Experiments 1 (2017), 73–140. Fieller confidence intervals. The American Statistician 64 (2010), 234–241. [2] Lars Backstrom and Jon Kleinberg. 2011. Network bucket testing. In Proceedings [31] Michael I Jordan, Jason D Lee, and Yun Yang. 2018. Communication-efficient of the 20th international conference on World wide web. ACM, 615–624. distributed statistical inference. J. Amer. Statist. Assoc. in press (2018). [3] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2014. Fitting doi:10.1080/01621459.2018.1429274. linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823 (2014).

9 [32] Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning sensi- [43] Walter Rudin et al. 1964. Principles of mathematical analysis. Vol. 3. McGraw-hill tive combinations of a/b test metrics. In Proceedings of the Tenth ACM Interna- New York. tional Conference on Web Search and Data Mining. ACM, 651–659. [44] Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlap- [33] Neil Klar and Allan Donner. 2001. Current and future challenges in the design ping Experiment Infrastructure: More, Better, Faster Experimentation. Proceed- and analysis of cluster randomization trials. Statistics in medicine 20, 24 (2001), ings of the 16th ACM SIGKDD Conference (2010). 3729–3740. [45] Aad W Van der Vaart. 2000. Asymptotic statistics. Vol. 3. Cambridge university [34] Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar, and Michael I Jordan. 2014. press. A scalable bootstrap for massive data. Journal of the Royal Statistical Society: [46] Ulrike Von Luxburg and Volker H Franz. 2009. A geometric approach to confi- Series B (Statistical Methodology) 76, 4 (2014), 795–816. dence sets for ratios: Fieller’s theorem, generalizations and bootstrap. Statistica [35] Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne, Sinica (2009), 1095–1117. Juan Lavista Ferres, and Tamir Melamed. 2009. Online experimentation at Mi- [47] Dongli Wang and Yan Zhou. 2012. Distributed support vector machines: An crosoft. In Proceedings of the Third International Workshop on Data Mining Case overview. In Control and Decision Conference (CCDC), 2012 24th Chinese. IEEE, Studies, held at the 5th ACM SIGKDD Conference. 11–23. 3897–3901. [36] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann. [48] Larry Wasserman. 2003. All of Statistics: A Concise Course in Statistical Inference. 2013. Online Controlled Experiments at Large Scale. Proceedings of the 19th Springer. ACM SIGKDD Conference (2013). [49] Huizhi Xie and Juliette Aurisset. 2016. Improving the sensitivity of online con- [37] Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to trolled experiments: Case studies at netflix. In Proceedings of the 22nd ACM controlled experiments on the web: listen to your customers not to the hippo. In SIGKDD International Conference on Knowledge Discovery and Data Mining. Proceedings of the 13th ACM SIGKDD Conference. 959–967. ACM, 645–654. [38] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009. [50] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015. Controlled experiments on the web: survey and practical guide. Data mining From infrastructure to culture: A/B testing challenges in large scale social net- and knowledge discovery 18, 1 (2009), 140–181. works. In Proceedings of the 21th ACM SIGKDD International Conference on [39] R. Kohavi, R. Longbotham, and T. Walker. 2010. Online Experiments: Practical Knowledge Discovery and Data Mining. ACM, 2227–2236. Lessons. Computer 43, 9 (Sept 2010), 82–85. [51] Matei Zaharia, Reynold S Xin, Patrick Wendell, Tathagata Das, Michael Arm- [40] Daniel Krewski. 1976. Distribution-free confidence intervals for quantile inter- brust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, vals. J. Amer. Statist. Assoc. 71, 354 (1976), 420–422. Michael J Franklin, et al. 2016. Apache Spark: A unified engine for big data [41] Kung-Yee Liang and Scott L Zeger. 1986. Longitudinal data analysis using gen- processing. Commun. ACM 59, 11 (2016), 56–65. eralized linear models. Biometrika 73, 1 (1986), 13–22. [52] Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. 2010. Paral- [42] John S Meyer. 1987. Outer and inner confidence intervals for finite population lelized stochastic gradient descent. In Advances in Neural Information Processing quantile intervals. J. Amer. Statist. Assoc. 82, 397 (1987), 201–204. Systems. 2595–2603.

10