Inference for Batched Bandits

Kelly W. Zhang Lucas Janson Department of Computer Science Departments of Statistics Harvard University Harvard University [email protected] [email protected]

Susan A. Murphy Departments of Statistics and Computer Science Harvard University [email protected]

Abstract

As bandit algorithms are increasingly utilized in scientific studies and industrial applications, there is an associated increasing need for reliable inference methods based on the resulting adaptively-collected data. In this work, we develop methods for inference on data collected in batches using a bandit algorithm. We first prove that the ordinary least squares estimator (OLS), which is asymptotically normal on independently sampled data, is not asymptotically normal on data collected using standard bandit algorithms when there is no unique optimal arm. This asymptotic non-normality result implies that the naive assumption that the OLS estimator is approximately normal can lead to Type-1 error inflation and confidence intervals with below-nominal coverage probabilities. Second, we introduce the Batched OLS estimator (BOLS) that we prove is (1) asymptotically normal on data collected from both multi-arm and contextual bandits and (2) robust to non-stationarity in the baseline reward.

1 Introduction

Due to their regret minimizing guarantees bandit algorithms have been increasingly used in in real- world sequential decision-making problems, like online advertising [28], mobile health [44], and online education [36]. However, for many real-world problems it is not enough to just minimize regret on a particular problem instance. For example, suppose we have run an online education

arXiv:2002.03217v3 [cs.LG] 8 Jan 2021 experiment using a bandit algorithm where we test different types of teaching strategies. When designing a new online course, ideally we could use the data from the previous experiment to inform the design, e.g., under-performing arms could be eliminated or modified. Moreover, to help others designing online courses we would like to be able to publish our findings about how different teaching strategies compare in their performance. This example demonstrates the need for statistical inference methods on bandit data, which allow practitioners to draw generalizable knowledge from the data they have collected (e.g., how much better one teaching strategy is compared to another) for the sake of scientific discovery and informed decision making. In this work we will focus on methods to construct confidence intervals for the margin—the difference in expected rewards of two bandit arms—from batched bandit data. Rather than constructing high probability confidence intervals, we are interested in constructing confidence intervals by using the asymptotic distribution of estimators to approximate their finite sample distribution. Asymptotic approximation methods for statistical inference has a long history of being successful in science and leads to much narrower confidence intervals than those constructed using high probability bounds. Most statistical inference methods based on asymptotic approximation assume that treatments

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. are assigned independently [15]. However, bandit data violates this independence assumption because it is collected adaptively, meaning previous actions and rewards inform future action selections. The non-independence makes statistical inference more challenging, e.g., estimators like the sample mean are often biased on bandit data [34, 39]. Throughout, we focus on the batched bandit setting, in which arms of the bandit are pulled in batches. For our asymptotic analysis we fix the total number of batches, T , and allow the arm pulls in each batch, n, to go to infinity. Note that we do not need or expect n to go to infinity for real-world experiments; we use the asymptotic distribution of estimators to approximate their finite-sample distribution when constructing confidence intervals. We focus on the batched setting because it closely reflects many of the problem settings where bandit algorithms are applied. For example, in many mobile health [44, 24, 29] and online education problems [22, 36] multiple users use apps / take courses simultaneously, so a batch corresponds to the number of unique users the bandit algorithm acts on at once. The batched setting is even common in online recommendations and advertising because it is impractical to update the bandit after every action if many users visit the site simultaneously [40, 38, 12, 27]. In many such experimental settings the length of the study, T , cannot be arbitrarily adjusted, e.g., in online education, courses generally cannot be made arbitrarily long, and clinical trials often run for a standard amount of time that depends on the domain science (e.g. the length of mobile health studies is a function of the scientific community’s belief in how long it should take for users to form a habit). On the other hand, the number of users, n, can in principle grow as large as funding allows. Additionally, in our batched setting, we assume that the means of the arms can change over time, i.e., from batch to batch, which reflects the temporal non-stationarity that is prevalent in many real world bandit application problems. For example, in online recommendation systems, the click through rate of a given recommendation typically varies over time, e.g., breaking news articles become less popular over time [40, 27]. Online education and mobile health are also highly non-stationary problems because users tend to disengage over time, so the same notification may be much less effective if sent near the end of an experiment than sent near the beginning [9, 21, 7]. Our statistical inference method does not need to assume that the number of stationary time periods in the experiment is large and is robust to temporal non-stationarity from batch to batch. The first contribution of this work is proving that on bandit data, rather surprisingly, whether standard estimators are asymptotically normal can depend on whether the margin is zero. We prove that for common bandit algorithms, the arm selection probabilities only concentrate if there is a unique optimal arm. Thus, for two-arm bandits, the arm selection probabilities do not concentrate when the margin—the difference in the expected rewards between the arms—is zero. We show that this leads the ordinary least squares (OLS) estimator to be asymptotically normal when the margin is non-zero, and asymptotically not normal when the margin is zero. Since the OLS estimator does not converge uniformly (over values of the margin), standard inference methods (normal approximations, bootstrap1) can lead to inflated Type-1 error and unreliable confidence intervals on bandit data. The second contribution of this work is introducing the Batched OLS (BOLS) estimator, which can be used for reliable inference—even in non-stationary settings—on data collected with batched bandits. We prove that, regardless of whether the margin is zero or not, the BOLS estimator for the margin for both multi-arm and contextual bandits is asymptotically normal and thus can be used for both hypothesis testing and obtaining confidence intervals. Moreover, BOLS is also automatically robust to non-stationarity in the rewards and can be used for constructing valid confidence intervals even if there is non-stationarity in the baseline reward, i.e., if the rewards of the arms change from batch to batch, but the margin remains constant. If the margin itself is also non-stationary, BOLS can also be used for constructing simultaneous confidence intervals for the margins for each batch.

2 Related Work

Batched Bandits Much work on batched bandits focuses on minimizing regret [35, 10] or identifying the best arm with high probability [2, 18]. The best arm identification literature utilizes high probability confidence bounds to construct confidence intervals for bandit parameters; we will discuss this method in the next section. Note that in contrast to other batched bandit literature that allow batch sizes to be adjusted adaptively [35], here we do not have adaptive control over the batch sizes.

1Note that the validity of bootstrap methods rely on uniform convergence [37].

2 Batched bandits are closely related to multistage adaptive clinical trials, in which between each batch (or stage of the trial) the data collection procedure can be adjusted depending on the outcome of the previous batches. Our Batched OLS estimator is most closely related to “stage-wise" p-values for group sequential trials that are computed on each stage separately [42]. p-value combination tests are commonly used to combine stage-wise p-values, when the sequence of p-values are shown to be independent or p-clud, meaning that under the null each p-value has a Uniform(0,1) distribution conditional on past p-values [42]. [31] formally establish the independence of stage-wise p-values for two-stage trials in which there are a countable number of adaptive rules; note that this rules out bandit algorithms with real-valued arm selection probabilities, like Thompson Sampling. [4] establishes the p-clud property for two-stage adaptive clinical trials under the assumption that the distribution of the second stage data is known conditioned on the decision rule and first stage data under the null hypothesis. Neither of these methods are sufficient for obtaining independent p-values for adaptive trials (1) with an arbitrary number of stages, (2) where exact distribution of rewards is unknown, and (3) where the action selection probabilities can be real numbers, like for Thompson Sampling. High Probability Confidence Intervals High probability confidence intervals provide stronger guarantees than those constructed using asymptotic approximations. In particular, these bounds are guaranteed to hold for finite number of observations and often even hold uniformly over all n and T . These types of bounds are used throughout the bandit and reinforcement learning literature to construct confidence intervals for bandit parameters [14, 20], prove regret bounds [1, 26], and provide guarantees regarding best arm identification [16, 17]. The primary drawback of high probability confidence intervals is that they are much more conservative than those constructed using asymptotic approximations. This means that many more observations will be needed to get a confidence interval of the same width or for a statistical test to have the same power when using high probability confidence intervals compared to those constructed using asymptotic approximation. Since the cost of increasing the the number of users in a study can be large, being able to construct narrow—yet reliable—confidence intervals is crucial to many applications. In our simulations we compare our method to high probability confidence bounds constructed using the self-normalized martingale bound of [1]. This bound is guaranteed to hold on adaptively collected data and is commonly used in the proof of regret bounds for bandit algorithms. We find that all the approaches based on asymptotic approximations (which we discuss next), significantly outperform the statistical test constructed using a self-normalized martingale bound in terms of power. Moreover, despite the weaker guarantees of statistical inference based on asymptotic approximations, they are generally able to provide reliable coverage of confidence intervals and type-1 error control. Adaptive Inference based on Asymptotic Approximations A common approach in the literature for performing inference on bandit data is to use adaptive weights, which are weights that are a function of the history. An early example of using adaptive weights is that of [33] and [32], who use adaptive weights in estimating the expected reward under the optimal policy when one has access to i.i.d. observational data. They use an Augmented-Inverse-Probability-Weighted estimator with adaptive weights that are a function of the estimated standard deviation of the reward. [32] conjecture that their approach can be adapted to the adaptive sampling case. Subsequently [11] developed the adaptively weighted method for inference on bandit data to produce the Adaptively- Weighted Augmented-Inverse-Probability-Weighted Estimator (AW-AIPW) for data collected via multi-arm bandits. They prove a central limit theorem (CLT) for AW-AIPW when the adaptive weights satisfy certain conditions. Note, however, the AW-AIPW estimator does not have guarantees in non-stationary settings. Adaptive weights are also used by [6] to form the W-decorrelated estimator, a debiased version of OLS, that is asymptotically normal. In the multi-arm bandit setting, the adaptive weights are a function of the number of times an arm was chosen previously. We found that in the two-arm setting, the W-decorrelated estimator down-weights rewards from later in the study (Appendix F). [5] introduce the Online Debiased Estimator that also has bias guarantees on adaptive data, but in the more challenging high-dimensional linear regression setting. They prove the asymptotic normality of their estimator in the Gaussian autoregressive time series and the two-batch settings. Note that none of these estimation methods have guarantees in non-stationary bandit settings. [25] provide conditions under which the OLS estimator is asymptotically normal on adaptively collected data. However, as noted in [41, 6, 11], classical inference techniques developed for i.i.d.

3 data often empirically have inﬂated Type-1 error on bandit data. In Section 4.1, we discuss the restrictive nature of [25]’s CLT conditions.

3 Problem Formulation

Setup and Notation Though our results generalize to K-arm, contextual bandits (see Section 5.2), we first focus on the two-arm bandit for expositional simplicity. Suppose there are T timesteps or n n batches in a study. In each batch t ∈ [1: T ], we select n binary actions {At,i}i=1 ∈ {0, 1} . We n then observe independent rewards {Rt,i}i=1, one for each action selected. Note that the distribution of these random variables changes with the batch size, n. For example, the distribution of the actions nd n one chooses for the 2 batch, {A2,i}i=1, may change if one has observed n = 10 vs. n = 100 n samples {A1,i,R1,i}i=1 in the first batch. For readability, we omit indexing random variables by n, (n) (n) (n) except for the variables Ht−1 and πt , and filtrations like Gt−1 to be introduced next.

n i.i.d. (n) (n) For each t∈[1: T ], the bandit selects actions {At,i}i=1 ∼ Bernoulli(πt ) conditional on Ht−1 := i=n,t0=t−1 (n) {At0,i,Rt0,i}i=1,t0=1 , the history prior to batch t. Note, the action selection probability πt := (n) (n) P(At,i = 1|Ht−1) depends on the history Ht−1. We assume the following conditional mean for rewards: (n) E Rt,i Ht−1,At,i = (1 − At,i)βt,0 + At,iβt,1. (1) (n) Note in equation (1) we condition on Ht−1 because the conditional mean of the reward does not > 2 depend on prior rewards or actions. Let Xt,i := [1−At,i,At,i] ∈ R ; note Xt,i is higher dimensional > when we add more arms and/or context variables. We define the errors as t,i := Rt,i − (Xt,i) βt. Equation (1) implies that {t,i : i ∈ [1: n], t ∈ [1: T ]} are a martingale difference array with respect (n) T (n) (n) n (n) to the filtration {Gt }t=1, where Gt := σ Ht−1 ∪{At,i}i=1 ; thus, E[t,i|Gt−1] = 0, ∀t, i, n. The parameters βt = (βt,0, βt,1) can change across batches t ∈ [1: T ], which allows for non-stationarity 0 between batches. Assuming that βt = βt0 for all t, t ∈ [1: T ] simplifies to the stationary mean case.

Action Selection Probability Constraint (Clipping) In order to perform inference on bandit data it is necessary to guarantee that the bandit algorithm explores sufficiently. For example, the CLTs for both the W-decorrelated [6] and the AW-AIPW [11] estimators have conditions that implicitly require that the bandit algorithms cannot sample any given action with probability that goes to zero or one arbitrarily fast. Greater exploration also increases the power of statistical tests regarding the margin [43]. Moreover, if there is non-stationarity in the margin between batches, it is desirable for the bandit algorithm to continue exploring. We explicitly guarantee exploration by constraining the probability that any given action can be sampled (see Definition 1). We allow the action selection (n) probabilities πt to converge to 0 and/or 1 at some rate. (n) Definition 1. A clipping constraint with rate f(n) means that πt satisfies the following: (n) lim πt ∈ [f(n), 1 − f(n)] = 1 (2) n→∞ P

4 Asymptotic Distribution of the Ordinary Least Squares Estimator

Suppose we are in the stationary case, and we would like to estimate β. Consider the OLS es- ˆOLS > −1 > > nT ×2 timator: β = (X X) X R, where X := [X1,1, .., X1,n, .., XT,1, .., XT,n] ∈ R and > nT > PT Pn > R := [R1,1, .., R1,n, .., RT,1, .., RT,n] ∈ R . Note that X X = t=1 i=1 Xt,iXt,i.

4.1 Conditions for Asymptotically Normality of the OLS estimator

2 2 If (Xt,i, t,i) are i.i.d., E[t,i] = 0, E[t,i] = σ , and the first two moments of Xt,i exist, a classical result from statistics [3] is that the OLS estimator is asymptotically normal, i.e., as n → ∞, > 1/2 ÔLS D 2 (X X) (β − β) → N (0, σ Ip). [25] generalize this result by proving that the OLS estimator is still asymptotically normal in the adaptive sampling case when X>X satisfies a certain stability condition. To show that a similar result

4 Figure 1: Empirical distribution of the Z-statistic (σ2 is known) of the OLS estimator for the margin. All simulations are with no margin (β1 = β0 = 0); N (0, 1) rewards; T = 25; and n = 100. For -greedy, = 0.1. holds for the batched setting, we generalize the CLT of [25] to triangular arrays (required since the distribution of our random variables vary as the batch size, n, changes), as stated in Theorem 5. 2 (n) 2 4 (n) Condition 1 (Moments). For all t, n, i, E[t,i Gt−1] = σ and E t,i Gt−1 < M < ∞. ∞ Condition 2 (Stability). For some non-random sequence of scalars {ai}i=1, as n → ∞, 1 PT Pn P an · nT t=1 i=1 At,i → 1 Theorem 1 (Triangular array version [25], Theorem 3). Assuming Conditions 1 and 2, as n → ∞, > 1/2 ˆOLS D 2 X X (β − β) → N (0, σ Ip). Note that in the bandit setting, Condition 2 means that prior to running the experiment, the asymptotic rate at which arms will be selected is predictable. We will show that Condition 2 is in a sense necessary for the asymptotic normality of OLS. In Corollary 1 below we state that Conditions 1 and 3, and a non-zero margin are sufﬁcient for stability Condition 2. Later, we will show that when the margin is zero, Condition 2 does not hold for many common bandit algorithms and prove that this leads the OLS estimator to be asymptotically non-normal.

n i.i.d. (n) Condition 3 (Conditionally i.i.d. actions). For each t ∈ [1: T ], {At,i}i=1 ∼ Bernoulli(πt ) i.i.d. (n) over i ∈ [1: n] conditional on Ht−1.

Corollary 1 (Sufﬁcient conditions for Theorem 5). If Conditions 1 and 3 hold, and the margin is non-zero, data collected in batches using -greedy, Thompson Sampling, or UCB with clipping 1 constraint with f(n) = c for some 0 < c ≤ 2 (see Deﬁnition 1) satisfy Theorem 5 conditions.

4.2 Asymptotic Non-Normality under No Margin

We prove the conjecture of [6] that when the margin is zero, the OLS estimator is asymptotically non-normal under common bandit algorithms, including Thompson Sampling, -greedy, and UCB. Thus as seen in Figure 1, assuming the OLS estimator is approximately Normal on bandit data can lead to inflated Type-1 error, even asymptotically. The asymptotic non-normality of OLS occurs (n) when the margin is zero because when there is no unique optimal arm, πt does not concentrate as n → ∞ (Appendix C). We state the asymptotic non-normality result for Thompson Sampling in Theorem 2; see Appendix C for the proof and similar results for -greedy and UCB. It is sufficient to prove asymptotic non-normality for T = 2. Note, ∆ˆ OLS is the difference in the sample means for each arm, so ˆ OLS ÔLS ÔLS ˆ OLS ∆ = β1 − β0 . The Z-statistic of ∆ , which is asymptotically normal under i.i.d. sampling, is as follows: s P2 Pn A )(P2 Pn 1 − A ) t=1 i=1 t,i t=1 i=1 t,i ∆ˆ OLS − ∆. (3) 2σ2n Theorem 2 (Asymptotic non-normality of OLS estimator under zero margin for Thompson Sampling). (n) 1 i.i.d. Let T = 2 and π1 = 2 . If t,i ∼ N (0, 1), we have independent normal priors on arm means

5 Figure 2: Empirical undercoverage probabilities (coverage probability below 95%) of confidence intervals using on a normal approximation for the OLS estimator. We use Thompson Sampling with N (0, 1) priors, a clipping (n) 2 constraint of 0.05 ≤ πt ≤ 0.95, N (0, 1) rewards, T = 25, and known σ . Standard errors are < 0.001. ˜ ˜ i.i.d. (n) ˜ ˜ (n) β0, β1 ∼ N (0, 1), and π2 = πmin ∨ [(1 − πmax) ∧ P(β1 > β0 H1 )] for constants πmin, πmax with 0 < πmin ≤ πmax < 1, then (3) is asymptotically not normal when the margin is zero. Since the OLS estimator is asymptotically normal when ∆ 6= 0 (Corollary 1) and asymptotically not Normal when ∆ = 0, the OLS estimator does not converge uniformly on data collected under standard bandit algorithms. The non-uniform convergence of the OLS estimator precludes us from using a normal approximation to perform hypothesis testing and construct confidence intervals (see [19]). In real-world applications, there is rarely exactly zero margin. However, the non-uniform convergence of the OLS estimator at zero margin is still practically important because the asymptotic distribution of the OLS estimator when the margin is zero is indicative of the finite-sample distribution when the margin is statistically difficult to differentiate from zero, i.e., when the signal-to-noise ratio, |∆| σ , is low. Figure 2 shows that even when the margin is non-zero, when the signal-to-noise ratio is low, confidence intervals constructed using a normal approximation have coverage probabilities below the nominal level. Moreover, for any batch size n and noise variance σ2, there exists a non-zero margin size with a finite-sample distribution that is poorly approximated by a normal distribution.

5 Batched OLS Estimator

5.1 Batched OLS Estimator for Multi-Arm Bandits

We now introduce the Batched OLS (BOLS) estimator that is asymptotically normal under a large class of bandit algorithms, even when the margin is zero. Instead of computing the OLS estimator on all the data, we compute the OLS estimator for each batch and normalize it by the variance estimated from that batch. For each t ∈ [1: T ], the BOLS estimator of the margin ∆t := βt,1 − βt,0 is: Pn Pn ˆ BOLS i=1(1 − At,i)Rt,i i=1 At,iRt,i ∆t = Pn − Pn . i=1 1 − At,i i=1 At,i Theorem 3 (Asymptotic Normality of Batched OLS estimator for multi-arm bandits). Assuming 1 Conditions 1 (moments) and 3 (conditionally i.i.d. actions), and a clipping rate f(n) = nα for some 0 ≤ α < 1 (see Deﬁnition 1),

 q (Pn 1−A )(Pn A )  i=1 1,i i=1 1,i ˆ BOLS n (∆1 − ∆1)  q (Pn 1−A )(Pn A )   i=1 2,i i=1 2,i ˆ BOLS   n (∆2 − ∆2)  D 2   → N (0, σ IT )  .   .  q (Pn 1−A )(Pn A )  i=1 T,i i=1 T,i ˆ BOLS n (∆T − ∆T ) It is straightforward to generalize Theorem 3 to the case that batches are different sizes but the size of the smallest batch goes to inﬁnity and the batch size is independent of the history.

6 By Theorem 3, for the stationary margin case, we can test H0 : ∆ = c vs. H1 : ∆ 6= c with the following statistic, which is asymptotically normal under the null: T r Pn Pn 1 X ( 1 − At,i)( At,i) √ i=1 i=1 (∆ˆ BOLS − c). (4) nσ2 t T t=1 This type of test statistic—a weighted combination of asymptotically independent normals—a special case of the inverse normal p-value combination test, has been used in simple settings in which the studies (e.g., batches) are independent (e.g., when conducting meta-analyses across multiple studies) [26]. Here the ability to use this type of test statistic is novel since, due to the bandit algorithm, the batches are not independent. The work here demonstrates asymptotic independence and thus for large n the Z-statistics from each batch should be approximately independently distributed. The key to proving asymptotic normality for BOLS is that the following ratio converges in probability Pn Pn ( i=1 1−At,i)( i=1 At,i) 1 P (n) (n) 1 to one: n (n) (n) → 1. Since πt ∈ Gt−1, (n) (n) is a constant nπt (1−πt ) nπt (1−πt ) (n) (n) given Gt−1. Thus, even if πt does not concentrate, we are still able to apply the martingale CLT [8] to prove asymptotic normality. See Appendix B for more details.

5.2 Batched OLS Estimator for Contextual Bandits

For contextual K-arm bandits, for any two arms x, y ∈ [0: K − 1], we can estimate the margin d n between them ∆t,x−y := βt,x − βt,y ∈ R . In each batch, we observe context vectors {Ct,i}i=1 0 d (n) i=1,t =t−1 for Ct,i ∈ R . We redefine the history Ht−1 := {Ct0,i,At0,i,Rt0,i}i=1,t0=1 and define the (n) (n) n (n) filtration Ft := σ Ht−1 ∪ {At,i, Ct,i}i=1 . The action selection probabilities πt are now (n) K th functions of the context, so πt (Ct,i) ∈ [0, 1] is a vector where the k dimension equals P(At,i = (n) (n) k|Ht−1, Ct,i). We assume the following conditional mean model of the reward: E Rt,i Ft−1 = PK−1 > PK−1 > k=0 I(At,i=k)Ct,iβt,k and let t,i := Rt,i − k=0 I(At,i=k)Ct,iβt,k. Condition 4 (Conditionally i.i.d. contexts). For each t, Ct,1, Ct,2, ..., Ct,n are i.i.d. and its first two (n) (n) moments, µt, Σt, are non-random given Ht−1, i.e., µt, Σt ∈ σ(Ht−1).

Condition 5 (Bounded context). ||Ct,i||max ≤ u for all i, t, n for some constant u. Also, the minimum eigenvalue of Σt is lower bounded, i.e., λmin(Σt) > l > 0. Definition 2. A conditional clipping constraint with rate f(n) means that the action selection (n) d K probabilities πt : R → [0, 1] satisfy the following: d (n) K P ∀ c ∈ R , πt (c) ∈ f(n), 1 − f(n) → 1 ˆ OLS −1 −1−1 ÔLS For each t ∈ [1: T ], we have the OLS estimator for ∆t,x−y: ∆t := Ct,x + Ct,y βt,x − ÔLS Pn > d×d ÔLS −1 Pn βt,y , where Ct,k := i=1 I (n) Ct,iCt,i ∈ R , βt,k = Ct,k i=1 I (n) Ct,iRt,i. At,i =k At,i =k Theorem 4 (Asymptotic Normality of Batched OLS estimator for contextual bandits). Assuming Conditions 1 (moments)2, 3 (conditionally i.i.d. actions), 4, and 5, and a conditional clipping rate 1 f(n) = c for some 0 ≤ c < 2 (see Definition 2),   −1 −1 1/2 ˆ OLS C1,x + C1,y (∆1 − ∆1,x−y)  1/2 OLS   C−1 + C−1 (∆ˆ − ∆ )   2,x 2,y 2 2,x−y  D 2  .  → N (0, σ IT d).  .    −1 −1 1/2 ˆ OLS CT,x + CT,y (∆T − ∆T,x−y)

5.3 Batched OLS Statistic for Non-Stationary Bandits

Many real-world problems we would like to use bandit algorithms for have non-stationary over time. For example, in online advertising, the effectiveness of an ad may change over time due to exposure

2 (n) (n) Assume an analogous moment condition for the contextual bandit case, where Gt is replaced by Ft .

7 to competing ads and general societal changes that could affect perceptions of an ad. We may believe that the expected reward for a given action may vary over time, but that the margin is constant from batch to batch. In the online advertising setting, this would mean whether one ad is always better than another is stable, but the overall effectiveness of both ads may change over time. In this case, we can simply use the BOLS test statistic described earlier in equation (4) to test H0 : ∆ = 0 vs. H1 : ∆ 6= 0. Note that the BOLS test statistic for the margin is robust to non-stationarity in the baseline reward without any adjustment. Moreover, in our simulation settings we estimate the variance σ2 separately for each batch, which allows for non-stationarity in the variance between batches as well; see Appendix A for variance estimation details and see Section 6 for simulation results. Additionally, in the case that we believe that the margin itself may vary from batch to batch, the BOLS test statistic can also be used to construct conﬁdence regions that contain the true margin ∆t for each batch simultaneously; see Appendix A.5 for details.

6 Simulation Experiments

Procedure We focus on the two-arm bandit setting and test whether the margin is zero, specifically 2 H0 : ∆ = 0 vs. H1 : ∆ 6= 0. We perform experiments for when the noise variance σ is estimated. We assume homoscedastic errors throughout. See Appendix A.4 for more details about how we estimate the noise variance and more details regarding our experimental setup. In Figures 6 and 7, we display results for stationary bandits and in Figure 5 we show results for bandits with non-stationary baseline rewards. See Appendix A.5 for results for bandits with non-stationary margins. In our simulations, we found that OLS and AW-AIPW have inflated Type-1 error. Since Type-1 control is a hard constraint, solutions with inflated Type-1 error are infeasible solutions. In the power plots, we adjust the cutoffs of the estimators to ensure proper Type-1 error control; if an estimator has inflated Type-1 error under the null, in the power simulations we use a critical value estimated using the simulations under the null. Note that it is unfeasible to make these cutoff adjustment for real experiments (unless one found the worst case setting), as there are many nuisance parameters—like the expected rewards for each arm and the noise variance—which can affect cutoff values.

Results Figure 3 shows that for small sample sizes (nT . 300), BOLS has more reliable Type-1 error control than AW-AIPW with variance stabilizing weights. After nT ≥ 500 samples, AW-AIPW has proper Type-1 error, and by Figure 4 it always has slightly greater power than BOLS in the stationary setting. The W-decorrelated estimator has reliable Type-1 error control, but very low power compared to AW-AIPW and BOLS. Finally the high probability, self-normalized martingale bound of [1], which we use for hypothesis testing, has very low power compared to the asymptotic approximation statistical inference methods. In Figure 5, we display simulation results for the non-stationary baseline reward setting. Whereas other estimators have no Type-1 error guarantees, BOLS still has proper Type-1 error control in the non-stationary baseline reward setting. Moreover, BOLS can have much greater power than other estimators when there is non-stationarity in the baseline reward. Overall, BOLS is favorable over other estimators in small-sample settings or when one wants to be robust to non-stationarity in the baseline reward—at the cost of losing a little power if the environment is stationary.

8 Figure 3: Stationary Setting: Type-1 error for a two-sided test of H0 : ∆ = 0 vs. H1 : ∆ 6= 0 (α = 0.05). (n) We set β1 = β0 = 0, n = 25, and a clipping constraint of 0.1 ≤ πt ≤ 0.9. We use 100k Monte Carlo simulations and standard errors are < 0.001.

Figure 4: Stationary Setting: Power for a two-sided test of H0 : ∆ = 0 vs. H1 : ∆ 6= 0 (α = 0.05). We (n) set β1 = 0, β0 = 0.25, n = 25, and a clipping constraint of 0.1 ≤ πt ≤ 0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002. We account for Type-1 error inﬂation as described in Section 6.

Figure 5: Non-stationary baseline reward setting: Type-1 error (upper left) and power (upper right) for a two-sided test of H0 : ∆ = 0 vs. H1 : ∆ 6= 0 (α = 0.05). In the lower two plots we plot the expected rewards for each arm; note the margin is constant across batches. We use n = 25 and a clipping constraint of (n) 0.1 ≤ πt ≤ 0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002.

7 Discussion

We found that the OLS estimator is asymptotically non-normal when the margin is zero due to the non-concentration of the action selection probabilities. Since the OLS estimator is a canonical example of a method-of-moments estimator [13], our results suggest that the inferential guarantees of standard method-of-moments estimators may fail to hold on adaptively collected data when there is no unique optimal, regret-minimizing policy. We develop the Batched OLS estimator, which is asymptotically normal even when the action selection probabilities do not concentrate. An open question is whether batched versions of general method-of-moments estimators could similarly be used for adaptive inference.

9 Broader Impact

Our work has the positive impact of encouraging the use of valid statistical inference methods on bandit data, which ultimately leads to more reliable scientiﬁc conclusions. In addition, by providing a valid statistical inference method on bandit data, our work facilitates the use of bandit algorithms in experimentation.

Acknowledgments and Disclosure of Funding

Research reported in this paper was supported by National Institute on Alcohol Abuse and Al- coholism (NIAAA) of the National Institutes of Health under award number R01AA23187, Na- tional Institute on Drug Abuse (NIDA) of the National Institutes of Health under award number P50DA039838, National Cancer Institute (NCI) of the National Institutes of Health under award number U01CA229437, and by NIH/NIBIB and OD award number P41EB028242. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE1745303. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References [1] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011. [2] Arpit Agarwal, Shivani Agarwal, Sepehr Assadi, and Sanjeev Khanna. Learning with limited rounds of adaptivity: Coin tossing, multi-armed bandits, and ranking from pairwise comparisons. In Conference on Learning Theory, pages 39–75, 2017. [3] Takeshi Amemiya. Advanced Econometrics. Harvard University Press, 1985. [4] W Brannath, G Gutjahr, and P Bauer. Probabilistic foundation of confirmatory adaptive designs. Journal of the American Statistical Association, 107(498):824–832, 2012. [5] Yash Deshpande, Adel Javanmard, and Mohammad Mehrabi. Online debiasing for adaptively collected high-dimensional data. arXiv preprint arXiv:1911.01040, 2019. [6] Yash Deshpande, Lester Mackey, Vasilis Syrgkanis, and Matt Taddy. Accurate inference for adaptive linear models. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1194–1203, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. [7] Katie L Druce, William G Dixon, and John McBeth. Maximizing engagement in mobile health studies: lessons learned and future directions. Rheumatic Disease Clinics, 45(2):159–172, 2019. [8] Aryeh Dvoretzky. Asymptotic normality for sums of dependent random variables. In Proceed- ings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972. [9] Gunther Eysenbach. The law of attrition. Journal of medical Internet research, 7(1):e11, 2005. [10] Zijun Gao, Yanjun Han, Zhimei Ren, and Zhengqing Zhou. Batched multi-armed bandits problem. Conference on Neural Information Processing Systems, 2019. [11] Vitor Hadad, David A Hirshberg, Ruohan Zhan, Stefan Wager, and Susan Athey. Confidence intervals for policy evaluation in adaptive experiments. arXiv preprint arXiv:1911.02768, 2019. [12] Yanjun Han, Zhengqing Zhou, Zhengyuan Zhou, Jose Blanchet, Peter W Glynn, and Yinyu Ye. Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321, 2020. [13] Martin L. Hazelton. Methods of Moments Estimation, pages 816–817. Springer Berlin Heidel- berg, Berlin, Heidelberg, 2011.

10 [14] Steven R Howard, Aaditya Ramdas, Jon McAuliffe, and Jasjeet Sekhon. Uniform, nonparamet- ric, non-asymptotic confidence sequences. arXiv preprint arXiv:1810.08240, 2018. [15] Guido W Imbens and Donald B Rubin. Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015. [16] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439, 2014. [17] Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014. [18] Kwang-Sung Jun, Kevin G Jamieson, Robert D Nowak, and Xiaojin Zhu. Top arm identification in multi-armed bandits with batch arm pulls. In AISTATS, pages 139–148, 2016. [19] Maximilian Kasy. Uniformity and the delta method. Journal of Econometric Methods, 8(1), 2019. [20] Emilie Kaufmann and Wouter Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint arXiv:1811.11419, 2018. [21] René F Kizilcec, Chris Piech, and Emily Schneider. Deconstructing disengagement: analyzing learner subpopulations in massive open online courses. In Proceedings of the third international conference on learning analytics and knowledge, pages 170–179, 2013. [22] René F Kizilcec, Justin Reich, Michael Yeomans, Christoph Dann, Emma Brunskill, Glenn Lopez, Selen Turkay, Joseph Jay Williams, and Dustin Tingley. Scaling up behavioral science interventions in online education. Proceedings of the National Academy of Sciences, 117(26):14900–14905, 2020. [23] Predrag Klasnja, Eric B Hekler, Saul Shiffman, Audrey Boruvka, Daniel Almirall, Ambuj Tewari, and Susan A Murphy. Microrandomized trials: An experimental design for developing just-in-time adaptive interventions. Health Psychology, 34(S):1220, 2015. [24] Predrag Klasnja, Shawna Smith, Nicholas J Seewald, Andy Lee, Kelly Hall, Brook Luers, Eric B Hekler, and Susan A Murphy. Efficacy of contextually tailored suggestions for physical activity: A micro-randomized optimization trial of heartsteps. Annals of Behavioral Medicine, 53(6):573–582, 2019. [25] Tze Leung Lai and Ching Zong Wei. Least squares estimates in stochastic regression models with applications to identification and control of dynamic systems. The Annals of Statistics, 10(1):154–166, 1982. [26] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020. [27] Chang Li and Maarten De Rijke. Cascading non-stationary bandits: Online learning to rank in the non-stationary cascade model. arXiv preprint arXiv:1905.12370, 2019. [28] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010. [29] Peng Liao, Kristjan Greenewald, Predrag Klasnja, and Susan Murphy. Personalized heartsteps: A reinforcement learning algorithm for optimizing physical activity. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(1):1–22, 2020. [30] Peng Liao, Predrag Klasnja, Ambuj Tewari, and Susan A Murphy. Sample size calculations for micro-randomized trials in mhealth. Statistics in medicine, 35(12):1944–1971, 2016. [31] Qing Liu, Michael A Proschan, and Gordon W Pledger. A unified theory of two-stage adaptive designs. Journal of the American Statistical Association, 97(460):1034–1041, 2002. [32] Alexander R Luedtke and Mark J van der Laan. Parametric-rate inference for one-sided differentiable parameters. Journal of the American Statistical Association, 113(522):780–788, 2018. [33] Alexander R Luedtke and Mark J Van Der Laan. Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of statistics, 44(2):713, 2016.

11 [34] Xinkun Nie, Xiaoying Tian, Jonathan Taylor, and James Zou. Why adaptively collected data have negative bias and how to correct for it. International Conference on Artificial Intelligence and Statistics, 2018. [35] Vianney Perchet, Philippe Rigollet, Sylvain Chassang, Erik Snowberg, et al. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016. [36] Anna Rafferty, Huiji Ying, and Joseph Williams. Statistical consequences of using multi-armed bandits to conduct adaptive educational experiments. JEDM| Journal of Educational Data Mining, 11(1):47–79, 2019. [37] Joseph P Romano, Azeem M Shaikh, et al. On the uniform asymptotic validity of subsampling and the bootstrap. The Annals of Statistics, 40(6):2798–2822, 2012. [38] Eric M Schwartz, Eric T Bradlow, and Peter S Fader. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science, 36(4):500–522, 2017. [39] Jaehyeok Shin, Aaditya Ramdas, and Alessandro Rinaldo. Are sample means in multi-armed bandits positively or negatively biased? In Advances in Neural Information Processing Systems, pages 7100–7109, 2019. [40] Liang Tang, Yexi Jiang, Lei Li, and Tao Li. Ensemble contextual bandits for personalized recommendation. In Proceedings of the 8th ACM Conference on Recommender Systems, pages 73–80, 2014. [41] Sofía S Villar, Jack Bowden, and James Wason. Multi-armed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199, 2015. [42] Gernot Wassmer and Werner Brannath. Group sequential and confirmatory adaptive designs in clinical trials. Springer, 2016. [43] Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, and Finale Doshi-Velez. Power- constrained bandits. arXiv preprint arXiv:2004.06230, 2020. [44] Elad Yom-Tov, Guy Feraru, Mark Kozdoba, Shie Mannor, Moshe Tennenholtz, and Irit Hochberg. Encouraging physical activity in patients with diabetes: intervention using a reinforcement learning system. Journal of medical Internet research, 19(10):e338, 2017.

12 A Simulation Details

A.1 W-Decorrelated Estimator

For the W -decorrelated estimator [6], for a batch size of n and for T batches, we set λ to be the 1 > > > nT quantile of λmin(X X)/ log(nT ), where λmin(X X) denotes the minimum eigenvalue of X X. This procedure of choosing λ is motivated by the conditions of Theorem 4 of [6] and follows the methods used by [6] in their simulation experiments. We had to adjust the original procedure for > choosing λ used by [6] (who set λ to the 0.15 quantile of λmin(X X)), because they only evaluated the W-decorrelated method for when the total number of samples was nT = 1000 and valid values of λ changes with the sample size.

A.2 AW-AIPW Estimator

Since the AW-AIPW test statistic for the treatment effect is not explicitly written in the original paper [11], we now write the formulas for the AW-AIPW estimator of the treatment effect: ∆ˆ AW-AIPW := βÂW-AIPW−βÂW-AIPW. We use the variance stabilizing weights, equal to the square root of the sampling 1 0q q (n) (n) Pn Pn probabilities, πt and 1 − πt . Below, Nt,1 = i=1 At,i and Nt,0 = i=1(1 − At,i). (n) Pt−1 Pn (n) A A 0 A Rt,i Y := t,i R + 1 − t,i t =1 i=1 t,i t,1 (n) t,i (n) Pt−1 πt πt t0=1 Nt0,1 (n) Pt−1 Pn 1 − A 1 − A 0 (1 − A )R Y := t,i R + 1 − t,i t =1 i=1 t,i t,i t,0 (n) t,i (n) Pt−1 1 − πt 1 − πt t0=1 Nt0,0 q q PT Pn π(n)Y PT Pn 1 − π(n)Y ÂW-AIPW t=1 i=1 t t,1 ÂW-AIPW t=1 i=1 t t,0 β1 := q and β0 := q PT Pn (n) PT Pn (n) t=1 i=1 πt t=1 i=1 1 − πt

AW-AIPW The variance estimator for ∆ˆ is Vˆ0 + Vˆ1 + 2Cˆ0,1 where

PT Pn π(n)(Y − βÂW-AIPW)2 PT Pn (1 − π(n))(Y − βÂW-AIPW)2 Vˆ := t=1 i=1 t t,1 1 and Vˆ := t=1 i=1 t t,0 0 1 q 2 0 q 2 PT Pn (n) PT Pn (n) t=1 i=1 πt t=1 i=1 1 − πt q PT Pn π(n)(1 − π(n))(Y − βÂW-AIPW)(Y − βÂW-AIPW) ˆ t=1 i=1 t t t,1 1 t,0 0 C0,1 := − q q PT Pn (n) PT Pn (n) t=1 i=1 πt t=1 i=1 1 − πt

A.3 Self-Normalized Martingale Bound

By the self-normalized martingale bound of [1], specifically Theorem 1 and Lemma 6, we have that in the two arm bandit setting, ÔLS ÔLS P ∀T, n ≥ 1, β1 − β1 ≤ c1,T and β0 − β0 ≤ c0,T ≥ 1 − δ where v u   q  T PT u P 2 1 + Nt,a u 2 1 + t=1 Nt,a t=1 ca,T = uσ 2 1 + 2 log   t PT δ t=1 Nt,a We estimate σ2 using the procedure stated below for the OLS estimator. We reject the null hypothesis that ∆ = 0 whenever either the confidence bounds for the two arms are non-overlapping. Specifically when ÔLS ÔLS ÔLS ÔLS β1 + c1,T ≤ β0 − c0,T or β0 + c0,T ≤ β1 − c1,T

13 A.4 Estimating Noise Variance

ÔLS ÔLS OLS Estimator Given the OLS estimators for the means of each arm, β1 , β0 , we estimate the noise variance σ2 as follows: T n 2 1 X X σˆ2 := R − A βÔLS − (1 − A )βÔLS . nT − 2 t,i t,i 1 t,i 0 t=1 i=1 We use a degrees of freedom bias correction by normalizing by nT − 2 rather than nT . Since the W-decorrelated estimator is a modified version of the OLS estimator, we also use this same noise variance estimator for the W-decorrelated estimator; we found that this worked well in practice, in terms of Type-1 error control.

Batched OLS Given the Batched OLS estimators for the means of each arm for each batch, ˆBOLS ˆBOLS 2 βt,1 , βt,0 , we estimate the noise variance for each batch σt as follows: n 2 1 X σˆ2 := R − A βˆBOLS − (1 − A )βˆBOLS . t n − 2 t,i t,i t,1 t,i t,0 i=1 Again, we use a degrees of freedom bias correction by normalizing by n − 2 rather than n. We 2 2 P 2 prove the consistency of σˆt (meaning σˆt → σ ) in Corollary 4. Using BOLS to test H0 : ∆ = a vs. H1 : ∆ 6= a, we use the following test statistic:

T s 1 X Nt,0Nt,1 √ (∆ˆ BOLS − a). nσˆ2 t T t=1 t Pn Pn Above, Nt,1 = i=1 At,i and Nt,0 = i=1(1 − At,i). For this test statistic, we use cutoffs based i.i.d. on the Student-t distribution, i.e., for Yt ∼ tn−2 we use a cutoff cα/2 such that T 1 X P √ Yt > cα/2 = α.

T t=1

We found cα/2 by simulating draws from the Student-t distribution.

A.5 Non-Stationary Treatment Effect

When we believe that the margin itself varies from batch to batch, we are able to construct a confidence region that contains the true margin ∆t for each batch simultaneously with probability 1 − α. Corollary 2 (Confidence band for margin for non-stationary bandits). Assume the same conditions th as Theorem 3. Let zx be x quantile of the standard Normal distribution, i.e., for Z ∼ N (0, 1), P(Z < zα) = α. For each t ∈ [1: T ], we define the interval

s 2 OLS σ n L = ∆ˆ ± z α . t t 1− 2T Nt,0Nt,1 Pn Pn limn→∞ P ∀t ∈ [1 : T ], ∆t ∈ Lt ≥ 1 − α. Above, Nt,1 = i=1 At,i and Nt,0 = i=1(1 − At,i).

Proof: Note that by Corollary 3,

T T X X α exists some t ∈ [1 : T ] s.t. ∆ ∈/ L ≤ ∆ ∈/ L → = α P t t P t t T t=1 t=1 where the limit is as n → ∞. Since P ∀t ∈ [1 : T ], ∆t ∈ Lt = 1 − P exists some t ∈ [1 : T ] s.t. ∆t ∈/ Lt Thus, lim ∀t ∈ [1 : T ], ∆t ∈ Lt ≥ 1 − α n→∞ P

14 We can also test the null hypothesis of no margin against the alternative that at least one batch has non-zero margin, i.e., H0 : ∀t ∈ [1: T ], ∆t = 0 vs. H1 : ∃t ∈ [1: T ] s.t. ∆t 6= 0. Note that the global null stated above is of great interest in the mobile health literature [23, 30]. Speciﬁcally we use the following test statistic: T X Nt,0Nt,1 (∆ˆ OLS − 0)2, σ2n t t=1 which by Theorem 3 converges in distribution to a chi-squared distribution with T degrees of freedom under the null ∆t = 0 for all t. To account for estimating noise variance σ2, in our simulations for this test statistic, we use cutoffs i.i.d. based on the Student-t distribution, i.e., for Yt ∼ tn−2 we use a cutoff cα/2 such that

T 1 X Y 2 > c = α. P T t α t=1

We found cα by simulating draws from the Student-t distribution. In the plots below we call the test statistic in (A.5) “BOLS Non-Stationary Treatment Effect” (BOLS NSTE). BOLS NSTE performs poorly in terms of power compared to other test statistics in the stationary setting; however, in the non-stationary setting, BOLS NSTE signiﬁcantly outperforms all other test statistics, which tend to have low power when the average treatment effect is close to zero. Note that the W-decorrelated estimator performs well in the left plot of Figure 8; this is because as we show in Appendix F, the W-decorrelated estimator upweights samples from the earlier batches in the study. So when the treatment effect is large in the beginning of the study, the W-decorrelated estimator has high power and when the treatment effect is small or zero in the beginning of the study, the W-decorrelated estimator has low power.

Figure 6: Stationary Setting: Type-1 error for a two-sided test of H0 : ∆ = 0 vs. H1 : ∆ 6= 0 (α = 0.05). (n) We set β1 = β0 = 0, n = 25, and a clipping constraint of 0.1 ≤ πt ≤ 0.9. We use 100k Monte Carlo simulations and standard errors are < 0.001.

Figure 7: Stationary Setting: Power for a two-sided test of H0 : ∆ = 0 vs. H1 : ∆ 6= 0 (α = 0.05). We (n) set β1 = 0, β0 = 0.25, n = 25, and a clipping constraint of 0.1 ≤ πt ≤ 0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002. We account for Type-1 error inﬂation as described in Section 6.

15 Figure 8: Nonstationary setting: The two upper plots display the power of estimators for a two-sided test of H0 : ∀t ∈ [1: T ], βt,1 − βt,0 = 0 vs. H1 : ∃t ∈ [1: T ], βt,1 − βt,0 6= 0 (α = 0.05). The two lower plots display two treatment effect trends; the left plot considers a decreasing trend (quadratic function) and the right (n) plot considers a oscillating trend (sin function). We set n = 25, and a clipping constraint of 0.1 ≤ πt ≤ 0.9. We use 100k Monte Carlo simulations and standard errors are < 0.002.

16 B Asymptotic Normality of the OLS Estimator

2 (n) 2 2 (n) Condition 6 (Weak moments). ∀t, n, i, E[t,i Gt−1] = σ and for all ∀t, n, i, E[ϕ(t,i)|Gt−1] < ϕ(x) M < ∞ a.s. for some function ϕ where limx→∞ x → ∞. Condition 7 (Stability). There exists a sequence of nonrandom positive-deﬁnite symmetric matrices, Vn, such that 1 1 −1 PT Pn > 2 −1 > 2 P (a) Vn t=1 i=1 Xt,iXt,i = Vn X X → Ip −1 P (b) maxi∈[1 : n],t∈[1 : T ] kVn Xt,ik2 → 0 p Theorem 5 (Triangular array version of Lai & Wei (1982), Theorem 3). Let Xt,i ∈ R be non- (n) T (n) anticipating with respect to ﬁltration {Gt }t=1, so Xt,i is Gt−1 measurable. We assume the following conditional mean model for rewards:

(n) > E Rt,i Gt−1 = Xt,iβ. > i=n,t=T We deﬁne t,i := Rt,i − Xt,iβ. Note that {t,i}i=1,t=1 is a martingale difference array with respect (n) T to ﬁltration {Gt }t=1. Assuming Conditions 6 and 7, as n → ∞,

> 1/2 ÔLS D 2 (X X) (β − β) → N (0, σ Ip) Note, in the body of the paper we state that this theorem holds in the two-arm bandit case assuming Conditions 2 and 1. Note that Condition 1 is sufficient for Condition 6 and Condition 2 is sufficient for Condition 7 in the two-arm bandit case.

Proof: OLS βˆ = ((X>X)−1X(n),>R(n) = (X>X)−1X>(Xβ + ) T n −1 T n ÔLS > −1 > X X > X X β − β = (X X) X = Xt,iXt,i Xt,it,i t=1 i=1 t=1 i=1 It is sufficient to show that as n → ∞: T n > −1/2 X X D 2 (X X) Xt,it,i → N (0, σ Ip) t=1 i=1 By Slutsky’s Theorem and Condition 7 (a), it is also sufficient to show that as n → ∞,

T n −1 X X D 2 Vn Xt,it,i → N (0, σ Ip) t=1 i=1

By Cramer-Wold device, it is sufficient to show multivariate normality if for any fixed c ∈ Rp s.t. kck2 = 1, as n → ∞, T n > −1 X X D 2 c Vn Xt,it,i → N (0, σ ) t=1 i=1 We will prove this central limit theorem by using a triangular array martingale central limit theorem, T −1 specifically Theorem 2.2 of [8]. We will do this by letting Yt,i = c Vn Xt,it,i. The theorem states PT Pn D 2 that as n → ∞, t=1 i=1 Yt,i → N (0, σ ) if the following conditions hold as n → ∞:

PT Pn (n) P (a) t=1 i=1 E[Yt,i|Gt−1] → 0 PT Pn 2 (n) P 2 (b) t=1 i=1 E[Yt,i|Gt−1] → σ PT Pn 2 (n) P (c) ∀δ > 0, t=1 i=1 E Yt,iI(|Yt,i|>δ) Gt−1 → 0

17 Useful Properties Note that by Cauchy-Schwartz and Condition 7 (b), as n → ∞,

> −1 −1 P max c Vn Xt,i ≤ max kck2kVn Xt,ik2 → 0 i∈[1 : n],t∈[1 : T ] i∈[1 : n],t∈[1 : T ] By continuous mapping theorem and since the square function on non-negative inputs is order preserving, as n → ∞, 2 > −1 > −1 2 P max c Vn Xt,i = max c Vn Xt,i → 0 (5) i∈[1 : n],t∈[1 : T ] i∈[1 : n],t∈[1 : T ]

> −1 > 1/2 P > By Condition 7 (a) and continuous mapping theorem, c Vn (Xt,iXt,i) → c , so

> −1 > 1/2 > 1/2 −1 P > c Vn (Xt,iXt,i) (Xt,iXt,i) Vn c → c c = 1 Thus, T n T n > −1 X X > −1 X X > −1 > −1 P c Vn Xt,iXt,i Vn c = c Vn Xt,iXt,iVn c → 1 t=1 i=1 t=1 i=1 > −1 Since c Vn Xt,i is a scalar, as n → ∞,

T n X X > −1 2 P (c Vn Xt,i) → 1 (6) t=1 i=1

Condition (a): Martingale

T n T n X X > −1 (n) X X > −1 (n) E[c Vn Xt,it,i|Gt−1] = c Vn Xt,iE[t,i|Gt−1] = 0 t=1 i=1 t=1 i=1

Condition (b): Conditional Variance

T n T n T n X X > −1 2 2 (n) X X > −1 2 2 (n) 2 X X > −1 2 P 2 E[(c Vn Xt,i) t,i|Gt−1] = (c Vn Xt,i) E[t,i|Gt−1] = σ (c Vn Xt,i) → σ t=1 i=1 t=1 i=1 t=1 i=1 where the last equality holds by Condition 6 and the limit holds by (6) as n → ∞.

Condition (c): Lindeberg Condition Let δ > 0. We want to show that as n → ∞, T n X X 2 2 (n) P Zt,iE t,iI(Z2 2 >δ2) Gt−1 → 0 t,i t,i t=1 i=1

(n) > −1 where above, we deﬁne Zt,i := c Vn Xt,i. By Condition 6, we have that for all n ≥ 1,

2 (n) max E[ϕ(t,i)|Gt−1] < M t∈[1 : T ],i∈[1 : n]

ϕ(x) Since we assume that limx→∞ x = ∞, for all m ≥ 1, there exists a bm s.t. ϕ(x) ≥ mMx for all x ≥ bm. So, for all n, t, i,

2 (n) 2 (n) 2 (n) M ≥ [ϕ( )|G ] ≥ [ϕ( ) 2 |G ] ≥ mM [ 2 |G ] E t,i t−1 E t,i I(t,i≥bm) t−1 E t,iI(t,i≥bm) t−1 Thus, 2 (n) 1 2 max E[t,iI( ≥bm)|Gt−1] ≤ t∈[1 : T ],i∈[1 : n] t,i m So we have that T n X X 2 2 (n) Z 2 2 2 G t,iE t,iI(Zt,it,i>δ ) t−1 t=1 i=1

18 T n X X 2 2 (n) 2 (n) 2 2 2 2 2 2 2 2 2 2 = Zt,i E t,iI(Z >δ ) Gt−1 I(Z ≤δ /bm) + E t,iI(Z >δ ) Gt−1 I(Z >δ /bm) t,i t,i t,i t,i t,i t,i t=1 i=1 T n X X 2 2 (n) 2 2 2 2 ≤ Zt,i E t,iI( >bm) Gt−1 + σ I(Z >δ /bm) t,i t,i t=1 i=1 T n 1 2 X X 2 2 2 ≤ + σ I(max 0 Z >δ /bm) Zt,i m t ∈[1 : T ],j∈[1 : n] t0,j t=1 i=1 By Slutsky’s Theorem and (6), it is sufﬁcient to show that as n → ∞,

1 2 P 2 2 + σ I(max 0 Z >δ /bm) → 0 m t ∈[1 : T ],j∈[1 : n] t0,j For any > 0, 1 2 2 2 2 1 2 2 P +σ I(max 0 Z >δ /bm) > ≤ I( > )+P σ I(max 0 Z >δ /bm) > m t ∈[1 : T ],j∈[1 : n] t0,j m 2 t ∈[1 : T ],j∈[1 : n] t0,j 2 1 1 We can choose m such that m ≤ 2 , so P m > 2 = 0. For the second term (note that m is now ﬁxed), 2 2 2 2 2 0 P σ I(max 0 Z >δ /bm) > ≤ P max Zt ,j > δ /bm → 0 t ∈[1 : T ],j∈[1 : n] t0,j 2 t0∈[1 : T ],j∈[1 : n] where the last limit holds by (5) as n → ∞.

B.1 Corollary 1 (Sufﬁcient conditions for Theorem 5)

Under Conditions 1 and 3, when the treatment effect is non-zero data collected in batches using -greedy, Thompson Sampling, or UCB with a fixed clipping constraint (see Definition 1) will satisfy Theorem 5 conditions. Proof: The only condition of Theorem 5 that needs to verified is Condition 2. To satisfy Condition 2, it is sufficient to show that for any given ∆, for some constant c ∈ (0,T ), T n T 1 X X 1 X P A = N → c. n t,i n t,1 t=1 i=1 t=1

(n) 1 -greedy We assume without loss of generality that ∆ > 0 and π1 = 2 . Recall that for -greedy, for a ∈ [2: T ], Pa Pn Pa Pn ( t=1 i=1 At,iRt,i t=1 i=1(1−At,i)Rt,i 1 − if a > a (n) 2 P N 0 P N 0 π = t0=1 t ,1 t0=1 t ,0 a 2 otherwise

(n) P Thus to show that πa → 1 − 2 for all a ∈ [2: T ], it is sufﬁcient to show that Pa Pn Pa Pn t=1 i=1 At,iRt,i t=1 i=1(1 − At,i)Rt,i P Pa > Pa → 1 (7) t0=1 Nt0,1 t0=1 Nt0,0 To show (7), it is equivalent to show that Pa Pn Pa Pn t=1 i=1(1 − At,i)t,i t=1 i=1 At,it,i P ∆ > Pa − Pa → 1 (8) t0=1 Nt0,0 t0=1 Nt0,1 To show (8), it is sufﬁcient to show that Pa Pn Pa Pn t=1 i=1(1 − At,i)t,i t=1 i=1 At,it,i P Pa − Pa → 0. (9) t0=1 Nt0,0 t0=1 Nt0,1 To show (9), it is equivalent to show that a p Pn a p Pn X Nt,0 i=1(1 − At,i)t,i X Nt,1 i=1 At,it,i P Pa − Pa → 0. (10) N 0 p N 0 p t=1 t0=1 t ,0 Nt,0 t=1 t0=1 t ,1 Nt,1

19 By Lemma 1, for all t ∈ [1: T ], N t,1 →P 1 (n) πt n Thus by Slutsky’s Theorem, to show (10), it is sufﬁcient to show that q q a (n) Pn a (n) Pn X n(1 − πt ) (1 − At,i)t,i X nπt At,it,i P i=1 − i=1 → 0. (11) Pa (n) pN Pa (n) pN t=1 n t0=1(1 − πt0 ) t,0 t=1 n t0=1 πt0 t,1

(n) Since πt ∈ [ 2 , 1 − 2 ] for all t, n, the left hand side of (11) equals the following: a Pn a Pn X i=1(1 − At,i)t,i X i=1 At,it,i P op(1) p − op(1) p → 0. t=1 Nt,0 t=1 Nt,1 The above limit holds because by Thereom 3, we have that Pn Pn Pn Pn i=1 A1,i1,i i=1(1 − A1,i)1,i i=1 AT,iT,i i=1(1 − AT,i)T,i D 2 p , p , ..., p , p → N (0, σ I2T ). N1,1 N1,0 NT,1 NT,0 (12) Thus, by Slutsky’s Theorem and Lemma 1, we have that T T 1 X P 1 1 X P 1 N → + (T − 1)(1 − ) and N → + (T − 1) n t,1 2 2 n t,0 2 2 t=1 t=1

(n) 1 Thompson Sampling We assume without loss of generality that ∆ > 0 and π1 = 2 . Recall that for ˜ ˜ i.i.d. Thompson Sampling with independent standard normal priors (β1, β0 ∼ N (0, 1)) for a ∈ [2: T ], (n) ˜ ˜ (n) πa = πmin ∨ πmax ∧ P(β1 > β0 | Ha−1) ˜ ˜ Given the independent standard normal priors on β1, β0, we have the following posterior distribution: Pa−1 Pn Pa−1 Pn ˜ ˜ (n) t=1 i=1 At,iRt,i t=1 i=1(1 − At,i)Rt,i β1 − β0 | H ∼ N − , a−1 2 Pa−1 2 Pa−1 σ + t=1 Nt,1 σ + t=1 Nt,0 2 2 Pa−1 2 2 Pa−1 σ (σ + t=1 Nt,1) + σ (σ + t=1 Nt,0) 2 Pa−1 2 Pa−1 (σ + t=1 Nt,0)(σ + t=1 Nt,1) (n) (n) 2 =: N µa−1, (σa−1)

(n) P (n) P Thus to show that πa → πmax for all a ∈ [2: T ], it is sufﬁcient to show that µa−1 → ∆ and (n) 2 P (σa−1) → 0 for all a ∈ [2: T ]. By Lemma 1, for all t ∈ [1: T ], N t,1 →P 1 (n) πt n

(n) 2 P Thus, to show (σa−1) → 0, it is sufﬁcient to show that σ2(σ2 + n Pa−1 π(n)) + σ2(σ2 + n Pa−1(1 − π(n))) t=1 t t=1 t →P 0 2 Pa−1 (n) 2 Pa−1 (n) (σ + n t=1 (1 − πt ))(σ + n t=1 πt ) (n) The above limit holds because πt ∈ [πmin, πmax] for 0 < πmin ≤ πmax < 1 by the clipping condition. (n) P We now show that µa−1 → ∆, which is equivalent to showing that the following converges in probability to ∆ Pa−1 Pn A R Pa−1 Pn (1 − A )R t=1 i=1 t,i t,i − t=1 i=1 t,i t,i 2 Pa−1 2 Pa−1 σ + t=1 Nt,1 σ + t=1 Nt,0

20 Pa−1 N Pa−1 Pn A R Pa−1 N Pa−1 Pn (1 − A )R = t=1 t,1 t=1 i=1 t,i t,i − t=1 t,0 t=1 i=1 t,i t,i 2 Pa−1 Pa−1 2 Pa−1 Pa−1 σ + t=1 Nt,1 t=1 Nt,1 σ + t=1 Nt,0 t=1 Nt,0

Pa−1 Pa−1 Pn t=1 Nt,1 t=1 i=1 At,it,i = β1 + 2 Pa−1 Pa−1 σ + t=1 Nt,1 t=1 Nt,1 Pa−1 Pa−1 Pn t=1 Nt,0 t=1 i=1(1 − At,i)t,i − β0 + (13) 2 Pa−1 Pa−1 σ + t=1 Nt,0 t=1 Nt,0

Note that Pa−1 Pa−1 t=1 Nt,1 t=1 Nt,0 P β1 − β0 → ∆ (14) 2 Pa−1 2 Pa−1 σ + t=1 Nt,1 σ + t=1 Nt,0 Equation (14) above holds by Lemma 1, because

n Pa−1 π(n) n Pa−1(1 − π(n)) t=1 t →P 1 t=1 t →P 1 (15) 2 Pa−1 (n) 2 Pa−1 (n) σ + n t=1 πt σ + n t=1 (1 − πt ) (n) which hold because πt ∈ [πmin, πmax] due to our clipping condition. By Slutsky’s Theorem and (14), to show (13), it is sufﬁcient to show that Pa−1 Pn A Pa−1 Pn (1 − A ) t=1 i=1 t,i t,i − t=1 i=1 t,i t,i →P 0. (16) 2 Pa−1 2 Pa−1 σ + t=1 Nt,1 σ + t=1 Nt,0

Equation (16) is equivalent to the following: a−1 p Pn a−1 p Pn X Nt,1 At,it,i X Nt,0 (1 − At,i)t,i P i=1 − i=1 → 0 (17) 2 Pa−1 p 2 Pa−1 p t=1 σ + t0=1 Nt0,1 Nt,1 t=1 σ + t0=1 Nt0,0 Nt,0

By Lemma 1, to show (17) it is sufﬁcient to show that q q a−1 (n) Pn a−1 (n) Pn X nπt At,it,i X n(1 − πt ) (1 − At,i)t,i P i=1 − i=1 → 0 (18) 2 Pa−1 (n) pN 2 Pa−1 (n) pN t=1 σ + n t0=1 πt0 t,1 t=1 σ + n t0=1(1 − πt0 ) t,0

(n) Since πt ∈ [πmin, πmax] due to our clipping condition, the left hand side of (18) equals the following a−1 Pn a−1 Pn X i=1 At,it,i X i=1(1 − At,i)t,i P op(1) p − op(1) p → 0 t=1 Nt,1 t=1 Nt,0 The above limit holds by (12). Thus, by Slutsky’s Theorem and Lemma 1, we have that

T T 1 X P 1 1 X P 1 N → + (T − 1)π and N → + (T − 1)π n t,1 2 max n t,0 2 min t=1 t=1

(n) 1 UCB We assume without loss of generality that ∆ > 0 and π1 = 2 . Recall that for UCB, for a ∈ [2: T ], (n) πmax if Ua−1,1 > Ua−1,0 πa = 1 − πmax otherwise where we define the upper confidence bounds U for any confidence level δ with 0 < δ < 1 as follows: ( Pa−1 ∞ if t=1 Nt,1 = 0 U = Pa−1 Pn q a−1,1 t=1 i=1 At,iRt,i 2 log 1/δ Pa−1 + Pa−1 otherwise t=1 Nt,1 t=1 Nt,1

21 ( ∞ if N1,0 = 0 U = Pa−1 Pn q a−1,0 t=1 i=1(1−At,i)Rt,i 2 log 1/δ Pa−1 + Pa−1 otherwise t=1 Nt,0 t=1 Nt,0

(n) P P Thus to show that πa → πmax for all a ∈ [2: T ], it is sufﬁcient to show that I(Ua,1>Ua,0) → 1, which is equivalent to showing that the following converges in probability to 1:

a a P P Pa Pn A R r Pa Pn (1−A )R r I( t=1 Nt,1>0, t=1 Nt,0>0)I t=1 i=1 t,i t,i 2 log 1/δ t=1 i=1 t,i t,i 2 log 1/δ Pa + Pa > Pa + Pa t=1 Nt,1 t=1 Nt,1 t=1 Nt,1 t=1 Nt,0

+ Pa Pa I( t=1 Nt,1=0, t=1 Nt,0>0)

= Pa Pn A r Pa Pn (1−A ) r + op(1) I t=1 i=1 t,i t,i 2 log 1/δ t=1 i=1 t,i t,i 2 log 1/δ (β1−β0)+ Pa + Pa > Pa + Pa t=1 Nt,1 t=1 Nt,1 t=1 Nt,1 t=1 Nt,0 Note that to show that the above converges in probability to 1, it is sufﬁcient to show that the following: Pa Pn s Pa Pn s t=1 i=1(1 − At,i)t,i 2 log 1/δ t=1 i=1 At,it,i 2 log 1/δ P Pa + Pa − Pa − Pa → 0 t=1 Nt,1 t=1 Nt,0 t=1 Nt,1 t=1 Nt,1

2 log 1/δ P Nt,0 P Note that for ﬁxed δ, we have that Pa → 0, since → 1. Also note that t=1 Nt,0 n/2 Pa Pn Pa Pn t=1 i=1(1−At,i)t,i t=1 i=1 At,it,i P Pa − Pa → 0, by the same argument made in the -greedy case to t=1 Nt,1 t=1 Nt,1 show (9). Thus, by Slutsky’s Theorem and Lemma 1, we have that

T T 1 X P 1 1 X P 1 N → + (T − 1)π and N → + (T − 1)(1 − π ) n t,1 2 max n t,0 2 max t=1 t=1

22 C Non-uniform convergence of the OLS Estimator

Deﬁnition 3 (Non-concentration of a sequence of random variables). For a sequence of random n variables {Yi}i=1 on probability space (Ω, F, P), we say Yn does not concentrate if for each a ∈ R there exists an a > 0 with P ω ∈ Ω: Yn(ω) − a > a 6→ 0.

C.1 Thompson Sampling

Proposition 1 (Non-concentration of sampling probabilities under Thompson Sampling). Under the assumptions of Theorem 2, the posterior distribution that arm 1 is better than arm 0 converges as follows: 1 if ∆ > 0 ˜ ˜ (n) D  P β1 > β0 H1 → 0 if ∆ < 0 Uniform[0, 1] if ∆ = 0 (n) Thus, the sampling probabilities πt do not concentrate when ∆ = 0. Pn Pn Proof: Below, Nt,1 = i=1 At,i and Nt,0 = i=1(1 − At,i). Posterior means: Pn 2 ˜ (n) i=1(1 − A1,i)R1,i σ β0|H1 ∼ N 2 , 2 σa + N1,0 σa + N0,1 Pn 2 ˜ (n) i=1 A1,iR1,i σa β1|H1 ∼ N 2 , 2 σa + N1,1 σa + N1,1 ˜ ˜ (n) 2 β1 − β0 | H1 ∼ N (µn, σn) Pn Pn 2 2 2 2 i=1 A1,iR1,i i=1(1−A1,i)R1,i 2 σa(σa+N1,1)+σa(σa+N1,0) for µn := 2 − 2 and σn := 2 2 . σa+N1,1 σa+N1,0 (σa+N1,0)(σa+N1,1) ˜ ˜ ˜ ˜ (n) ˜ ˜ (n) β1 − β0 − µn µn (n) P (β1 > β0 | H1 ) = P (β1 − β0 > 0 | H1 ) = P > − H1 σn σn

For Z ∼ N (0, 1) independent of µn, σn. µn (n) µn (n) µn (n) = P Z > − H1 = P Z < H1 = Φ H1 σn σn σn s Pn Pn 2 2 µn i=1 A1,iR1,i i=1(1 − A1,i)R1,i (σa + N1,0)(σa + N1,1) = 2 − 2 4 2 σn σa + N1,1 σa + N1,0 2σa + σan s Pn Pn 2 2 β1N1,1 + i=1 A1,i1,i β0N1,0 + i=1(1 − A1,i)1,i (σa + N1,0)(σa + N1,1) = 2 − 2 4 2 σa + N1,1 σa + N1,0 2σa + σan s s Pn A N (σ2 + N ) Pn (1 − A ) N (σ2 + N ) = i=1 1,i 1,i 1,1 a 1,0 − i=1 1,i 1,i 1,0 a 1,1 p 4 2 2 p 4 2 2 N1,1 (2σa + σan)(σa + N1,1) N1,0 (2σa + σan)(σa + N1,0) s 2 2 N1,1 N1,0 (σa + N1,0)(σa + N1,1) + β1 2 − β0 2 4 2 =: Bn + Cn σa + N1,1 σa + N1,0 2σa + σan

N1,1 N1,0 Let’s ﬁrst examine Cn. Note that β1 = β0 + ∆, so β1 2 − β0 2 equals σa+N1,1 σa+N1,0 N1,1 N1,0 N1,1 N1,1 N1,0 = (β0 + ∆) 2 − β0 2 = ∆ 2 + β0 2 − 2 σa + N1,1 σa + N1,0 σa + N1,1 σa + N1,1 σa + N1,0 2 2 N1,1/n N1,1(σa + N1,0) − N1,0(σa + N1,1) = ∆ 2 + β0 2 2 (σa + N1,1)/n (σa + N1,1)(σa + N1,1)

23 1 2 + o(1) 2 N1,1 − N1,0 1 = ∆ 1 + β0σa 2 2 = ∆[1 + o(1)] + o 2 + o(1) (σa + N1,1)(σa + N1,1) n where the last equality holds by the Strong Law of Large Numbers because 1 1 1 1 1 2 (N − N ) [ − + o(1)] o(1) 1 n 1,1 1,0 = n 2 2 = n = o 1 2 2 1 1 1 n2 (σa + N1,1)(σa + N1,1) [ 2 + o(1)][ 2 + o(1)] 4 + o(1) n Thus, s 2 2 1 (σa + N1,0)(σa + N1,1) Cn = ∆[1 + o(1)] + o 4 2 n 2σa + σan s 1 n[ 1 + o(1)][ 1 + o(1)] √ 1 2 2 √ = ∆[1 + o(1)] + o 2 = n∆ 1/(2σa) + o(1) + o n o(1) + σa n

Let’s now examine Bn. s s s N (σ2 + N ) [ 1 + o(1)][ 1 + o(1)] 1 1,1 a 1,0 = 2 2 = + o(1) 4 2 2 2 1 2 (2σa + σan)(σa + N1,1) [σa + o(1)][ 2 + o(1)] 2σa s s s N (σ2 + N ) [ 1 + o(1)][ 1 + o(1)] 1 1,0 a 1,1 = 2 2 = + o(1) 4 2 2 2 1 2 (2σa + σan)(σa + N1,0) [σa + o(1)][ 2 + o(1)] 2σa √ 1 Pn √ 1 Pn D Note that by Theorem 3, i=1 1,iA1,i, i=1 1,i(1 − A1,i) → N (0, I2). Thus by N1,1 N1,0 Slutky’s Theorem,

 Pn q 2  Pn i=1 A1,i1,i N1,1(σa+N1,0)  i=1 A1,i1,i q 1  √ 4 2 2 √ 2 + o(1) N1,1 (2σa+σan)(σa+N1,1) N1,1 2σa D 1  n  = Pn → N 0, I P (1−A ) q N (σ2 +N )  (1−A1,i)1,i q  2 2  i=1 1,i 1,i 1,0 a 1,1  i=1√ 1 2σ √ 4 2 2 2σ2 + o(1) a N1,0 (2σa+σan)(σa+N1,0) N1,0 a

D 1 Thus, we have that, Bn → N 0, 2 . Since we assume that the algorithm’s variance is correctly σa 2 speciﬁed, so σa = 1, ∞ if ∆ > 0 D  Bn + Cn → −∞ if ∆ < 0 N (0, 1) if ∆ = 0 Thus, by continuous mapping theorem,  1 if ∆ > 0 ˜ ˜ (n) µn D  β1 > β0 H = Φ = Φ(Bn + Cn) → 0 if ∆ < 0 P 1 σ n Uniform[0, 1] if ∆ = 0

Proof of Theorem 2 (Non-uniform convergence of the OLS estimator of the treatment effect for Thompson Sampling): The normalized errors of the OLS estimator for ∆, which are asymptotically normal under i.i.d. sampling are as follows: r (N + N )(N + N ) 1,1 2,1 1,0 2,0 βˆOLS − βˆOLS − ∆ 2n 1 0

r 2 n 2 n (N + N )(N + N )P P A R P P (1 − A )R = 1,1 2,1 1,0 2,0 t=1 i=1 t,i t,i − t=1 i=1 t,i t,i − ∆ 2n N1,1 + N2,1 N1,0 + N2,0 r P2 Pn P2 Pn (N1,1 + N2,1)(N1,0 + N2,0) t=1 i=1 At,it,i t=1 i=1(1 − At,i)t,i = (β1−β0)−∆+ − 2n N1,1 + N2,1 N1,0 + N2,0 r P2 Pn r P2 Pn N1,0 + N2,0 t=1 i=1 At,it,i N1,1 + N2,1 t=1 i=1(1 − At,i)t,i = p − p 2n N1,1 + N2,0 2n N1,0 + N2,0

24  q N +N Pn A  1,0 2,0 √i=1 1,i 1,i 2n N +N  1,1 2,1  q N +N Pn (1−A )   1,1 2,1 i√=1 1,i 1,i   2n N +N  = [1, −1, 1, −1]  1,0 2,0   q N +N Pn A   1,0 2,0 √i=1 2,i 2,i   2n   N1,1+N2,1  q N +N Pn (1−A )  1,1 2,1 i√=1 2,i 2,i 2n N1,0+N2,0

 q N +N q N Pn A  1,0 2,0 1,1 i=1√ 1,i 1,i 2(N1,1+N2,1) n N  1,1  q N +N q N Pn (1−A )   1,1 2,1 1,0 i=1√ 1,i 1,i   2(N1,0+N2,0) n N  = [1, −1, 1, −1]  1,0  (19)  q N +N q N Pn A   1,0 2,0 2,1 i=1√ 2,i 2,i   2(N1,1+N2,1) n   N2,1  q N +N q N Pn (1−A )  1,1 2,1 2,0 i=1√ 2,i 2,i 2(N1,0+N2,0) n N2,0

Pn Pn Pn Pn i=1 A1,i1,i i=1(1−A1,i)1,i i=1 A2,i2,i i=1(1−A2,i)2,i D By Theorem 3, √ , √ , √ , √ → N (0, I4). By N1,1 N1,0 N2,1 N2,0 q r 1 1 2n(N1,1+N2,1) 2 ( 2 +[1−π2]) Lemma 1 and Slutsky’s Theorem, N (N +N ) 1 = 1 + op(1), thus, 1,1 1,0 2,0 2( 2 +π2)

s r Pn N1,0 + N2,0 N1,1 i=1 A1,i1,i p 2(N1,1 + N2,1) n N1,1

s s 1 1 s r Pn 2n(N1,1 + N2,1) 2 ( 2 + [1 − π2]) N1,0 + N2,0 N1,1 i=1 A1,i1,i = 1 + op(1) p N1,1(N1,0 + N2,0) 2( 2 + π2) 2(N1,1 + N2,1) n N1,1 s 1 1 Pn s r Pn 2 ( 2 + [1 − π2]) i=1 A1,i1,i N1,0 + N2,0 N1,1 i=1 A1,i1,i = 1 p + op(1) p 2( 2 + π2) N1,1 2(N1,1 + N2,1) n N1,1 q Note that N1,0+N2,0 is stochastically bounded because for any K > 2, 2(N1,1+N2,1) N1,0 + N2,0 n 1 N1,1 P > K ≤ P > K = P > → 0 2(N1,1 + N2,1) N1,1 K n

(n) 1 where the limit holds by the law of large numbers since N1,1 ∼ Binomial(n, 2 ). Thus, since N Pn A D 1,1 ≤ 1 and i=1√ 1,i 1,i → N (0, 1), n N1,1 s r Pn N1,0 + N2,0 N1,1 i=1 A1,i1,i op(1) p = op(1) 2(N1,1 + N2,1) n N1,1 We can perform the above procedure on the other three terms. Thus, equation (19) is equal to the following:  q Pn A  1/2+1−π2 i=1√ 1,i 1,i 4(1/2+π )  2 N1,1   q Pn (1−A )   1/2+π2 i=1√ 1,i 1,i   4(1/2+1−π2) N1,0  [1, −1, 1, −1]   + op(1)  q Pn A   (1/2+1−π2)π2 i=1√ 2,i 2,i   2(1/2+π2) N   2,1  q Pn (1−A )  (1/2+π2)(1−π2) i=1√ 2,i 2,i 2(1/2+1−π2) N2,0 Recall that we showed earlier in Proposition 1 that (n) µn π2 = πmin ∨ πmax ∧ Φ = πmin ∨ πmax ∧ Φ Bn + Cn σn

25 Pn Pn i=1 A1,i1,i i=1(1 − A1,i)1,i √ 1 = πmin ∨ πmax ∧ Φ p − p + n∆ + o(1) + o(1) 2N1,1 2N1,0 2

(n) P (n) P When ∆ > 0, π2 → πmax and when ∆ < 0, π2 → πmin. We now consider the ∆ = 0 case. Pn Pn (n) 1 i=1 A1,i1,i i=1(1 − A1,i)1,i π2 = πmin ∨ πmax ∧ Φ √ p − p + o(1) 2 N1,1 N1,0 Pn Pn 1 i=1 A1,i1,i i=1(1 − A1,i)1,i = πmin ∨ πmax ∧ Φ √ p − p + o(1) 2 N1,1 N1,0

i.i.d. By Slutsky’s Theorem, for Z1,Z2,Z3,Z4 ∼ N (0, 1),  q Pn  1/2+1−π2 i=1 A1,i1,i q √  1/2+1−π∗  4(1/2+π2) N1,1 Z1   4(1/2+π∗) Pn  q 1/2+π (1−A1,i)1,i   q   2 i=1√   1/2+π∗   4(1/2+1−π2) N   4(1/2+1−π ) Z2  [1, −1, 1, −1]  1,0  + o (1) →D [1, −1, 1, −1]  ∗  q Pn p q  (1/2+1−π2)π2 i=1 A2,i2,i   (1/2+1−π∗)π∗   √   Z3   2(1/2+π2) N2,1   2(1/2+π∗)    q  q (1/2+π )(1−π ) Pn (1−A )  (1/2+π∗)(1−π∗) 2 2 i=1 2,i 2,i Z4 √ 2(1/2+1−π∗) 2(1/2+1−π2) N2,0 s s 1/2 + 1 − π∗ p √ 1/2 + π∗ p √ = 1/2Z1 + π∗Z3 − 1/2Z2 + 1 − π∗Z4 2(1/2 + π∗) 2(1/2 + 1 − π∗) π if ∆ > 0  max where π∗ = πmin if ∆ < 0  p πmin ∨ (πmax ∧ Φ[ 1/2(Z1 − Z2)]) if ∆ = 0

26 C.2 -Greedy

Proposition 2 (Non-concentration of the sampling probabilities under zero treatment effect for (n) 1 n i.i.d. -greedy). Let T = 2 and π1 = 2 for all n. We assume that {t,i}i=1 ∼ N (0, 1), and Pn Pn ( i=1 A1,iR1,i i=1(1−A1,i)R1,i (n) 1 − if > π = 2 N1,1 N1,0 2 2 otherwise (n) Thus, the sampling probability π2 does not concentrate when β1 = β0.

Proof: We deﬁne Mn := Pn A R Pn (1−A )R = I i=1 1,i 1,i > i=1 1,i 1,i N1,1 N1,0 (n) Pn A Pn (1−A ) I i=1 1,i 1,i i=1 1,i 1,i . Note that when Mn = 1, π2 = 1 − and when (β1−β0)+ > 2 N1,1 N1,0 (n) Mn = 0, π2 = 2 . i.i.d. When the margin is zero, Mn does not concentrate because for all N1,1,N1,0, since 1,i ∼ N (0, 1), Pn Pn i=1 A1,i1,i i=1(1 − A1,i)1,i 1 1 1 P > = P p Z1 − p Z2 > 0 = N1,1 N1,0 N1,1 N1,0 2

i.i.d. (n) for Z1,Z2 ∼ N (0, 1). Thus, we have shown that π2 does not concentrate when β1 − β0 = 0.

Theorem 6 (Non-uniform convergence of the OLS estimator of the treatment effect for -greedy). Assuming the setup and conditions of Proposition 2, and that β1 = b, we show that the normalized errors of the OLS estimator converges in distribution as follows:

p ˆOLS D N1,1 + N2,1 β1 − b → Y ( Z1 if β1 − β0 6= 0 √ Y = q 1 q 1 √ 3− Z1 − 2 − Z3 I(Z1>Z2) + 1+ Z1 − Z3 I(Z1

Proof: The normalized errors of the OLS estimator for β1 are P2 Pn P2 Pn p t=1 i=1 At,iRt,i t=1 i=1 At,it,i N1,1 + N2,1 − b = p N1,1 + N2,1 N1,1 + N2,1

Pn A q N Pn A   √i=1 1,i 1,i  1,1 i=1√ 1,i 1,i N1,1+N2,1 N1,1+N2,1 N1,1 = [1, 1] Pn = [1, 1]  n   A2,i2,i  q N P A √i=1  2,1 i=1√ 2,i 2,i  N1,1+N2,1 N1,1+N2,1 N2,1

r q r (n) q 1/2 N1,1+N2,1 π2 N1,1+N2,1 P By Slutsky’s Theorem and Lemma 1, (n) , (n) → (1, 1), N1,1 N2,1 1/2+π2 1/2+π2 so r q q Pn  1/2 N1,1+N2,1 N1,1 i=1 A1,i1,i (n) + op(1) √ 1/2+π N1,1 N1,1+N2,1 N  2 1,1  = [1, 1]   r (n) q q Pn  π2 N1,1+N2,1 N2,1 i=1 A2,i2,i  (n) + op(1) √ N2,1 N1,1+N2,1 1/2+π2 N2,1

r Pn  1/2 i=1 A1,i1,i (n) √ + op(1) 1/2+π N = [1, 1]  2 1,1  r (n) Pn   π2 i=1 A2,i2,i  (n) √ + op(1) 1/2+π2 N2,1 Pn Pn i=1 A1,i1,i i=1 A2,i2,i D The last equality holds because by Theorem 3, √ , √ → N (0, I2). N1,1 N2,1

27 Pn A R Pn (1−A )R Pn A Pn (1−A ) Let’s deﬁne Mn := I i=1 1,i 1,i i=1 1,i 1,i = I i=1 1,i 1,i i=1 1,i 1,i . > (β1−β0)+ > N1,1 N1,0 N1,1 N1,0 (n) (n) Note that when Mn = 1, π2 = 1 − 2 and when Mn = 0, π2 = 2 .

Pn A Pn (1−A ) r Mn = I i=1 1,i 1,i i=1 1,i 1,i = I √ N Pn A Pn (1−A ) (β1−β0)+ > 1,0 i=1 1,i 1,i i=1 1,i 1,i N1,1 N1,0 N1,0(β1−β0)+ √ > √ N1,1 N1,1 N1,0

√ Pn A Pn (1−A ) = I i=1 1,i 1,i i=1 1,i 1,i N1,0(β1−β0)+[1+op(1)] √ > √ N1,1 N1,0 q where the last equality holds because N1,0 →P 1 by Lemma 1, Slutsky’s Theorem, and continuous N1,1 mapping theorem. Thus, by Proposition 2,  1 if β1 − β0 > 0 (n) P  M → 0 if β1 − β0 < 0  does not concentrate if β1 − β0 = 0

Note that r 1 Pn  2 i=1 A1,i1,i (n) √ + op(1) 1 N  2 +π2 1,1  r  π(n) Pn A  2 i=1√ 2,i 2,i  1 (n) + op(1) 2 +π2 N2,1 r  r  1 Pn A 1 Pn A 2 i=1√ 1,i 1,i 2 i=1√ 1,i 1,i 1 +1− + op(1) 1 + + op(1) =  2 2 N1,1  M +  2 2 N1,1  (1 − M ) q Pn n q Pn n  1−/2 i=1 A2,i2,i   2 i=1 A2,i2,i  1 √ + op(1) 1 √ + op(1) 2 +1− 2 N2,1 2 + 2 N2,1 Pn A Pn (1−A ) Pn A Pn (1−A ) D Also note that by Theorem 3, i=1√ 1,i 1,i , i=1√ 1,i 1,i , i=1√ 2,i 2,i , i=1√ 2,i 2,i → N1,1 N1,0 N2,1 N2,1

N (0, I4). P P When β1 > β0, Mn → 1 and when β1 < β0, Mn → 0; in both these cases the normalized errors are asymptotically normal. We now focus on the case that β1 = β0. By continuous mapping theorem i.i.d. and Slutsky’s theorem for Z1,Z2,Z3,Z4 ∼ N (0, 1),

r 1 Pn  2 i=1 A1,i1,i 1 √ + op(1) +1− N  2 2 1,1  Pn Pn = [1, 1] A1,i1,i (1−A1,i)1,i q Pn  I [1+o(1)] i=1√ > i=1 √ 1−/2 i=1 A2,i2,i N N 1 √ + op(1) 1,1 1,0 2 +1− 2 N2,1

r 1 Pn  2 i=1 A1,i1,i 1 √ + op(1) 2 + 2 N1,1 +[1, 1]   1 − Pn A Pn (1−A ) Pn I i=1 1,i 1,i i=1 1,i 1,i q A2,i2,i  [1+o(1)] √ > √ 2 i=1 N1,1 N1,0 1 √ + op(1) 2 + 2 N2,1

q 1/2  q 1/2  Z1 Z1 →D [1, 1] 1/2+1−/2 + [1, 1] 1/2+/2 q 1−/2  I(Z1>Z2) q /2  I(Z1Z2) 1 + 1 1 + 3 I(Z1 0  3− q 1 √ Y := 1+ Z1 − Z3 if β1 − β0 < 0 q √ q √  1 1  3− Z1 − 2 − Z3 I(Z1>Z2) + 1+ Z1 − Z3 I(Z1

28 C.3 UCB

Theorem 7 (Asymptotic non-Normality under zero treatment effect for clipped UCB). Let T = 2 (n) 1 n i.i.d. and π1 = 2 for all n. We assume that {t,i}i=1 ∼ N (0, 1), and (n) πmax if U1 > U0 π2 = 1 − πmax otherwise where we define the upper confidence bounds U for any confidence level δ with 0 < δ < 1 as follows: ( ∞ if N1,1 = 0 U1 = Pn A R q i=1 1,i 1,i + 2 log 1/δ otherwise N1,1 N1,1 ( ∞ if N1,0 = 0 U0 = Pn (1−A )R q i=1 1,i 1,i + 2 log 1/δ otherwise N1,1 N1,0

Assuming above conditions, and that β1 = b, we show that the normalized errors of the OLS estimator converges in distribution as follows: p ˆOLS D N1,1 + N2,1 β1 − b → Y  Z1 if ∆ = 0 Y = r 1 r 1 2 q πmax 2 q 1−πmax  1 Z1 + 1 Z3 I(Z1>Z2) + 3 Z1 + 3 Z3 I(Z1

Proof: The proof is very similar to that of asymptotic non-normality result for -Greedy. By the same arguments made as in the -Greedy case, we have that

r Pn  1/2 i=1 A1,i1,i 2 n (n) √ + op(1) P P 1/2+π N p t=1 i=1 At,iRt,i  2 1,1  N1,1 + N2,1 − b = [1, 1] r (n) Pn  N1,1 + N2,1  π2 i=1 A2,i2,i  (n) √ + op(1) 1/2+π2 N2,1

Assuming n ≥ 1, we then deﬁne

Mn := I(U1>U0)

Pn r Pn r = I(N1,1>0,N1,0>0)I A R (1−A )R + I(N1,1=0,N1,0>0) i=1 1,i 1,i + 2 log 1/δ > i=1 1,i 1,i + 2 log 1/δ N1,1 N1,1 N1,1 N1,0

= Pn A r Pn (1−A ) r + I(N1,1>0,N1,0>0)I i=1 1,i 1,i 2 log 1/δ i=1 1,i 1,i 2 log 1/δ I(N1,1=0,N1,0>0) (β1−β0)+ + > + N1,1 N1,1 N1,1 N1,0

= (N >0,N >0) √ r Pn A √ Pn (1−A ) √ I 1,1 1,0 I N1,0 i=1 1,i 1,i i=1 1,i 1,i N1,0(β1−β0)+ √ + 2 log 1/δ > √ + 2 log 1/δ N1,1 N1,1 N1,1

+ I(N1,1=0,N1,0>0) Note that N1,0 →P 1 by Lemma 1. Thus by Slutsky’s Theorem and continuous mapping theorem, N1,1

√ Pn A Pn (1−A ) = I i=1 1,i 1,i i=1 1,i 1,i + op(1) (20) N1,0(β1−β0)+[1+op(1)] √ +op(1)> √ N1,1 N1,1 Note that r 1 Pn  2 i=1 A1,i1,i (n) √ + op(1) 1 N  2 +π2 1,1  r  π(n) Pn A  2 i=1√ 2,i 2,i  1 (n) + op(1) 2 +π2 N2,1 r  r  1 Pn A 1 Pn A 2 i=1√ 1,i 1,i 2 i=1√ 1,i 1,i + o (1) 1 +π + op(1) 1 +1−π p  2 max N1,1   2 max N1,1  = Pn Mn + Pn (1 − Mn) q πmax i=1 A2,i2,i  q 1−πmax i=1 A2,i2,i  1 √ + op(1) 1 √ + op(1) 2 +πmax N2,1 2 +1−πmax N2,1

29 Pn Pn Pn Pn (n) (n) (n) (n) i=1 A1,i1,i i=1(1−A1,i)1,i i=1 A2,i2,i i=1(1−A2,i)2,i Let (Z1 ,Z2 ,Z3 ,Z4 ) := √ , √ , √ , √ . N1,1 N1,0 N2,1 N2,1 (n) (n) (n) (n) D Note that by Theorem 3, (Z1 ,Z2 ,Z3 ,Z4 ) → N (0, I4). P P When β1 > β0, Mn → 1 and when β1 < β0, Mn → 0; in both these cases the normalized errors are asymptotically normal. We now focus on the case that β1 = β0. By continuous mapping theorem and Slutsky’s theorem, r  1 (n) 2 1 +π Z1 + op(1)  2 max  = [1, 1] I (n) (n) + op(1) q πmax (n)  [1+op(1)]Z1 +op(1)>Z2 1 Z3 + op(1) 2 +πmax r  1 (n) 2 1 Z1 + op(1) 2 +1−πmax + [1, 1]   1 − (n) (n) + op(1) I[1+o (1)]Z +o (1)>Z q 1−πmax (n)  p 1 p 2 1 Z3 + op(1) 2 +1−πmax

r 1  r 1  2 2 D 1 +π Z1 1 +1−π Z1  2 max   2 max  → [1, 1] I(Z1>Z2) + [1, 1] I(Z1Z2) + 3 Z1 + 3 Z3 I(Z1

(n) Note that (20) implies that if β1 = β0, that π2 will not concentrate.

30 D Asymptotic Normality of the Batched OLS Estimator: Multi-Arm Bandits

Theorem 3 (Asymptotic normality of Batched OLS estimator for multi-arm bandits) Assuming 1 Conditions 6 (weak moments) and 3 (conditionally i.i.d. actions), and a clipping rate of f(n) = ω( n ) (Deﬁnition 1),  1/2  N1,0 0 ˆBOLS (β1 − β1)  0 N1,1     1/2   N2,0 0 ˆBOLS   (β2 − β2)  D 2  0 N2,1  → N (0, σ I )   2T  .   .   1/2   NT,0 0 ˆBOLS  (βT − βT ) 0 NT,1 Pn Pn where βt = (βt,0, βt,1), Nt,1 = i=1 At,i, and Nt,0 = i=1(1 − At,i). Note in the body of this paper, we state Theorem 3 with conditions that are are sufﬁcient for the weaker conditions we use here. Lemma 1. Assuming the conditions of Theorem 3, for any batch t ∈ [1: T ], N Pn A N Pn (1 − A ) t,1 = i=1 t,i →P 1 t,0 = i=1 t,i →P 1 (n) (n) and (n) (n) nπt nπt n(1 − πt ) n(1 − πt )

Nt,1 P 1 Pn Proof of Lemma 1: To prove that (n) → 1, it is equivalent to show that (n) i=1(At,i − nπt nπt (n) P πt ) → 0. Let > 0. n 1 X (n) P (At,i − πt ) > (n) nπt i=1 n 1 X (n) = P (n) (At,i − πt ) I (n) + I (n) > (πt ∈[f(n),1−f(n)]) (πt 6∈[f(n),1−f(n)]) nπt i=1

n 1 X (n) ≤ P (n) (At,i − πt ) I (n) > (πt ∈[f(n),1−f(n)]) 2 nπt i=1 n 1 X (n) + P (n) (At,i − πt ) I (n) > (πt 6∈[f(n),1−f(n)]) 2 nπt i=1

P Since by our clipping assumption, I (n) → 1, the second probability in the summation (πt ∈[f(n),1−f(n)]) above converges to 0 as n → ∞. We will now show that the ﬁrst probability in the summation above 1 Pn (n) 1 Pn (n) (n) also goes to zero. Note that E (n) i=1(At,i −πt ) = E (n) i=1(E[At,i|Ht−1]−πt ) = nπt nπt 0. So by Chebychev inequality, for any > 0, n 1 X (n) P (At,i − πt ) I(π(n)∈[f(n),1−f(n)]) > (n) t nπt i=1 n 2 1 1 X (n) ≤ 2 2 E (At,i − πt ) I(π(n)∈[f(n),1−f(n)]) n (n) 2 t (πt ) i=1 n n 1 X X 1 (n) (n) ≤ 2 2 E (At,i − πt )(At,j − πt )I(π(n)∈[f(n),1−f(n)]) n (n) 2 t i=1 j=1 (πt ) n n 1 X X 1 (n) (n) 2 (n) = 2 2 E I(π(n)∈[f(n),1−f(n)])E At,iAt,j − πt (At,i + At,j) + (πt ) Ht−1 n (n) 2 t i=1 j=1 (πt )

31 n n 1 X X 1 (n) (n) 2 = 2 2 E I(π(n)∈[f(n),1−f(n)]) E At,iAt,j Ht−1 − (πt ) (21) n (n) 2 t i=1 j=1 (πt )

i.i.d. (n) (n) Note that if i 6= j, since At,i ∼ Bernoulli(πt ), E[At,iAt,j|Ht−1] = (n) (n) (n) 2 E[At,i|Ht−1]E[At,j|Ht−1] = (πt ) , so (21) above equals the following n 1 X 1 (n) (n) 2 = 2 2 E I(π(n)∈[f(n),1−f(n)]) E At,i Ht−1 − (πt ) n (n) 2 t i=1 (πt )

n (n) (n) 1 X 1 − πt 1 1 − πt 1 1 = 2 2 E (n) I (n) = 2 E (n) I (n) ≤ 2 → 0 n (πt ∈[f(n),1−f(n)]) n (πt ∈[f(n),1−f(n)]) n f(n) i=1 πt πt 1 where the limit holds because we assume f(n) = ω( n ) so f(n)n → ∞. We can make a very similar Nt,0 P argument for (n) → 1. n(1−πt )

Proof for Theorem 3 (Asymptotic normality of Batched OLS estimator for multi-arm bandits): (n) For readability, for this proof we drop the (n) superscript on πt . Note that 1/2 −1/2 n Nt,0 0 ˆBOLS Nt,0 0 X 1 − At,i (βt − βt) = t,i. 0 Nt,1 0 Nt,1 At,i i=1 We want to show that  −1/2 Pn   −1/2  N (1 − A1,i)1,i N0,1 0 n 1 − A1,i 0,1 i=1 P −1/2 n i=1 1,i P  0 N1,1 A1,i   N A1,i1,i     1,1 i=1   −1/2   N −1/2 Pn (1 − A )   N0,2 0 Pn 1 − A2,i   0,2 i=1 2,i 2,i  i=1 2,i  −1/2 n  D  0 N1,2 A2,i   N P A  2   =  1,2 i=1 2,i 2,i  → N (0, σ I2T ).  .   .   .   .     .   −1/2   −1/2 Pn   Nt,0 0 Pn 1 − AT,i  Nt,0 i=1(1 − AT,i)T,i i=1 T,i 0 Nt,1 AT,i −1/2 Pn Nt,1 i=1 AT,iT,i By Lemma 1 and Slutsky’s Theorem it is sufﬁcient to show that as n → ∞,

1 Pn −1/2  √ (1 − A1,i)1,i    i=1 1 − π1,1 0 n 1 − A1,i n(1−π1) √1 P 1 n i=1 1,i  √ P   n 0 π1,1 A1,i  nπ i=1 A1,i1,i  1   −1/2   √ 1 Pn   " (n) #   i=1(1 − A2,i)2,i   1 1 − π 0 n 1 − A2,i  n(1−π2) √ 2 P    (n) i=1 2,i  1 Pn n A2,i D  √ A2,i2,i   0 π2  2  nπ2 i=1  =   → N (0, σ I )     2T  .   .   .   .     −1/2   √ 1 Pn (1 − A )   " (n) #  i=1 T,i T,i 1 − π 0 n 1 − AT,i  n(1−πT )   √1 t P  1 Pn n (n) i=1 T,i √ AT,iT,i 0 π AT,i nπT i=1 t 2T By Cramer-Wold device, it is sufﬁcient to show that for any ﬁxed vector c ∈ R s.t. kck2 = 1 that as n → ∞,  −1/2  −1/2 1 − π1,1 0 Pn 1 − A1,i n i=1 1,i  0 π1,1 A1,i   −1/2   " (n) #   −1/2 1 − π2 0 Pn 1 − A2,i   n (n) i=1 2,i  >  A2,i  D 2 c  0 π2  → N (0, σ )    .   .   −1/2   " (n) #   −1/2 1 − πt 0 Pn 1 − AT,i  n (n) i=1 T,i AT,i 0 πt

32 > 2T 2 Let us break up c so that c = [c1, c2, ..., cT ] ∈ R with ct ∈ R for t ∈ [1: T ]. The above is equivalent to −1/2 T " (n) # n X −1/2 > 1 − πt 0 X 1 − At,i D 2 n ct (n) t,i → N (0, σ ) At,i t=1 0 πt i=1 −1/2 −1/2 > 1 − πt,i 0 1 − At,i Let us deﬁne Yt,i := n ct t,i. 0 πt,i At,i The sequence {Y1,1,Y1,2, ..., Y1,n, ..., YT,1,YT,2, ..., YT,n} is a martingale with respect to sequence (n) T of histories {Ht }t=1, since −1/2 " (n) # (n) −1/2 > 1 − πt 0 1 − At,i (n) E[Yt,i|Ht−1] = n ct (n) E t,i Ht−1 At,i 0 πt −1/2 " (n) # " (n) (n) # −1/2 > 1 − πt 0 (1 − πt )E[t,i|Ht−1,At,i = 0] (n) = n ct (n) E (n) Ht−1 = 0 0 πt πt,iE[t,i|Ht−1,At,i = 1] for all i ∈ [1: n] and all t ∈ [1: T ]. We then apply [8] martingale central limit theorem to Yt,i to show the desired result (see the proof of Theorem 5 in Appendix B for the statement of the martingale CLT conditions).

(n) Condition(a): Martingale Condition The ﬁrst condition holds because E[Yt,i|Ht−1] = 0 for all i ∈ [1: n] and all t ∈ [1: T ].

Condition(b): Conditional Variance −1/2 T n T n " (n) # 2 X X 2 (n) X X 1 > 1 − πt 0 1 − At,i (n) E[Yn,t,i|Ht−1] = E √ ct (n) t,i Ht−1 n At,i t=1 i=1 t=1 i=1 0 πt −1/2 −1/2 T n " (n) # " (n) # X X 1 > 1 − πt 0 1 − At,i 0 1 − πt 0 2 (n) = E ct (n) (n) ctt,i Ht−1 n 0 At,i t=1 i=1 0 πt 0 πt −1/2 −1/2 T n " (n) # " 2 (n) #" (n) # X X 1 > 1 − π 0 [(1 − At,i) |H ] 0 1 − π 0 = c t E t,i t−1 t c n t (n) 2 (n) (n) t t=1 i=1 0 πt 0 E[At,it,i|Ht−1] 0 πt 2 (n) (n) 2 (n) 2 2 (n) Since E[At,it,i|Ht−1] = πt E[t,i|Ht−1,At,i = 1] = σ πt and E[(1 − At,i)t,i|Ht−1] = (1 − 2 (n) 2 πt)E[t,i|Ht−1,At,i = 0] = σ (1 − πt), T n T X X −1 > 2 X > 2 2 = n ct ctσ = ct ctσ = σ t=1 i=1 t=1 Condition(c): Lindeberg Condition Let δ > 0. −1/2 T n T n " (n) # 2 X X 2 (n) X X −1/2 > 1 − πt 0 1 − At,i (n) E Yt,iI(Y 2 >δ2) Ht−1 = E n ct (n) t,i I(Y 2 >δ2) Ht−1 t,i At,i t,i t=1 i=1 t=1 i=1 0 πt −1/2 −1/2 T n " (n) # " (n) # X 1 X > 1 − πt 0 1 − At,i 0 1 − πt 0 2 (n) = E ct (n) (n) ctt,iI(Y 2 >δ2) Ht−1 n 0 At,i t,i t=1 i=1 0 πt 0 πt

− 1 T n " (n) # 2 X 1 X > 1 − π 0 = c t n t (n) t=1 i=1 0 πt " 2 (n) # [(1 − At,i)t,i 2 2 |H ] 0 E I(Yt,i>δ ) t−1 2 (n) 0 [At,it,i 2 2 |H ] E I(Yt,i>δ ) t−1 − 1 " (n) # 2 1 − πt 0 (n) ct 0 πt

33 > 2 (n) Note that for ct = [ct,0, ct,1] , E (1 − At,i)t,iI(Y 2 >δ2) Ht−1 = t,i 2 (n) 2 (n) c2 H ,At,i = 0 (1 − πt) and At,i (Y 2 >δ2) H = E t,iI t,0 2 2 t−1 E t,iI t,i t−1 (n) t,i>nδ 1−π t 2 (n) c2 H ,At,i = 1 πt. Thus, we have that E t,iI t,1 2 2 t−1 (n) t,i>nδ πt T n X 1 X 2 2 (n) 2 2 (n) = ct,0 t,i nδ2(1−π ) H ,At,i = 0 + ct,1 t,i (n) H ,At,i = 1 E I 2 > t t−1 E I nδ2π t−1 n t,i 2 2 > t t=1 i=1 ct,0 t,i 2 ct,1 T X 2 2 (n) 2 2 (n) ≤ max ct,0 t,i nδ2(1−π ) H ,At,i = 0 +ct,1 t,i (n) H ,At,i = 1 E I 2 > t t−1 E I nδ2π t−1 i∈[1 : n] t,i 2 2 > t t=1 ct,0 t,i 2 ct,1 Note that for any t ∈ [1: T ] and i ∈ [1: n], 2 (n) t,i 2 (n) Ht−1,At,i = 1 E I 2 nδ πt t,i> 2 ct,1 2 (n) (n) (n) (n) = E t,iI nδ2π Ht−1,At,i = 1 I + I 2 t (πt ∈[f(n),1−f(n)]) (πt 6∈[f(n),1−f(n)]) t,i> 2 ct,1 2 (n) 2 ≤ 2 H ,A = 1 + σ (n) E t,iI 2 nδ f(n) t−1 t,i I(π 6∈[f(n),1−f(n)]) t,i> 2 t ct,1 The second term converges in probability to zero as n → ∞ by our clipping assumption. We now 1 show how the ﬁrst term goes to zero in probability. Since we assume f(n) = ω( n ), nf(n) → ∞. So, it is sufﬁcient to show that for all t, n, 2 (n) lim max E t,iI(2 >m) Ht−1,At,i = 1 = 0 m→∞ i∈[1 : n] t,i By Condition 6, we have that for all n ≥ 1, 2 (n) max E[ϕ(t,i)|Ht−1,At,i = 1] < M t∈[1 : T ],i∈[1 : n] ϕ(x) Since we assume that limx→∞ x = ∞, for all m, there exists a bm s.t. ϕ(x) ≥ mMx for all x ≥ bm. So, for all n, t, i, 2 (n) 2 (n) M ≥ [ϕ( )|H ,At,i = 1] ≥ [ϕ( ) 2 |H ,At,i = 1] E t,i t−1 E t,i I(t,i≥bm) t−1 2 (n) ≥ mM [ 2 |H ,At,i = 1] E t,iI(t,i≥bm) t−1 Thus, 2 (n) 1 2 max E[t,iI( ≥bm)|Ht−1,At,i = 1] ≤ t∈[1 : T ],i∈[1 : n] t,i m We can make a very similar argument that for all t ∈ [1: T ], as n → ∞, 2 (n) P 2 max E t,iI 2 nδ (1−πt) Ht−1,At,i = 0 → 0 i∈[1 : n] t,i> 2 ct,0 Corollary 3 (Asymptotic Normality of the Batched OLS Estimator of Margin; two-arm bandit setting). Assume the same conditions as Theorem 3. For each t ∈ [1: T ], we have the BOLS estimator of the margin β1 − β0: Pn Pn ˆ BOLS i=1(1 − At,i)Rt,i i=1 At,iRt,i ∆t = − Nt,0 Nt,1 We show that as n → ∞,  q  N1,0N1,1 ˆ BOLS n (∆1 − ∆1)  q   N2,0N2,1 ˆ BOLS   n (∆2 − ∆2)  D 2   → N (0, σ IT )  .   .  q  NT,0NT,1 ˆ BOLS n (∆T − ∆T )

34 Proof: r r Pn Pn Nt,0Nt,1 ˆ BOLS Nt,0Nt,1 i=1(1 − At,i)t,i i=1 At,it,i (∆t − ∆t) = − n n Nt,0 Nt,1 r Pn r Pn Nt,1 i=1(1 − At,i)t,i Nt,0 i=1 At,it,i = p − p n Nt,0 n Nt,1 hq q i −1/2 n Nt,1 Nt,0 Nt,0 0 X 1 − At,i = − t,i n n 0 Nt,1 At,i i=1 By Slutsky’s Theorem and Lemma 1, it is sufficient to show that as n → ∞, −1/2  " (n) #  hq q i 1 − π 0 n 1 − A1,i √1 (n) (n) 1 P π − 1 − π (n) i=1 1,i  n 1 1 A1,i   0 π1   −1/2   " (n) #  hq q i 1 − π 0 n 1 − A2,i  √1 (n) (n) 2 P   π − 1 − π (n) i=1 2,i   n 2 2 0 π A2,i  D 2  2  → N (0, σ IT )    .   .   −1/2   " (n) #  hq q i 1 − π 0 n 1 − AT,i  √1 (n) (n) t P  π − 1 − π (n) i=1 T,i n t t AT,i 0 πt T By Cramer-Wold device, it is sufficient to show that for any fixed vector d ∈ R s.t. kdk2 = 1 that −1/2  " (n) #  hq q i 1 − π 0 n 1 − A1,i √1 (n) (n) 1 P π − 1 − π (n) i=1 1,i  n 1 1 A1,i   0 π1   −1/2   " (n) #  hq q i 1 − π 0 n 1 − A2,i  √1 (n) (n) 2 P   π − 1 − π (n) i=1 2,i  >  n 2 2 0 π A2,i  D 2 d  2  → N (0, σ )    .   .   −1/2   " (n) #  hq q i 1 − π 0 n 1 − AT,i  √1 (n) (n) t P  π − 1 − π (n) T,i n t t i=1 AT,i 0 πt > T Let [d1, d2, ..., dT ] := d ∈ R . The above is equivalent to −1/2 T " (n) # n X 1 hq (n) q (n) i 1 − πt 0 X 1 − At,i D 2 √ dt π − 1 − π (n) t,i → N (0, σ ) n t t At,i t=1 0 πt i=1 " #−1/2 hq q i 1 − π(n) 0 1 − A √1 (n) (n) t t,i Define Yt,i := dt π − 1 − π (n) t,i. n t t At,i 0 πt {Y1,1,Y1,2, ..., Y1,n, ..., YT,1,YT,2, ..., YT,n} is a martingale difference array with respect to the se- (n) T quence of histories {Ht }t=1 because for all i ∈ [1: n] and t ∈ [1: T ], −1/2 " (n) # (n) 1 hq (n) q (n) i 1 − πt 0 1 − At,i (n) E[Yt,i|Ht−1] = √ dt π − 1 − π (n) E t,i Ht−1 t t At,i n 0 πt

− 1 " (n) # 2 " (n) (n) # dt hq q i 1 − π 0 (1 − π ) [ |H ,A = 0] = √ (n) (n) t t E t,i t−1 t,i H(n) = 0 πt − 1 − πt (n) E (n) t−1 n 0 πt πt,iE[t,i|Ht−1,At,i = 1] We now apply [8] martingale central limit theorem to Yt,i to show the desired result. Verifying the conditions for the martingale CLT is equivalent to what we did to verify the conditions in the conditions in the proof of Theorem 3—the only difference is that we replace c> in the Theorem hq q i t 3 proof with d (n) (n) in this proof. Even though c is a constant vector and t 1 − πt − πt t hq q i d (n) (n) is a random vector, the proof still goes through with this adjusted c t 1 − πt − πt t hq q i hq q i vector, since (i) d (n) (n) ∈ H(n) , (ii) k (n) (n) k = 1, and t 1 − πt − πt t−1 1 − πt − πt 2 2 (n) 2 (n) 2 2 nδ πt nδ πt nδ (1−πt) nδ (1−πt) (iii) 2 = (n) → ∞ and 2 = 2 → ∞. c 2 c d (1−πt) t,1 dt πt t,0 t

35 Corollary 4 (Consistency of BOLS Variance Estimator). Assuming Conditions 1 (moments) and 3 1 (conditionally i.i.d. actions), and a clipping rate of f(n) = ω( n ) (Deﬁnition 1), for all t ∈ [1: T ], as n → ∞, n 2 1 X P σˆ2 = R − A βˆBOLS − (1 − A )βˆBOLS → σ2 t n − 2 t,i t,i t,1 t,i t,0 i=1

Proof: n 2 1 X σˆ2 = R − A βˆBOLS − (1 − A )βˆBOLS t n − 2 t,i t,i t,1 t,i t,0 i=1 n Pn Pn 2 1 X At,it,i (1 − At,i)t,i = A β +(1−A )β + −A β + i=1 −(1−A ) β + i=1 n − 2 t,i t,1 t,i t,0 t,i t,i t,1 N t,i t,0 N i=1 t,1 t,0 n Pn Pn 2 1 X At,it,i (1 − At,i)t,i = − A i=1 − (1 − A ) i=1 n − 2 t,i t,i N t,i N i=1 t,1 t,0

n Pn Pn 1 X At,it,i (1 − At,i)t,i = 2 − 2A i=1 − 2(1 − A ) i=1 n − 2 t,i t,i t,i N t,i t,i N i=1 t,1 t,0 Pn 2 Pn 2 i=1 At,it,i i=1(1 − At,i)t,i + At,i + (1 − At,i) Nt,1 Nt,0

n Pn 2 Pn 2 1 X ( At,it,i) ( (1 − At,i)t,i) = 2 − 2 i=1 − 2 i=1 n − 2 t,i (n − 2)N (n − 2)N i=1 t,1 t,0 N Pn A 2 N Pn (1 − A ) 2 + t,1 i=1 t,i t,i + t,0 i=1 t,i t,i n − 2 Nt,1 n − 2 Nt,0

n Pn 2 Pn 2 1 X ( At,it,i) ( (1 − At,i)t,i) = 2 − i=1 − i=1 n − 2 t,i (n − 2)N (n − 2)N i=1 t,1 t,0 1 Pn 2 P 2 Note that n−2 i=1 t,i → σ because for all δ > 0,

n n 2 2 1 X 2 2 1 X 2 σ (n − 2) σ (n − 2) 2 P t,i −σ > δ ≤ P t,i − > δ/2 +P −σ > δ/2 n − 2 n − 2 n n i=1 i=1 n 1 X 2 2 2 −2 = P (t,i − σ ) > δ/2 + P σ > δ/2 n − 2 n i=1 Since the second term in the summation above goes to zero for sufﬁciently large n, we now focus on the ﬁrst term in the summation above. By Chebychev inequality, n n n n 1 X 2 2 4 X X 2 2 2 2 4 X 2 2 2 P (t,i−σ ) > δ/2 ≤ E (t,i−σ )(t,j−σ ) = E (t,i−σ ) n − 2 δ2(n − 2)2 δ2(n − 2)2 i=1 i=1 j=1 i=1

2 2 2 2 2 2 2 where the equality above holds because for i 6= j, E[(t,i − σ )(t,j − σ )] = E E[(t,i − σ )(t,j − 2 (n) 2 2 (n) 2 2 (n) 4 (n) σ )|Ht−1] = E E[t,i−σ |Ht−1]E[t,j −σ |Ht−1] = 0. By Condition 1 E[t,i|Ht=1] < M < ∞,

n 4 4 X (n) 4n(M + σ ) = [(4 − 22 σ2 + σ4)|H ] ≤ → 0 δ2(n − 2)2 E E t,i t,i t−1 δ2(n − 2)2 i=1

(Pn A )2 (Pn (1−A ) )2 P Thus by Slutsky’s Theorem it is sufﬁcient to show that i=1 t,i t,i + i=1 t,i t,i → 0. We (n−2)Nt,1 (n−2)Nt,0 (Pn A )2 P (Pn (1−A ) )2 P will only show that i=1 t,i t,i → 0; i=1 t,i t,i → 0 holds by a very similar argument. (n−2)Nt,1 (n−2)Nt,0

36 Pn 2 Nt,1 P ( i=1 At,it,i) P Note that by Lemma 1, (n) → 1. Thus, to show that → 0 by Slutsky’s Theorem it (n−2)Nt,1 nπt Pn 2 ( i=1 At,it,i) P is sufﬁcient to show that (n) → 0. Let δ > 0. By Markov inequality, (n−2)nπt Pn 2 n 2 n n ( i=1 At,it,i) 1 X 1 X X P > δ ≤ E At,it,i = E At,jAt,it,it,j (n) (n) (n) (n − 2)nπt δ(n − 2)nπt i=1 δ(n − 2)nπt j=1 i=1

(n) (n) Since πt ∈ Ht−1, n n 1 X X (n) = [A A |H ] E (n) E t,j t,i t,i t,j t−1 δ(n − 2)nπt j=1 i=1

(n) (n) (n) Since for i 6= j, E[At,jAt,it,jt,i|Ht−1] = E[At,jt,j|Ht−1]E[At,it,i|Ht−1] = 0, n 1 X (n) = [A 2 |H ] E (n) E t,i t,i t−1 δ(n − 2)nπt i=1

2 (n) 2 (n) (n) 2 (n) Since E[At,it,i|Ht−1] = E[t,i|Ht−1,At,i = 1]πt = σ πt , 1 σ2 = nσ2π(n) = → 0 E (n) t δ(n − 2) δ(n − 2)nπt

37 E Asymptotic Normality of the Batched OLS Estimator: Contextual Bandits

Theorem 4 (Asymptotic Normality of the Batched OLS Statistic) For a K-armed contextual bandit, we for each t ∈ [1: T ], we have the BOLS estimator:

C 0 0 ... 0 −1 t,0  C  0 C 0 ... 0 IAt,i=0 t,i  t,1  n C BOLS X IAt,i=1 t,i ˆ  0 0 Ct,2 ... 0    Kd βt =    .  Rt,i ∈ R  . . . . .   .   ......  i=1 IAt,i=K−1Ct,i 0 0 0 ... Ct,K−1 Pn > d×d where Ct,k := i=1 I (n) Ct,i(Ct,i) ∈ R . Assuming Conditions 6 (weak moments), 3 At,i =k (conditionally i.i.d. actions), 4 (conditionally i.i.d. contexts), and 5 (bounded contexts), and a 1 conditional clipping rate f(n) = c for some 0 ≤ c < 2 (see Deﬁnition 2), we show that as n → ∞,   1/2 ˆBOLS Diagonal C1,0, C1,1, ..., C1,K−1 (β1 − β1)  1/2 BOLS   DiagonalC , C , ..., C (βˆ − β )   2,0 2,1 2,K−1 2 2  D 2  .  → N (0, σ IT Kd)  .    1/2 ˆBOLS Diagonal CT,0, CT,1, ..., CT,K−1 (βT − βT )

Lemma 2. Assuming the conditions of Theorem 4, for any batch t ∈ [1: T ] and any arm k ∈ [0: K − 1], as n → ∞, n X > −1 P IAt,i=kCt,iCt,i nZt,kPt,k → Id (22) i=1 n 1/2 X > −1/2 P IAt,i=kCt,iCt,i nZt,kPt,k → Id (23) i=1 (n) > (n) where Pt,k := P(At,i = k|Ht−1) and Zt,k := E Ct,iCt,i Ht−1,At,i = k .

1 Pn > P Proof of Lemma 2: We ﬁrst show that as n → ∞, n i=1 IAt,i=kCt,iCt,i − Zt,kPt,k → 0. It is sufﬁcient to show that convergence holds entry-wise so for any r, s ∈ [0: d − 1], as n → ∞, 1 Pn > P n i=1 IAt,i=kCt,iCt,i(r, s) − Pt,kZt,k(r, s) → 0. Note that > > E IAt,i=kCt,iCt,i(r, s) − Pt,kZt,k(r, s) = E E Ct,iCt,i(r, s) Ht−1,At,i = k Pt,k − Pt,kZt,k(r, s) = 0

By Chebychev inequality, for any > 0, n n 2 1 X > 1 X > P IAt,i=kCt,iCt,i(r, s)−Pt,kZt,k(r, s) > ≤ E IAt,i=kCt,iCt,i−Pt,kZt,k(r, s) n 2n2 i=1 i=1 n n 1 X X > > = C C (r, s) − P Z (r, s) C C (r, s) − P Z (r, s) (24) 2n2 E IAt,i=k t,i t,i t,k t,k IAt,i=k t,j t,j t,k t,k i=1 j=1 (n) By conditional independence and by law of iterated expectations (conditioning on Ht−1), for i 6= j, > > E IAt,i=kCt,iCt,i(r, s) − Pt,kZt,k(r, s) IAt,j =kCt,jCt,j(r, s) − Pt,kZt,k(r, s) = 0. Thus, (24) above equals the following: n 2 1 X = C C> (r, s) − P Z (r, s) 2n2 E IAt,i=k t,i t,i t,k t,k i=1 n 1 X 2 2 = C C> (r, s) − 2 C C> (r, s)P Z (r, s) + P 2 Z (r, s) 2n2 E IAt,i=k t,i t,i IAt,i=k t,i t,i t,k t,k t,k t,k i=1

38 n 1 X 2 2 = C C> (r, s) − P 2 Z (r, s) 2n2 E IAt,i=k t,i t,i t,k t,k i=1 2 1 2 2 2d max(u , 1) = C C> (r, s) − P 2 Z (r, s) ≤ → 0 2nE IAt,i=k t,i t,i t,k t,k 2n as n → ∞. The last inequality above holds by Condition 5. Proving Equation (22): It is sufﬁcient to show that 2 2 2 max(du , 1) −1 2 max(du , 1) −1 P 2 nZt,kPt,k = 2 2 Zt,k → 0 (25) n op n Pt,k op

(n) We deﬁne random variable Mt = I d (n) K , representing whether the (∀ c∈R , At(Ht−1,c)∈[f(n),1−f(n)] ) (n) P conditional clipping condition is satisﬁed. Note that by our conditional clipping assumption, Mt → 1 as n → ∞. The left hand side of (25) is equal to the following 2 2 2 max(du , 1) −1 (n) (n) 2 max(du , 1) −1 (n) 2 2 Zt,k (Mt + (1 − Mt )) = 2 2 Zt,k Mt + op(1) (26) n Pt,k op n Pt,k op

By our conditional clipping condition and Bayes rule we have that for all c ∈ [−u, u]d, (n) (n) P(Ct,i = c|At,i = k, Ht−1,Mt = 1) (n) (n) (n) (n) P(At,i = k|Ct,i = c,Ht−1,Mt = 1)P(Ct,i = c|Ht−1,Mt = 1) = (n) (n) P(At,i = k|Ht−1,Mt = 1) (n) (n) f(n) P(Ct,i = c|Ht−1,Mt = 1) ≥ . 1 Thus, we have that (n) > (n) (n) > (n) (n) (n) Zt,kMt = E Ct,iCt,i Ht−1,At,i = k Mt = E Ct,iCt,i Ht−1,At,i = k, Mt = 1 Mt > (n) (n) (n) > (n) (n) (n) (n) < f(n)E Ct,iCt,i Ht−1,Mt = 1 Mt = f(n)E Ct,iCt,i Ht−1 Mt = f(n)Σt Mt . By apply matrix inverses to both sides of the above inequality, we get that

1 −1 1 λ (Z−1M (n)) ≤ λ Σ(n) M (n) ≤ (27) max t,k t f(n) max t t l f(n) where the last inequality above holds for constant l by Condition 5. Recall that Pt,k = P(At,i = (n) (n) k | Ht−1), so Pt,k | (Mt = 1) ≥ f(n). Thus, equation (26) is bounded above by the following 2 max(du2, 1) ≤ + o (1) →P 0 2n2lf(n)2 p 1 where the limit above holds because we assume that f(n) = c for some 0 < c ≤ 2 . 1 Proving Equation (23): By Condition 5, k n Ct,kkmax ≤ u and kZt,kPt,kkmax ≤ u. Thus, 1 any continuous function of n Ct,k and Zt,kPt,k will have compact support and thus be uniformly continuous. For any uniformly continuous function f : Rd×d → Rd×d, for any > 0, there exists a d×d δ > 0 such that for any matrices A, B ∈ R , whenever kA − Bkop < δ, then kf(A) − f(B)kop < . Thus, for any > 0, there exists some δ > 0 such that n 1 X > P I(At,k=k)Ct,iCt,i − Zt,kPt,k > δ → 0 n i=1 op implies n 1 X > P f I(At,k=k)Ct,iCt,i − f(Zt,kPt,k) > → 0 n i=1 op

39 Thus, by letting f be the matrix square-root function, n 1/2 1 X P C C> − (Z P )1/2 → 0. n I(At,k=k) t,i t,i t,k t,k i=1

We now want to show that for some constant r > 0, Z−1 1 > r, because this would imply P t,k Pt,k op that n 1/2 1 X P C C> − (Z P )1/2 (Z P )−1/2 → 0. n I(At,k=k) t,i t,i t,k t,k t,k t,k i=1

(n) Recall that for Mt = I d (n) K , representing whether the conditional (∀ c∈R , At(Ht−1,c)∈[f(n),1−f(n)] ) clipping condition is satisﬁed, −1 −1 (n) (n) −1 (n) Zt,k = Zt,k (Mt + (1 − Mt )) = Zt,k Mt + op(1). Thus it is sufﬁcient to show that Z−1 1 M (n) > r. Recall that by equation (27) we have P t,k Pt,k t op that 1 −1 1 λ (Z−1M (n)) ≤ λ Σ(n) M (n) ≤ max t,k t f(n) max t t l f(n) (n) (n) Also note that Pt,k = P(At,i = k | Ht−1), so Pt,k | (Mt = 1) ≥ f(n). Thus we have that −1 1 (n) P Zt,k Mt > r ≤ I( 1 >r) = 0 l f(n)2 Pt,k op 1 1 1 for r > l f(n)2 = lc2 , since we assume that f(n) = c for some 0 < c ≤ 2 .

(n) > (n) Proof of Theorem 4: We deﬁne Pt,k := P(At,i = k|Ht−1) and Zt,k := E Ct,iCt,i Ht−1,At,i = k. We also deﬁne  −1/2  Ct,0 Ct,iIAt,i=0 n  C−1/2 C  (n) 1/2 ˆ X  t,1 t,iIAt,i=1  D := Diagonal C , C , ..., C (β − β ) =   t,i t t,0 t,1 t,K−1 t t  .  i=1  .  −1/2 Ct,K−1 Ct,iIAt,i=K−1

(n) (n) (n) > D 2 We want to show that [D1 , D2 , ..., DT ] → N (0, σ IT Kd). By Lemma 2 and Slutsky’s Theorem, (n) (n) (n) > D 2 it sufficient to show that as n → ∞, [Q1 , Q2 , ..., QT ] → N (0, σ IT Kd) for  √ 1 −1/2  Zt,0 Ct,iIAt,i=0 nPt,0  −1/2  n √ 1  Zt,1 Ct,iIAt,i=1  (n) X  nPt,1  Q :=   t,i t  .  i=1  .    √ 1 −1/2 Zt,K−1Ct,iIAt,i=K−1 nPt,K−1 T Kd By Cramer Wold device, it is sufficient to show that for any b ∈ R with kbk2 = 1, where Kd b = [b1, b2, ..., bT ] for bt ∈ R , as n → ∞. T X > (n) D 2 bt Qt → N (0, σ ) (28) t=1 d We can further define for all t ∈ [1: T ], bt = [bt,0, bt,1, ..., bt,K−1] with bt,k ∈ R . Thus to show (28) it is equivalent to show that T K−1 n X X > 1 −1/2 X D 2 bt,k Z IAt,i=kCt,it,i → N (0, σ ) pnP t,k t=1 k=0 t,k i=1

40 (n) PK−1 > √ 1 −1/2 We define Yt,i := k=0 bt,k IAt,i=kZt,k Ct,it,i. The sequence nPt,k (n) (n) (n) (n) (n) (n) Y1,1 ,Y1,2 , ..., Y1,n , ...YT,1 ,YT,2 , ..., YT,n is a martingale difference array with respect to the (n) T (n) (n) (n) (n) (n) sequence of histories {Ht−1}t=1 because E[Yt,i |Ht−1] = E E[Yt,i |Ht−1,At,i, Ct,i] Ht−1 = 0 for all i ∈ [1: n] and all t ∈ [1: T ]. We then apply the martingale central limit theorem of [8] to (n) Yt,i to show the desired result (see the proof of Theorem 5 in Appendix B for the statement of the martingale CLT conditions). Note that the first condition (a) of the martingale CLT is already (n) (n) satisfied, as we just showed that Yt,i form a martingale difference array with respect to Ht−1.

Condition(b): Conditional Variance T n T n K−1 2 X X 2 (n) X X X > 1 −1/2 (n) E[Yt,i|Ht−1] = E bt,k Z IAt,i=kCt,it,i Ht−1 pnP t,k t=1 i=1 t=1 i=1 k=0 t,k

T n K−1 X X X 1 > −1/2 > 2 (n) −1/2 = bt,kZt,k E IAt,i=kCt,iCt,it,i Ht−1 Zt,k bt,k nPt,k t=1 i=1 k=0 (n) By law of iterated expectations (conditioning on Ht−1,At,i, Ct,i) and Condition 6, T n K−1 1 X X X 1 > −1/2 > (n) −1/2 2 = bt,kZt,k E IAt,i=kCt,iCt,i Ht−1 Zt,k bt,kσ n Pt,k t=1 i=1 k=0

T n K−1 1 X X X 1 > −1/2 > (n) −1/2 2 = bt,kZt,k E Ct,iCt,i Ht−1,At,i = k Pt,kZt,k bt,kσ n Pt,k t=1 i=1 k=0

T n K−1 T K−1 1 X X X X X = b> I b σ2 = σ2 b> b = σ2 n t,k d t,k t,k t,k t=1 i=1 k=0 t=1 k=0

Condition(c): Lindeberg Condition Let δ > 0.

T n T n K−1 2 X X 2 (n) X X X > 1 −1/2 (n) 2 E Yt,iI(|Yt,i|>δ) Ht−1 = E bt,k Zt,i IAt,i=kCt,it,i I(Y >δ2) Ht−1 p t,i t=1 i=1 t=1 i=1 k=0 nPt,k T n K−1 X X X 1 > −1/2 > 2 (n) −1/2 2 = bt,kZt,i E IAt,i=kCt,iCt,it,iI(Y >δ2) Ht−1 Zt,i bt,k nPt,k t,i t=1 i=1 k=0 It is sufﬁcient to show that for any t ∈ [1: T ] and any k ∈ [0 : K − 1] the following converges in probability to zero: n X 1 > −1/2 > 2 (n) −1/2 2 bt,kZt,i E IAt,i=kCt,iCt,it,iI(Y >δ2) Ht−1 Zt,i bt,k nP t,i i=1 t,k PK−1 > √ 1 −1/2 Recall that Yt,i = k=0 bt,k IAt,i=kZt,k Ct,it,i. nPt,k n 1 X > −1/2 > 2 (n) −1/2 = b Z Ct,iC 1 > −1/2 > −1/2 2 2 H ,At,i = k Z bt,k t,k t,i E t,i t,iI( b Z Ct,iC Z bt,k >δ ) t−1 t,i n nPt,k t,k t,k t,i t,k t,i i=1 Since c ∈ [−u, u], by the Gershgorin circle theorem, we can bound the maximum eigenvalue of cc> by some constant a > 0. n a X > −1 2 (n) ≤ b Z b a > −1 2 2 H ,A = k t,k t,i t,kE t,iI( b Z bt,k >δ ) t−1 t,i n nPt,k t,k t,k t,i i=1

41 (n) We deﬁne random variable Mt = I d (n) K , representing whether the (∀ c∈R , At(Ht−1,c)∈[f(n),1−f(n)] ) (n) P conditional clipping condition is satisﬁed. Note that by our conditional clipping assumption, Mt → 1 as n → ∞. n a X > −1 2 (n) (n) (n) = b Z b a > −1 2 2 H ,A = k M + (1 − M ) t,k t,k t,kE t,iI( b Z bt,k >δ ) t−1 t,i t t n nPt,k t,k t,k t,i i=1 n a X > −1 2 (n) (n) = b Z b a > −1 2 2 H ,A = k M + o (1) (29) t,k t,k t,kE t,iI( b Z bt,k >δ ) t−1 t,i t p n nPt,k t,k t,k t,i i=1 By equation (27), have that

1 −1 1 λ (Z−1) ≤ λ Σ(n) ≤ max t,k f(n) max t l f(n)

(n) (n) Recall that Pt,k = P(At,i = k | Ht−1), so Pt,k | (Mt = 1) ≥ f(n). Thus we have that equation (29) is upper bounded by the following: n > 1 X bt,kbt,k 2 (n) ≤ > H ,A = k + o (1) E t,iI b bt,k t−1 t,i p n l f(n) ( a t,k 2 >δ2) i=1 nf(n) l f(n) t,i n > 1 X bt,kbt,k 2 (n) = 2 H ,A = k + o (1) E t,iI(2 >δ2 l nf(n) ) t−1 t,i p n l f(n) t,i ab> b i=1 t,k t,k It is sufﬁcient to show that 1 2 (n) lim max 2 H ,A = k = 0. (30) E t,iI(2 >δ2 l nf(n) ) t−1 t,i n→∞ i∈[1 : n] f(n) t,i ab> b t,k t,k

2 (n) By Condition 6, we have that for all n ≥ 1, maxt∈[1 : T ],i∈[1 : n] E[ϕ(t,i)|Ht−1,At,i = k] < M. ϕ(x) Since we assume that limx→∞ x = ∞, for all m ≥ 1, there exists a bm s.t. ϕ(x) ≥ mMx for all x ≥ bm. So, for all n, t, i,

2 (n) 2 (n) M ≥ [ϕ( )|H ,At,i = k] ≥ [ϕ( ) 2 |H ,At,i = k] E t,i t−1 E t,i I(t,i≥bm) t−1

2 (n) ≥ mM [ 2 |H ,At,i = k] E t,iI(t,i≥bm) t−1 2 (n) 1 Thus, max [ 2 |H ,At,i = k] ≤ ; so i∈[1 : n] E t,iI(t,i≥bm) t−1 m 2 (n) limm→∞ max [ 2 |H ,At,i = k] = 0. Since by our conditional clip- i∈[1 : n] E t,iI(t,i≥bm) t−1 1 2 ping assumption, f(n) = c for some 0 < c ≤ 2 thus nf(n) → ∞. So equation (30) holds.

42 Corollary 5 (Asymptotic Normality of the Batched OLS for Margin with Context Statistic). Assume the same conditions as Theorem 4. For any two arms x, y ∈ [0: K − 1] for all t ∈ [1: T ], we have the BOLS estimator for ∆t,x−y := βt,x − βt,y. We show that as n → ∞,  1/2  BOLS C−1 + C−1 (∆ˆ − ∆ )  1,x 1,y 1,x−y 1,x−y     1/2   −1 −1 ˆ BOLS   C2,x + C2,y (∆2,x−y − ∆2,x−y)  D 2   → N (0, σ I )   T d  .   .   1/2   −1 −1 ˆ BOLS  CT,x + CT,y (∆T,x−y − ∆T,x−y) where −1 n n ˆ BOLS −1 −1 −1 X −1 X ∆t,x−y = Ct,x + Ct,y Ct,y At,iCt,iRt,i − Ct,x (1 − At,i)Ct,iRt,i . i=1 i=1

Proof: By Cramer-Wold device, it is sufﬁcient to show that for any ﬁxed vector d ∈ RT d s.t. d PT > −1 −11/2 ˆ BOLS D kdk2 = 1, where d = [d1, d2, ..., dT ] for dt ∈ R , t=1 dt Ct,x +Ct,y (∆t,x−y −∆t,x−y) → N (0, σ2), as n → ∞.

T 1/2 X > −1 −1 ˆ BOLS dt Ct,x + Ct,y (∆t,x−y − ∆t,x−y) t=1 T −1/2 n n X > −1 −1 −1 X −1 X = dt Ct,x + Ct,y Ct,y At,iCt,it,i − Ct,x (1 − At,i)Ct,it,i t=1 i=1 i=1 By Lemma 2, as n → ∞, 1 Z−1C →P I and 1 Z−1C →P I , so by Slutsky’s Theorem it nPt,x t,x t,x d nPt,y t,y t,y d is sufﬁcient to that as n → ∞, T −1/2 n n X > −1 −1 1 −1 X 1 −1 X D 2 dt Ct,x + Ct,y Zt,y At,iCt,it,i − Zt,x (1 − At,i)Ct,it,i → N (0, σ ) nPt,y nPt,x t=1 i=1 i=1 −1/2 1/2 We know that 1 Z−1 + 1 Z−1 1 Z−1 + 1 Z−1 →P I . Pt,x t,x Pt,y t,y Pt,x t,x Pt,y t,y d −1 P −1 P By Lemma 2 and continuous mapping theorem, nPt,xZt,xCt,x → Id and nPt,yZt,yCt,y → Id. So by Slutsky’s Theorem, −1/2 1/2 1 −1 1 −1 −1 −1 P Zt,x + Zt,y nCt,x + nCt,y → Id Pt,x Pt,y So, returning to our CLT, by Slutsky’s Theorem, it is sufﬁcient to show that as n → ∞, T −1/2 n X 1 1 1 X d> Z−1 + Z−1 Z−1 A C t nP t,x nP t,y nP t,y t,i t,i t,i t=1 t,x t,y t,y i=1 T −1/2 n X 1 1 1 X D − d> Z−1 + Z−1 Z−1 (1 − A )C → N (0, σ2) t nP t,x nP t,y nP t,x t,i t,i t,i t=1 t,x t,y t,x i=1 The above sum equals the following: T −1/2 n X 1 1 1 −1/2 1 −1/2 X = d> Z−1 + Z−1 Z Z A C t nP t,x nP t,y p t,x p t,x t,i t,i t,i t=1 t,x t,y nPt,x nPt,x i=1 T −1/2 n X 1 1 1 −1/2 1 −1/2 X − d> Z−1 + Z−1 Z Z (1 − A )C t nP t,x p t,y p t,y p t,y t,i t,i t,i t=1 t,x nPt,x nPt,y nPt,y i=1

43 Asymptotic normality holds by the same martingale CLT as we used in the proof of Theorem 4. The only difference is that we adjust our bt,k vector from Theorem 4 to the following:  0 if k 6∈ {x, y}   −1/2  > 1 −1 1 −1 √ 1 −1/2 dt nP Zt,x + nP Zt,y Zt,x if k = x bt,k := t,x t,y nPt,x  −1/2  > 1 −1 1 −1 1 −1/2 dt Zt,x + Zt,y √ Zt,y if k = y  nPt,x nPt,y nPt,y

(n) The proof still goes through with this adjustment because for all k ∈ [0: K − 1], (i) bt,k ∈ Ht−1, PT PK−1 > PT > l nf(n)2 > (ii) t=1 k=0 bt,kbt,k = t=1 dt dt = 1. and (iii) > → ∞ still holds because bt,kbt,k is abt,kbt,k bounded above by one.

44 F W-Decorrelated Estimator [6]

To better understand why the W-decorrelated estimator has relatively low power, but is still able to guarantee asymptotic normality, we now investigate the form of the W-decorrelated estimator in the two-arm bandit setting.

F.1 Decorrelation Approach

We now assume we are in the unbatched setting (i.e., batch size of one), as the W-decorrelated estimator was developed for this setting; however, these results easily translate to the batched setting. We now let n index the number of samples total (previously this was nT ) and examine asymptotics as n → ∞. We assume the following model: > Rn = Xn β + n n n×p p where Rn, n ∈ R and Xn ∈ R and β ∈ R . The W-decorrelated OLS estimator is defined as follows: ˆd ˆ ˆ β = βOLS + Wn(Rn − XnβOLS) With this definition we have that, ˆd ˆ ˆ β − β = βOLS + Wn(Rn − XnβOLS) − β ˆ ˆ = βOLS + Wn(Xnβ + n) − WnXnβOLS − β ˆ = (Ip − WnXn)(βOLS − β) + Wnn Pn th Note that if E[Wnn] = E i=1 Wii = 0 (where Wi is the i column of Wn), then E[(Ip − ˆ WnXn)(βOLS − β)] would be the bias of the estimator. We assume {i} is a martingale difference n sequence w.r.t. filtration {Gi}i=1. Thus, if we constrain Wi to be Gi−1 measurable, n n n X X X E[Wnn] = E Wii = E E[Wii|Gi−1] = E WiE[i|Gi−1] = 0 i=1 i=1 i=1 ˆ Trading off Bias and Variance While decreasing E[(Ip − WnXn)(βOLS − β)] will decrease the bias, making Wn larger in norm will increase the variance. So the trade-off between bias and variance can be adjusted with different values of λ for the following optimization problem: 2 2 2 > ||Ip − WnXnkF + λkWnkF = kIp − WnXnkF + λTr(WnWn )

Optimizing for Wn The authors propose to optimize for Wn in a recursive fashion, so that the th Pn i column, Wi, only depends on {Xj}j≤i ∪ {j}j≤i−1 (so i=1 E[Wii] = 0). We let W0 = 0, X0 = 0, and recursively define Wn := [Wn−1Wn] where > 2 2 W = argmin p kI − W X − WX k + λkWk n W∈R p n−1 n−1 n F 2 p×(n−1) > (n−1)×p where Wn−1 = [W1; W2; ...; Wn−1] ∈ R and Xn−1 = [X1; X2; ...; Xn−1] ∈ R . Now, let us find the closed form solution for each step of this minimization: d kI − W X − WX>k2 + λkWk2 = 2(I − W X − WX>)(−X ) + 2λW dW p n−1 n−1 n F 2 p n−1 n−1 n n Note that since the Hessian is positive definite, so we can find the minimizing W by setting the first derivative to 0: 2 d > 2 2 > kIp − Wn−1Xn−1 − WXn kF + λkWk2 = 2XnXn + 2λIp < 0 dWdW>

> 0 = 2(Ip − Wn−1Xn−1 − WXn )(−Xn) + 2λW > (Ip − Wn−1Xn−1 − WXn )Xn = λW > 2 (Ip − Wn−1Xn−1)Xn = λW + WXn Xn = (λ + kXnk2)W

∗ Xn W = (Ip − Wn−1Xn−1) 2 λ + kXnk2

45 Proposition 3 (W-decorrelated estimator and time discounting in the two-arm bandit setting). Sup- th pose we have a 2-arm bandit. Ai is an indicator that equals 1 if arm 1 is chosen for the i sample, 2 and 0 if arm 0 is chosen. We define Xi := [1 − Ai,Ai] ∈ R . We assume the following model of rewards: > Ri = Xi β + i = Aiβ1 + (1 − Ai)β0 + i n We further assume that {i}i=1 are a martingale difference sequence with respect to filtration n n {Gi}i=1. We also assume that Xi are non-anticipating with respect to filtration {Gi}i=1. Note the W-decorrelated estimator: ˆd ˆ ˆ β = βOLS + Wn(Rn − XnβOLS) p×n We show that for Wn = [W1; W2; ...; Wn] ∈ R and choice of constant λ, Pn 1 i=1(1−Ai) 1 (1 − λ+1 ) λ+1 2 Wi = Pn ∈ 1 i=1 Ai 1 R (1 − λ+1 ) λ+1

Moreover, we show that the W-decorrelated estimator for the mean of arm 1, β1, is as follows: n N −1 n N −1 X 1 1 1,i X 1 1 1,i βˆd = 1 − A 1 − βˆOLS + A R · 1 − 1 t λ + 1 λ + 1 1 t t λ + 1 λ + 1 i=1 i=1

Pn A R ˆOLS i=1 i i Pn where β = for N1,n = Ai. Since [6] require that λ ≥ 1 for their CLT results to 1 N1,n i=1 hold, thus, the W-decorrelated estimators is down-weighting samples drawn later on in the study and up-weighting earlier samples.

Proof: Recall the formula for Wi,

Xi Wi = (Ip − Wi−1Xi−1) 2 λ + kXik2 > 1 We let Wi = [W0,i,W1,i] . For notational simplicity, we let r = λ+1 . We now solve for W1,n:

W1,1 = (1 − 0) · rA1 = rA1

W1,2 = (1 − W1,1 · A1) · rA2 = (1 − rA1)rA2 2 X W1,3 = 1− W1,i ·Ai ·rA3 = 1−rA1 −(1−rA1)rA2 ·rA3 = (1−rA1)(1−rA2)·rA3 i=1 3 X W1,4 = 1− W1,i ·Ai ·rA4 = 1−rA1 −(1−rA1)rA2 −(1−rA1)(1−rA2)·rA3 ·rA4 i=1 = (1 − rA1) 1 − rA2 − (1 − rA2)rA3) · rA4 = (1 − rA1)(1 − rA2)(1 − rA3) · rA4 We have that for arbitrary n,

n−1 n−1 Pn−1 X Y Ai N1,n−1 W1,n = 1− W1,i ·Ai ·rAn = rAn (1−rAi) = rAn(1−r) i=1 = rAn(1−r) i=1 i=1 By symmetry, we have that n−1 X N0,n−1 W0,n = 1 − W1,i · (1 − Ai) · r(1 − An) = r(1 − An)(1 − r) i=1

Note the W-decorrelated estimator for β1: n ˆd ÔLS X ÔLS N1,i−1 β1 = β1 + Ai Ri − β1 r(1 − r) i=1 n n X N1,i−1 ÔLS X N1,i−1 = 1 − Air(1 − r) β1 + AiRi · r(1 − r) i=1 i=1