The Big Data Bootstrap

Ariel Kleiner [email protected] Ameet Talwalkar [email protected] Purnamrita Sarkar [email protected] Computer Science Division, University of California, Berkeley, CA 94720, USA Michael I. Jordan [email protected] Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA

Abstract form hypothesis testing, do bias correction, make more efficient use of available resources (e.g., by ceasing to The bootstrap provides a simple and pow- process data when the confidence region is sufficiently erful means of assessing the quality of esti- small), perform active learning, and do feature selec- mators. However, in settings involving large tion, among many more potential uses. datasets, the computation of bootstrap-based quantities can be prohibitively demanding. Accurate assessment of estimate quality has been a As an alternative, we present the Bag of Lit- longstanding concern in statistics. A great deal of tle Bootstraps (BLB), a new procedure which classical work in this vein has proceeded via asymp- incorporates features of both the bootstrap totic analysis, which relies on deep study of particu- and subsampling to obtain a robust, compu- lar classes of estimators in particular settings (Politis tationally efficient means of assessing estima- et al., 1999). While this approach ensures asymptotic tor quality. BLB is well suited to modern par- correctness and allows analytic computation, its appli- allel and distributed architectures cability is limited to cases in which the relevant asymp- and retains the generic applicability, statisti- totic analysis is tractable and has actually been per- cal efficiency, and favorable theoretical prop- formed. In contrast, recent decades have seen greater erties of the bootstrap. We provide the re- focus on more automatic methods, which generally re- sults of an extensive empirical and theoretical quire significantly less analysis, at the expense of do- investigation of BLB’s behavior, including a ing more computation. The bootstrap (Efron, 1979; study of its statistical correctness, its large- Efron & Tibshirani, 1993) is perhaps the best known scale implementation and performance, selec- and most widely used among these methods, due to tion of hyperparameters, and performance on its simplicity, generic applicability, and automatic na- real data. ture. Efforts to ameliorate statistical shortcomings of the bootstrap in turn led to the development of re- lated methods such as the m out of n bootstrap and 1. Introduction subsampling (Bickel et al., 1997; Politis et al., 1999). Assessing the quality of estimates based upon finite Despite the computational demands of this more au- data is a task of fundamental importance in data anal- tomatic methodology, advancements have been driven ysis. For example, when estimating a vector of model primarily by a desire to remedy shortcomings in statis- parameters given a training dataset, it is useful to be tical correctness. However, with the advent of increas- able to quantify the uncertainty in that estimate (e.g., ingly large datasets and diverse sets of often complex via a confidence region), its bias, and its risk. Such and exploratory queries, computational considerations quality assessments provide far more information than and automation (i.e., lack of reliance on deep analysis a simple point estimate itself and can be used to im- of the specific estimator and setting of interest) are in- prove human interpretation of inferential outputs, per- creasingly important. Even as the amount of available data grows, the number of parameters to be estimated Appearing in Proceedings of the 29 th International Confer- and the number of potential sources of bias often also ence on , Edinburgh, Scotland, UK, 2012. grow, leading to a need to be able to tractably assess Copyright 2012 by the author(s)/owner(s). The Big Data Bootstrap

−1 Pn estimator quality in the setting of large data. Pn = n i=1 δXi is the empirical distribution of X1,...,Xn. The true (unknown) population value to Thus, unlike previous work on estimator quality as- ˆ sessment, here we directly address computational costs be estimated is θ(P ). For example, θn might estimate and scalability in addition to statistical correctness a measure of correlation, the parameters of a regres- sor, or the prediction accuracy of a trained classifi- and automation. Indeed, existing methods have sig- ˆ nificant drawbacks with respect to one or more of cation model. Noting that θn is a random quantity these desiderata. The bootstrap, despite its strong because it is based on n random observations, we de- statistical properties, has high—even prohibitive— fine Qn(P ) ∈ Q as the true underlying distribution of ˆ computational costs; thus, its usefulness is severely θn, which is determined by both P and the form of blunted by the large datasets increasingly encountered the mapping θ. Our end goal is the computation of in practice. We also find that its relatives, such as some metric ξ(Qn(P )) : Q → Ξ, for Ξ a vector space, the m out of n bootstrap and subsampling, can have which informatively summarizes Qn(P ). For instance, lesser computational costs, as expected, but are gen- ξ might compute a confidence region, a standard error, erally not robust to specification of hyperparameters or a bias. In practice, we do not have direct knowledge (such as the number of subsampled data points) and of P or Qn(P ), and so we must estimate ξ(Qn(P )) it- are also somewhat less automatic due to their need to self based only on the observed data and knowledge of explicitly utilize theoretically derived estimator con- the form of the mapping θ. vergence rates. Motivated by the need for an automatic, accurate 3. Related Work means of assessing estimator quality that is scalable The bootstrap (Efron, 1979; Efron & Tibshirani, 1993) to large datasets, we present a new procedure, the provides an automatic and widely applicable means of Bag of Little Bootstraps (BLB), which functions by quantifying estimator quality: it simply uses the data- combining the results of bootstrapping multiple small driven plugin approximation ξ(Qn(P )) ≈ ξ(Qn(Pn)). subsets of a larger original dataset. BLB has a sig- While ξ(Qn(Pn)) cannot be computed exactly in most nificantly more favorable computational profile than cases, it is generally amenable to straightforward the bootstrap, as it only requires repeated computa- Monte Carlo approximation as follows: repeatedly tion of the estimator under consideration on quantities resample n points i.i.d. from Pn, compute the esti- of data that can be much smaller than the original mate on each resample, form the empirical distribu- dataset. Hence, BLB is well suited to implementation tion Qn of the computed estimates, and approximate on modern distributed and parallel computing archi- ξ(Qn(P )) ≈ ξ(Qn). Though conceptually simple and tectures. Our procedure maintains the bootstrap’s au- powerful, this procedure requires repeated estimator tomatic and generic applicability, favorable statistical computation on resamples having size comparable to properties, and simplicity of implementation. Finally, that of the original dataset. Therefore, if the original as we show empirically, BLB is consistently more ro- dataset is large, then this computation can be costly. bust than alternatives such as the m out of n bootstrap and subsampling. While the literature does contain some discussion of techniques for improving the computational efficiency We next formalize our statistical setting and notation of the bootstrap, that work is largely devoted to in Section2, discuss relevant prior work in Section3, reducing the number of Monte Carlo resamples re- and present BLB in full detail in Section4. Subse- quired (Efron, 1988; Efron & Tibshirani, 1993). These quently, we present an empirical and theoretical study techniques in general introduce significant additional of statistical correctness in Section5, an exploration complexity of implementation and do not obviate the of scalability including large-scale experiments on a need for repeated estimator computation on resamples distributed computing platform in Section6, practical having size comparable to that of the original dataset. methods for automatically selecting hyperparameters in Section7, and assessments on real data in Section8. Bootstrap variants such as the m out of n boot- strap (Bickel et al., 1997) and subsampling (Politis et al., 1999) were introduced to achieve statistical con- 2. Setting and Notation sistency in edge cases in which the bootstrap fails, though they also have the potential to yield compu- We assume that we observe a sample X1,...,Xn ∈ X drawn i.i.d. from some true (unknown) underlying tational benefits. The m out of n bootstrap (and sub- distribution P ∈ P. Based only on this observed sampling) functions as follows, for m < n: repeatedly ˆ resample m points i.i.d. from n (subsample m points data, we obtain an estimate θn = θ(Pn) ∈ Θ, where P without replacement from X1,...,Xn), compute the The Big Data Bootstrap estimate on each resample (subsample), form the em- Algorithm 1 Bag of Little Bootstraps (BLB) pirical distribution Qn,m of the computed estimates, Input: Data X1,...,Xn approximate ξ(Qm(P )) ≈ ξ(Qn,m), and apply an an- θ: estimator of interest alytical correction to in turn approximate ξ(Qn(P )). ξ: estimator quality assessment This final analytical correction uses prior knowledge of b: subset size ˆ the convergence rate of θn as n increases and is neces- s: number of sampled subsets sary because each estimate is computed based on only r: number of Monte Carlo iterations m rather than n points. Output: An estimate of ξ(Qn(P )) These procedures have a more favorable computational for j ← 1 to s do profile than the bootstrap, as they only require re- //Subsample the data peated estimator computation on smaller sets of data. Randomly sample a set I = {i1, . . . , ib} of b in- However, they require knowledge and explicit use of dices from {1, . . . , n} without replacement ˆ [or, choose I to be a disjoint subset of size b from the convergence rate of θn, and, as we show in our em- pirical investigation below, their success is sensitive to a predefined random partition of {1, . . . , n}] (j) the choice of m. While schemes have been proposed for //Approximate ξ(Qn(Pn,b)) automatic selection of an optimal value of m (Bickel & for k ← 1 to r do Sakov, 2008), they require significantly greater compu- Sample (n1, . . . , nb) ∼ Multinomial(n, 1b/b) ∗ ← n−1 Pb n δ tation which would eliminate any computational gains. Pn,k a=1 a Xia It is also worth noting that some work on the m out of ˆ∗ ∗ θn,k ← θ(Pn,k) n bootstrap has explicitly sought to reduce computa- end for tional costs using two different values of m in conjunc- ∗ −1 Pr n,j ← r δˆ∗ Q k=1 θn,k tion with extrapolation (Bickel & Yahav, 1988; Bickel ∗ ∗ ξn,j ← ξ(Qn,j) & Sakov, 2002). However, these approaches explicitly end for use series expansions of the cdf values of the estima- (j) //Average values of ξ(Qn(Pn,b)) computed tor’s sampling distribution and hence are less auto- //for different data subsets matically usable; they also require execution of the m return s−1 Ps ξ∗ out of n bootstrap for multiple different values of m. j=1 n,j

4. Bag of Little Bootstraps (BLB) contains at most b distinct data points. In particular, The Bag of Little Bootstraps (Algorithm1) functions to generate each resample, it suffices to draw a vector by averaging the results of bootstrapping multiple of counts from an n-trial uniform multinomial distri- small subsets of X1,...,Xn. More formally, given a bution over b objects. We can then represent each subset size b < n, BLB samples s subsets of size b resample by simply maintaining the at most b distinct from the original n data points, uniformly at random points present within it, accompanied by correspond- (one can also impose the constraint that the subsets ing sampled counts (i.e., each resample requires only be disjoint). Let I1,..., Is ⊂ {1, . . . , n} be the cor- storage space in O(b)). In turn, if the estimator can responding index multisets (note that |Ij| = b, ∀j), work directly with this weighted data representation, and let (j) = b−1 P δ be the empirical distri- Pn,b i∈Ij Xi then its computational requirements—with respect to bution corresponding to subset j. BLB’s estimate of both time and storage space—scale only in b, rather −1 Ps (j) ξ(Qn(P )) is then given by s j=1 ξ(Qn(Pn,b)). Al- than n. Fortunately, this property does indeed hold for (j) many if not most commonly used estimators, including though the terms ξ(Qn(Pn,b)) cannot be computed an- alytically in general, they are computed numerically by M-estimators such as linear and kernel regression, lo- the inner loop in Algorithm1 via Monte Carlo approx- gistic regression, and Support Vector Machines, among imation in the manner of the bootstrap: for each term many others. (j) j, we repeatedly resample n points i.i.d. from Pn,b, As a result, BLB only requires repeated computation compute the estimate on each resample, form the em- on small subsets of the original dataset and avoids the ∗ pirical distribution Qn,j of the computed estimates, bootstrap’s problematic need for repeated computa- (j) ∗ tion of estimates on resamples having size comparable and approximate ξ(Qn(Pn,b)) ≈ ξ(Qn,j). to that of the original dataset. A simple and standard To realize the substantial computational benefits af- calculation (Efron & Tibshirani, 1993) shows that each forded by BLB, we utilize the following crucial fact: bootstrap resample contains approximately 0.632n dis- each BLB resample, despite having nominal size n, tinct points, which is large if n is large. In contrast, as The Big Data Bootstrap discussed above, each BLB resample contains at most b To evaluate the various quality assessment procedures distinct points, and b can be chosen to be much smaller on a given estimation task and true underlying data than n or 0.632n. For example, we might take b = nγ distribution P , we first compute the ground truth where γ ∈ [0.5, 1]. More concretely, if n = 1, 000, 000, ξ(Qn(P )) based on 2, 000 realizations of datasets of then each bootstrap resample would contain approxi- size n from P . Then, for an independent dataset real- mately 632, 000 distinct points, whereas with b = n0.6 ization of size n from the true underlying distribution, each BLB subsample and resample would contain at we run each quality assessment procedure and record most 3, 981 distinct points. If each data point occu- the estimate of ξ(Qn(P )) produced after each iteration pies 1 MB of storage space, then the original dataset (e.g., after each bootstrap resample or BLB subsam- would occupy 1 TB, a bootstrap resample would oc- ple is processed), as well as the cumulative time re- cupy approximately 632 GB, and each BLB subsample quired to produce that estimate. Every such estimate or resample would occupy at most 4 GB. is evaluated based on the average (across dimensions) relative absolute deviation of its component-wise con- BLB thus has a significantly more favorable compu- fidence intervals’ widths from the corresponding true tational profile than the bootstrap. As seen in sub- widths: given an estimated confidence interval width sequent sections, our procedure typically requires less c and a true width c , the relative deviation of c from total computation to reach comparably high accuracy o c is defined as |c − c |/c . We repeat this process on (fairly modest values of s and r suffice); is significantly o o o five independent dataset realizations of size n and av- more amenable to implementation on distributed and erage the resulting relative errors and corresponding parallel computing architectures which are often used times across these five datasets to obtain a trajectory to process large datasets; maintains the favorable sta- of relative error versus time for each quality assessment tistical properties of the bootstrap; and is more robust procedure (the trajectories’ variances are small relative than the m out of n bootstrap and subsampling to the to the relevant differences between their means, so the choice of subset size. variances are not shown in our plots). To maintain consistency of notation, we henceforth refer to the m 5. Statistical Correctness out of n bootstrap as the b out of n bootstrap. For BLB, the b out of n bootstrap, and subsampling, we We show via both a simulation study and theoreti- consider b = nγ where γ ∈ {0.5, 0.6, 0.7, 0.8, 0.9}; we cal analysis that BLB shares the favorable statistical use r = 100 in all runs of BLB. performance of the bootstrap while also being consis- tently more robust than the m out of n bootstrap and Figure1 shows a set of representative results for subsampling to the choice of b. Here we present a rep- the classification setting, where P generates the com- resentative summary of our investigation of statistical ponents of X˜i i.i.d. from StudentT(3) and Yi ∼ ˜ T −1 correctness; see Kleiner et al.(2012) for more detail. Bernoulli((1 + exp(−Xi 1)) ); we use this represen- tative empirical setting in subsequent sections as well. We investigate the empirical performance character- As seen in the figure, BLB (left plot) succeeds in con- istics of BLB and compare to existing methods via verging to low relative error more quickly than the experiments on different simulated datasets and es- bootstrap for b > n0.5, while converging to somewhat timation tasks. Use of simulated data is necessary higher relative error for b = n0.5. We are more robust here because it allows knowledge of Q (P ) and hence n than the b out of n bootstrap (middle plot), which fails ξ(Q (P )) (i.e., ground truth). We consider two dif- n to converge to low relative error for b ≤ n0.6. In fact, ferent settings: regression and classification. For both even for b = n0.5, BLB’s performance is substantially settings, the data have the form X = (X˜ ,Y ) ∼ P , i i i superior to that of the b out of n bootstrap. For the i.i.d. for i = 1, . . . , n, where X˜ ∈ d; Y ∈ for re- i R i R aforementioned case in which BLB does not match the gression, whereas Y ∈ {0, 1} for classification. We use i relative error of the bootstrap, additional empirical re- n = 20, 000 for the plots shown, and d is set to 100 sults (right plot) and our theoretical analysis indicate for regression and 10 for classification. In each case, that this discrepancy in relative error diminishes as n θˆ estimates a parameter vector in d (via either least n R increases. Identical evaluation of subsampling (plots squares or logistic regression with Newton’s method, not shown) shows that it performs strictly worse than all implemented in MATLAB) for a linear or gener- the b out of n bootstrap. Qualitatively similar results alized linear model of the mapping between X˜ and i also hold in both the classification and regression set- Y . We define ξ as computing a set of marginal 95% i tings (with the latter generally showing better perfor- confidence intervals, one for each element of the esti- mance) when P generates X˜ from Normal, Gamma, mated parameter vector (averaging across ξ’s consists i or StudentT distributions, and when P uses a non- of averaging element-wise interval boundaries). The Big Data Bootstrap

1 1 1 BLB−0.5 BOFN−0.5 BOFN−0.5 BLB−0.6 BOFN−0.6 BLB−0.5 0.8 BLB−0.7 0.8 BOFN−0.7 0.8 BOFN−0.6 BLB−0.8 BOFN−0.8 BLB−0.6 BLB−0.9 BOFN−0.9 BOFN−0.7 BOOT BLB−0.7 0.6 0.6 BOOT 0.6 BOOT

0.4 0.4 0.4 Relative Error Relative Error Relative Error

0.2 0.2 0.2

0 0 0 0 2 4 6 8 10 0 2 4 6 8 10 0 1 2 3 4 5 n 5 Time (sec) Time (sec) x 10

Figure 1. Results for classification setting with linear data generating distribution and StudentT X˜i distribution. For both BLB and the b out of n bootstrap (BOFN), b = nγ with the value of γ for each trajectory given in the legend. (left and middle) Relative error vs. processing time, with n = 20, 000. The left plot shows BLB with bootstrap (BOOT), and the middle plot shows BOFN. (right) Relative error (after convergence) vs. n for BLB, BOFN, and BOOT.

linear noisy mapping between X˜i and Yi (so that we 6. Scalability estimate a misspecified model). The experiments of the preceding section, though pri- These experiments illustrate the statistical correctness marily intended to investigate statistical performance, of BLB, as well as its improved robustness to the choice also provide some insight into computational perfor- of b. The results are borne out by our theoretical anal- mance: as seen in Figure1, when computing serially, ysis, which shows that BLB has statistical properties BLB generally requires less time, and hence less total that are identical to those of the bootstrap, under the computation, than the bootstrap to attain comparably same conditions that have been used in prior analysis high accuracy. Those results only hint at BLB’s supe- of the bootstrap. In particular, BLB is asymptotically rior ability to scale computationally to large datasets, consistent for a broad class of estimators and quality which we now demonstrate in full in the following dis- measures (see Kleiner et al.(2012) for proof): cussion and via large-scale experiments. Theorem 1. Suppose that θ is Hadamard differen- Modern massive datasets often exceed both the pro- tiable at P tangentially to some subspace, with P cessing and storage capabilities of individual proces- (j) and Pn,b viewed as maps from some Donsker class sors or compute nodes, thus necessitating the use of F to R such that Fδ is measurable for every δ > 0, parallel and distributed computing architectures. In- where Fδ = {f − g : f, g ∈ F, ρP (f − g) < δ} deed, in the large data setting, computing a single full- 21/2 and ρP (f) = P (f − P f) . Additionally, assume data point estimate often requires simultaneous dis- that ξ is continuous in the space of distributions Q tributed computation across multiple compute nodes, with respect to a metric that metrizes weak conver- among which the observed dataset is partitioned. As a ˆ gence. Then, up to centering and rescaling of θn, result, the scalability of a quality assessment method −1 Ps (j) P is closely tied to its ability to effectively utilize such s j=1 ξ(Qn(Pn,b)) − ξ(Qn(P )) → 0 as n → ∞, for any sequence b → ∞ and for any fixed s. Fur- computing resources. thermore, the result holds for sequences s → ∞ if Due to the large size of bootstrap resamples (recall (j) E|ξ(Qn(Pn,b))| < ∞. that approximately 63% of data points appear at least once in each resample), the following is the most nat- BLB is also higher-order correct: despite the fact that ural avenue for applying the bootstrap to large-scale the procedure only applies the estimator in question data using distributed computing: given data on a to subsets of the full observed dataset, it shares the cluster of compute nodes, parallelize the estimate com- fast convergence rates of the bootstrap, which permit putation on each resample across the cluster, and com- convergence of the procedure’s√ output at rate O(1/n) pute on one resample at a time. This approach, while rather than the rate of O(1/ n) achieved by asymp- at least potentially feasible, remains quite problem- totic approximations. To achieve these fast conver- atic. Each estimate computation requires the use of gence rates, some natural conditions on BLB’s hyper-√ an entire cluster of compute nodes, and the bootstrap parameters are required, for example that b = Ω( n) repeatedly incurs the associated overhead, such as the and that s be sufficiently large with respect to the vari- cost of repeatedly communicating intermediate data ability in the observed data. Quite interestingly, these among nodes. Additionally, many cluster computing conditions permit b to be significantly smaller than n, systems in widespread use (e.g., Hadoop MapReduce) with b/n → 0 as n → ∞. See Kleiner et al.(2012) for store data only on disk, rather than in memory, due our theoretical results on higher-order correctness. The Big Data Bootstrap to physical size constraints (if the data size exceeds for faster repeated access. For BLB, we use r = 50, the amount of available memory) or architectural con- s = 5, and b = n0.7. Due to the larger data size straints (e.g., the need for fault tolerance). In that and use of a distributed architecture, we now imple- case, the bootstrap incurs the extreme costs associ- ment the bootstrap using Poisson , and we ated with repeatedly reading a very large dataset from compute ground truth using 200 independent dataset disk; though disk read costs may be acceptable when realizations. (slowly) computing only a single point estimate, they easily become prohibitive when computing many esti- 1 1 BLB−0.7 BLB−0.7 mates on one hundred or more resamples. BOOT BOOT 0.8 0.8

In contrast, BLB permits computation on multiple (or 0.6 0.6 even all) subsamples and resamples simultaneously in 0.4 0.4 parallel, allowing for straightforward and effective dis- Relative Error Relative Error 0.2 0.2 tributed and parallel implementations which enable ef- 0 0 0 5000 10000 15000 0 1000 2000 3000 4000 5000 fective scalability and large computational gains. Be- Time (sec) Time (sec) cause BLB subsamples and resamples can be signifi- cantly smaller than the original dataset, they can be Figure 2. Large-scale results for BLB and the bootstrap transferred to, stored by, and processed on individ- (BOOT) on 150 GB of data. Because BLB is fully paral- ual (or very small sets of) compute nodes. For ex- lelized across subsamples, we only show the relative error ample, we can naturally leverage modern hierarchical and total processing time of its final output. (left) Data distributed architectures by distributing subsamples to stored on disk. (right) Data cached in memory. different compute nodes and subsequently using intra- node parallelism to compute across different resamples In the left plot of Figure2, we show results obtained generated from the same subsample. Note that gener- using a cluster of 10 worker nodes, each having 6 GB ation and distribution of the subsamples requires only of memory and 8 compute cores; thus, the total mem- a single pass over the full dataset (i.e., only a single ory of the cluster is 60 GB, and the full dataset read of the full dataset from disk, if it is stored only on (150 GB) can only be stored on disk. As expected, disk), after which all required data (i.e., the subsam- the time required by the bootstrap to produce even a ples) can be stored in memory. Beyond this significant low-accuracy output is prohibitively high, while BLB architectural benefit, we also achieve implementation provides a high-accuracy output quite quickly, in less and algorithmic benefits: we do not need to parallelize than the time required to process even a single boot- the estimator computation internally, as BLB uses the strap resample. In the right plot of Figure2, we show available parallelism to compute on multiple resamples results obtained using a cluster of 20 worker nodes, simultaneously, and exposing the estimator to only b each having 12 GB of memory and 4 compute cores; rather than n distinct points significantly reduces the thus, the total memory of the cluster is 240 GB, and computational cost of estimation, particularly if the we cache the full dataset in memory for fast repeated estimator computation scales super-linearly. access. Unsurprisingly, the bootstrap’s performance We now empirically substantiate the preceding discus- improves significantly with respect to the previous ex- sion via large-scale experiments performed on Amazon periment. However, the performance of BLB (which EC2. We use the representative experimental setup of also improves), remains substantially better than that of the bootstrap. Figure1, but now with d = 3, 000,√ n = 6, 000, 000, ˜ T −1 Yi ∼ Bernoulli((1 + exp(−Xi 1/ d)) ), and the lo- Thus, relative to the bootstrap, BLB both allows more gistic regression implemented using L-BFGS. The size natural and effective use of parallel and distributed of a full observed dataset in this setting is thus ap- computational resources and decreases the total com- proximately 150 GB. We compare the performance of putational cost of assessing estimator quality. Finally, BLB and the bootstrap (now omitting the m out of n it is worth noting that even if only a single compute bootstrap and subsampling due to the shortcomings il- node is available, BLB allows the following somewhat lustrated in Section5), both implemented as described counterintuitive possibility: even if it is prohibitive to above (we parallelize the estimator computation for actually compute a point estimate for the full observed the bootstrap by simply distributing gradient compu- data using a single compute node (because the full tations via MapReduce) using the Spark cluster com- dataset is large), it may still be possible to efficiently puting framework (Spark, 2012), which provides the assess such a point estimate’s quality using only a sin- ability to either read data from disk or cache it in gle compute node by processing one subsample (and memory (provided that sufficient memory is available) the associated resamples) at a time. The Big Data Bootstrap

7. Hyperparameter Selection Algorithm 2 Convergence Assessment (1) (2) (t) d Like existing resampling-based procedures such as the Input: A series z , z , . . . , z ∈ R bootstrap, BLB requires the specification of hyperpa- w ∈ N: window size (< t) rameters controlling the number of subsamples and re-  ∈ R: target relative error (> 0) samples processed. Setting such hyperparameters to Output: true iff the input series is deemed to have be sufficiently large is necessary to ensure good sta- ceased to fluctuate beyond the target relative error (t−j) (t) 1 Pd |zi −zi | tistical performance; however, setting them to be un- if ∀j ∈ [1, w], d i=1 (t) ≤  then true |zi | necessarily large results in wasted computation. Prior else false work on the bootstrap and related procedures gener- ally assumes that a procedure’s user will simply se- −1 Ps ∗ lect a priori a large, constant number of resamples to output value s j=1 ξn,j has stabilized; in this case, be processed (with the exception of Tibshirani(1985), one can simultaneously also choose r adaptively and who does not provide a general solution to this issue). independently for each subsample. However, this approach reduces the level of automa- tion of these methods and can be quite inefficient in The middle plot of Figure3 shows the results of ap- the large data setting. plying such adaptive hyperparameter selection. For Thus, we now examine the dependence of BLB’s per- selection of r we use  = 0.05 and w = 20, and for formance on the choice of r and s, with the goal of bet- selection of s we use  = 0.05 and w = 3. As seen in ter understanding their influence and providing guid- the plot, the adaptive hyperparameter selection allows ance and adaptive (i.e., more automatic) methods for BLB to cease computing shortly after it has converged their selection. The left plot of Figure3 illustrates the (to low relative error), limiting the amount of unneces- influence of r and s, giving the relative errors achieved sary computation that is performed. Though selected by BLB with b = n0.7 for different r, s pairs in the rep- a priori,  and w are more intuitively interpretable and resentative empirical setting described in Section5. In less dependent on the details of ξ and the underlying particular, note that for all but the smallest values of data generating distribution than r and s. Indeed, the r and s, it is possible to choose these values indepen- aforementioned specific values of  and w yield results dently such that BLB achieves low relative error; in of comparably good quality when also used for a vari- this case, choosing s ≥ 3, r ≥ 50 is sufficient. ety of other synthetic and real data generation settings (see Section8 below), as well as for different forms of While these results are useful and provide some guid- ξ (see the righthand table in Figure3, which shows ance, we expect the minimal sufficient values of r and that smaller values of r are selected when ξ is easier s to change based on the identity of ξ (e.g., we expect a to compute). Thus, our scheme significantly helps to confidence interval to be harder to compute and hence alleviate the burden of hyperparameter selection. to require larger r than a standard error) and the prop- erties of the underlying data. Thus, to help avoid the Automatic selection of a value of b in a computation- need to choose r and s conservatively, we now provide ally efficient manner is more difficult due to the inabil- a means for adaptive hyperparameter selection, which ity to reuse computations performed for different val- we validate empirically. ues of b. One could consider similarly increasing b from some small value until the output of BLB stabilizes; Concretely, to select r adaptively in the inner loop of devising a means of doing so efficiently is the subject Algorithm1, we propose, for each subsample j, to con- of future work. Nonetheless, based on our fairly ex- ∗ tinue to process resamples and update ξn,j until it has tensive empirical investigation, it seems that b = n0.7 ceased to change significantly. Noting that the values is a reasonable and effective choice in many situations. ˆ∗ ∗ θn,k used to compute ξn,j are conditionally i.i.d. given a subsample, for most forms of ξ the series of computed ∗ 8. Real Data ξn,j values will be well behaved√ and will converge (in many cases at rate O(1/ r), though with unknown We now present the results of applying BLB to sev- constant) to a constant target value as more resamples eral different real datasets. In this setting, given the are processed. Therefore, it suffices to process resam- absence of ground truth, it is not possible to objec- ∗ ples (i.e., to increase r) until we are satisfied that ξn,j tively evaluate estimator quality assessment methods’ has ceased to fluctuate significantly; we propose us- statistical correctness. As a result, we are reduced to ing Algorithm2 to assess this convergence. The same comparing the outputs of different methods to each scheme can be used to select s adaptively by process- other; we also now report the average (across dimen- ing more subsamples (i.e., increasing s) until BLB’s sions) absolute confidence interval width produced by The Big Data Bootstrap

1 500 0.4 BLB−0.6 BLB−0.7 200 0.35 0.8 BLB−0.8 BLB−0.9 100 0.3 BOOT r Stats CI STDERR 50 0.6 0.25

r 40 mean 89.6 67.7 0.2 30 0.4 min 50 40 Relative Error 0.15 20 max 150 110 0.1 0.2 10

5 0.05 0 1 3 5 10 25 50 100 200 0 10 20 30 40 s Time (sec)

Figure 3. Results for BLB hyperparameter selection in the empirical setting of Figure1, with b = n0.7. (left) Relative error achieved by BLB for different values of r and s. (middle) Relative error vs. processing time (without parallelization) for BLB using adaptive selection of r and s (resulting stopping times of BLB trajectories are marked by squares) and the bootstrap (BOOT). (right) Statistics of the different values of r selected by BLB’s adaptive hyperparameter selection (across multiple subsamples) when ξ is either our usual confidence interval-based quality measure (CI), or a component- wise standard error (STDERR); the relative errors achieved by BLB and the bootstrap are comparable in both cases. each procedure, rather than relative error. Acknowledgments

Figure4 shows results for BLB, the bootstrap, and This research is supported in part by an NSF CISE Ex- the b out of n bootstrap on the UCI connect4 dataset peditions award, gifts from Google, SAP, Amazon Web (logistic regression, d = 42, n = 67, 557). We select Services, Blue Goji, Cisco, Cloudera, Ericsson, General the BLB hyperparameters r and s using the adaptive Electric, Hewlett Packard, Huawei, Intel, MarkLogic, Mi- method described in the previous section. Notably, the crosoft, NetApp, Oracle, Quanta, Splunk, VMware and by outputs of BLB for all values of b considered, and the DARPA (contract #FA8650-11-C-7136). Ameet Talwalkar output of the bootstrap, are tightly clustered around was supported by NSF award No. 1122732. the same value; additionally, as expected, BLB con- verges more quickly than the bootstrap. However, the References values produced by the b out of n bootstrap vary sig- nificantly as b changes, thus further highlighting this Bickel, P. J. and Sakov, A. Extrapolation and the boot- procedure’s lack of robustness. We have obtained qual- strap. Sankhya: The Indian Journal of Statistics, 64: 640–652, 2002. itatively similar results on six additional UCI datasets (ct-slice, magic, millionsong, parkinsons, poker, shut- Bickel, P. J. and Sakov, A. On the choice of m in the m tle) with different estimators (linear regression and lo- out of n bootstrap and confidence bounds for extrema. Statistica Sinica, 18:967–985, 2008. gistic regression) and a range of different values of n and d; see Kleiner et al.(2012) for additional plots. Bickel, P. J. and Yahav, J. A. Richardson extrapolation and the bootstrap. Journal of the American Statistical Association, 83(402):387–393, 1988.

BLB−0.6 BOFN−0.6 0.5 0.5 BLB−0.7 BOFN−0.7 Bickel, P. J., Gotze, F., and van Zwet, W. Resampling BLB−0.8 BOFN−0.8 BLB−0.9 BOFN−0.9 0.4 0.4 fewer than n observations: Gains, losses, and remedies BOOT BOOT for losses. Statistica Sinica, 7:1–31, 1997. 0.3 0.3 Efron, B. Bootstrap methods: Another look at the jack- 0.2 0.2 Absolute CI Width Absolute CI Width knife. Annals of Statistics, 7(1):1–26, 1979. 0.1 0.1 Efron, B. More efficient bootstrap computations. Journal 0 0 0 50 100 150 200 0 50 100 150 200 Time (sec) Time (sec) of the American Statistical Association, 85(409):79–89, 1988. Figure 4. Average absolute confidence interval width vs. Efron, B. and Tibshirani, R. An Introduction to the Boot- processing time on the UCI connect4 dataset (logistic re- strap. Chapman and Hall, 1993. gression, d = 42, n = 67, 557). (left) Results for BLB Kleiner, A., Talwalkar, A., Sarkar, P., and Jordan, M. I. (with adaptive hyperparameter selection) and bootstrap A scalable bootstrap for massive data, May 2012. URL (BOOT). (right) Results for b out of n bootstrap (BOFN). http://arxiv.org/abs/1112.5016. Politis, D., Romano, J., and Wolf, M. Subsampling. 9. Conclusion Springer, 1999. We have presented a new procedure, BLB, which pro- Spark. Spark cluster computing framework, May 2012. vides a powerful new alternative for automatic, accu- URL http://www.spark-project.org. rate assessment of estimator quality that is well suited Tibshirani, R. How many bootstraps? Technical report, to large-scale data and modern parallel and distributed Department of Statistics, Stanford University, Stanford, computing architectures. CA, 1985.