The Big Data Bootstrap

The Big Data Bootstrap Ariel Kleiner [email protected] Ameet Talwalkar [email protected] Purnamrita Sarkar [email protected] Computer Science Division, University of California, Berkeley, CA 94720, USA Michael I. Jordan [email protected] Computer Science Division and Department of Statistics, University of California, Berkeley, CA 94720, USA Abstract form hypothesis testing, do bias correction, make more efficient use of available resources (e.g., by ceasing to The bootstrap provides a simple and pow- process data when the confidence region is sufficiently erful means of assessing the quality of esti- small), perform active learning, and do feature selec- mators. However, in settings involving large tion, among many more potential uses. datasets, the computation of bootstrap-based quantities can be prohibitively demanding. Accurate assessment of estimate quality has been a As an alternative, we present the Bag of Lit- longstanding concern in statistics. A great deal of tle Bootstraps (BLB), a new procedure which classical work in this vein has proceeded via asymp- incorporates features of both the bootstrap totic analysis, which relies on deep study of particu- and subsampling to obtain a robust, compu- lar classes of estimators in particular settings (Politis tationally efficient means of assessing estima- et al., 1999). While this approach ensures asymptotic tor quality. BLB is well suited to modern par- correctness and allows analytic computation, its appli- allel and distributed computing architectures cability is limited to cases in which the relevant asymp- and retains the generic applicability, statisti- totic analysis is tractable and has actually been per- cal efficiency, and favorable theoretical prop- formed. In contrast, recent decades have seen greater erties of the bootstrap. We provide the re- focus on more automatic methods, which generally results of an extensive empirical and theoretical quire significantly less analysis, at the expense of do- investigation of BLB's behavior, including a ing more computation. The bootstrap (Efron, 1979; study of its statistical correctness, its large- Efron & Tibshirani, 1993) is perhaps the best known scale implementation and performance, selec- and most widely used among these methods, due to tion of hyperparameters, and performance on its simplicity, generic applicability, and automatic na- real data. ture. Efforts to ameliorate statistical shortcomings of the bootstrap in turn led to the development of related methods such as the m out of n bootstrap and 1. Introduction subsampling (Bickel et al., 1997; Politis et al., 1999). Assessing the quality of estimates based upon finite Despite the computational demands of this more au- data is a task of fundamental importance in data anal- tomatic methodology, advancements have been driven ysis. For example, when estimating a vector of model primarily by a desire to remedy shortcomings in statis- parameters given a training dataset, it is useful to be tical correctness. However, with the advent of increas- able to quantify the uncertainty in that estimate (e.g., ingly large datasets and diverse sets of often complex via a confidence region), its bias, and its risk. Such and exploratory queries, computational considerations quality assessments provide far more information than and automation (i.e., lack of reliance on deep analysis a simple point estimate itself and can be used to im- of the specific estimator and setting of interest) are in- prove human interpretation of inferential outputs, per- creasingly important. Even as the amount of available data grows, the number of parameters to be estimated Appearing in Proceedings of the 29 th International Confer- and the number of potential sources of bias often also ence on Machine Learning, Edinburgh, Scotland, UK, 2012. grow, leading to a need to be able to tractably assess Copyright 2012 by the author(s)/owner(s). The Big Data Bootstrap −1 Pn estimator quality in the setting of large data. Pn = n i=1 δXi is the empirical distribution of X1;:::;Xn. The true (unknown) population value to Thus, unlike previous work on estimator quality as- ^ sessment, here we directly address computational costs be estimated is θ(P ). For example, θn might estimate and scalability in addition to statistical correctness a measure of correlation, the parameters of a regres- sor, or the prediction accuracy of a trained classifi- and automation. Indeed, existing methods have sig- ^ nificant drawbacks with respect to one or more of cation model. Noting that θn is a random quantity these desiderata. The bootstrap, despite its strong because it is based on n random observations, we de- statistical properties, has high|even prohibitive| fine Qn(P ) 2 Q as the true underlying distribution of ^ computational costs; thus, its usefulness is severely θn, which is determined by both P and the form of blunted by the large datasets increasingly encountered the mapping θ. Our end goal is the computation of in practice. We also find that its relatives, such as some metric ξ(Qn(P )) : Q! Ξ, for Ξ a vector space, the m out of n bootstrap and subsampling, can have which informatively summarizes Qn(P ). For instance, lesser computational costs, as expected, but are gen- ξ might compute a confidence region, a standard error, erally not robust to specification of hyperparameters or a bias. In practice, we do not have direct knowledge (such as the number of subsampled data points) and of P or Qn(P ), and so we must estimate ξ(Qn(P )) it- are also somewhat less automatic due to their need to self based only on the observed data and knowledge of explicitly utilize theoretically derived estimator con- the form of the mapping θ. vergence rates. Motivated by the need for an automatic, accurate 3. Related Work means of assessing estimator quality that is scalable The bootstrap (Efron, 1979; Efron & Tibshirani, 1993) to large datasets, we present a new procedure, the provides an automatic and widely applicable means of Bag of Little Bootstraps (BLB), which functions by quantifying estimator quality: it simply uses the data- combining the results of bootstrapping multiple small driven plugin approximation ξ(Qn(P )) ≈ ξ(Qn(Pn)). subsets of a larger original dataset. BLB has a sig- While ξ(Qn(Pn)) cannot be computed exactly in most nificantly more favorable computational profile than cases, it is generally amenable to straightforward the bootstrap, as it only requires repeated computa- Monte Carlo approximation as follows: repeatedly tion of the estimator under consideration on quantities resample n points i.i.d. from Pn, compute the esti- of data that can be much smaller than the original mate on each resample, form the empirical distribu- dataset. Hence, BLB is well suited to implementation tion Qn of the computed estimates, and approximate on modern distributed and parallel computing archi- ξ(Qn(P )) ≈ ξ(Qn). Though conceptually simple and tectures. Our procedure maintains the bootstrap's au- powerful, this procedure requires repeated estimator tomatic and generic applicability, favorable statistical computation on resamples having size comparable to properties, and simplicity of implementation. Finally, that of the original dataset. Therefore, if the original as we show empirically, BLB is consistently more ro- dataset is large, then this computation can be costly. bust than alternatives such as the m out of n bootstrap and subsampling. While the literature does contain some discussion of techniques for improving the computational efficiency We next formalize our statistical setting and notation of the bootstrap, that work is largely devoted to in Section2, discuss relevant prior work in Section3, reducing the number of Monte Carlo resamples re- and present BLB in full detail in Section4. Subse- quired (Efron, 1988; Efron & Tibshirani, 1993). These quently, we present an empirical and theoretical study techniques in general introduce significant additional of statistical correctness in Section5, an exploration complexity of implementation and do not obviate the of scalability including large-scale experiments on a need for repeated estimator computation on resamples distributed computing platform in Section6, practical having size comparable to that of the original dataset. methods for automatically selecting hyperparameters in Section7, and assessments on real data in Section8. Bootstrap variants such as the m out of n bootstrap (Bickel et al., 1997) and subsampling (Politis et al., 1999) were introduced to achieve statistical con- 2. Setting and Notation sistency in edge cases in which the bootstrap fails, though they also have the potential to yield compu- We assume that we observe a sample X1;:::;Xn 2 X drawn i.i.d. from some true (unknown) underlying tational benefits. The m out of n bootstrap (and sub- distribution P 2 P. Based only on this observed sampling) functions as follows, for m < n: repeatedly ^ resample m points i.i.d. from n (subsample m points data, we obtain an estimate θn = θ(Pn) 2 Θ, where P without replacement from X1;:::;Xn), compute the The Big Data Bootstrap estimate on each resample (subsample), form the em- Algorithm 1 Bag of Little Bootstraps (BLB) pirical distribution Qn;m of the computed estimates, Input: Data X1;:::;Xn approximate ξ(Qm(P )) ≈ ξ(Qn;m), and apply an an- θ: estimator of interest alytical correction to in turn approximate ξ(Qn(P )). ξ: estimator quality assessment This final analytical correction uses prior knowledge of b: subset size ^ the convergence rate of θn as n increases

Load more