Approximate Slice Sampling for Bayesian Posterior Inference
Total Page:16
File Type:pdf, Size:1020Kb
Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois∗ Anoop Korattikara∗ Max Welling Padhraic Smyth GraphLab, Inc. Dept. of Computer Science Informatics Institute Dept. of Computer Science UC Irvine University of Amsterdam UC Irvine Abstract features, or as is now popular in the context of deep learning, dropout. We believe deep learning is so suc- cessful because it provides the means to train arbitrar- In this paper, we advance the theory of large ily complex models on very large datasets that can be scale Bayesian posterior inference by intro- effectively regularized. ducing a new approximate slice sampler that uses only small mini-batches of data in ev- Methods for Bayesian posterior inference provide a ery iteration. While this introduces a bias principled alternative to frequentist regularization in the stationary distribution, the computa- techniques. The workhorse for approximate posterior tional savings allow us to draw more samples inference has long been MCMC, which unfortunately in a given amount of time and reduce sam- is not (yet) quite up to the task of dealing with very pling variance. We empirically verify on three large datasets. The main reason is that every iter- different models that the approximate slice ation of sampling requires O(N) calculations to de- sampling algorithm can significantly outper- cide whether a proposed parameter is accepted or not. form a traditional slice sampler if we are al- Frequentist methods based on stochastic gradients are lowed only a fixed amount of computing time orders of magnitude faster because they effectively ex- for our simulations. ploit the fact that far away from convergence, the in- formation in the data about how to improve the model is highly redundant. Therefore, only a few data-cases 1 Introduction need to be queried in order to make a good guess on how to improve the model. Thus, a key question is In a time where the amount of data is expanding at an whether it is possible to design MCMC algorithms that exponential rate, one might argue that Bayesian pos- behave more like stochastic gradient based algorithms. terior inference is neither necessary nor a luxury that Some methods have appeared recently [1, 13, 5] that one can afford computationally. However, the lesson use stochastic minibatches of the data at every iter- learned from the \deep learning" literature is that the ation rather then the entire dataset. These methods best performing models in the context of big data are exploit an inherent trade-off between bias (sampling models with very many adjustable degrees of freedom from the wrong distribution asymptotically) and vari- (in the order of tens of billions of parameters in some ance (error due to sampling noise). Given a limit on cases). In line with the nonparametric Bayesian phi- the total sampling time, it may be beneficial to use losophy, real data is usually more complex than any a biased procedure if it allows you to sample faster, finitely parameterized model can describe, leading to resulting in lower error due to variance. the conclusion that even in the context of big data the The work presented here is an extension of [5] where best performing model will be the most complex model sequential hypothesis testing was used to adaptively that does not overfit the data. To find that boundary choose the size of the mini-batch in an approximate between under- and overfitting, it is most convenient Metropolis-Hastings step. The size of the mini-batch to over-parameterize a model and regularize the model is just large enough so that the proposed sample is capacity back through methods such as regularization accepted or rejected with sufficient confidence. That penalties, early stopping, adding noise to the input work used a random walk which is well known to be rather slow mixing. Here we show that a similar strat- Appearing in Proceedings of the 17th International Con- egy can be exploited to speed up a slice sampler [10], ference on Artificial Intelligence and Statistics (AISTATS) which is known to have much better mixing behav- 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- right 2014 by the authors. ∗. The first two authors contributed equally. 185 Approximate Slice Sampling for Bayesian Posterior Inference ior than an MCMC sampler based on a random walk Note that the initial placement of the interval (L; R) kernel. around θt−1 in Algorithm 1 is random, and Vmax is randomly divided into a limit on stepping to the left, This paper is organized as follows. First, we review V , and a limit on stepping to the right, V . These are the slice sampling algorithm in Section 2. In Section L R necessary for detailed balance to hold, as described in 3, we discuss the bias-variance trade-off for MCMC al- [10]. gorithms. Then, we develop an approximate slice sam- pler based on stochastic minibatches in Section 4. In Once a suitable interval containing all or most of the Section 5 we test our algorithm on a number of mod- slice has been found, θt is sampled by repeatedly draw- els, including L1-regularized linear regression, multi- ing a candidate state θprop uniformly from (L; R) until nomial regression and logistic regression. We conclude it falls on the slice S. At this point, we set θt θprop in Section 6 and discuss possible future work. and move to the next iteration. If a proposed state θprop does not fall on the slice, the interval (L; R) can 2 Slice Sampling be shrunk to (L; θprop) or (θprop;R) such that θt−1 is still included in the new interval. A proof that Algo- Slice sampling [10] is an MCMC method for sam- rithm 1 defines a Markov chain with the correct sta- tionary distribution p(θ; h) can be found in [10]. pling from a (potentially unnormalized) density P0(θ), where θ 2 Θ ⊂ RD, by sampling uniformly from the Algorithm 1 Slice Sampling D + 1 dimensional region under the density. In this section, we will first introduce slice sampling for uni- Input: θt−1, w, Vmax variate distributions and then describe how it can be Output: θt easily extended to the multivariate case. Draw ut ∼ Uniform[0; 1] // Pick an initial interval randomly: We start by defining an auxiliary variable h, Draw w ∼ Uniform[0; 1] and a joint distribution ( ) that is uniform p θ; h Initialize L θt−1 − ww and R L + w over the region under the density (i.e. the set // Determine maximum steps in each direction: f(θ; h): θ 2 Θ; 0 < h < P0(θ)g) and 0 elsewhere. Since Draw v ∼ Uniform[0; 1] marginalizing ( ) over yields ( ), sampling p θ; h h P0 θ Set VL Floor(Vmaxv) and VR Vmax − 1 − VL from p(θ; h) and discarding h is equivalent to sam- // Obtain boundaries of slice: pling from ( ). However, sampling from ( ) is P0 θ p θ; h while VL > 0 and OnSlice(L; θt−1; ut) do not straightforward and is usually implemented as an VL VL − 1 MCMC method where in every iteration t, we first L L − w sample ht ∼ p(hjθt−1) and then sample θt ∼ p(θjht). end while while V > 0 and OnSlice(R; θ − ; u ) do Sampling ht ∼ p(hjθt−1) is trivial since p(hjθt−1) R t 1 t V V − 1 is just Uniform[0;P0(θt−1)]. For later convenience, R R R R + w we will introduce the uniform random variable ut ∼ end while Uniform[0; 1], so that ht can be defined determin- // Sample until proposal is on slice: istically as ht = utP0(θt−1). Now, p(θjht) is a loop uniform distribution over the `slice' S[ut; θt−1] = Draw η ∼ Uniform[0; 1] fθ : P0(θ) > utP0(θt−1)g. Sampling θt ∼ p(θjht) is hard because of the difficulty in finding all segments θprop L + η(R − L) of the slice in a finite amount of time. if OnSlice(θprop; θt−1; ut) then θt θprop The authors of [10] developed a method where instead break of sampling from p(θjht) directly, a `stepping-out' pro- end if cedure is used to find an interval ( ) around L; R θt−1 if θprop < θt−1 then that includes most or all of the slice, and is sampled θt L θprop uniformly from the part of (L; R) that is on S[ut; θt−1]. else Pseudo code for one iteration of slice sampling is shown R θprop in Algorithm 1. A pictorial illustration is given in Fig- end if ure 1. end loop The stepping out procedure involves placing a window (L; R) of width w around θt−1 and widening the inter- The algorithm uses Procedure OnSlice, both dur- val by stepping away from θt−1 in opposite directions ing the stepping-out phase and the rejection sampling until L and R are no longer in the slice. The maximum phase, to check whether a point (L, R or θprop) lies number of step-outs can be limited by a constant Vmax. on the slice S[ut; θt−1]. In a practical implementation, 186 Running heading author breaks the line h L ✓t R ✓t ✓t+1 ✓prop Figure 1: Illustration of the slice sampling procedure. Left: A \slice" is chosen at a height h and with endpoints (L; R) that both lie above the unnormalized density (the curved line). Right: We sample proposed locations θprop until we obtain a sample on the slice (in green), which is our θt+1. The proposed method replaces each function evaluation (represented as dots) with a sequential hypothesis test.