Approximate Slice Sampling for Bayesian Posterior Inference

Total Page:16

File Type:pdf, Size:1020Kb

Approximate Slice Sampling for Bayesian Posterior Inference Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois∗ Anoop Korattikara∗ Max Welling Padhraic Smyth GraphLab, Inc. Dept. of Computer Science Informatics Institute Dept. of Computer Science UC Irvine University of Amsterdam UC Irvine Abstract features, or as is now popular in the context of deep learning, dropout. We believe deep learning is so suc- cessful because it provides the means to train arbitrar- In this paper, we advance the theory of large ily complex models on very large datasets that can be scale Bayesian posterior inference by intro- effectively regularized. ducing a new approximate slice sampler that uses only small mini-batches of data in ev- Methods for Bayesian posterior inference provide a ery iteration. While this introduces a bias principled alternative to frequentist regularization in the stationary distribution, the computa- techniques. The workhorse for approximate posterior tional savings allow us to draw more samples inference has long been MCMC, which unfortunately in a given amount of time and reduce sam- is not (yet) quite up to the task of dealing with very pling variance. We empirically verify on three large datasets. The main reason is that every iter- different models that the approximate slice ation of sampling requires O(N) calculations to de- sampling algorithm can significantly outper- cide whether a proposed parameter is accepted or not. form a traditional slice sampler if we are al- Frequentist methods based on stochastic gradients are lowed only a fixed amount of computing time orders of magnitude faster because they effectively ex- for our simulations. ploit the fact that far away from convergence, the in- formation in the data about how to improve the model is highly redundant. Therefore, only a few data-cases 1 Introduction need to be queried in order to make a good guess on how to improve the model. Thus, a key question is In a time where the amount of data is expanding at an whether it is possible to design MCMC algorithms that exponential rate, one might argue that Bayesian pos- behave more like stochastic gradient based algorithms. terior inference is neither necessary nor a luxury that Some methods have appeared recently [1, 13, 5] that one can afford computationally. However, the lesson use stochastic minibatches of the data at every iter- learned from the \deep learning" literature is that the ation rather then the entire dataset. These methods best performing models in the context of big data are exploit an inherent trade-off between bias (sampling models with very many adjustable degrees of freedom from the wrong distribution asymptotically) and vari- (in the order of tens of billions of parameters in some ance (error due to sampling noise). Given a limit on cases). In line with the nonparametric Bayesian phi- the total sampling time, it may be beneficial to use losophy, real data is usually more complex than any a biased procedure if it allows you to sample faster, finitely parameterized model can describe, leading to resulting in lower error due to variance. the conclusion that even in the context of big data the The work presented here is an extension of [5] where best performing model will be the most complex model sequential hypothesis testing was used to adaptively that does not overfit the data. To find that boundary choose the size of the mini-batch in an approximate between under- and overfitting, it is most convenient Metropolis-Hastings step. The size of the mini-batch to over-parameterize a model and regularize the model is just large enough so that the proposed sample is capacity back through methods such as regularization accepted or rejected with sufficient confidence. That penalties, early stopping, adding noise to the input work used a random walk which is well known to be rather slow mixing. Here we show that a similar strat- Appearing in Proceedings of the 17th International Con- egy can be exploited to speed up a slice sampler [10], ference on Artificial Intelligence and Statistics (AISTATS) which is known to have much better mixing behav- 2014, Reykjavik, Iceland. JMLR: W&CP volume 33. Copy- right 2014 by the authors. ∗. The first two authors contributed equally. 185 Approximate Slice Sampling for Bayesian Posterior Inference ior than an MCMC sampler based on a random walk Note that the initial placement of the interval (L; R) kernel. around θt−1 in Algorithm 1 is random, and Vmax is randomly divided into a limit on stepping to the left, This paper is organized as follows. First, we review V , and a limit on stepping to the right, V . These are the slice sampling algorithm in Section 2. In Section L R necessary for detailed balance to hold, as described in 3, we discuss the bias-variance trade-off for MCMC al- [10]. gorithms. Then, we develop an approximate slice sam- pler based on stochastic minibatches in Section 4. In Once a suitable interval containing all or most of the Section 5 we test our algorithm on a number of mod- slice has been found, θt is sampled by repeatedly draw- els, including L1-regularized linear regression, multi- ing a candidate state θprop uniformly from (L; R) until nomial regression and logistic regression. We conclude it falls on the slice S. At this point, we set θt θprop in Section 6 and discuss possible future work. and move to the next iteration. If a proposed state θprop does not fall on the slice, the interval (L; R) can 2 Slice Sampling be shrunk to (L; θprop) or (θprop;R) such that θt−1 is still included in the new interval. A proof that Algo- Slice sampling [10] is an MCMC method for sam- rithm 1 defines a Markov chain with the correct sta- tionary distribution p(θ; h) can be found in [10]. pling from a (potentially unnormalized) density P0(θ), where θ 2 Θ ⊂ RD, by sampling uniformly from the Algorithm 1 Slice Sampling D + 1 dimensional region under the density. In this section, we will first introduce slice sampling for uni- Input: θt−1, w, Vmax variate distributions and then describe how it can be Output: θt easily extended to the multivariate case. Draw ut ∼ Uniform[0; 1] // Pick an initial interval randomly: We start by defining an auxiliary variable h, Draw w ∼ Uniform[0; 1] and a joint distribution ( ) that is uniform p θ; h Initialize L θt−1 − ww and R L + w over the region under the density (i.e. the set // Determine maximum steps in each direction: f(θ; h): θ 2 Θ; 0 < h < P0(θ)g) and 0 elsewhere. Since Draw v ∼ Uniform[0; 1] marginalizing ( ) over yields ( ), sampling p θ; h h P0 θ Set VL Floor(Vmaxv) and VR Vmax − 1 − VL from p(θ; h) and discarding h is equivalent to sam- // Obtain boundaries of slice: pling from ( ). However, sampling from ( ) is P0 θ p θ; h while VL > 0 and OnSlice(L; θt−1; ut) do not straightforward and is usually implemented as an VL VL − 1 MCMC method where in every iteration t, we first L L − w sample ht ∼ p(hjθt−1) and then sample θt ∼ p(θjht). end while while V > 0 and OnSlice(R; θ − ; u ) do Sampling ht ∼ p(hjθt−1) is trivial since p(hjθt−1) R t 1 t V V − 1 is just Uniform[0;P0(θt−1)]. For later convenience, R R R R + w we will introduce the uniform random variable ut ∼ end while Uniform[0; 1], so that ht can be defined determin- // Sample until proposal is on slice: istically as ht = utP0(θt−1). Now, p(θjht) is a loop uniform distribution over the `slice' S[ut; θt−1] = Draw η ∼ Uniform[0; 1] fθ : P0(θ) > utP0(θt−1)g. Sampling θt ∼ p(θjht) is hard because of the difficulty in finding all segments θprop L + η(R − L) of the slice in a finite amount of time. if OnSlice(θprop; θt−1; ut) then θt θprop The authors of [10] developed a method where instead break of sampling from p(θjht) directly, a `stepping-out' pro- end if cedure is used to find an interval ( ) around L; R θt−1 if θprop < θt−1 then that includes most or all of the slice, and is sampled θt L θprop uniformly from the part of (L; R) that is on S[ut; θt−1]. else Pseudo code for one iteration of slice sampling is shown R θprop in Algorithm 1. A pictorial illustration is given in Fig- end if ure 1. end loop The stepping out procedure involves placing a window (L; R) of width w around θt−1 and widening the inter- The algorithm uses Procedure OnSlice, both dur- val by stepping away from θt−1 in opposite directions ing the stepping-out phase and the rejection sampling until L and R are no longer in the slice. The maximum phase, to check whether a point (L, R or θprop) lies number of step-outs can be limited by a constant Vmax. on the slice S[ut; θt−1]. In a practical implementation, 186 Running heading author breaks the line h L ✓t R ✓t ✓t+1 ✓prop Figure 1: Illustration of the slice sampling procedure. Left: A \slice" is chosen at a height h and with endpoints (L; R) that both lie above the unnormalized density (the curved line). Right: We sample proposed locations θprop until we obtain a sample on the slice (in green), which is our θt+1. The proposed method replaces each function evaluation (represented as dots) with a sequential hypothesis test.
Recommended publications
  • Sampling Methods (Gatsby ML1 2017)
    Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani [email protected] Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2017 Both are often intractable. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Both are often intractable. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions.
    [Show full text]
  • Slice Sampling for General Completely Random Measures
    Slice Sampling for General Completely Random Measures Peiyuan Zhu Alexandre Bouchard-Cotˆ e´ Trevor Campbell Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4 Abstract such model can select each category zero or once, the latent categories are called features [1], whereas if each Completely random measures provide a princi- category can be selected with multiplicities, the latent pled approach to creating flexible unsupervised categories are called traits [2]. models, where the number of latent features is Consider for example the problem of modelling movie infinite and the number of features that influ- ratings for a set of users. As a first rough approximation, ence the data grows with the size of the data an analyst may entertain a clustering over the movies set. Due to the infinity the latent features, poste- and hope to automatically infer movie genres. Cluster- rior inference requires either marginalization— ing in this context is limited; users may like or dislike resulting in dependence structures that prevent movies based on many overlapping factors such as genre, efficient computation via parallelization and actor and score preferences. Feature models, in contrast, conjugacy—or finite truncation, which arbi- support inference of these overlapping movie attributes. trarily limits the flexibility of the model. In this paper we present a novel Markov chain As the amount of data increases, one may hope to cap- Monte Carlo algorithm for posterior inference ture increasingly sophisticated patterns. We therefore that adaptively sets the truncation level using want the model to increase its complexity accordingly. auxiliary slice variables, enabling efficient, par- In our movie example, this means uncovering more and allelized computation without sacrificing flexi- more diverse user preference patterns and movie attributes bility.
    [Show full text]
  • Efficient Sampling for Gaussian Linear Regression with Arbitrary Priors
    Efficient sampling for Gaussian linear regression with arbitrary priors P. Richard Hahn ∗ Arizona State University and Jingyu He Booth School of Business, University of Chicago and Hedibert F. Lopes INSPER Institute of Education and Research May 18, 2018 Abstract This paper develops a slice sampler for Bayesian linear regression models with arbitrary priors. The new sampler has two advantages over current approaches. One, it is faster than many custom implementations that rely on auxiliary latent variables, if the number of regressors is large. Two, it can be used with any prior with a density function that can be evaluated up to a normalizing constant, making it ideal for investigating the properties of new shrinkage priors without having to develop custom sampling algorithms. The new sampler takes advantage of the special structure of the linear regression likelihood, allowing it to produce better effective sample size per second than common alternative approaches. Keywords: Bayesian computation, linear regression, shrinkage priors, slice sampling ∗The authors gratefully acknowledge the Booth School of Business for support. 1 1 Introduction This paper develops a computationally efficient posterior sampling algorithm for Bayesian linear regression models with Gaussian errors. Our new approach is motivated by the fact that existing software implementations for Bayesian linear regression do not readily handle problems with large number of observations (hundreds of thousands) and predictors (thou- sands). Moreover, existing sampling algorithms for popular shrinkage priors are bespoke Gibbs samplers based on case-specific latent variable representations. By contrast, the new algorithm does not rely on case-specific auxiliary variable representations, which allows for rapid prototyping of novel shrinkage priors outside the conditionally Gaussian framework.
    [Show full text]
  • Sampling Methods (Gatsby ML1 2015)
    Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani [email protected] Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2016 Both are often intractable. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Both are often intractable. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions.
    [Show full text]
  • Computer Intensive Statistics STAT:7400
    Computer Intensive Statistics STAT:7400 Luke Tierney Spring 2019 Introduction Syllabus and Background Basics • Review the course syllabus http://www.stat.uiowa.edu/˜luke/classes/STAT7400/syllabus.pdf • Fill out info sheets. – name – field – statistics background – computing background Homework • some problems will cover ideas not covered in class • working together is OK • try to work on your own • write-up must be your own • do not use solutions from previous years • submission by GitHub at http://github.uiowa.edu or by Icon at http://icon.uiowa.edu/. 1 Computer Intensive Statistics STAT:7400, Spring 2019 Tierney Project • Find a topic you are interested in. • Written report plus possibly some form of presentation. Ask Questions • Ask questions if you are confused or think a point needs more discussion. • Questions can lead to interesting discussions. 2 Computer Intensive Statistics STAT:7400, Spring 2019 Tierney Computational Tools Computers and Operating Systems • We will use software available on the Linux workstations in the Mathe- matical Sciences labs (Schaeffer 346 in particular). • Most things we will do can be done remotely by using ssh to log into one of the machines in Schaeffer 346 using ssh. These machines are l-lnx2xy.stat.uiowa.edu with xy = 00, 01, 02, . , 19. • You can also access the CLAS Linux systems using a browser at http://fastx.divms.uiowa.edu/ – This connects you to one of several servers. – It is OK to run small jobs on these servers. – For larger jobs you should log into one of the lab machines. • Most of the software we will use is available free for installing on any Linux, Mac OS X, or Windows computer.
    [Show full text]
  • Slice Sampling with Multivariate Steps by Madeleine B. Thompson A
    Slice Sampling with Multivariate Steps by Madeleine B. Thompson A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Statistics University of Toronto Copyright c 2011 by Madeleine B. Thompson ii Abstract Slice Sampling with Multivariate Steps Madeleine B. Thompson Doctor of Philosophy Graduate Department of Statistics University of Toronto 2011 Markov chain Monte Carlo (MCMC) allows statisticians to sample from a wide variety of multidimensional probability distributions. Unfortunately, MCMC is often difficult to use when components of the target distribution are highly correlated or have disparate variances. This thesis presents three results that attempt to address this problem. First, it demonstrates a means for graphical comparison of MCMC methods, which allows re- searchers to compare the behavior of a variety of samplers on a variety of distributions. Second, it presents a collection of new slice-sampling MCMC methods. These methods either adapt globally or use the adaptive crumb framework for sampling with multivariate steps. They perform well with minimal tuning on distributions when popular methods do not. Methods in the first group learn an approximation to the covariance of the tar- get distribution and use its eigendecomposition to take non-axis-aligned steps. Methods in the second group use the gradients at rejected proposed moves to approximate the local shape of the target distribution so that subsequent proposals move more efficiently through the state space. Finally, this thesis explores the scaling of slice sampling with multivariate steps with respect to dimension, resulting in a formula for optimally choos- ing scale tuning parameters.
    [Show full text]
  • Slice Sampling on Hamiltonian Trajectories
    Slice Sampling on Hamiltonian Trajectories Benjamin Bloem-Reddy [email protected] John P. Cunningham [email protected] Department of Statistics, Columbia University Abstract tonian system, and samples a new point from the distribu- tion by simulating a dynamical trajectory from the current Hamiltonian Monte Carlo and slice sampling sample point. With careful tuning, HMC exhibits favorable are amongst the most widely used and stud- mixing properties in many situations, particularly in high ied classes of Markov Chain Monte Carlo sam- dimensions, due to its ability to take large steps in sam- plers. We connect these two methods and present ple space if the dynamics are simulated for long enough. Hamiltonian slice sampling, which allows slice Proper tuning of HMC parameters can be difficult, and sampling to be carried out along Hamiltonian there has been much interest in automating it; examples in- trajectories, or transformations thereof. Hamil- clude Wang et al.(2013); Hoffman & Gelman(2014). Fur- tonian slice sampling clarifies a class of model thermore, better mixing rates associated with longer trajec- priors that induce closed-form slice samplers. tories can be computationally expensive because each sim- More pragmatically, inheriting properties of slice ulation step requires an evaluation or numerical approxi- samplers, it offers advantages over Hamiltonian mation of the gradient of the distribution. Monte Carlo, in that it has fewer tunable hyper- parameters and does not require gradient infor- Similar in objective but different in approach is slice sam- mation. We demonstrate the utility of Hamil- pling, which attempts to sample uniformly from the volume tonian slice sampling out of the box on prob- under the target density.
    [Show full text]
  • A Markov Chain Monte Carlo Version of the Genetic Algorithm Differential Evolution: Easy Bayesian Computing for Real Parameter Spaces
    Stat Comput (2006) 16:239–249 DOI 10.1007/s11222-006-8769-1 A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces Cajo J. F. Ter Braak Received: April 2004 / Accepted: January 2006 C Springer Science + Business Media, LLC 2006 Abstract Differential Evolution (DE) is a simple genetic al- Keywords Block updating · Evolutionary Monte Carlo · gorithm for numerical optimization in real parameter spaces. Metropolis algorithm · Population Markov Chain Monte In a statistical context one would not just want the optimum Carlo · Simulated Annealing · Simulated Tempering; but also its uncertainty. The uncertainty distribution can be Theophylline Kinetics. obtained by a Bayesian analysis (after specifying prior and likelihood) using Markov Chain Monte Carlo (MCMC) sim- ulation. This paper integrates the essential ideas of DE and 1. Introduction MCMC, resulting in Differential Evolution Markov Chain (DE-MC). DE-MC is a population MCMC algorithm, in This paper combines the genetic algorithm called Differen- which multiple chains are run in parallel. DE-MC solves tial Evolution (DE) (Price and Storn, 1997, Storn and Price, an important problem in MCMC, namely that of choosing an 1997, Price, 1999) for global optimization over real parame- appropriate scale and orientation for the jumping distribu- ter space with Markov Chain Monte Carlo (MCMC) (Gilks tion. In DE-MC the jumps are simply a fixed multiple of the et al., 1996) so as to generate a sample from a target distribu- differences of two random parameter vectors that are cur- tion. In Bayesian analysis the target distribution is typically a rently in the population.
    [Show full text]
  • Slice Sampler Algorithm for Generalized Pareto Distribution
    Slice sampler algorithm for generalized Pareto distribution Mohammad Rostami∗y, Mohd Bakri Adam Yahyaz, Mohamed Hisham Yahyax, Noor Akma Ibrahim{ Abstract In this paper, we developed the slice sampler algorithm for the gener- alized Pareto distribution (GPD) model. Two simulation studies have shown the performance of the peaks over given threshold (POT) and GPD density function on various simulated data sets. The results were compared with another commonly used Markov chain Monte Carlo (MCMC) technique called Metropolis-Hastings algorithm. Based on the results, the slice sampler algorithm provides closer posterior mean values and shorter 95% quantile based credible intervals compared to the Metropolis-Hastings algorithm. Moreover, the slice sampler algo- rithm presents a higher level of stationarity in terms of the scale and shape parameters compared with the Metropolis-Hastings algorithm. Finally, the slice sampler algorithm was employed to estimate the re- turn and risk values of investment in Malaysian gold market. Keywords: extreme value theory, Markov chain Monte Carlo, slice sampler, Metropolis-Hastings algorithm, Bayesian analysis, gold price. 2000 AMS Classification: 62C12, 62F15, 60G70, 62G32, 97K50, 65C40. 1. Introduction The field of extreme value theory (EVT) goes back to 1927, when Fr´echet [1] formulated the functional equation of stability for maxima, which later was solved with some restrictions by Fisher and Tippett [2] and finally by Gnedenko [3] and De Haan [4]. There are two main approaches in the modelling of extreme values. First, under certain conditions, the asymptotic distribution of a series of maxima (minima) can be properly approximated by Gumbel, Weibull and Frechet distributions which have been unified in a generalized form named generalized extreme value (GEV) distribution [5].
    [Show full text]
  • Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review
    Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review Benyamin Ghojogh* [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Hadi Nekoei* [email protected] MILA (Montreal Institute for Learning Algorithms) – Quebec AI Institute, Montreal, Quebec, Canada Aydin Ghojogh* [email protected] Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada * The first three authors contributed equally to this work. Abstract pendent, we cover importance sampling and re- This paper is a tutorial and literature review on jection sampling. For MCMC methods, we cover sampling algorithms. We have two main types Metropolis algorithm, Metropolis-Hastings al- of sampling in statistics. The first type is sur- gorithm, Gibbs sampling, and slice sampling. vey sampling which draws samples from a set or Then, we explain the random walk behaviour of population. The second type is sampling from Monte Carlo methods and more efficient Monte probability distribution where we have a proba- Carlo methods, including Hamiltonian (or hy- bility density or mass function. In this paper, we brid) Monte Carlo, Adler’s overrelaxation, and cover both types of sampling. First, we review ordered overrelaxation. Finally, we summarize some required background on mean squared er- the characteristics, pros, and cons of sampling ror, variance, bias, maximum likelihood estima- methods compared to each other. This paper can tion, Bernoulli, Binomial, and Hypergeometric be useful for different fields of statistics, machine distributions, the Horvitz–Thompson estimator, learning, reinforcement learning, and computa- and the Markov property.
    [Show full text]
  • Download File
    Advances in Bayesian inference and stable optimization for large-scale machine learning problems Francois Fagan Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2019 c 2018 Francois Fagan All Rights Reserved ABSTRACT Advances in Bayesian inference and stable optimization for large-scale machine learning problems Francois Fagan A core task in machine learning, and the topic of this thesis, is developing faster and more accurate methods of posterior inference in probabilistic models. The thesis has two components. The first explores using deterministic methods to improve the efficiency of Markov Chain Monte Carlo (MCMC) algorithms. We propose new MCMC algorithms that can use deterministic methods as a \prior" to bias MCMC proposals to be in areas of high posterior density, leading to highly efficient sampling. In Chapter 2 we develop such methods for continuous distributions, and in Chapter 3 for binary distributions. The resulting methods consistently outperform existing state-of-the-art sampling techniques, sometimes by several orders of magnitude. Chapter 4 uses similar ideas as in Chapters 2 and 3, but in the context of modeling the performance of left-handed players in one-on-one interactive sports. The second part of this thesis explores the use of stable stochastic gradient descent (SGD) methods for computing a maximum a posteriori (MAP) estimate in large-scale machine learning problems. In Chapter 5 we propose two such methods for softmax regression. The first is an implementation of Implicit SGD (ISGD), a stable but difficult to implement SGD method, and the second is a new SGD method specifically designed for optimizing a double-sum formulation of the softmax.
    [Show full text]
  • Multivariate Slice Sampling
    Multivariate Slice Sampling A Thesis Submitted to the Faculty of Drexel University by Jingjing Lu in partial ful¯llment of the requirements for the degree of Doctor of Philosophy June 2008 °c Copyright June 2008 Jingjing Lu . All Rights Reserved. Dedications To My beloved family Acknowledgements I would like to express my gratitude to my entire thesis committee: Professor Avijit Banerjee, Professor Thomas Hindelang, Professor Seung-Lae Kim, Pro- fessor Merrill Liechty, Professor Maria Rieders, who were always available and always willing to help. This dissertation could not have been ¯nished without them. Especially, I would like to give my special thanks to my advisor and friend, Professor Liechty, who o®ered me stimulating suggestions and encouragement during my graduate studies. I would like to thank Decision Sciences department and all the faculty and sta® at LeBow College of Business for their help and caring. I thank my family for their support and encouragement all the time. Thank you for always being there for me. Many thanks also go to all my friends, who share a lot of happiness and dreams with me. i Table of Contents List of Figures ................................................................ v List of Tables ................................................................. vii Abstract ....................................................................... viii 1. Introduction ............................................................... 1 1.1 Bayesian Computations ............................................. 4 1.2 Monte
    [Show full text]