Slice Sampling on Hamiltonian Trajectories

Total Page:16

File Type:pdf, Size:1020Kb

Slice Sampling on Hamiltonian Trajectories Slice Sampling on Hamiltonian Trajectories Benjamin Bloem-Reddy [email protected] John P. Cunningham [email protected] Department of Statistics, Columbia University Abstract tonian system, and samples a new point from the distribu- tion by simulating a dynamical trajectory from the current Hamiltonian Monte Carlo and slice sampling sample point. With careful tuning, HMC exhibits favorable are amongst the most widely used and stud- mixing properties in many situations, particularly in high ied classes of Markov Chain Monte Carlo sam- dimensions, due to its ability to take large steps in sam- plers. We connect these two methods and present ple space if the dynamics are simulated for long enough. Hamiltonian slice sampling, which allows slice Proper tuning of HMC parameters can be difficult, and sampling to be carried out along Hamiltonian there has been much interest in automating it; examples in- trajectories, or transformations thereof. Hamil- clude Wang et al.(2013); Hoffman & Gelman(2014). Fur- tonian slice sampling clarifies a class of model thermore, better mixing rates associated with longer trajec- priors that induce closed-form slice samplers. tories can be computationally expensive because each sim- More pragmatically, inheriting properties of slice ulation step requires an evaluation or numerical approxi- samplers, it offers advantages over Hamiltonian mation of the gradient of the distribution. Monte Carlo, in that it has fewer tunable hyper- parameters and does not require gradient infor- Similar in objective but different in approach is slice sam- mation. We demonstrate the utility of Hamil- pling, which attempts to sample uniformly from the volume tonian slice sampling out of the box on prob- under the target density. Slice sampling has been employed lems ranging from Gaussian process regression successfully in a wide range of inference problems, in large to Pitman-Yor based mixture models. part because of its flexibility and relative ease of tuning (very few if any tunable parameters). Although efficiently slice sampling from univariate distributions is straightfor- 1. Introduction ward, in higher dimensions it is more difficult. Previous ap- proaches include slice sampling each dimension individu- After decades of work in approximate inference and nu- ally, amongst others. One particularly successful approach merical integration, Markov Chain Monte Carlo (MCMC) is to generate a curve, parameterized by a single scalar techniques remain the gold standard for working with in- value, through the high-dimensional sample space. Ellip- tractable probabilistic models, throughout statistics and tical slice sampling (ESS) (Murray et al., 2010) is one such machine learning. Of course, this gold standard also approach, generating ellipses parameterized by θ 2 [0; 2π]. comes with sometimes severe computational requirements, As we show in section 2.2, ESS is a special case of our which has spurred many developments for increasing the proposed sampling algorithm. efficacy of MCMC. Accordingly, numerous MCMC algo- rithms have been proposed in different fields and with dif- In this paper, we explore the connections between HMC ferent motivations, and perhaps as a result the similarities and slice sampling by observing that the elliptical trajectory between some popular methods have not been highlighted used in ESS is the Hamiltonian flow of a Gaussian poten- or exploited. Here we consider two important classes of tial. This observation is perhaps not surprising – indeed it MCMC methods: Hamiltonian Monte Carlo (HMC) (Neal, appears at least implicitly in Neal(2011); Pakman & Panin- 2011) and slice sampling (Neal, 2003). ski(2013); Strathmann et al.(2015) – but here we leverage that observation in a way not previously considered. To wit, HMC considers the (negative log) probability of the in- this connection between HMC and ESS suggests that we tractable distribution as the potential energy of a Hamil- might perform slice sampling along more general Hamil- rd tonian trajectories, a method we introduce under the name Proceedings of the 33 International Conference on Machine Hamiltonian slice sampling (HSS). HSS is of theoretical in- Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s). terest, allowing us to consider model priors that will induce Slice Sampling on Hamiltonian Trajectories simple closed form slice samplers. Perhaps more impor- the system evolves deterministically in an augmented state tantly, HSS is of practical utility. As the conceptual off- space (q; p) according to Hamilton’s equations. In par- spring of HMC and slice sampling, it inherits the relatively ticular, at the i-th sampling iteration of HMC, the initial (i−1) small amount of required tuning from ESS and the ability conditions of the system are q0 = q , the previous to take large steps in sample space from HMC techniques. sample, and p0, which in most implementations is sam- pled from N (0;M). M is a “mass” matrix that may be In particular, we offer the following contributions: used to express a priori beliefs about the scaling and depen- • We clarify the link between two popular classes of dence of the different dimensions, and for models with high MCMC techniques. dependence between sampling dimensions, M can greatly affect sampling efficiency. An active area of research is • We introduce Hamiltonian slice sampling, a general investigating how to adapt M for increased sampling effi- sampler for target distributions that factor into two ciency; for example Girolami & Calderhead(2011); Betan- components (e.g., a prior and likelihood), where the court et al.(2016). For simplicity, in this work we assume prior factorizes or can be transformed so as to factor- throughout that M is set ahead of time and remains fixed. ize, and all dependence structure in the target distri- The resulting Hamiltonian is bution is induced by the likelihood or the intermediate 1 transformation. H(q; p) = − logπ ~(q) + pT M −1p: (1) 2 • We show that the prior can be either of a form which induces analytical Hamiltonian trajectories, or more When solved exactly, Hamilton’s equations generate generally, of a form such that we can derive such a tra- Metropolis-Hastings (MH) proposals that are accepted with jectory via a measure preserving transformation. No- probability 1. In most situations of interest, Hamilton’s table members of this class include Gaussian process equations do not have an analytic solution, so HMC is of- and stick-breaking priors. ten performed using a numerical integrator, e.g. Störmer- Verlet, which is sensitive to tuning and can be computation- • We demonstrate the usefulness of HSS on a range of ally expensive due to evaluation or numerical approxima- models, both parametric and nonparametric. tion of the gradient at each step. See Neal(2011) for more details. We first review HMC and ESS to establish notation and a conceptual framework. We then introduce HSS generally, 2.2. Univariate and Elliptical Slice Sampling followed by a specific version based on a transformation to the unit interval. Finally, we demonstrate the effectiveness Univariate slice sampling (Neal, 2003) generates samples and flexibility of HSS on two different popular probabilistic uniformly from the area beneath a density p(x), such that models. the resulting samples are distributed according to p(x). It proceeds as follows: given a sample x0, a threshold 2. Sampling via Hamiltonian dynamics h ∼ U(0; p(x0)) is sampled; the next sample x1 is then sampled uniformly from the slice S := fx : p(x) > hg. We are interested in the problem of generating samples of When S is not known in closed form, a proxy slice S0 can a random variable f from an intractable distribution π∗(f), be randomly constructed from operations that leave the uni- either directly or via a measure preserving transformation form distribution on S invariant. In this paper, we use the r−1(q) = f. For clarity, we use f to denote the quantity stepping out and shrinkage procedure, with parameters w, of interest in its natural space as defined by the distribution the width, and m, the step out limit, from Neal(2003). π∗, and q denotes the transformed quantity of interest in Hamiltonian phase space, as described in detail below. We ESS (Murray et al., 2010) is a popular sampling algorithm defer further discussion of the map r(·) until section 2.4, for latent Gaussian models, e.g., Markov random field or GP but note that in typical implementations of HMC and ESS, Gaussian process ( ) models. For a latent multivariate f 2 d π(f) = N (fj0; Σ) r(·) is the indentity map, i.e. f = q. Gaussian R with prior , ESS gen- erates MCMC transitions f ! f 0 by sampling an auxiliary variable ν ∼ N (0; Σ) and slice sampling the likelihood 2.1. Hamiltonian Monte Carlo supported on the ellipse defined by HMC generates MCMC samples from a target dis- 0 tribution, often an intractable posterior distribution, f = f cos θ + ν sin θ; θ 2 [0; 2π]: (2) ∗ 1 π (q) := Z π~(q), with normalizing constant Z, by simu- lating the Hamiltonian dynamics of a particle in the poten- Noting the fact that the solutions to Hamilton’s equations in tial U(q) = − logπ ~(q). We are free to specify a distribu- an elliptical potential are ellipses, we may reinterpret ESS. tion for the particle’s starting momentum p, given which The Hamiltonian induced by π~(q) /N (qj0; Σ) and mo- Slice Sampling on Hamiltonian Trajectories mentum distribution p ∼ N (0;M) is Algorithm 1 Hamiltonian slice sampling 1 1 Input: Current state f; differentiable, one-to-one trans- H(q; p) = qT Σ−1q + pT M −1p: (3) −1 2 2 formation q := r(f) with Jacobian J(r (q)) Output: A new state f 0. If f is drawn from π∗, the When M = Σ−1, Hamilton’s equations yield the trajectory marginal distribution of f 0 is also π∗.
Recommended publications
  • Sampling Methods (Gatsby ML1 2017)
    Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani [email protected] Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2017 Both are often intractable. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Both are often intractable. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions.
    [Show full text]
  • Slice Sampling for General Completely Random Measures
    Slice Sampling for General Completely Random Measures Peiyuan Zhu Alexandre Bouchard-Cotˆ e´ Trevor Campbell Department of Statistics University of British Columbia Vancouver, BC V6T 1Z4 Abstract such model can select each category zero or once, the latent categories are called features [1], whereas if each Completely random measures provide a princi- category can be selected with multiplicities, the latent pled approach to creating flexible unsupervised categories are called traits [2]. models, where the number of latent features is Consider for example the problem of modelling movie infinite and the number of features that influ- ratings for a set of users. As a first rough approximation, ence the data grows with the size of the data an analyst may entertain a clustering over the movies set. Due to the infinity the latent features, poste- and hope to automatically infer movie genres. Cluster- rior inference requires either marginalization— ing in this context is limited; users may like or dislike resulting in dependence structures that prevent movies based on many overlapping factors such as genre, efficient computation via parallelization and actor and score preferences. Feature models, in contrast, conjugacy—or finite truncation, which arbi- support inference of these overlapping movie attributes. trarily limits the flexibility of the model. In this paper we present a novel Markov chain As the amount of data increases, one may hope to cap- Monte Carlo algorithm for posterior inference ture increasingly sophisticated patterns. We therefore that adaptively sets the truncation level using want the model to increase its complexity accordingly. auxiliary slice variables, enabling efficient, par- In our movie example, this means uncovering more and allelized computation without sacrificing flexi- more diverse user preference patterns and movie attributes bility.
    [Show full text]
  • Efficient Sampling for Gaussian Linear Regression with Arbitrary Priors
    Efficient sampling for Gaussian linear regression with arbitrary priors P. Richard Hahn ∗ Arizona State University and Jingyu He Booth School of Business, University of Chicago and Hedibert F. Lopes INSPER Institute of Education and Research May 18, 2018 Abstract This paper develops a slice sampler for Bayesian linear regression models with arbitrary priors. The new sampler has two advantages over current approaches. One, it is faster than many custom implementations that rely on auxiliary latent variables, if the number of regressors is large. Two, it can be used with any prior with a density function that can be evaluated up to a normalizing constant, making it ideal for investigating the properties of new shrinkage priors without having to develop custom sampling algorithms. The new sampler takes advantage of the special structure of the linear regression likelihood, allowing it to produce better effective sample size per second than common alternative approaches. Keywords: Bayesian computation, linear regression, shrinkage priors, slice sampling ∗The authors gratefully acknowledge the Booth School of Business for support. 1 1 Introduction This paper develops a computationally efficient posterior sampling algorithm for Bayesian linear regression models with Gaussian errors. Our new approach is motivated by the fact that existing software implementations for Bayesian linear regression do not readily handle problems with large number of observations (hundreds of thousands) and predictors (thou- sands). Moreover, existing sampling algorithms for popular shrinkage priors are bespoke Gibbs samplers based on case-specific latent variable representations. By contrast, the new algorithm does not rely on case-specific auxiliary variable representations, which allows for rapid prototyping of novel shrinkage priors outside the conditionally Gaussian framework.
    [Show full text]
  • Approximate Slice Sampling for Bayesian Posterior Inference
    Approximate Slice Sampling for Bayesian Posterior Inference Christopher DuBois∗ Anoop Korattikara∗ Max Welling Padhraic Smyth GraphLab, Inc. Dept. of Computer Science Informatics Institute Dept. of Computer Science UC Irvine University of Amsterdam UC Irvine Abstract features, or as is now popular in the context of deep learning, dropout. We believe deep learning is so suc- cessful because it provides the means to train arbitrar- In this paper, we advance the theory of large ily complex models on very large datasets that can be scale Bayesian posterior inference by intro- effectively regularized. ducing a new approximate slice sampler that uses only small mini-batches of data in ev- Methods for Bayesian posterior inference provide a ery iteration. While this introduces a bias principled alternative to frequentist regularization in the stationary distribution, the computa- techniques. The workhorse for approximate posterior tional savings allow us to draw more samples inference has long been MCMC, which unfortunately in a given amount of time and reduce sam- is not (yet) quite up to the task of dealing with very pling variance. We empirically verify on three large datasets. The main reason is that every iter- different models that the approximate slice ation of sampling requires O(N) calculations to de- sampling algorithm can significantly outper- cide whether a proposed parameter is accepted or not. form a traditional slice sampler if we are al- Frequentist methods based on stochastic gradients are lowed only a fixed amount of computing time orders of magnitude faster because they effectively ex- for our simulations. ploit the fact that far away from convergence, the in- formation in the data about how to improve the model is highly redundant.
    [Show full text]
  • Sampling Methods (Gatsby ML1 2015)
    Probabilistic & Unsupervised Learning Sampling Methods Maneesh Sahani [email protected] Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 2016 Both are often intractable. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Deterministic approximations on distributions (factored variational / mean-field; BP; EP) or expectations (Bethe / Kikuchi methods) provide tractability, at the expense of a fixed approximation penalty. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions. I Expectations with respect to these distributions. Both are often intractable. An alternative is to represent distributions and compute expectations using randomly generated samples. Results are consistent, often unbiased, and precision can generally be improved to an arbitrary degree by increasing the number of samples. Sampling For inference and learning we need to compute both: I Posterior distributions (on latents and/or parameters) or predictive distributions.
    [Show full text]
  • Computer Intensive Statistics STAT:7400
    Computer Intensive Statistics STAT:7400 Luke Tierney Spring 2019 Introduction Syllabus and Background Basics • Review the course syllabus http://www.stat.uiowa.edu/˜luke/classes/STAT7400/syllabus.pdf • Fill out info sheets. – name – field – statistics background – computing background Homework • some problems will cover ideas not covered in class • working together is OK • try to work on your own • write-up must be your own • do not use solutions from previous years • submission by GitHub at http://github.uiowa.edu or by Icon at http://icon.uiowa.edu/. 1 Computer Intensive Statistics STAT:7400, Spring 2019 Tierney Project • Find a topic you are interested in. • Written report plus possibly some form of presentation. Ask Questions • Ask questions if you are confused or think a point needs more discussion. • Questions can lead to interesting discussions. 2 Computer Intensive Statistics STAT:7400, Spring 2019 Tierney Computational Tools Computers and Operating Systems • We will use software available on the Linux workstations in the Mathe- matical Sciences labs (Schaeffer 346 in particular). • Most things we will do can be done remotely by using ssh to log into one of the machines in Schaeffer 346 using ssh. These machines are l-lnx2xy.stat.uiowa.edu with xy = 00, 01, 02, . , 19. • You can also access the CLAS Linux systems using a browser at http://fastx.divms.uiowa.edu/ – This connects you to one of several servers. – It is OK to run small jobs on these servers. – For larger jobs you should log into one of the lab machines. • Most of the software we will use is available free for installing on any Linux, Mac OS X, or Windows computer.
    [Show full text]
  • Slice Sampling with Multivariate Steps by Madeleine B. Thompson A
    Slice Sampling with Multivariate Steps by Madeleine B. Thompson A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Statistics University of Toronto Copyright c 2011 by Madeleine B. Thompson ii Abstract Slice Sampling with Multivariate Steps Madeleine B. Thompson Doctor of Philosophy Graduate Department of Statistics University of Toronto 2011 Markov chain Monte Carlo (MCMC) allows statisticians to sample from a wide variety of multidimensional probability distributions. Unfortunately, MCMC is often difficult to use when components of the target distribution are highly correlated or have disparate variances. This thesis presents three results that attempt to address this problem. First, it demonstrates a means for graphical comparison of MCMC methods, which allows re- searchers to compare the behavior of a variety of samplers on a variety of distributions. Second, it presents a collection of new slice-sampling MCMC methods. These methods either adapt globally or use the adaptive crumb framework for sampling with multivariate steps. They perform well with minimal tuning on distributions when popular methods do not. Methods in the first group learn an approximation to the covariance of the tar- get distribution and use its eigendecomposition to take non-axis-aligned steps. Methods in the second group use the gradients at rejected proposed moves to approximate the local shape of the target distribution so that subsequent proposals move more efficiently through the state space. Finally, this thesis explores the scaling of slice sampling with multivariate steps with respect to dimension, resulting in a formula for optimally choos- ing scale tuning parameters.
    [Show full text]
  • A Markov Chain Monte Carlo Version of the Genetic Algorithm Differential Evolution: Easy Bayesian Computing for Real Parameter Spaces
    Stat Comput (2006) 16:239–249 DOI 10.1007/s11222-006-8769-1 A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces Cajo J. F. Ter Braak Received: April 2004 / Accepted: January 2006 C Springer Science + Business Media, LLC 2006 Abstract Differential Evolution (DE) is a simple genetic al- Keywords Block updating · Evolutionary Monte Carlo · gorithm for numerical optimization in real parameter spaces. Metropolis algorithm · Population Markov Chain Monte In a statistical context one would not just want the optimum Carlo · Simulated Annealing · Simulated Tempering; but also its uncertainty. The uncertainty distribution can be Theophylline Kinetics. obtained by a Bayesian analysis (after specifying prior and likelihood) using Markov Chain Monte Carlo (MCMC) sim- ulation. This paper integrates the essential ideas of DE and 1. Introduction MCMC, resulting in Differential Evolution Markov Chain (DE-MC). DE-MC is a population MCMC algorithm, in This paper combines the genetic algorithm called Differen- which multiple chains are run in parallel. DE-MC solves tial Evolution (DE) (Price and Storn, 1997, Storn and Price, an important problem in MCMC, namely that of choosing an 1997, Price, 1999) for global optimization over real parame- appropriate scale and orientation for the jumping distribu- ter space with Markov Chain Monte Carlo (MCMC) (Gilks tion. In DE-MC the jumps are simply a fixed multiple of the et al., 1996) so as to generate a sample from a target distribu- differences of two random parameter vectors that are cur- tion. In Bayesian analysis the target distribution is typically a rently in the population.
    [Show full text]
  • Slice Sampler Algorithm for Generalized Pareto Distribution
    Slice sampler algorithm for generalized Pareto distribution Mohammad Rostami∗y, Mohd Bakri Adam Yahyaz, Mohamed Hisham Yahyax, Noor Akma Ibrahim{ Abstract In this paper, we developed the slice sampler algorithm for the gener- alized Pareto distribution (GPD) model. Two simulation studies have shown the performance of the peaks over given threshold (POT) and GPD density function on various simulated data sets. The results were compared with another commonly used Markov chain Monte Carlo (MCMC) technique called Metropolis-Hastings algorithm. Based on the results, the slice sampler algorithm provides closer posterior mean values and shorter 95% quantile based credible intervals compared to the Metropolis-Hastings algorithm. Moreover, the slice sampler algo- rithm presents a higher level of stationarity in terms of the scale and shape parameters compared with the Metropolis-Hastings algorithm. Finally, the slice sampler algorithm was employed to estimate the re- turn and risk values of investment in Malaysian gold market. Keywords: extreme value theory, Markov chain Monte Carlo, slice sampler, Metropolis-Hastings algorithm, Bayesian analysis, gold price. 2000 AMS Classification: 62C12, 62F15, 60G70, 62G32, 97K50, 65C40. 1. Introduction The field of extreme value theory (EVT) goes back to 1927, when Fr´echet [1] formulated the functional equation of stability for maxima, which later was solved with some restrictions by Fisher and Tippett [2] and finally by Gnedenko [3] and De Haan [4]. There are two main approaches in the modelling of extreme values. First, under certain conditions, the asymptotic distribution of a series of maxima (minima) can be properly approximated by Gumbel, Weibull and Frechet distributions which have been unified in a generalized form named generalized extreme value (GEV) distribution [5].
    [Show full text]
  • Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review
    Sampling Algorithms, from Survey Sampling to Monte Carlo Methods: Tutorial and Literature Review Benyamin Ghojogh* [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Hadi Nekoei* [email protected] MILA (Montreal Institute for Learning Algorithms) – Quebec AI Institute, Montreal, Quebec, Canada Aydin Ghojogh* [email protected] Fakhri Karray [email protected] Department of Electrical and Computer Engineering, Centre for Pattern Analysis and Machine Intelligence, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada * The first three authors contributed equally to this work. Abstract pendent, we cover importance sampling and re- This paper is a tutorial and literature review on jection sampling. For MCMC methods, we cover sampling algorithms. We have two main types Metropolis algorithm, Metropolis-Hastings al- of sampling in statistics. The first type is sur- gorithm, Gibbs sampling, and slice sampling. vey sampling which draws samples from a set or Then, we explain the random walk behaviour of population. The second type is sampling from Monte Carlo methods and more efficient Monte probability distribution where we have a proba- Carlo methods, including Hamiltonian (or hy- bility density or mass function. In this paper, we brid) Monte Carlo, Adler’s overrelaxation, and cover both types of sampling. First, we review ordered overrelaxation. Finally, we summarize some required background on mean squared er- the characteristics, pros, and cons of sampling ror, variance, bias, maximum likelihood estima- methods compared to each other. This paper can tion, Bernoulli, Binomial, and Hypergeometric be useful for different fields of statistics, machine distributions, the Horvitz–Thompson estimator, learning, reinforcement learning, and computa- and the Markov property.
    [Show full text]
  • Download File
    Advances in Bayesian inference and stable optimization for large-scale machine learning problems Francois Fagan Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2019 c 2018 Francois Fagan All Rights Reserved ABSTRACT Advances in Bayesian inference and stable optimization for large-scale machine learning problems Francois Fagan A core task in machine learning, and the topic of this thesis, is developing faster and more accurate methods of posterior inference in probabilistic models. The thesis has two components. The first explores using deterministic methods to improve the efficiency of Markov Chain Monte Carlo (MCMC) algorithms. We propose new MCMC algorithms that can use deterministic methods as a \prior" to bias MCMC proposals to be in areas of high posterior density, leading to highly efficient sampling. In Chapter 2 we develop such methods for continuous distributions, and in Chapter 3 for binary distributions. The resulting methods consistently outperform existing state-of-the-art sampling techniques, sometimes by several orders of magnitude. Chapter 4 uses similar ideas as in Chapters 2 and 3, but in the context of modeling the performance of left-handed players in one-on-one interactive sports. The second part of this thesis explores the use of stable stochastic gradient descent (SGD) methods for computing a maximum a posteriori (MAP) estimate in large-scale machine learning problems. In Chapter 5 we propose two such methods for softmax regression. The first is an implementation of Implicit SGD (ISGD), a stable but difficult to implement SGD method, and the second is a new SGD method specifically designed for optimizing a double-sum formulation of the softmax.
    [Show full text]
  • Multivariate Slice Sampling
    Multivariate Slice Sampling A Thesis Submitted to the Faculty of Drexel University by Jingjing Lu in partial ful¯llment of the requirements for the degree of Doctor of Philosophy June 2008 °c Copyright June 2008 Jingjing Lu . All Rights Reserved. Dedications To My beloved family Acknowledgements I would like to express my gratitude to my entire thesis committee: Professor Avijit Banerjee, Professor Thomas Hindelang, Professor Seung-Lae Kim, Pro- fessor Merrill Liechty, Professor Maria Rieders, who were always available and always willing to help. This dissertation could not have been ¯nished without them. Especially, I would like to give my special thanks to my advisor and friend, Professor Liechty, who o®ered me stimulating suggestions and encouragement during my graduate studies. I would like to thank Decision Sciences department and all the faculty and sta® at LeBow College of Business for their help and caring. I thank my family for their support and encouragement all the time. Thank you for always being there for me. Many thanks also go to all my friends, who share a lot of happiness and dreams with me. i Table of Contents List of Figures ................................................................ v List of Tables ................................................................. vii Abstract ....................................................................... viii 1. Introduction ............................................................... 1 1.1 Bayesian Computations ............................................. 4 1.2 Monte
    [Show full text]