IEOR E4703: Monte-Carlo Simulation MCMC and Bayesian Modeling
Total Page:16
File Type:pdf, Size:1020Kb
Load more
										Recommended publications
									
								- 
												
												The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo
Journal of Machine Learning Research 15 (2014) 1593-1623 Submitted 11/11; Revised 10/13; Published 4/14 The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo Matthew D. Hoffman [email protected] Adobe Research 601 Townsend St. San Francisco, CA 94110, USA Andrew Gelman [email protected] Departments of Statistics and Political Science Columbia University New York, NY 10027, USA Editor: Anthanasios Kottas Abstract Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC's performance is highly sensitive to two user-specified parameters: a step size and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS performs at least as efficiently as (and sometimes more efficiently than) a well tuned standard HMC method, without requiring user intervention or costly tuning runs. - 
												
												Markov Chain Monte Carlo Convergence Diagnostics: a Comparative Review
Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin1 Abstract A critical issue for users of Markov Chain Monte Carlo (MCMC) methods in applications is how to determine when it is safe to stop sampling and use the samples to estimate characteristics of the distribu- tion of interest. Research into methods of computing theoretical convergence bounds holds promise for the future but currently has yielded relatively little that is of practical use in applied work. Consequently, most MCMC users address the convergence problem by applying diagnostic tools to the output produced by running their samplers. After giving a brief overview of the area, we provide an expository review of thirteen convergence diagnostics, describing the theoretical basis and practical implementation of each. We then compare their performance in two simple models and conclude that all the methods can fail to detect the sorts of convergence failure they were designed to identify. We thus recommend a combination of strategies aimed at evaluating and accelerating MCMC sampler convergence, including applying diagnostic procedures to a small number of parallel chains, monitoring autocorrelations and cross- correlations, and modifying parameterizations or sampling algorithms appropriately. We emphasize, however, that it is not possible to say with certainty that a finite sample from an MCMC algorithm is rep- resentative of an underlying stationary distribution. 1 Mary Kathryn Cowles is Assistant Professor of Biostatistics, Harvard School of Public Health, Boston, MA 02115. Bradley P. Carlin is Associate Professor, Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455. - 
												
												Meta-Learning MCMC Proposals
Meta-Learning MCMC Proposals Tongzhou Wang∗ Yi Wu Facebook AI Research University of California, Berkeley [email protected] [email protected] David A. Moorey Stuart J. Russell Google University of California, Berkeley [email protected] [email protected] Abstract Effective implementations of sampling-based probabilistic inference often require manually constructed, model-specific proposals. Inspired by recent progresses in meta-learning for training learning agents that can generalize to unseen environ- ments, we propose a meta-learning approach to building effective and generalizable MCMC proposals. We parametrize the proposal as a neural network to provide fast approximations to block Gibbs conditionals. The learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required. We explore several applications including open-universe Gaussian mixture models, in which our learned proposals outperform a hand-tuned sampler, and a real-world named entity recognition task, in which our sampler yields higher final F1 scores than classical single-site Gibbs sampling. 1 Introduction Model-based probabilistic inference is a highly successful paradigm for machine learning, with applications to tasks as diverse as movie recommendation [31], visual scene perception [17], music transcription [3], etc. People learn and plan using mental models, and indeed the entire enterprise of modern science can be viewed as constructing a sophisticated hierarchy of models of physical, mental, and social phenomena. Probabilistic programming provides a formal representation of models as sample-generating programs, promising the ability to explore a even richer range of models. - 
												
												Markov Chain Monte Carlo Edps 590BAY
Markov Chain Monte Carlo Edps 590BAY Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Fall 2019 Overivew Bayesian computing Markov Chains Metropolis Example Practice Assessment Metropolis 2 Practice Overview ◮ Introduction to Bayesian computing. ◮ Markov Chain ◮ Metropolis algorithm for mean of normal given fixed variance. ◮ Revisit anorexia data. ◮ Practice problem with SAT data. ◮ Some tools for assessing convergence ◮ Metropolis algorithm for mean and variance of normal. ◮ Anorexia data. ◮ Practice problem. ◮ Summary Depending on the book that you select for this course, read either Gelman et al. pp 2751-291 or Kruschke Chapters pp 143–218. I am relying more on Gelman et al. C.J. Anderson (Illinois) MCMC Fall 2019 2.2/ 45 Overivew Bayesian computing Markov Chains Metropolis Example Practice Assessment Metropolis 2 Practice Introduction to Bayesian Computing ◮ Our major goal is to approximate the posterior distributions of unknown parameters and use them to estimate parameters. ◮ The analytic computations are fine for simple problems, ◮ Beta-binomial for bounded counts ◮ Normal-normal for continuous variables ◮ Gamma-Poisson for (unbounded) counts ◮ Dirichelt-Multinomial for multicategory variables (i.e., a categorical variable) ◮ Models in the exponential family with small number of parameters ◮ For large number of parameters and more complex models ◮ Algebra of analytic solution becomes overwhelming ◮ Grid takes too much time. ◮ Too difficult for most applications. C.J. Anderson (Illinois) MCMC Fall 2019 3.3/ 45 Overivew Bayesian computing Markov Chains Metropolis Example Practice Assessment Metropolis 2 Practice Steps in Modeling Recall that the steps in an analysis: 1. Choose model for data (i.e., p(y θ)) and model for parameters | (i.e., p(θ) and p(θ y)). - 
												
												Bayesian Phylogenetics and Markov Chain Monte Carlo Will Freyman
IB200, Spring 2016 University of California, Berkeley Lecture 12: Bayesian phylogenetics and Markov chain Monte Carlo Will Freyman 1 Basic Probability Theory Probability is a quantitative measurement of the likelihood of an outcome of some random process. The probability of an event, like flipping a coin and getting heads is notated P (heads). The probability of getting tails is then 1 − P (heads) = P (tails). • The joint probability of event A and event B both occuring is written as P (A; B). The joint probability of mutually exclusive events, like flipping a coin once and getting heads and tails, is 0. • The probability of A occuring given that B has already occurred is written P (AjB), which is read \the probability of A given B". This is a conditional probability. • The marginal probability of A is P (A), which is calculated by summing or integrating the joint probability over B. In other words, P (A) = P (A; B) + P (A; not B). • Joint, conditional, and marginal probabilities can be combined with the expression P (A; B) = P (A)P (BjA) = P (B)P (AjB). • The above expression can be rearranged into Bayes' theorem: P (BjA)P (A) P (AjB) = P (B) Bayes' theorem is a straightforward and uncontroversial way to calculate an inverse conditional probability. • If P (AjB) = P (A) and P (BjA) = P (B) then A and B are independent. Then the joint probability is calculated P (A; B) = P (A)P (B). 2 Interpretations of Probability What exactly is a probability? Does it really exist? There are two major interpretations of probability: • Frequentists believe that the probability of an event is its relative frequency over time. - 
												
												The Markov Chain Monte Carlo Approach to Importance Sampling
The Markov Chain Monte Carlo Approach to Importance Sampling in Stochastic Programming by Berk Ustun B.S., Operations Research, University of California, Berkeley (2009) B.A., Economics, University of California, Berkeley (2009) Submitted to the School of Engineering in partial fulfillment of the requirements for the degree of Master of Science in Computation for Design and Optimization at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2012 c Massachusetts Institute of Technology 2012. All rights reserved. Author.................................................................... School of Engineering August 10, 2012 Certified by . Mort Webster Assistant Professor of Engineering Systems Thesis Supervisor Certified by . Youssef Marzouk Class of 1942 Associate Professor of Aeronautics and Astronautics Thesis Reader Accepted by . Nicolas Hadjiconstantinou Associate Professor of Mechanical Engineering Director, Computation for Design and Optimization 2 The Markov Chain Monte Carlo Approach to Importance Sampling in Stochastic Programming by Berk Ustun Submitted to the School of Engineering on August 10, 2012, in partial fulfillment of the requirements for the degree of Master of Science in Computation for Design and Optimization Abstract Stochastic programming models are large-scale optimization problems that are used to fa- cilitate decision-making under uncertainty. Optimization algorithms for such problems need to evaluate the expected future costs of current decisions, often referred to as the recourse function. In practice, this calculation is computationally difficult as it involves the evaluation of a multidimensional integral whose integrand is an optimization problem. Accordingly, the recourse function is estimated using quadrature rules or Monte Carlo methods. Although Monte Carlo methods present numerous computational benefits over quadrature rules, they require a large number of samples to produce accurate results when they are embedded in an optimization algorithm. - 
												
												Arxiv:1801.02309V4 [Stat.ML] 11 Dec 2019
Journal of Machine Learning Research 20 (2019) 1-42 Submitted 4/19; Revised 11/19; Published 12/19 Log-concave sampling: Metropolis-Hastings algorithms are fast Raaz Dwivedi∗;y [email protected] Yuansi Chen∗;} [email protected] Martin J. Wainwright};y;z [email protected] Bin Yu};y [email protected] Department of Statistics} Department of Electrical Engineering and Computer Sciencesy University of California, Berkeley Voleon Groupz, Berkeley Editor: Suvrit Sra Abstract We study the problem of sampling from a strongly log-concave density supported on Rd, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds show that the use of an accept-reject step in MALA leads to an ex- ponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most δ for a density with condition number κ, we show that MALA requires κd log(1/δ) steps from a warm start, as compared to the κ2d/δ2 steps establishedO in past work on ULA. We also demonstrate the gains of a modifiedO version of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for the Metropolized random walk (MRW) and obtain (κ) mixing time slower than MALA. We provide numerical examples that support ourO theoretical findings, and demonstrate the benefits of Metropolis-Hastings adjustment for Langevin-type sampling algorithms. - 
												
												Markov Chain Monte Carlo (Mcmc) Methods
MARKOV CHAIN MONTE CARLO (MCMC) METHODS 0These notes utilize a few sources: some insights are taken from Profs. Vardeman’s and Carriquiry’s lecture notes, some from a great book on Monte Carlo strategies in scientific computing by J.S. Liu. EE 527, Detection and Estimation Theory, # 4c 1 Markov Chains: Basic Theory Definition 1. A (discrete time/discrete state space) Markov chain (MC) is a sequence of random quantities {Xk}, each taking values in a (finite or) countable set X , with the property that P {Xn = xn | X1 = x1,...,Xn−1 = xn−1} = P {Xn = xn | Xn−1 = xn−1}. Definition 2. A Markov Chain is stationary if 0 P {Xn = x | Xn−1 = x } is independent of n. Without loss of generality, we will henceforth name the elements of X with the integers 1, 2, 3,... and call them “states.” Definition 3. With 4 pij = P {Xn = j | Xn−1 = i} the square matrix 4 P = {pij} EE 527, Detection and Estimation Theory, # 4c 2 is called the transition matrix for a stationary Markov Chain and the pij are called transition probabilities. Note that a transition matrix has nonnegative entries and its rows sum to 1. Such matrices are called stochastic matrices. More notation for a stationary MC: Define (k) 4 pij = P {Xn+k = j | Xn = i} and (k) 4 fij = P {Xn+k = j, Xn+k−1 6= j, . , Xn+1 6= j, | Xn = i}. These are respectively the probabilities of moving from i to j in k steps and first moving from i to j in k steps. - 
												
												An Introduction to Markov Chain Monte Carlo Methods and Their Actuarial Applications
AN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS AND THEIR ACTUARIAL APPLICATIONS DAVID P. M. SCOLLNIK Department of Mathematics and Statistics University of Calgary Abstract This paper introduces the readers of the Proceed- ings to an important class of computer based simula- tion techniques known as Markov chain Monte Carlo (MCMC) methods. General properties characterizing these methods will be discussed, but the main empha- sis will be placed on one MCMC method known as the Gibbs sampler. The Gibbs sampler permits one to simu- late realizations from complicated stochastic models in high dimensions by making use of the model’s associated full conditional distributions, which will generally have a much simpler and more manageable form. In its most extreme version, the Gibbs sampler reduces the analy- sis of a complicated multivariate stochastic model to the consideration of that model’s associated univariate full conditional distributions. In this paper, the Gibbs sampler will be illustrated with four examples. The first three of these examples serve as rather elementary yet instructive applications of the Gibbs sampler. The fourth example describes a reasonably sophisticated application of the Gibbs sam- pler in the important arena of credibility for classifica- tion ratemaking via hierarchical models, and involves the Bayesian prediction of frequency counts in workers compensation insurance. 114 AN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS 115 1. INTRODUCTION The purpose of this paper is to acquaint the readership of the Proceedings with a class of simulation techniques known as Markov chain Monte Carlo (MCMC) methods. These methods permit a practitioner to simulate a dependent sequence of ran- dom draws from very complicated stochastic models. - 
												
												An Introduction to Data Analysis Using the Pymc3 Probabilistic Programming Framework: a Case Study with Gaussian Mixture Modeling
An introduction to data analysis using the PyMC3 probabilistic programming framework: A case study with Gaussian Mixture Modeling Shi Xian Liewa, Mohsen Afrasiabia, and Joseph L. Austerweila aDepartment of Psychology, University of Wisconsin-Madison, Madison, WI, USA August 28, 2019 Author Note Correspondence concerning this article should be addressed to: Shi Xian Liew, 1202 West Johnson Street, Madison, WI 53706. E-mail: [email protected]. This work was funded by a Vilas Life Cycle Professorship and the VCRGE at University of Wisconsin-Madison with funding from the WARF 1 RUNNING HEAD: Introduction to PyMC3 with Gaussian Mixture Models 2 Abstract Recent developments in modern probabilistic programming have offered users many practical tools of Bayesian data analysis. However, the adoption of such techniques by the general psychology community is still fairly limited. This tutorial aims to provide non-technicians with an accessible guide to PyMC3, a robust probabilistic programming language that allows for straightforward Bayesian data analysis. We focus on a series of increasingly complex Gaussian mixture models – building up from fitting basic univariate models to more complex multivariate models fit to real-world data. We also explore how PyMC3 can be configured to obtain significant increases in computational speed by taking advantage of a machine’s GPU, in addition to the conditions under which such acceleration can be expected. All example analyses are detailed with step-by-step instructions and corresponding Python code. Keywords: probabilistic programming; Bayesian data analysis; Markov chain Monte Carlo; computational modeling; Gaussian mixture modeling RUNNING HEAD: Introduction to PyMC3 with Gaussian Mixture Models 3 1 Introduction Over the last decade, there has been a shift in the norms of what counts as rigorous research in psychological science. - 
												
												Hamiltonian Monte Carlo Swindles
Hamiltonian Monte Carlo Swindles Dan Piponi Matthew D. Hoffman Pavel Sountsov Google Research Google Research Google Research Abstract become indistinguishable (Mangoubi and Smith, 2017; Bou-Rabee et al., 2018). These results were proven in Hamiltonian Monte Carlo (HMC) is a power- service of proofs of convergence and mixing speed for ful Markov chain Monte Carlo (MCMC) algo- HMC, but have been used by Heng and Jacob (2019) rithm for estimating expectations with respect to derive unbiased estimators from finite-length HMC to continuous un-normalized probability dis- chains. tributions. MCMC estimators typically have Empirically, we observe that two HMC chains targeting higher variance than classical Monte Carlo slightly different stationary distributions still couple with i.i.d. samples due to autocorrelations; approximately when driven by the same randomness. most MCMC research tries to reduce these In this paper, we propose methods that exploit this autocorrelations. In this work, we explore approximate-coupling phenomenon to reduce the vari- a complementary approach to variance re- ance of HMC-based estimators. These methods are duction based on two classical Monte Carlo based on two classical Monte Carlo variance-reduction “swindles”: first, running an auxiliary coupled techniques (sometimes called “swindles”): antithetic chain targeting a tractable approximation to sampling and control variates. the target distribution, and using the aux- iliary samples as control variates; and sec- In the antithetic scheme, we induce anti-correlation ond, generating anti-correlated ("antithetic") between two chains by driving the second chain with samples by running two chains with flipped momentum variables that are negated versions of the randomness. Both ideas have been explored momenta from the first chain. - 
												
												Hamiltonian Monte Carlo Within Stan
Hamiltonian Monte Carlo within Stan Daniel Lee Columbia University, Statistics Department [email protected] BayesComp mc-stan.org 1 Why MCMC? • Have data. • Have a rich statistical model. • No analytic solution. • (Point estimate not adequate.) 2 Review: MCMC • Markov Chain Monte Carlo. The samples form a Markov Chain. • Markov property: Pr(θn+1 j θ1; : : : ; θn) = Pr(θn+1 j θn) • Invariant distribution: π × Pr = π • Detailed balance: sufficient condition: R Pr(θn+1; A) = A q(θn+1; θn)dy π(θn+1)q(θn+1; θn) = π(θn)q(θn; θn+1) 3 Review: RWMH • Want: samples from posterior distribution: Pr(θjx) • Need: some function proportional to joint model. f (x, θ) / Pr(x, θ) • Algorithm: Given f (x, θ), x, N, Pr(θn+1 j θn) For n = 1 to N do Sample θˆ ∼ q(θˆ j θn−1) f (x,θ)ˆ With probability α = min 1; , set θn θˆ, else θn θn−1 f (x,θn−1) 4 Review: Hamiltonian Dynamics • (Implicit: d = dimension) • q = position (d-vector) • p = momentum (d-vector) • U(q) = potential energy • K(p) = kinectic energy • Hamiltonian system: H(q; p) = U(q) + K(p) 5 Review: Hamiltonian Dynamics • for i = 1; : : : ; d dqi = @H dt @pi dpi = − @H dt @qi • kinectic energy usually defined as K(p) = pT M−1p=2 • for i = 1; : : : ; d dqi −1 dt = [M p]i dpi = − @U dt @qi 6 Connection to MCMC • q, position, is the vector of parameters • U(q), potential energy, is (proportional to) the minus the log probability density of the parameters • p, momentum, are augmented variables • K(p), kinectic energy, is calculated • Hamiltonian dynamics used to update q.