INTRODUCTION to MONTE CARLO METHODS Tance Sampling

Total Page:16

File Type:pdf, Size:1020Kb

INTRODUCTION to MONTE CARLO METHODS Tance Sampling INTRODUCTION TO MONTE CARLO METHODS DJC MACKAY Department of Physics Cambridge University Cavendish Laboratory Madingley Road Cambridge CB HE United Kingdom ABSTRACT This chapter describ es a sequence of Monte Carlo metho ds imp or tance sampling rejection samplingtheMetrop olis metho dand Gibbs samplingFor each metho d we discuss whether the metho d is exp ected to b e useful for highdimensional problems such as arise in in ference with graphical mo dels After the metho ds have b een describ ed the terminology of Markovchain Monte Carlo metho ds is presented The chapter concludes with a discussion of advanced metho ds includ ing metho ds for reducing random walk b ehaviour For details of Monte Carlo metho ds theorems and pro ofs and a full list of references the reader is directed to Neal Gilks Richardson and Spiegelhalter and Tanner The problems to b e solved The aims of Monte Carlo metho ds are to solve one or b oth of the following problems r R Problem to generate samples fx g from a given probability distri r bution P x Problem to estimate exp ectations of functions under this distribution for example Z N d hxi x P xx Please note that I will use the word sample in the following sense a sample from a distribution P x is a single realization x whose probability distributio n is P x This contrasts with the alternative usage in statistics where sample refers to a collection of realizations fxg DJC MACKAY The probability distribution P x whichwewillcallthetarget density might b e a distribution from statistical physics or a conditional distribution arising in data mo delling for example the p osterior probabilityofa mo dels parameters given some observed data We will generally assume that x is an N dimensional vector with real comp onents x but wewill n sometimes consider discrete spaces also We will concentrate on the rst problem sampling b ecause if wehave solved it then we can solve the second problem by using the random sam r R ples fx g to give the estimator r X r x R r r R Clearly if the vectors fx g are generated from P x then the exp ecta r tion of is Also as the number of samples R increases the variance of will decrease as where is the variance of R Z N d x P xx This is one of the imp ortant prop erties of Monte Carlo metho ds The accuracy of the Monte Carlo estimate equation is indep endent of the dimensionality of the space sampled Tobe So regardless of the dimensionality precise the variance of goesas R r of xitmay b e that as few as a dozen indep endent samples fx g suce to estimate satisfactorily We will nd later however that high dimensionality can cause other dif culties for Monte Carlo metho ds Obtaining indep endent samples from a given distribution P x is often not easy WHY IS SAMPLING FROM P x HARD We will assume that the density from whichwe wish to draw samples P x can b e evaluated at least to within a multiplicative constant that is we can evaluate a function P x such that P x P xZ If wecanevaluate P x why can we not easily solve problem Whyisit in general dicult to obtain samples from P x There are two diculties The rst is that wetypically do not know the normalizing constant Z N Z d x P x MONTE CARLO METHODS 3 3 P*(x) P*(x) 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 -4 -2 0 2 4 -4 -2 0 2 4 a b Howtodraw samples Figure a The function P x exp x x from this density b The function P xevaluated at a discrete set of uniformly spaced p oints fx gHowtodraw samples from this discrete distribution i The second is that even if we did know Z the problem of drawing samples from P x is still a challenging one esp ecially in highdimensional spaces There are only a few highdimensional densities from whichitiseasyto draw samples for example the Gaussian distribution Let us start from a simple onedimensional example Imagine that we wish to draw samples from the density P xP xZ where i h x P x exp x x We can plot this function gure a But that do es not mean wecandraw samples from it To give ourselves a simpler problem we could discretize the variable x and ask for samples from the discrete probability distribution over a set of uniformly spaced p oints fx g gure b How could wesolve i this problem If weevaluate p P x at each p oint x we can compute i i i X Z p i i and Z p p i i and we can then sample from the probability distribution fp g using various i metho ds based on a source of random bits But what is the cost of this pro cedure and how do es it scale with the dimensionality of the space A sample from a univariate Gaussian can be generated by computing p logu where u and u are uniformly distributed in cosu DJC MACKAY N Let us concentrate on the initial cost of evaluating Z To compute Z equation wehave to visit every p oint in the space In gure b there are uniformly spaced p oints in one dimension If our system had N dimensions N say then the corresp onding number of points would b e an unimaginable number of evaluations of P Even if each comp onent x only to ok two discrete values the number of evaluations of n P would b e anumb er that is still horribly huge equal to the fourth power of the numb er of particles in the universe One system with states is a collection of spins for example a fragment of an Ising mo del or Boltzmann machine or Markov eld Yeomans whose probability distribution is prop ortional to P x expE x where x fg and n X X E x J x x H x mn m n n n mn n The energy function E x is readily evaluated for any x But if wewishto evaluate this function at al l states x the computer time required would b e function evaluations The Ising mo del is a simple mo del which has b een around for a long time but the task of generating samples from the distribution P x P xZ is still an activeresearch area as evidenced by the work of Propp and Wilson UNIFORM SAMPLING Having agreed that we cannot visit every lo cation x in the state space we might consider trying to solve the second problem estimating the exp ec r R uniformly tation of a function x bydrawing random samples fx g r from the state space and evaluating P x at those p oints Then we could intro duce Z dened by R R X r Z P x R r R N and estimate d x xP xby R r X P x r x Z R r MONTE CARLO METHODS Is anything wrong with this strategy Well it dep ends on the functions x and P x Let us assume that x is a b enign smo othly varying function and concentrate on the nature of P x A highdimensional distribution is often concentrated in a small region of the state space known as its H X typical set T whose volume is given by jT j where H X is the ShannonGibbs entropy of the probability distribution P x X H X P x log P x x If almost all the probability mass is lo cated in the typical set and x R N is a b enign function the value of d x xP x will b e principally determined by the values that xtakes on in the typical set So uniform sampling will only stand a chance of giving a go o d estimate of if we makethenumb er of samples R suciently large that we are likely to hit the typical set a numb er of times So howmany samples are required Let us take the case of the Ising mo del again The total size of the state space N H is states and the typical set has size So each sample has a chance H N of of falling in the typical set The numb er of samples required to hit the typical set once is thus of order N H R min So what is H At high temp eratures the probability distribution of an Ising mo del tends to a uniform distribution and the entropy tends to H N bits so R is of order Under these conditions uniform max min sampling maywell b e a satisfactory technique for estimating But high temp eratures are not of great interest Considerably more interesting are intermediate temp eratures such as the critical temp erature at whichthe Ising mo del melts from an ordered phase to a disordered phase At this temp erature the entropy of an Ising mo del is roughly NbitsFor this probability distribution the numb er of samples required simply to hit the typical set once is of order N N N R min whichforN is ab out This is roughly the square of the number of particles in the universe Thus uniform sampling is utterly useless for the study of Ising mo dels of mo dest size And in most highdimensional problems if the distribution P x is not actually uniform uniform sampling is unlikely to b e useful OVERVIEW Having established that drawing samples from a highdimensional distri bution P x P xZ is dicult even if P x is easy to evaluate we will DJC MACKAY φ(x) Q*(x) P*(x) x Figure Functions involved in imp ortance sampling We wish to estimate the exp ecta tion of x under P x P x We can generate samples from the simpler distribution Qx Q x We can evaluate Q and P at any p oint now study a sequence of Monte Carlo metho ds imp ortance sampling rejection sampling the Metrop olis metho d and Gibbs sampling Imp ortance sampling Imp ortance sampling is not a metho d for generating samples from P x problem it is just a metho d for estimating the exp ectation of a func tion x problem It can b e viewed as a generalization of the uniform sampling metho d For illustrative purp oses let us imagine that the target distribution is a onedimensional density P x It is assumed that we are able to evalu ate this density at least to within a multiplicative constant thus wecan evaluate a function P xsuch that P x P xZ But P x is to o complicated a function for us to b e able to sample from it directlyWenow assume that wehave
Recommended publications
  • The Exponential Family 1 Definition
    The Exponential Family David M. Blei Columbia University November 9, 2016 The exponential family is a class of densities (Brown, 1986). It encompasses many familiar forms of likelihoods, such as the Gaussian, Poisson, multinomial, and Bernoulli. It also encompasses their conjugate priors, such as the Gamma, Dirichlet, and beta. 1 Definition A probability density in the exponential family has this form p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where is the natural parameter; t.x/ are sufficient statistics; h.x/ is the “base measure;” a./ is the log normalizer. Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) The statistic t.x/ is called sufficient because the probability as a function of only depends on x through t.x/. The exponential family has fundamental connections to the world of graphical models (Wainwright and Jordan, 2008). For our purposes, we’ll use exponential 1 families as components in directed graphical models, e.g., in the mixtures of Gaussians. The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ d.x/ (2) D f g This is the negative logarithm of the normalizing constant. The function h.x/ can be a source of confusion. One way to interpret h.x/ is the (unnormalized) distribution of x when 0. It might involve statistics of x that D are not in t.x/, i.e., that do not vary with the natural parameter.
    [Show full text]
  • 1 Introduction to Bayesian Inference 2 Introduction to Gibbs Sampling
    MCMC I: July 5, 2016 1 MCMC I 8th Summer Institute in Statistics and Modeling in Infectious Diseases Course Time Plan July 13-15, 2016 Instructors: Vladimir Minin, Kari Auranen, M. Elizabeth Halloran Course Description: This module is an introduction to Markov chain Monte Carlo methods with some simple applications in infectious disease studies. The course includes an introduction to Bayesian inference, Monte Carlo, MCMC, some background theory, and convergence diagnostics. Algorithms include Gibbs sampling and Metropolis-Hastings and combinations. Programming is in R. Familiarity with the R statistical package or other computing language is needed. Course schedule: The course is composed of 10 90-minute sessions, for a total of 15 hours of instruction. 1 Introduction to Bayesian Inference • Overview of the course. • Bayesian inference: Likelihood, prior, posterior, normalizing constant • Conjugate priors; Beta-binomial; Poisson-gamma; normal-normal • Posterior summaries, mean, mode, posterior intervals • Motivating examples: Chain binomial model (Reed-Frost), General Epidemic Model, SIS model. • Lab: – Goals: Warm-up with R for simple Bayesian computation – Example: Posterior distribution of transmission probability with a binomial sampling distribution using a conjugate beta prior distribution – Summarizing posterior inference (mean, median, posterior quantiles and intervals) – Varying the amount of prior information – Writing an R function 2 Introduction to Gibbs Sampling • Chain binomial model and data augmentation • Brief introduction
    [Show full text]
  • Fast Bayesian Non-Negative Matrix Factorisation and Tri-Factorisation
    Fast Bayesian Non-Negative Matrix Factorisation and Tri-Factorisation Thomas Brouwer Jes Frellsen Pietro Lio’ Computer Laboratory Department of Engineering Computer Laboratory University of Cambridge University of Cambridge University of Cambridge United Kingdom United Kingdom United Kingdom [email protected] [email protected] [email protected] Abstract We present a fast variational Bayesian algorithm for performing non-negative matrix factorisation and tri-factorisation. We show that our approach achieves faster convergence per iteration and timestep (wall-clock) than Gibbs sampling and non-probabilistic approaches, and do not require additional samples to estimate the posterior. We show that in particular for matrix tri-factorisation convergence is difficult, but our variational Bayesian approach offers a fast solution, allowing the tri-factorisation approach to be used more effectively. 1 Introduction Non-negative matrix factorisation methods Lee and Seung [1999] have been used extensively in recent years to decompose matrices into latent factors, helping us reveal hidden structure and predict missing values. In particular we decompose a given matrix into two smaller matrices so that their product approximates the original one. The non-negativity constraint makes the resulting matrices easier to interpret, and is often inherent to the problem – such as in image processing or bioinformatics (Lee and Seung [1999], Wang et al. [2013]). Some approaches approximate a maximum likelihood (ML) or maximum a posteriori (MAP) solution that minimises the difference between the observed matrix and the decomposition of this matrix. This gives a single point estimate, which can lead to overfitting more easily and neglects uncertainty. Instead, we may wish to find a full distribution over the matrices using a Bayesian approach, where we define prior distributions over the matrices and then compute their posterior after observing the actual data.
    [Show full text]
  • Bayesian Inference in the Normal Linear Regression Model
    Bayesian Inference in the Normal Linear Regression Model () Bayesian Methods for Regression 1 / 53 Bayesian Analysis of the Normal Linear Regression Model Now see how general Bayesian theory of overview lecture works in familiar regression model Reading: textbook chapters 2, 3 and 6 Chapter 2 presents theory for simple regression model (no matrix algebra) Chapter 3 does multiple regression In lecture, I will go straight to multiple regression Begin with regression model under classical assumptions (independent errors, homoskedasticity, etc.) Chapter 6 frees up classical assumptions in several ways Lecture will cover one way: Bayesian treatment of a particular type of heteroskedasticity () Bayesian Methods for Regression 2 / 53 The Regression Model Assume k explanatory variables, xi1,..,xik for i = 1, .., N and regression model: yi = b1 + b2xi2 + ... + bk xik + #i . Note xi1 is implicitly set to 1 to allow for an intercept. Matrix notation: y1 y2 y = 2 . 3 6 . 7 6 7 6 y 7 6 N 7 4 5 # is N 1 vector stacked in same way as y () Bayesian Methods for Regression 3 / 53 b is k 1 vector X is N k matrix 1 x12 .. x1k 1 x22 .. x2k X = 2 ..... 3 6 ..... 7 6 7 6 1 x .. x 7 6 N2 Nk 7 4 5 Regression model can be written as: y = X b + #. () Bayesian Methods for Regression 4 / 53 The Likelihood Function Likelihood can be derived under the classical assumptions: 1 2 # is N(0N , h IN ) where h = s . All elements of X are either fixed (i.e. not random variables). Exercise 10.1, Bayesian Econometric Methods shows that likelihood function can be written in
    [Show full text]
  • An Introduction to Markov Chain Monte Carlo Methods and Their Actuarial Applications
    AN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS AND THEIR ACTUARIAL APPLICATIONS DAVID P. M. SCOLLNIK Department of Mathematics and Statistics University of Calgary Abstract This paper introduces the readers of the Proceed- ings to an important class of computer based simula- tion techniques known as Markov chain Monte Carlo (MCMC) methods. General properties characterizing these methods will be discussed, but the main empha- sis will be placed on one MCMC method known as the Gibbs sampler. The Gibbs sampler permits one to simu- late realizations from complicated stochastic models in high dimensions by making use of the model’s associated full conditional distributions, which will generally have a much simpler and more manageable form. In its most extreme version, the Gibbs sampler reduces the analy- sis of a complicated multivariate stochastic model to the consideration of that model’s associated univariate full conditional distributions. In this paper, the Gibbs sampler will be illustrated with four examples. The first three of these examples serve as rather elementary yet instructive applications of the Gibbs sampler. The fourth example describes a reasonably sophisticated application of the Gibbs sam- pler in the important arena of credibility for classifica- tion ratemaking via hierarchical models, and involves the Bayesian prediction of frequency counts in workers compensation insurance. 114 AN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS 115 1. INTRODUCTION The purpose of this paper is to acquaint the readership of the Proceedings with a class of simulation techniques known as Markov chain Monte Carlo (MCMC) methods. These methods permit a practitioner to simulate a dependent sequence of ran- dom draws from very complicated stochastic models.
    [Show full text]
  • Accelerating Bayesian Inference on Structured Graphs Using Parallel Gibbs Sampling
    Accelerating Bayesian Inference on Structured Graphs Using Parallel Gibbs Sampling Glenn G. Ko∗, Yuji Chai∗, Rob A. Rutenbary, David Brooks∗, Gu-Yeon Wei∗ ∗Harvard University, Cambridge, USA yUniversity of Pittsburgh, Pittsburgh, USA [email protected], [email protected], [email protected], fdbrooks, [email protected] Abstract—Bayesian models and inference is a class of machine with substantial uncertainty and noise is normally reserved for learning that is useful for solving problems where the amount Bayesian models and inference. of data is scarce and prior knowledge about the application Bayesian models pose challenges that are different from allows you to draw better conclusions. However, Bayesian models often requires computing high-dimensional integrals and finding deep learning. They tend to have more complex structure, the posterior distribution can be intractable. One of the most that may be hierarchical with dependencies between different commonly used approximate methods for Bayesian inference distributions Without an easily-defined common computational is Gibbs sampling, which is a Markov chain Monte Carlo kernel and hardware accelerators like GPU and TPU [5], and (MCMC) technique to estimate target stationary distribution. software frameworks such as Tensorflow [6] and PyTorch [7], The idea in Gibbs sampling is to generate posterior samples by iterating through each of the variables to sample from its it is hard to build and run models fast and efficiently. This conditional given all the other variables fixed. While Gibbs work explores hardware accelerators for Bayesian models and sampling is a popular method for probabilistic graphical models inference which has growing interest and demand. such as Markov Random Field (MRF), the plain algorithm is Probabilistic graphical models are Bayesian models described slow as it goes through each of the variables sequentially.
    [Show full text]
  • Markov Chain Monte Carlo 1 Importance Sampling
    CS731 Spring 2011 Advanced Artificial Intelligence Markov Chain Monte Carlo Lecturer: Xiaojin Zhu [email protected] A fundamental problem in machine learning is to generate samples from a distribution: x ∼ p(x). (1) This problem has many important applications. For example, one can approximate the expectation of a function φ(x) Z µ ≡ Ep[φ(x)] = φ(x)p(x)dx (2) by the sample average over x1 ... xn ∼ p(x): n 1 X µˆ ≡ φ(x ) (3) n i i=1 This is known as the Monte Carlo method. A concrete application is in Bayesian predictive modeling where R p(x | θ)p(θ | Data)dθ. Clearly,µ ˆ is an unbiased estimator of µ. Furthermore, the variance ofµ ˆ decreases as V(φ)/n (4) R 2 where V(φ) = (φ(x) − µ) p(x)dx. Thus Monte Carlo methods depend on the sample size n, not on the dimensionality of x. 1 Often, p is only known up to a constant. That is, we assume p(x) = Z p˜(x) and we are able to evaluate p˜(x) at any point x. However, we cannot compute the normalization constant Z. For instance, in the 1 Bayesian modeling example, the posterior distribution is p(θ | Data) = Z p(θ)p(Data | θ). This makes sampling harder. All methods below work onp ˜. 1 Importance Sampling Unlike other methods discussed later, importance sampling is not a method to sample from p. Rather, it is a method to compute (3). Assume we know the unnormalizedp ˜. It turns out that we (or Matlab) only know how to directly sample from very few distributions such as uniform, Gaussian, etc.; p may not be one of them in general.
    [Show full text]
  • Gibbs Sampling, Exponential Families and Orthogonal Polynomials
    Statistical Science 2008, Vol. 23, No. 2, 151–178 DOI: 10.1214/07-STS252 c Institute of Mathematical Statistics, 2008 Gibbs Sampling, Exponential Families and Orthogonal Polynomials1 Persi Diaconis, Kshitij Khare and Laurent Saloff-Coste Abstract. We give families of examples where sharp rates of conver- gence to stationarity of the widely used Gibbs sampler are available. The examples involve standard exponential families and their conjugate priors. In each case, the transition operator is explicitly diagonalizable with classical orthogonal polynomials as eigenfunctions. Key words and phrases: Gibbs sampler, running time analyses, expo- nential families, conjugate priors, location families, orthogonal polyno- mials, singular value decomposition. 1. INTRODUCTION which has f as stationary density under mild con- ditions discussed in [4, 102]. The Gibbs sampler, also known as Glauber dy- The algorithm was introduced in 1963 by Glauber namics or the heat-bath algorithm, is a mainstay of [49] to do simulations for Ising models, and inde- scientific computing. It provides a way to draw sam- pendently by Turcin [103]. It is still a standard tool ples from a multivariate probability density f(x1,x2, of statistical physics, both for practical simulation ...,xp), perhaps only known up to a normalizing (e.g., [88]) and as a natural dynamics (e.g., [10]). constant, by a sequence of one-dimensional sampling The basic Dobrushin uniqueness theorem showing problems. From (X1,...,Xp) proceed to (X1′ , X2,..., existence of Gibbs measures was proved based on Xp), then (X1′ , X2′ , X3,...,Xp),..., (X1′ , X2′ ,...,Xp′ ) this dynamics (e.g., [54]). It was introduced as a where at the ith stage, the coordinate is sampled base for image analysis by Geman and Geman [46].
    [Show full text]
  • Gibbs Sampling
    4 Modern Model Estimation Part 1: Gibbs Sampling The estimation of a Bayesian model is the most difficult part of undertaking a Bayesian analysis. Given that researchers may use different priors for any particular model, estimation must be tailored to the specific model under consideration. Classical analyses, on the other hand, often involve the use of standard likelihood functions, and hence, once an estimation routine is developed, it can be used again and again. The trade-off for the additional work required for a Bayesian analysis is that (1) a more appropriate model for the data can be constructed than extant software may allow, (2) more measures of model fit and outlier/influential case diagnostics can be produced, and (3) more information is generally available to summarize knowledge about model parameters than a classical analysis based on maximum likelihood (ML) estimation provides. Along these same lines, additional measures may be constructed to test hypotheses concerning parameters not directly estimated in the model. In this chapter, I first discuss the goal of model estimation in the Bayesian paradigm and contrast it with that of maximum likelihood estimation. Then, I discuss modern simulation/sampling methods used by Bayesian statisticians to perform analyses, including Gibbs sampling. In the next chapter, I discuss the Metropolis-Hastings algorithm as an alternative to Gibbs sampling. 4.1 What Bayesians want and why As the discussion of ML estimation in Chapter 2 showed, the ML approach finds the parameter values that maximize the likelihood function for the ob- served data and then produces point estimates of the standard errors of these estimates.
    [Show full text]
  • Differentially Private Bayesian Inference for Exponential Families
    Differentially Private Bayesian Inference for Exponential Families Garrett Bernstein College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01002 [email protected] Daniel Sheldon College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01002 [email protected] Abstract The study of private inference has been sparked by growing concern regarding the analysis of data when it stems from sensitive sources. We present the first method for private Bayesian inference in exponential families that properly accounts for noise introduced by the privacy mechanism. It is efficient because it works only with sufficient statistics and not individual data. Unlike other methods, it gives properly calibrated posterior beliefs in the non-asymptotic data regime. 1 Introduction Differential privacy is the dominant standard for privacy [1]. A randomized algorithm that satisfies differential privacy offers protection to individuals by guaranteeing that its output is insensitive to changes caused by the data of any single individual entering or leaving the data set. An algorithm can be made differentially private by applying one of several general-purpose mechanisms to randomize the computation in an appropriate way, for example, by adding noise calibrated to the sensitivity of the quantity being computed, where sensitivity captures how much the quantity depends on any individual’s data [1]. Due to the obvious importance of protecting individual privacy while drawing population level inferences from data, differentially private algorithms have been developed for a broad range of machine learning tasks [2–9]. There is a growing interest in private methods for Bayesian inference [10–14].
    [Show full text]
  • A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics
    bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics Renato Rodrigues Silva 1,* 1 Institute of Mathematics and Statistics, Federal University of Goi´as, Goi^ania,Goi´as,Brazil *[email protected] Abstract In the regression analysis, there are situations where the model have more predictor variables than observations of dependent variable, resulting in the problem known as \large p small n". In the last fifteen years, this problem has been received a lot of attention, specially in the genome-wide context. Here we purposed the bayes H model, a bayesian regression model using mixture of two scaled inverse chi square as hyperprior distribution of variance for each regression coefficient. This model is implemented in the R package BayesH. Introduction 1 In the regression analysis, there are situations where the model have more predictor 2 variables than observations of dependent variable, resulting in the problem known as 3 \large p small n" [1]. 4 To figure out this problem, there are already exists some methods developed as ridge 5 regression [2], least absolute shrinkage and selection operator (LASSO) regression [3], 6 bridge regression [4], smoothly clipped absolute deviation (SCAD) regression [5] and 7 others. This class of regression models is known in the literature as regression model 8 with penalized likelihood [6]. In the bayesian paradigma, there are also some methods 9 purposed as stochastic search variable selection [7], and Bayesian LASSO [8].
    [Show full text]
  • Stat 451 Lecture Notes 0512 Simulating Random Variables
    Stat 451 Lecture Notes 0512 Simulating Random Variables Ryan Martin UIC www.math.uic.edu/~rgmartin 1Based on Chapter 6 in Givens & Hoeting, Chapter 22 in Lange, and Chapter 2 in Robert & Casella 2Updated: March 7, 2016 1 / 47 Outline 1 Introduction 2 Direct sampling techniques 3 Fundamental theorem of simulation 4 Indirect sampling techniques 5 Sampling importance resampling 6 Summary 2 / 47 Motivation Simulation is a very powerful tool for statisticians. It allows us to investigate the performance of statistical methods before delving deep into difficult theoretical work. At a more practical level, integrals themselves are important for statisticians: p-values are nothing but integrals; Bayesians need to evaluate integrals to produce posterior probabilities, point estimates, and model selection criteria. Therefore, there is a need to understand simulation techniques and how they can be used for integral approximations. 3 / 47 Basic Monte Carlo Suppose we have a function '(x) and we'd like to compute Ef'(X )g = R '(x)f (x) dx, where f (x) is a density. There is no guarantee that the techniques we learn in calculus are sufficient to evaluate this integral analytically. Thankfully, the law of large numbers (LLN) is here to help: If X1; X2;::: are iid samples from f (x), then 1 Pn R n i=1 '(Xi ) ! '(x)f (x) dx with prob 1. Suggests that a generic approximation of the integral be obtained by sampling lots of Xi 's from f (x) and replacing integration with averaging. This is the heart of the Monte Carlo method. 4 / 47 What follows? Here we focus mostly on simulation techniques.
    [Show full text]