Prior Vs Likelihood Vs Posterior Posterior Predictive Distribution Poisson Data

Total Page:16

File Type:pdf, Size:1020Kb

Prior Vs Likelihood Vs Posterior Posterior Predictive Distribution Poisson Data Prior vs Likelihood vs Posterior Posterior Predictive Distribution Poisson Data Statistics 220 Spring 2005 Copyright °c 2005 by Mark E. Irwin Choosing the Likelihood Model While much thought is put into thinking about priors in a Bayesian Analysis, the data (likelihood) model can have a big e®ect. Choices that need to be made involve ² Independence vs Exchangable vs More Complex Dependence ² Tail size, e.g. Normal vs tdf ² Probability of events Choosing the Likelihood Model 1 Example: Probability of God's Existance Two di®erent analyses - both using the prior P [God] = P [No God] = 0:5 Likelihood Ratio Components: P [DataijGod] Di = P [DataijNo God] Evidence (Datai) Di - Unwin Di - Shermer Recognition of goodness 10 0.5 Existence of moral evil 0.5 0.1 Existence of natural evil 0.1 0.1 Intranatural miracles (prayers) 2 1 Extranatural miracles (resurrection) 1 0.5 Religious experiences 2 0.1 Choosing the Likelihood Model 2 P [GodjData]: 2 ² Unwin: 3 ² Shermer: 0.00025 So even starting with the same prior, the di®erence beliefs about what the data says gives quite di®erent posterior probabilities. This is based on an analysis published in an July 2004 Scienti¯c American article. (Available on the course web site on the Articles page.) Stephen D. Unwin is a risk management consultant who has done work in physics on quantum gravity. He is author of the book The Probability of God. Michael Shermer is the publisher of Skeptic and a regular contributor to Scienti¯c American as author of the column Skeptic. Choosing the Likelihood Model 3 Note in the article, Bayes' rule is presented as P [God]D P [GodjData] = P [God]D + P [No God] See if you can show that this is equivalent to the normal version of Bayes' rule, under the assumption that the components of the data model are independent. Choosing the Likelihood Model 4 Prior vs Likelihood vs Posterior The posterior distribution can be seen as a compromise between the prior and the data In general, this can be seen based on the two well known relationships E[θ] = E[E[θjy]] (1) Var(θ) = E[Var(θjy)] + Var(E[θjy]) (2) The ¯rst equation says that our prior mean is the average of all possible posterior means (averaged over all possible data sets). The second says that the posterior variance is, on average, smaller than the prior variance. The size of the di®erence depends on the variability of the posterior means. Prior vs Likelihood vs Posterior 5 This can be exhibited more precisely using examples ² Binomial Model - Conjugate prior ¼ » Beta(a; b) yj¼ » Bin(n; ¼) ¼jy » Beta(a + y; b + n ¡ y) Prior mean: a E[¼] = =¼ ~ a + b MLE: y ¼^ = n Prior vs Likelihood vs Posterior 6 Then the posterior mean satis¯es a + b n E[¼jy] = ¼~ + ¼^ =¼ ¹ a + b + n a + b + n a weighted average of the prior mean and the sample proportion (MLE) Prior variance: ¼~(1 ¡ ¼~) Var(¼) = a + b + 1 Posterior variance: ¼¹(1 ¡ ¼¹) Var(¼jy) = a + b + n + 1 So if n is large enough, the posterior variance will be smaller than the prior variance. Prior vs Likelihood vs Posterior 7 ² Normal Model - Conjugate prior, ¯xed variance 2 θ » N(¹0;¿0 ) iid 2 yijθ » N(θ; σ ); i = 1; : : : ; n 2 θjy » N(¹n;¿n) Then 1 n 2 ¹0 + 2 y¹ ¿0 σ 1 1 n ¹n = 1 n and 2 = 2 + 2 2 + 2 ¿n ¿0 σ ¿0 σ So the posterior mean is a weighted average of the prior mean and a sample mean of the data. Prior vs Likelihood vs Posterior 8 The posterior mean can be thought of in two other ways 2 ¿0 ¹n = ¹0 + (¹y ¡ ¹0) σ2 2 n + ¿0 σ2 n =y ¹ ¡ (¹y ¡ ¹0) σ2 2 n + ¿0 The ¯rst case has ¹n as the prior mean adjusted towards the sample average of the data. The second case has the sample average shrunk towards the prior mean. In most problems, the posterior mean can be thought of as a shrinkage estimator, where the estimate just based on the data is shrunk toward the prior mean. The form of the shrinkage may not be able to be written out in as quite a nice form for more general problems. Prior vs Likelihood vs Posterior 9 In this example the posterior variance is never bigger than the prior variance as 1 1 n 1 1 n 2 = 2 + 2 ¸ 2 and 2 ¸ 2 ¿n ¿0 σ ¿0 ¿n σ The ¯rst part of this is thought of as Posterior Precision = Prior Precision + Data Precision The ¯rst inequality gives 2 2 ¿n · ¿0 The second inequality gives σ2 ¿ 2 · n n Prior vs Likelihood vs Posterior 10 n = 5; y = 2; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 6 E[¼] = ¼^ = E[¼jy] = = 0:5 7 5 12 3 1 Var(¼) = = 0:0306 Var(¼jy) = = 0:0192 98 52 SD(¼) = 0:175 SD(¼jy) = 0:139 Prior vs Likelihood vs Posterior 11 n = 20; y = 8; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 12 E[¼] = ¼^ = E[¼jy] = = 0:444 7 5 27 3 Var(¼) = = 0:0306 Var(¼jy) = 0:0088 98 SD(¼) = 0:175 SD(¼jy) = 0:094 Prior vs Likelihood vs Posterior 12 n = 100; y = 40; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 44 E[¼] = ¼^ = E[¼jy] = = 0:411 7 5 107 3 Var(¼) = = 0:0306 Var(¼jy) = 0:0022 98 SD(¼) = 0:175 SD(¼jy) = 0:047 Prior vs Likelihood vs Posterior 13 n = 1000; y = 400; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 404 E[¼] = ¼^ = E[¼jy] = = 0:401 7 5 1007 3 Var(¼) = = 0:0306 Var(¼jy) = 0:0024 98 SD(¼) = 0:175 SD(¼jy) = 0:0154 Prior vs Likelihood vs Posterior 14 Prediction Another useful summary is the posterior predictive distribution of a future observation, y~ Z p(~yjy) = p(~yjy; θ)p(θjy)dθ In many situations, y~ will be conditionally independent of y given θ. Thus the distribution in this case reduces to Z p(~yjy) = p(~yjθ)p(θjy)dθ In many situations this can be di±cult to calculate, though it is often easy with a conjugate prior. Prediction 15 For example, with Binomial-Beta model, the posterior distribution of the success probability is Beta(a1; b1) (for some a1; b1). Then the distribution of the number of successes in m new trials is Z p(~yjy) = p(~yj¼)p(¼jy)d¼ Z µ ¶ 1 m ¡(a + b ) = ¼y~(1 ¡ ¼)m¡y~ 1 1 ¼a1¡1(1 ¡ ¼)b1¡1d¼ 0 y~ ¡(a1)¡(b1) µ ¶ m ¡(a + b ) ¡(a +y ~)¡(b + m ¡ y~) = 1 1 1 1 y~ ¡(a1)¡(b1) ¡(a1 + b1 + m) Which is an example of the Beta-Binomial distribution. The mean this distribution is a E[~yjy] = m 1 = m¼~ a1 + b1 Prediction 16 This can be gotten by applying E[~yjy] = E[E[~yj¼]jy] = E[m¼jy] The variance of this can be gotten by Var(~yjy) = Var(E[~yj¼]jy) + E[Var(~yj¼)jy] = Var(m¼jy) + E[m¼(1 ¡ ¼)jy] This is of the form m¼~(1 ¡ ¼~)f1 + (m ¡ 1)¿ 2g Prediction 17 One way of thinking about this is that there is two pieces of uncertainty in predicting a new observation. 1. Uncertainty about the true success probability 2. Deviation of an observation from its expected value This is more clear with the Normal-Normal model with ¯xed variance. As we saw earlier, the posterior distribution is of the form 2 θjy » N(¹n;¿n) Then Z µ ¶ µ ¶ 1 1 1 1 p 2 p 2 p(~yjy) = exp ¡ 2(~y ¡ θ) exp ¡ 2(θ ¡ ¹n) dθ σ 2¼ 2σ ¿n 2¼ 2¿n Prediction 18 A little bit of calculus will show that this reduces to a normal density with mean E[~yjy] = ¹n = E[E[~yj¹n]jy] = E[θjy] and variance 2 2 Var(~yjy) = ¿n + σ = Var(E[~yjθ]jy) + E[Var(~yjθ)jy] 2 = Var(¹njy) + E[σ jy] Prediction 19 An analogue to this is the variance for prediction in linear regression. It is exactly of this form µ 2 ¶ 2 1 (x ¡ x¹) Var(^yjx) = σ 1 + + 2 n (n ¡ 1)sx Simulating the posterior predictive distribution This is easy to do, assuming that you can simulate from the posterior distribution of the parameter, which is usually feasible. To do it involves two steps: 1. Simulate θi from θjy; i = 1; : : : ; m 2. Simulate y~i from y~jθi (=y ~jθi; y); i = 1; : : : ; m The pairs (θi; y~i) are draws from the joint distribution θ; y~jy. Therefore the y~i are draws from y~jy. Prediction 20 Why interest in the posterior predictive distribution? ² You might want to do predictions. For example, what will happen to a stock in 6 months. ² Model checking: Is your model reasonable? There are a number of ways of doing this. Future observations could be compared with the posterior predictive distribution. Another option might be something along the lines of cross validation.
Recommended publications
  • The Exponential Family 1 Definition
    The Exponential Family David M. Blei Columbia University November 9, 2016 The exponential family is a class of densities (Brown, 1986). It encompasses many familiar forms of likelihoods, such as the Gaussian, Poisson, multinomial, and Bernoulli. It also encompasses their conjugate priors, such as the Gamma, Dirichlet, and beta. 1 Definition A probability density in the exponential family has this form p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where is the natural parameter; t.x/ are sufficient statistics; h.x/ is the “base measure;” a./ is the log normalizer. Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) The statistic t.x/ is called sufficient because the probability as a function of only depends on x through t.x/. The exponential family has fundamental connections to the world of graphical models (Wainwright and Jordan, 2008). For our purposes, we’ll use exponential 1 families as components in directed graphical models, e.g., in the mixtures of Gaussians. The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ d.x/ (2) D f g This is the negative logarithm of the normalizing constant. The function h.x/ can be a source of confusion. One way to interpret h.x/ is the (unnormalized) distribution of x when 0. It might involve statistics of x that D are not in t.x/, i.e., that do not vary with the natural parameter.
    [Show full text]
  • Machine Learning Conjugate Priors and Monte Carlo Methods
    Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods CPSC 540: Machine Learning Conjugate Priors and Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2016 Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Admin Nothing exciting? We discussed empirical Bayes, where you optimize prior using marginal likelihood, Z argmax p(xjα; β) = argmax p(xjθ)p(θjα; β)dθ: α,β α,β θ Can be used to optimize λj, polynomial degree, RBF σi, polynomial vs. RBF, etc. We also considered hierarchical Bayes, where you put a prior on the prior, p(xjα; β)p(α; βjγ) p(α; βjx; γ) = : p(xjγ) But is the hyper-prior really needed? Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Last Time: Bayesian Statistics In Bayesian statistics we work with posterior over parameters, p(xjθ)p(θjα; β) p(θjx; α; β) = : p(xjα; β) We also considered hierarchical Bayes, where you put a prior on the prior, p(xjα; β)p(α; βjγ) p(α; βjx; γ) = : p(xjγ) But is the hyper-prior really needed? Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Last Time: Bayesian Statistics In Bayesian statistics we work with posterior over parameters, p(xjθ)p(θjα; β) p(θjx; α; β) = : p(xjα; β) We discussed empirical Bayes, where you optimize prior using marginal likelihood, Z argmax p(xjα; β) = argmax p(xjθ)p(θjα; β)dθ: α,β α,β θ Can be used to optimize λj, polynomial degree, RBF σi, polynomial vs.
    [Show full text]
  • Polynomial Singular Value Decompositions of a Family of Source-Channel Models
    Polynomial Singular Value Decompositions of a Family of Source-Channel Models The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Makur, Anuran and Lizhong Zheng. "Polynomial Singular Value Decompositions of a Family of Source-Channel Models." IEEE Transactions on Information Theory 63, 12 (December 2017): 7716 - 7728. © 2017 IEEE As Published http://dx.doi.org/10.1109/tit.2017.2760626 Publisher Institute of Electrical and Electronics Engineers (IEEE) Version Author's final manuscript Citable link https://hdl.handle.net/1721.1/131019 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ IEEE TRANSACTIONS ON INFORMATION THEORY 1 Polynomial Singular Value Decompositions of a Family of Source-Channel Models Anuran Makur, Student Member, IEEE, and Lizhong Zheng, Fellow, IEEE Abstract—In this paper, we show that the conditional expec- interested in a particular subclass of one-parameter exponential tation operators corresponding to a family of source-channel families known as natural exponential families with quadratic models, defined by natural exponential families with quadratic variance functions (NEFQVF). So, we define natural exponen- variance functions and their conjugate priors, have orthonor- mal polynomials as singular vectors. These models include the tial families next. Gaussian channel with Gaussian source, the Poisson channel Definition 1 (Natural Exponential Family). Given a measur- with gamma source, and the binomial channel with beta source. To derive the singular vectors of these models, we prove and able space (Y; B(Y)) with a σ-finite measure µ, where Y ⊆ R employ the equivalent condition that their conditional moments and B(Y) denotes the Borel σ-algebra on Y, the parametrized are strictly degree preserving polynomials.
    [Show full text]
  • A Compendium of Conjugate Priors
    A Compendium of Conjugate Priors Daniel Fink Environmental Statistics Group Department of Biology Montana State Univeristy Bozeman, MT 59717 May 1997 Abstract This report reviews conjugate priors and priors closed under sampling for a variety of data generating processes where the prior distributions are univariate, bivariate, and multivariate. The effects of transformations on conjugate prior relationships are considered and cases where conjugate prior relationships can be applied under transformations are identified. Univariate and bivariate prior relationships are verified using Monte Carlo methods. Contents 1 Introduction Experimenters are often in the position of having had collected some data from which they desire to make inferences about the process that produced that data. Bayes' theorem provides an appealing approach to solving such inference problems. Bayes theorem, π(θ) L(θ x ; : : : ; x ) g(θ x ; : : : ; x ) = j 1 n (1) j 1 n π(θ) L(θ x ; : : : ; x )dθ j 1 n is commonly interpreted in the following wayR. We want to make some sort of inference on the unknown parameter(s), θ, based on our prior knowledge of θ and the data collected, x1; : : : ; xn . Our prior knowledge is encapsulated by the probability distribution on θ; π(θ). The data that has been collected is combined with our prior through the likelihood function, L(θ x ; : : : ; x ) . The j 1 n normalized product of these two components yields a probability distribution of θ conditional on the data. This distribution, g(θ x ; : : : ; x ) , is known as the posterior distribution of θ. Bayes' j 1 n theorem is easily extended to cases where is θ multivariate, a vector of parameters.
    [Show full text]
  • Bayesian Filtering: from Kalman Filters to Particle Filters, and Beyond ZHE CHEN
    MANUSCRIPT 1 Bayesian Filtering: From Kalman Filters to Particle Filters, and Beyond ZHE CHEN Abstract— In this self-contained survey/review paper, we system- IV Bayesian Optimal Filtering 9 atically investigate the roots of Bayesian filtering as well as its rich IV-AOptimalFiltering..................... 10 leaves in the literature. Stochastic filtering theory is briefly reviewed IV-BKalmanFiltering..................... 11 with emphasis on nonlinear and non-Gaussian filtering. Following IV-COptimumNonlinearFiltering.............. 13 the Bayesian statistics, different Bayesian filtering techniques are de- IV-C.1Finite-dimensionalFilters............ 13 veloped given different scenarios. Under linear quadratic Gaussian circumstance, the celebrated Kalman filter can be derived within the Bayesian framework. Optimal/suboptimal nonlinear filtering tech- V Numerical Approximation Methods 14 niques are extensively investigated. In particular, we focus our at- V-A Gaussian/Laplace Approximation ............ 14 tention on the Bayesian filtering approach based on sequential Monte V-BIterativeQuadrature................... 14 Carlo sampling, the so-called particle filters. Many variants of the V-C Mulitgrid Method and Point-Mass Approximation . 14 particle filter as well as their features (strengths and weaknesses) are V-D Moment Approximation ................. 15 discussed. Related theoretical and practical issues are addressed in V-E Gaussian Sum Approximation . ............. 16 detail. In addition, some other (new) directions on Bayesian filtering V-F Deterministic
    [Show full text]
  • 36-463/663: Hierarchical Linear Models
    36-463/663: Hierarchical Linear Models Taste of MCMC / Bayes for 3 or more “levels” Brian Junker 132E Baker Hall [email protected] 11/3/2016 1 Outline Practical Bayes Mastery Learning Example A brief taste of JAGS and RUBE Hierarchical Form of Bayesian Models Extending the “Levels” and the “Slogan” Mastery Learning – Distribution of “mastery” (to be continued…) 11/3/2016 2 Practical Bayesian Statistics (posterior) ∝ (likelihood) ×(prior) We typically want quantities like point estimate: Posterior mean, median, mode uncertainty: SE, IQR, or other measure of ‘spread’ credible interval (CI) (^θ - 2SE, ^θ + 2SE) (θ. , θ. ) Other aspects of the “shape” of the posterior distribution Aside: If (prior) ∝ 1, then (posterior) ∝ (likelihood) 1/2 posterior mode = mle, posterior SE = 1/I( θ) , etc. 11/3/2016 3 Obtaining posterior point estimates, credible intervals Easy if we recognize the posterior distribution and we have formulae for means, variances, etc. Whether or not we have such formulae, we can get similar information by simulating from the posterior distribution. Key idea: where θ, θ, …, θM is a sample from f( θ|data). 11/3/2016 4 Example: Mastery Learning Some computer-based tutoring systems declare that you have “mastered” a skill if you can perform it successfully r times. The number of times x that you erroneously perform the skill before the r th success is a measure of how likely you are to perform the skill correctly (how well you know the skill). The distribution for the number of failures x before the r th success is the negative binomial distribution.
    [Show full text]
  • A Geometric View of Conjugate Priors
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Geometric View of Conjugate Priors Arvind Agarwal Hal Daume´ III Department of Computer Science Department of Computer Science University of Maryland University of Maryland College Park, Maryland USA College Park, Maryland USA [email protected] [email protected] Abstract Using the same geometry also gives the closed-form solution for the maximum-a-posteriori (MAP) problem. We then ana- In Bayesian machine learning, conjugate priors are lyze the prior using concepts borrowed from the information popular, mostly due to mathematical convenience. geometry. We show that this geometry induces the Fisher In this paper, we show that there are deeper reasons information metric and 1-connection, which are respectively, for choosing a conjugate prior. Specifically, we for- the natural metric and connection for the exponential family mulate the conjugate prior in the form of Bregman (Section 5). One important outcome of this analysis is that it divergence and show that it is the inherent geome- allows us to treat the hyperparameters of the conjugate prior try of conjugate priors that makes them appropriate as the effective sample points drawn from the distribution un- and intuitive. This geometric interpretation allows der consideration. We finally extend this geometric interpre- one to view the hyperparameters of conjugate pri- tation of conjugate priors to analyze the hybrid model given ors as the effective sample points, thus providing by [7] in a purely geometric setting, and justify the argument additional intuition. We use this geometric under- presented in [1] (i.e. a coupling prior should be conjugate) standing of conjugate priors to derive the hyperpa- using a much simpler analysis (Section 6).
    [Show full text]
  • Gibbs Sampling, Exponential Families and Orthogonal Polynomials
    Statistical Science 2008, Vol. 23, No. 2, 151–178 DOI: 10.1214/07-STS252 c Institute of Mathematical Statistics, 2008 Gibbs Sampling, Exponential Families and Orthogonal Polynomials1 Persi Diaconis, Kshitij Khare and Laurent Saloff-Coste Abstract. We give families of examples where sharp rates of conver- gence to stationarity of the widely used Gibbs sampler are available. The examples involve standard exponential families and their conjugate priors. In each case, the transition operator is explicitly diagonalizable with classical orthogonal polynomials as eigenfunctions. Key words and phrases: Gibbs sampler, running time analyses, expo- nential families, conjugate priors, location families, orthogonal polyno- mials, singular value decomposition. 1. INTRODUCTION which has f as stationary density under mild con- ditions discussed in [4, 102]. The Gibbs sampler, also known as Glauber dy- The algorithm was introduced in 1963 by Glauber namics or the heat-bath algorithm, is a mainstay of [49] to do simulations for Ising models, and inde- scientific computing. It provides a way to draw sam- pendently by Turcin [103]. It is still a standard tool ples from a multivariate probability density f(x1,x2, of statistical physics, both for practical simulation ...,xp), perhaps only known up to a normalizing (e.g., [88]) and as a natural dynamics (e.g., [10]). constant, by a sequence of one-dimensional sampling The basic Dobrushin uniqueness theorem showing problems. From (X1,...,Xp) proceed to (X1′ , X2,..., existence of Gibbs measures was proved based on Xp), then (X1′ , X2′ , X3,...,Xp),..., (X1′ , X2′ ,...,Xp′ ) this dynamics (e.g., [54]). It was introduced as a where at the ith stage, the coordinate is sampled base for image analysis by Geman and Geman [46].
    [Show full text]
  • Sparsity Enforcing Priors in Inverse Problems Via Normal Variance
    Sparsity enforcing priors in inverse problems via Normal variance mixtures: model selection, algorithms and applications Mircea Dumitru1 1Laboratoire des signaux et systemes` (L2S), CNRS – CentraleSupelec´ – Universite´ Paris-Sud, CentraleSupelec,´ Plateau de Moulon, 91192 Gif-sur-Yvette, France Abstract The sparse structure of the solution for an inverse problem can be modelled using different sparsity enforcing pri- ors when the Bayesian approach is considered. Analytical expression for the unknowns of the model can be obtained by building hierarchical models based on sparsity enforcing distributions expressed via conjugate priors. We consider heavy tailed distributions with this property: the Student-t distribution, which is expressed as a Normal scale mixture, with the mixing distribution the Inverse Gamma distribution, the Laplace distribution, which can also be expressed as a Normal scale mixture, with the mixing distribution the Exponential distribution or can be expressed as a Normal inverse scale mixture, with the mixing distribution the Inverse Gamma distribution, the Hyperbolic distribution, the Variance-Gamma distribution, the Normal-Inverse Gaussian distribution, all three expressed via conjugate distribu- tions using the Generalized Hyperbolic distribution. For all distributions iterative algorithms are derived based on hierarchical models that account for the uncertainties of the forward model. For estimation, Maximum A Posterior (MAP) and Posterior Mean (PM) via variational Bayesian approximation (VBA) are used. The performances of re- sulting algorithm are compared in applications in 3D computed tomography (3D-CT) and chronobiology. Finally, a theoretical study is developed for comparison between sparsity enforcing algorithms obtained via the Bayesian ap- proach and the sparsity enforcing algorithms issued from regularization techniques, like LASSO and some others.
    [Show full text]
  • Jeffreys Priors 1 Priors for the Multivariate Gaussian
    Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate Gaussian variable X of size p. Its probability density function can be parametrized by a mean vector µ Rp and a covariance matrix Σ +: ∈ ∈ Sp 1 1 1 p (X µ, Σ) = exp (X µ)T Σ−1 (X µ) (1) | (2π)p/2 Σ −2 − − | | We will consider three cases of conjugate priors:p the case when the covariance is fixed, the case when the mean is fixed and the general case. 1.1 The case of fixed variance The conjugate prior is a multivariate Gaussian of mean µ0 and covariance matrix Σ0. The derivations are the same as in the univariate case. 1.2 The case of fixed mean The conjugate prior is the inverse Wishart distribution. One can see this using the trace trick: (X µ)T Σ−1 (X µ) = Tr Σ−1 (X µ) (X µ)T (2) − − − − This is an inner product in the matrix space between the matrices Σ−1 and (X µ) (X µ)T . This partic- ular inner product corresponds to summing all the pairwise products of elements− of each− matrix. We will derive here the Wishart distribution, and you can derive the inverse Wishart by a change of variable. The computations are similar for the inverse Wishart distribution. The distribution over the precision matrix is the Wishart distribution with one degree of freedom is: 1 −1 M p/2 e− 2 Tr(MV ) p (M V ) = | | (3) | 2p/2 V 1/2 Γ 1 | | p 2 where Γp is the generalized (multivariate) Gamma function, and M,V are positive definite matrices.
    [Show full text]
  • Bayesian Inference for the Multivariate Normal
    Bayesian Inference for the Multivariate Normal Will Penny Wellcome Trust Centre for Neuroimaging, University College, London WC1N 3BG, UK. November 28, 2014 Abstract Bayesian inference for the multivariate Normal is most simply instanti- ated using a Normal-Wishart prior over the mean and covariance. Predic- tive densities then correspond to multivariate T distributions, and the moments from the marginal densities are provided analytically or via Monte-Carlo sampling. We show how this textbook approach is applied to a simple two-dimensional example. 1 Introduction This report addresses the following question. Given samples of multidimensional vectors drawn from a multivariate Normal density with mean m and precision Λ, what are the likely values of m and Λ ? Here, the precision is simply the inverse of the covariance i.e. Λ = C−1. Whilst it is simple to compute the sample mean and covariance, for inference we need the probability distributions over these quantities. In a Bayesian conception of this problem we place prior distributions over all quantities of interest and use Bayes rule to compute the posterior. We follow the formulation in Bernardo and Smith [1] (tabularised on page 441). 2 Preliminaries 2.1 Multivariate T distribution The multivariate T distribution over a d-dimensional random variable x is p(x) = T (x; µ, Λ; v) (1) with parameters µ, Λ and v. The mean and covariance are given by E(x) = µ (2) v V ar(x) = Λ−1 v − 2 The multivariate T approaches a multivariate Normal for large degrees of free- dom, v, as shown in Figure 1.
    [Show full text]
  • Prior Distributions for Variance Parameters in Hierarchical Models
    Bayesian Analysis (2006) 1, Number 3, pp. 515–533 Prior distributions for variance parameters in hierarchical models Andrew Gelman Department of Statistics and Department of Political Science Columbia University Abstract. Various noninformative prior distributions have been suggested for scale parameters in hierarchical models. We construct a new folded-noncentral-t family of conditionally conjugate priors for hierarchical standard deviation pa- rameters, and then consider noninformative and weakly informative priors in this family. We use an example to illustrate serious problems with the inverse-gamma family of “noninformative” prior distributions. We suggest instead to use a uni- form prior on the hierarchical standard deviation, using the half-t family when the number of groups is small and in other settings where a weakly informative prior is desired. We also illustrate the use of the half-t family for hierarchical modeling of multiple variance parameters such as arise in the analysis of variance. Keywords: Bayesian inference, conditional conjugacy, folded-noncentral-t distri- bution, half-t distribution, hierarchical model, multilevel model, noninformative prior distribution, weakly informative prior distribution 1 Introduction Fully-Bayesian analyses of hierarchical linear models have been considered for at least forty years (Hill, 1965, Tiao and Tan, 1965, and Stone and Springer, 1965) and have remained a topic of theoretical and applied interest (see, e.g., Portnoy, 1971, Box and Tiao, 1973, Gelman et al., 2003, Carlin and Louis, 1996, and Meng and van Dyk, 2001). Browne and Draper (2005) review much of the extensive literature in the course of comparing Bayesian and non-Bayesian inference for hierarchical models. As part of their article, Browne and Draper consider some different prior distributions for variance parameters; here, we explore the principles of hierarchical prior distributions in the context of a specific class of models.
    [Show full text]