Bayesian Linear Model: Gory Details 1 the NIG Conjugate Prior Family

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Linear Model: Gory Details 1 the NIG Conjugate Prior Family Bayesian Linear Model: Gory Details Pubh7440 Notes By Sudipto Banerjee n Let y = [yi]i=1 be an n × 1 vector of independent observations on a dependent variable (or response) from n experimental units. Associated with the yi, is a p × 1 vector of regressors, say xi, and lead to the linear regression model y = Xβ + , (1) T n T where X = [xi ]i=1 is the n × p matrix of regressors with i-th row being xi and is assumed fixed, n β is the slope vector of regression coefficients and = [i]i=1 is the vector of random variables representing “pure error” or measurement error in the dependent variable. For independent obser- 2 iid 2 vations, we assume ∼ MVN(0, σ In), viz. that each component i ∼ N(0, σ ). Furthermore, we will assume that the columns of the matrix X are linearly independent so that the rank of X is p. 1 The NIG conjugate prior family A popular Bayesian model builds upon the linear regression of y using conjugate priors by specifying 2 2 2 2 p(β, σ ) = p(β | σ )p(σ ) = N(µβ, σ Vβ) × IG(a, b) = NIG(µβ,Vβ, a, b) ba 1 a+p/2+1 1 1 = × exp − b + (β − µ )T V −1(β − µ ) p/2 1/2 2 2 β β β (2π) |Vβ| Γ(a) σ σ 2 1 a+p/2+1 1 1 ∝ × exp − b + (β − µ )T V −1(β − µ ) , (2) σ2 σ2 2 β β β where Γ(·) represents the Gamma function and the IG(a, b) prior density for σ2 is given by ba 1 a+1 b p(σ2) = exp − , σ2 > 0, Γ(a) σ2 σ2 where a, b > 0. We call this the Normal-Inverse-Gamma (NIG) prior and denote it as NIG(µβ,Vβ, a, b). The NIG probability distribution is a joint probability distribution of a vector β and a scalar σ2. If (β, σ2) ∼ NIG(µ, V, a, b), then an interesting analytic form results from integrating out σ2 1 from the joint density: Z ba Z 1 a+1 1 1 NIG(µ, V, a, b)dσ2 = exp − b + (β − µ)T V −1(β − µ) dσ2 (2π)p/2|V |1/2Γ(a) σ2 σ2 2 ba Z 1 1 = exp − b + (β − µ)T V −1(β − µ) dσ2 (2π)p/2|V |1/2Γ(a) σ2 2 a p −(a+ p ) b Γ a + 1 2 = 2 b + (β − µ)T V −1(β − µ) (2π)p/2|V |1/2Γ(a) 2 2a+p " −1 #−( 2 ) Γ a + p (β − µ)T b V (β − µ) = 2 1 + a . p/2 b 1/2 2a π |(2a) a V | Γ(a) This is a multivariate t density: ν+p − ν+p Γ (β − µ)T Σ−1(β − µ) 2 MV St (µ, Σ) = 2 1 + , (3) ν ν p/2 1/2 Γ 2 π |νΣ| ν b with ν = 2a and Σ = a V . 2 The likelihood The likelihood for the model is defined, up to proportionality, as the joint probability of observing the data given the parameters. Since X is fixed, the likelihood is given by 1 n/2 1 p(y | β, σ2) = N(Xβ, σ2I) = exp − (y − Xβ)T (y − Xβ) . (4) 2πσ2 2σ2 3 The posterior distribution from the NIG prior Inference will proceed from the posterior distribution p(β, σ2)p(y | β, σ2) p(β, σ2 | y) = , p(y) where p(y) = R p(β, σ2)p(y | β, σ2)dβdσ2 is the marginal distribution of the data. The key to deriving the joint posterior distribution is the following easily verified multivariate completion of squares or ellipsoidal rectification identity: uT Au − 2αT u = (u − A−1α)T A(u − A−1α) − αT A−1α, (5) 2 where A is a symmetric positive definite (hence invertible) matrix. An application of this identity immediately reveals, 1 1 1 1 b + (β − µ )T V (β − µ ) + (y − Xβ)T (y − Xβ) = b∗ + (β − µ∗)T V ∗−1(β − µ∗) , σ2 2 β β β σ2 2 using which we can write the posterior as 1 a+(n+p)/2+1 1 1 p(β, σ2 | y) ∝ × exp − b∗ + (β − µ∗)T V ∗−1(β − µ∗) , (6) σ2 σ2 2 where ∗ −1 T −1 −1 T µ = (Vβ + X X) (Vβ µβ + X y), V ∗ = (V −1 + XT X)−1, a∗ = a + n/2, 1 b∗ = b + [µT V −1µ + yT y − µ∗T V ∗−1µ∗]. 2 β β β This posterior distribution is easily identified as a NIG(µ∗,V ∗, a∗, b∗) proving it to be a conjugate family for the linear regression model. Note that the marginal posterior distribution of σ2 is immediately seen to be an IG(a∗, b∗) whose density is given by: ∗ b∗a∗ 1 a +1 b∗ p(σ2 | y) = exp − . (7) Γ(a∗) σ2 σ2 The marginal posterior distribution of β is obtained by integrating out σ2 from the NIG joint posterior as follows: Z Z p(β|y) = p(β, σ2 | y)dσ2 = NIG(µ∗,V ∗, a∗, b∗)dσ2 ∗ Z 1 a +1 1 1 ∝ exp − b∗ + (β − µ∗)T V ∗−1(β − µ∗) dσ2 σ2 σ2 2 ∗ (β − µ∗)T V ∗−1(β − µ∗)−(a +p/2) ∝ 1 + . 2b∗ This is a multivariate t density: ∗ ∗ ν +p − ν +p Γ ∗ T ∗−1 ∗ 2 ∗ ∗ 2 (β − µ ) Σ (β − µ ) MV St ∗ (µ , Σ ) = 1 + , (8) ν ν∗ p/2 ∗ ∗ 1/2 ∗ Γ 2 π |ν Σ | ν ∗ ∗ ∗ b∗ ∗ with ν = 2a and Σ = a∗ V . 3 4 A useful expression for the NIG scale parameter Here we will prove: 1 −1 b∗ = b + y − Xµ T I + XV XT (y − Xµ ) (9) 2 β β β On account of the expression for b∗ derived in the preceding section, it suffices to prove that T T −1 ∗ ∗−1 ∗ T T −1 y y + µβ V µβ − µ V µ = y − Xµβ I + XVβX (y − Xµβ) ∗ ∗ −1 T Substituting µ = V (V µβ + X y) in the left hand side above we obtain: T T −1 ∗ ∗−1 ∗ T T −1 −1 T ∗ −1 T y y + µβ Vβ µβ − µ V µ = y y + µβ Vβ µβ − (Vβ µβ + X y)V (Vβ µβ + X y) T ∗ T T ∗ −1 T −1 −1 ∗ −1 = y (I − XVβ X )y − 2y XV V µβ + µβ (Vβ − Vβ V Vβ )µ. (10) Further development of the proof will employ two tricky identities. The first is the well-known Sherman-Woodbury-Morrison identity in matrix algebra: −1 (A + BDC)−1 = A−1 − A−1B D−1 + CA−1B CA−1, (11) where A and D are square matrices that are invertible and B and C are rectangular (square if A and D have the same dimensions) matrices such that the multiplications are well-defined. This identity is easily verified by multiplying the right hand side with A + BDC and simplifying to reduce it to the identity matrix. T −1 Applying (11) twice, once with A = Vβ and D = (X X) to get the second equality and then T −1 with A = (X X) and D = Vβ to get the third equality, we have −1 −1 ∗ −1 −1 −1 −1 T −1 −1 Vβ − Vβ V Vβ = Vβ − Vβ (Vβ + XX ) Vβ T −1 −1 = [Vβ + (X X) ] = XT X − XT X(XT X + V −1)−1XT X T ∗ T = X (In − XV X )X. (12) ∗ −1 T ∗ −1 ∗ T The next identity notes that since V (V + X X) = Ip, we have V V = Ip − V X X, so that ∗ −1 ∗ T ∗ T XV V = X − XV X X = (In − XV X )X. (13) 4 Substituting (12) and (13) in (10) we obtain T ∗ T T ∗ T T ∗ T y (In − XV X )y − 2y (In − XV X )µβ + µβ (In − XV X )µβ T ∗ T = (y − Xµβ) (In − XV X )(y − Xµβ) T T −1 = (y − Xµβ) (In + XVX ) (y − Xµβ), (14) where the last step is again a consequence of (11): T −1 −1 T −1 T ∗ T (In + XVX ) = In − X(V + X X) X = In − XV X . 5 Marginal distributions – the hard way To obtain the marginal distribution of y, we first compute the distribution p(y | σ2) by integrating out β and subsequently integrate out σ2 to obtain p(y). To be precise, we use the expression for b∗ derived in the preceding section, proceeding as below: Z Z 2 2 2 2 2 p(y | σ ) = p(y | β, σ )p(β | σ )dβ = N(Xβ, σ In) × N(µβ, σ Vβ)dβ Z 1 1 n T T −1 o = n+p exp − (y − Xβ) (y − Xβ) + (β − µβ) Vβ (β − µβ) dβ 2 1/2 2σ2 (2πσ ) 2 |Vβ| 1 = n+p 2 1/2 (2πσ ) 2 |Vβ| Z 1 × exp − (y − Xµ )T (I + XVXT )−1(y − Xµ ) + (β − µ∗)T V ∗−1(β − µ∗) dβ 2σ2 β β 1 1 T T −1 = n+p exp − (y − Xµβ) (I + XVβX ) (y − Xµβ) 2 1/2 2σ2 (2πσ ) 2 |Vβ| Z 1 × exp − (β − µ∗)T V ∗−1(β − µ∗) dβ 2σ2 ∗ 1/2 1 |V | 1 T T −1 = n exp − 2 (y − Xµβ) (I + XVβX ) (y − Xµβ) (2πσ2) 2 |Vβ| 2σ 1 1 T T −1 = n exp − (y − Xµβ) (I + XVβX ) (y − Xµβ) 2 T 1/2 2 (2πσ ) 2 |I + XVβX | 2σ 2 T = N(Xµβ, σ (I + XVβX )). (15) Here we have applied the matrix identity |A + BDC| = |A||D||D−1 + CA−1B| (16) 5 to obtain |V | |I + XV XT | = |V ||V −1 + XT X| = β .
Recommended publications
  • The Exponential Family 1 Definition
    The Exponential Family David M. Blei Columbia University November 9, 2016 The exponential family is a class of densities (Brown, 1986). It encompasses many familiar forms of likelihoods, such as the Gaussian, Poisson, multinomial, and Bernoulli. It also encompasses their conjugate priors, such as the Gamma, Dirichlet, and beta. 1 Definition A probability density in the exponential family has this form p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where is the natural parameter; t.x/ are sufficient statistics; h.x/ is the “base measure;” a./ is the log normalizer. Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) The statistic t.x/ is called sufficient because the probability as a function of only depends on x through t.x/. The exponential family has fundamental connections to the world of graphical models (Wainwright and Jordan, 2008). For our purposes, we’ll use exponential 1 families as components in directed graphical models, e.g., in the mixtures of Gaussians. The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ d.x/ (2) D f g This is the negative logarithm of the normalizing constant. The function h.x/ can be a source of confusion. One way to interpret h.x/ is the (unnormalized) distribution of x when 0. It might involve statistics of x that D are not in t.x/, i.e., that do not vary with the natural parameter.
    [Show full text]
  • Machine Learning Conjugate Priors and Monte Carlo Methods
    Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods CPSC 540: Machine Learning Conjugate Priors and Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2016 Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Admin Nothing exciting? We discussed empirical Bayes, where you optimize prior using marginal likelihood, Z argmax p(xjα; β) = argmax p(xjθ)p(θjα; β)dθ: α,β α,β θ Can be used to optimize λj, polynomial degree, RBF σi, polynomial vs. RBF, etc. We also considered hierarchical Bayes, where you put a prior on the prior, p(xjα; β)p(α; βjγ) p(α; βjx; γ) = : p(xjγ) But is the hyper-prior really needed? Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Last Time: Bayesian Statistics In Bayesian statistics we work with posterior over parameters, p(xjθ)p(θjα; β) p(θjx; α; β) = : p(xjα; β) We also considered hierarchical Bayes, where you put a prior on the prior, p(xjα; β)p(α; βjγ) p(α; βjx; γ) = : p(xjγ) But is the hyper-prior really needed? Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Last Time: Bayesian Statistics In Bayesian statistics we work with posterior over parameters, p(xjθ)p(θjα; β) p(θjx; α; β) = : p(xjα; β) We discussed empirical Bayes, where you optimize prior using marginal likelihood, Z argmax p(xjα; β) = argmax p(xjθ)p(θjα; β)dθ: α,β α,β θ Can be used to optimize λj, polynomial degree, RBF σi, polynomial vs.
    [Show full text]
  • Simple Linear Regression with Least Square Estimation: an Overview
    Aditya N More et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 7 (6) , 2016, 2394-2396 Simple Linear Regression with Least Square Estimation: An Overview Aditya N More#1, Puneet S Kohli*2, Kshitija H Kulkarni#3 #1-2Information Technology Department,#3 Electronics and Communication Department College of Engineering Pune Shivajinagar, Pune – 411005, Maharashtra, India Abstract— Linear Regression involves modelling a relationship amongst dependent and independent variables in the form of a (2.1) linear equation. Least Square Estimation is a method to determine the constants in a Linear model in the most accurate way without much complexity of solving. Metrics where such as Coefficient of Determination and Mean Square Error is the ith value of the sample data point determine how good the estimation is. Statistical Packages is the ith value of y on the predicted regression such as R and Microsoft Excel have built in tools to perform Least Square Estimation over a given data set. line The above equation can be geometrically depicted by Keywords— Linear Regression, Machine Learning, Least Squares Estimation, R programming figure 2.1. If we draw a square at each point whose length is equal to the absolute difference between the sample data point and the predicted value as shown, each of the square would then represent the residual error in placing the I. INTRODUCTION regression line. The aim of the least square method would Linear Regression involves establishing linear be to place the regression line so as to minimize the sum of relationships between dependent and independent variables.
    [Show full text]
  • Polynomial Singular Value Decompositions of a Family of Source-Channel Models
    Polynomial Singular Value Decompositions of a Family of Source-Channel Models The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Makur, Anuran and Lizhong Zheng. "Polynomial Singular Value Decompositions of a Family of Source-Channel Models." IEEE Transactions on Information Theory 63, 12 (December 2017): 7716 - 7728. © 2017 IEEE As Published http://dx.doi.org/10.1109/tit.2017.2760626 Publisher Institute of Electrical and Electronics Engineers (IEEE) Version Author's final manuscript Citable link https://hdl.handle.net/1721.1/131019 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ IEEE TRANSACTIONS ON INFORMATION THEORY 1 Polynomial Singular Value Decompositions of a Family of Source-Channel Models Anuran Makur, Student Member, IEEE, and Lizhong Zheng, Fellow, IEEE Abstract—In this paper, we show that the conditional expec- interested in a particular subclass of one-parameter exponential tation operators corresponding to a family of source-channel families known as natural exponential families with quadratic models, defined by natural exponential families with quadratic variance functions (NEFQVF). So, we define natural exponen- variance functions and their conjugate priors, have orthonor- mal polynomials as singular vectors. These models include the tial families next. Gaussian channel with Gaussian source, the Poisson channel Definition 1 (Natural Exponential Family). Given a measur- with gamma source, and the binomial channel with beta source. To derive the singular vectors of these models, we prove and able space (Y; B(Y)) with a σ-finite measure µ, where Y ⊆ R employ the equivalent condition that their conditional moments and B(Y) denotes the Borel σ-algebra on Y, the parametrized are strictly degree preserving polynomials.
    [Show full text]
  • Generalized Linear Models
    Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608) Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions • So far we've seen two canonical settings for regression. Let X 2 Rp be a vector of predictors. In linear regression, we observe Y 2 R, and assume a linear model: T E(Y jX) = β X; for some coefficients β 2 Rp. In logistic regression, we observe Y 2 f0; 1g, and we assume a logistic model (Y = 1jX) log P = βT X: 1 − P(Y = 1jX) • What's the similarity here? Note that in the logistic regression setting, P(Y = 1jX) = E(Y jX). Therefore, in both settings, we are assuming that a transformation of the conditional expec- tation E(Y jX) is a linear function of X, i.e., T g E(Y jX) = β X; for some function g. In linear regression, this transformation was the identity transformation g(u) = u; in logistic regression, it was the logit transformation g(u) = log(u=(1 − u)) • Different transformations might be appropriate for different types of data. E.g., the identity transformation g(u) = u is not really appropriate for logistic regression (why?), and the logit transformation g(u) = log(u=(1 − u)) not appropriate for linear regression (why?), but each is appropriate in their own intended domain • For a third data type, it is entirely possible that transformation neither is really appropriate. What to do then? We think of another transformation g that is in fact appropriate, and this is the basic idea behind a generalized linear model 1.2 Generalized linear models • Given predictors X 2 Rp and an outcome Y , a generalized linear model is defined by three components: a random component, that specifies a distribution for Y jX; a systematic compo- nent, that relates a parameter η to the predictors X; and a link function, that connects the random and systematic components • The random component specifies a distribution for the outcome variable (conditional on X).
    [Show full text]
  • A Compendium of Conjugate Priors
    A Compendium of Conjugate Priors Daniel Fink Environmental Statistics Group Department of Biology Montana State Univeristy Bozeman, MT 59717 May 1997 Abstract This report reviews conjugate priors and priors closed under sampling for a variety of data generating processes where the prior distributions are univariate, bivariate, and multivariate. The effects of transformations on conjugate prior relationships are considered and cases where conjugate prior relationships can be applied under transformations are identified. Univariate and bivariate prior relationships are verified using Monte Carlo methods. Contents 1 Introduction Experimenters are often in the position of having had collected some data from which they desire to make inferences about the process that produced that data. Bayes' theorem provides an appealing approach to solving such inference problems. Bayes theorem, π(θ) L(θ x ; : : : ; x ) g(θ x ; : : : ; x ) = j 1 n (1) j 1 n π(θ) L(θ x ; : : : ; x )dθ j 1 n is commonly interpreted in the following wayR. We want to make some sort of inference on the unknown parameter(s), θ, based on our prior knowledge of θ and the data collected, x1; : : : ; xn . Our prior knowledge is encapsulated by the probability distribution on θ; π(θ). The data that has been collected is combined with our prior through the likelihood function, L(θ x ; : : : ; x ) . The j 1 n normalized product of these two components yields a probability distribution of θ conditional on the data. This distribution, g(θ x ; : : : ; x ) , is known as the posterior distribution of θ. Bayes' j 1 n theorem is easily extended to cases where is θ multivariate, a vector of parameters.
    [Show full text]
  • Bayesian Filtering: from Kalman Filters to Particle Filters, and Beyond ZHE CHEN
    MANUSCRIPT 1 Bayesian Filtering: From Kalman Filters to Particle Filters, and Beyond ZHE CHEN Abstract— In this self-contained survey/review paper, we system- IV Bayesian Optimal Filtering 9 atically investigate the roots of Bayesian filtering as well as its rich IV-AOptimalFiltering..................... 10 leaves in the literature. Stochastic filtering theory is briefly reviewed IV-BKalmanFiltering..................... 11 with emphasis on nonlinear and non-Gaussian filtering. Following IV-COptimumNonlinearFiltering.............. 13 the Bayesian statistics, different Bayesian filtering techniques are de- IV-C.1Finite-dimensionalFilters............ 13 veloped given different scenarios. Under linear quadratic Gaussian circumstance, the celebrated Kalman filter can be derived within the Bayesian framework. Optimal/suboptimal nonlinear filtering tech- V Numerical Approximation Methods 14 niques are extensively investigated. In particular, we focus our at- V-A Gaussian/Laplace Approximation ............ 14 tention on the Bayesian filtering approach based on sequential Monte V-BIterativeQuadrature................... 14 Carlo sampling, the so-called particle filters. Many variants of the V-C Mulitgrid Method and Point-Mass Approximation . 14 particle filter as well as their features (strengths and weaknesses) are V-D Moment Approximation ................. 15 discussed. Related theoretical and practical issues are addressed in V-E Gaussian Sum Approximation . ............. 16 detail. In addition, some other (new) directions on Bayesian filtering V-F Deterministic
    [Show full text]
  • 36-463/663: Hierarchical Linear Models
    36-463/663: Hierarchical Linear Models Taste of MCMC / Bayes for 3 or more “levels” Brian Junker 132E Baker Hall [email protected] 11/3/2016 1 Outline Practical Bayes Mastery Learning Example A brief taste of JAGS and RUBE Hierarchical Form of Bayesian Models Extending the “Levels” and the “Slogan” Mastery Learning – Distribution of “mastery” (to be continued…) 11/3/2016 2 Practical Bayesian Statistics (posterior) ∝ (likelihood) ×(prior) We typically want quantities like point estimate: Posterior mean, median, mode uncertainty: SE, IQR, or other measure of ‘spread’ credible interval (CI) (^θ - 2SE, ^θ + 2SE) (θ. , θ. ) Other aspects of the “shape” of the posterior distribution Aside: If (prior) ∝ 1, then (posterior) ∝ (likelihood) 1/2 posterior mode = mle, posterior SE = 1/I( θ) , etc. 11/3/2016 3 Obtaining posterior point estimates, credible intervals Easy if we recognize the posterior distribution and we have formulae for means, variances, etc. Whether or not we have such formulae, we can get similar information by simulating from the posterior distribution. Key idea: where θ, θ, …, θM is a sample from f( θ|data). 11/3/2016 4 Example: Mastery Learning Some computer-based tutoring systems declare that you have “mastered” a skill if you can perform it successfully r times. The number of times x that you erroneously perform the skill before the r th success is a measure of how likely you are to perform the skill correctly (how well you know the skill). The distribution for the number of failures x before the r th success is the negative binomial distribution.
    [Show full text]
  • Lecture 18: Regression in Practice
    Regression in Practice Regression in Practice ● Regression Errors ● Regression Diagnostics ● Data Transformations Regression Errors Ice Cream Sales vs. Temperature Image source Linear Regression in R > summary(lm(sales ~ temp)) Call: lm(formula = sales ~ temp) Residuals: Min 1Q Median 3Q Max -74.467 -17.359 3.085 23.180 42.040 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -122.988 54.761 -2.246 0.0513 . temp 28.427 2.816 10.096 3.31e-06 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 35.07 on 9 degrees of freedom Multiple R-squared: 0.9189, Adjusted R-squared: 0.9098 F-statistic: 101.9 on 1 and 9 DF, p-value: 3.306e-06 Some Goodness-of-fit Statistics ● Residual standard error ● R2 and adjusted R2 ● F statistic Anatomy of Regression Errors Image Source Residual Standard Error ● A residual is a difference between a fitted value and an observed value. ● The total residual error (RSS) is the sum of the squared residuals. ○ Intuitively, RSS is the error that the model does not explain. ● It is a measure of how far the data are from the regression line (i.e., the model), on average, expressed in the units of the dependent variable. ● The standard error of the residuals is roughly the square root of the average residual error (RSS / n). ○ Technically, it’s not √(RSS / n), it’s √(RSS / (n - 2)); it’s adjusted by degrees of freedom. R2: Coefficient of Determination ● R2 = ESS / TSS ● Interpretations: ○ The proportion of the variance in the dependent variable that the model explains.
    [Show full text]
  • A Geometric View of Conjugate Priors
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence A Geometric View of Conjugate Priors Arvind Agarwal Hal Daume´ III Department of Computer Science Department of Computer Science University of Maryland University of Maryland College Park, Maryland USA College Park, Maryland USA [email protected] [email protected] Abstract Using the same geometry also gives the closed-form solution for the maximum-a-posteriori (MAP) problem. We then ana- In Bayesian machine learning, conjugate priors are lyze the prior using concepts borrowed from the information popular, mostly due to mathematical convenience. geometry. We show that this geometry induces the Fisher In this paper, we show that there are deeper reasons information metric and 1-connection, which are respectively, for choosing a conjugate prior. Specifically, we for- the natural metric and connection for the exponential family mulate the conjugate prior in the form of Bregman (Section 5). One important outcome of this analysis is that it divergence and show that it is the inherent geome- allows us to treat the hyperparameters of the conjugate prior try of conjugate priors that makes them appropriate as the effective sample points drawn from the distribution un- and intuitive. This geometric interpretation allows der consideration. We finally extend this geometric interpre- one to view the hyperparameters of conjugate pri- tation of conjugate priors to analyze the hybrid model given ors as the effective sample points, thus providing by [7] in a purely geometric setting, and justify the argument additional intuition. We use this geometric under- presented in [1] (i.e. a coupling prior should be conjugate) standing of conjugate priors to derive the hyperpa- using a much simpler analysis (Section 6).
    [Show full text]
  • Chapter 2 Simple Linear Regression Analysis the Simple
    Chapter 2 Simple Linear Regression Analysis The simple linear regression model We consider the modelling between the dependent and one independent variable. When there is only one independent variable in the linear regression model, the model is generally termed as a simple linear regression model. When there are more than one independent variables in the model, then the linear model is termed as the multiple linear regression model. The linear model Consider a simple linear regression model yX01 where y is termed as the dependent or study variable and X is termed as the independent or explanatory variable. The terms 0 and 1 are the parameters of the model. The parameter 0 is termed as an intercept term, and the parameter 1 is termed as the slope parameter. These parameters are usually called as regression coefficients. The unobservable error component accounts for the failure of data to lie on the straight line and represents the difference between the true and observed realization of y . There can be several reasons for such difference, e.g., the effect of all deleted variables in the model, variables may be qualitative, inherent randomness in the observations etc. We assume that is observed as independent and identically distributed random variable with mean zero and constant variance 2 . Later, we will additionally assume that is normally distributed. The independent variables are viewed as controlled by the experimenter, so it is considered as non-stochastic whereas y is viewed as a random variable with Ey()01 X and Var() y 2 . Sometimes X can also be a random variable.
    [Show full text]
  • Gibbs Sampling, Exponential Families and Orthogonal Polynomials
    Statistical Science 2008, Vol. 23, No. 2, 151–178 DOI: 10.1214/07-STS252 c Institute of Mathematical Statistics, 2008 Gibbs Sampling, Exponential Families and Orthogonal Polynomials1 Persi Diaconis, Kshitij Khare and Laurent Saloff-Coste Abstract. We give families of examples where sharp rates of conver- gence to stationarity of the widely used Gibbs sampler are available. The examples involve standard exponential families and their conjugate priors. In each case, the transition operator is explicitly diagonalizable with classical orthogonal polynomials as eigenfunctions. Key words and phrases: Gibbs sampler, running time analyses, expo- nential families, conjugate priors, location families, orthogonal polyno- mials, singular value decomposition. 1. INTRODUCTION which has f as stationary density under mild con- ditions discussed in [4, 102]. The Gibbs sampler, also known as Glauber dy- The algorithm was introduced in 1963 by Glauber namics or the heat-bath algorithm, is a mainstay of [49] to do simulations for Ising models, and inde- scientific computing. It provides a way to draw sam- pendently by Turcin [103]. It is still a standard tool ples from a multivariate probability density f(x1,x2, of statistical physics, both for practical simulation ...,xp), perhaps only known up to a normalizing (e.g., [88]) and as a natural dynamics (e.g., [10]). constant, by a sequence of one-dimensional sampling The basic Dobrushin uniqueness theorem showing problems. From (X1,...,Xp) proceed to (X1′ , X2,..., existence of Gibbs measures was proved based on Xp), then (X1′ , X2′ , X3,...,Xp),..., (X1′ , X2′ ,...,Xp′ ) this dynamics (e.g., [54]). It was introduced as a where at the ith stage, the coordinate is sampled base for image analysis by Geman and Geman [46].
    [Show full text]