Hierarchical Models – Motivation James-Stein Inference • Suppose X ∼ N(Θ, 1) – X Is Admissible (Not Dominated) for Estimating Θ with Squared Error Loss

Total Page:16

File Type:pdf, Size:1020Kb

Hierarchical Models – Motivation James-Stein Inference • Suppose X ∼ N(Θ, 1) – X Is Admissible (Not Dominated) for Estimating Θ with Squared Error Loss Hierarchical models { motivation James-Stein inference ² Suppose X » N(θ; 1) { X is admissible (not dominated) for estimating θ with squared error loss ² Now Xi » N(θi; 1); i = 1; : : : ; r { X = (X1;:::;Xr) is admissible if r = 1; 2 but not r ¸ 3 { for r ¸ 3 r ¡ 2 ±i = (1 ¡ P 2 )Xi i Xi yields better estimates { known as James-Stein estimation 57 Hierarchical models { motivation James-Stein inference (cont'd) ² Bayes view: Xi » N(θi; 1) and θi » N(0; a) { posterior distn: θijXi » N 1 { posterior mean is (1 ¡ a+1 )Xi { need to estimate a; one natural approach yields James-Stein ² Summary { estimation results depend on loss function { squared-error loss do well on avg but maybe poor for one component { powerful lesson about combining related problems to get improved inferences 58 Hierarchical Models Suppose we have data Yij j = 1;:::;J i = 1; : : : ; nj such that Yij i = 1; : : : ; nj are independent given θj with distribution p(Y jθj). e.g. |scores{z } for |students{z } in |classrooms{z } It might be reasonable Y (i) (j) to expect θj's to be \similar" (but not necessarily identical). Therefore, we may perhaps try to estimate population distribution of θj's. This is achieved in a natural way if we use a prior distribution in which the θj's are viewed as a sample from a common population distribution. 59 Hierarchical Models ² Key: The observed data, yij , with units indexed by i within groups indexed by j, can be used to estimate aspects of the population distribution of the θj's even though the values of θj are not themselves observed. ² How? It is natural to model such a problem hierarchically { observable outcomes modeled conditionally on parameters θ { θ given a probabilistic speci¯cation in terms of other parameters, Á, known as hyperparameters. 60 Hierarchical Models ² Nonhierarchical models are usually inappropiate for hierarchical data. { a single θ (i.e. θj ´ θ 8j) may be inadequate to ¯t a combined data set. { separate unrelated θj are likely to \over¯t" data. { information about one θj can be obtained from others' data. ² Hierarchical model uses many parameters but population distribution induces enough structure to avoid over¯tting. 61 Setting up hierarchical models Exchangeability Recall: A set of random variables (θ1; : : : ; θk) is exchangeable if the joint distribution is invariant to permutations of the indexes (1; : : : ; k). The indexes contain no information about the values of the random variables. ² hierarchical models often use exchangeable models for the prior distribution of model parameters ² iid random variables are one example ² seemingly non-exchangeable r.v.'s may become exchangeable if we condition on all available information (e.g., regression analysis) 62 Setting up hierarchical models Exchangeable models ² Basic form of exchangeable model { θ = (θ1; : : : ; θk) are independent conditional on additional parameters Á (known as hyperparameters) Yk p(θjÁ) = p(θjjÁ) j=1 { Á referred to as hyperparameter(s) with hyperprior distn p(Á) R { implies p(θ) = p(θjÁ)p(Á)dÁ { work with joint posterior distribution, p(θ; Ájy) ² One objection to exchangeable model is that we may have other information, say (Xj). In that case may take YJ p(θ1; : : : ; θJ jX1;:::;XJ ) = p(θijÁ; Xi) i=1 63 Setting up hierarchical models ² Model is speci¯ed in nested stages { sampling distribution p(yjθ) (¯rst level of hierarchy) { prior (or population) distribution for θ: p(θjÁ) (second level of hierarchy) { prior distribution for Á (hyperprior): p(Á) { Note: more levels are possible { hyperprior at highest level is often di®use but improper priors must be checked carefully to avoid improper posterior distributions. 64 Setting up hierarchical models ² Inference { Joint distn: p(y; θ; Á) = p(yjθ; Á)p(θjÁ)p(Á) = p(yjθ)p(θjÁ)p(Á) { Posterior distribution p(θ; Ájy) / p(Á)p(θjÁ)p(yjθ) = p(θjy; Á)p(Ájy) ¤ often p(θjÁ) is conjugate for p(yjθ) ¤ if we know (or ¯x) Á: p(θjy; Á) follows from conjugacy ¤ then need inference for Á: p(Ájy) 65 Computational approaches for hierarchical models ² Marginal model Z p(yjÁ) = p(yjθ)p(θjÁ)dθ do inference only for Á (e.g. marginal maximum likelihood) ² Empirical Bayes p(θjy; Á^) / p(yjθ)p(θjÁ^) do inference for θ ² Hierarchical Bayes (a.k.a. full Bayes) p(θ; Ájy) / p(yjθ)p(θjÁ)p(Á) inference for θ and Á 66 Hierarchical models and random e®ects Animal breeding example Consider the following mixed linear model commonly used in animal breeding studies Y = X¯ + Zu + e X = design matrix for ¯xed e®ects Z = design matrix for random e®ects ¯ = ¯xed e®ects parameters u = random e®ects parameters 2 e = individual variation » N(0; σe I) 2 2 Y j¯; u; σe » N(X¯ + Zu; σe I) 2 2 ujσa » N(0; σaA) (can also think of ¯ as random with p(¯) / 1) 67 Hierarchical models and random e®ects Animal breeding example ² Marginal model (after integrating out u) 2 2 2 0 2 Y j¯; σa; σe » N(X¯; σaZAZ + σe I) ² Note: the separation of parameters into θ and Á is somewhat ambiguous here: 2 { model speci¯cation suggests Á = fσag and θ = f¯; u; σeg 2 2 { marginal model suggests Á = f¯; σa; σe g and θ = fug 68 Hierarchical models and random e®ects Animal breeding example ² Empirical Bayes (known as REML/BLUP) 2 2 We can estimate σa, σe by marginal 2 2 (restricted?) maximum likelihood (^σa,σ ^e ). Then 2 2 2 2 p(u; ¯jy; σ^a; σ^e ) / p(yj¯; u; σ^e )p(ujσ^a) (a joint normal distn) ² Hierarchical Bayes 2 2 2 2 2 2 p(¯; σa; σe ; ¹jy) / p(yj¯; u; σe )P (ujσa)p(¯; σa; σe ) 69 Computation with hierarchical models ² Two cases { conjugate case (p(θjÁ) conjugate prior for p(yjθ)) ¤ approach described below { non-conjugate case ¤ requires more advanced computing ¤ problem-speci¯c implementations ² Computational strategy for conjugate case { write p(θ; Ájy) = p(Ájy)p(θjÁ; y) { identify conditional posterior density of θ given Á, p(θjÁ; y) (easy for conjugate models) { obtain marginal posterior distribution of Á, p(Ájy) { simulate from p(Ájy) and then p(θjÁ; y) 70 Computation with hierarchical models The marginal posterior distribution p(Ájy) ² Approaches for obtaining p(Ájy) R { integration p(Ájy) = p(θ; Ájy)dθ { algebra - for a convenient value of θ p(θ; Ájy) p(Ájy) = p(θjÁ; y) ² Sampling from p(Ájy) { easy if known distribution { grid if Á is low-dimensional { more sophisticated methods (later) 71 Beta-binomial example ² Series of toxicology studies ² Study j: nj exchangeable individuals yj develop tumors ² Model speci¯cation: { yjjθj » Bin(nj; θj); j = 1;:::;J (indep) { θj; j = 1;:::;J j ®; ¯ » Beta(®; ¯) (iid) { p(®; ¯) { to be speci¯ed later, hopefully "non-informative" ² Marginal model: { can integrate out θj; j = 1;:::;J in this case Z Z J µ ¶ Y ¡(® + ¯) ®¡1 ¯¡1 nj yj n ¡y p(yj®; ¯) = ¢ θ (1 ¡ θ ) θ (1 ¡ θ ) j j dθ ¢ dθ j j j j 1 J j=1 ¡(®)¡(¯) yj J µ ¶ Y nj ¡(® + ¯) ¡(® + yj )¡(¯ + nj ¡ yj ) = j=1 yj ¡(®)¡(¯) ¡(® + ¯ + nj ) { yj; j = 1;:::;J are ind { distn of yj is known as beta-binomial distn 72 Beta-binomial example ² Conditional distn of θ's given ®; ¯; y Q { p(θj®; ¯; y) = j Beta(® + yj; ¯ + nj ¡ yj) { independent conjugate analyses { ¯nd this by algebra or by inspection of p(θ; ®; ¯jy) { analysis is thus reduced to ¯nding (and simulating from) p(®; ¯jy) ² Marginal posterior distn of ®; ¯ YJ ¡(® + ¯) ¡(® + y )¡(¯ + n ¡ y ) p(®; ¯jy) / p(®; ¯) j j j ¡(®)¡(¯) ¡(® + ¯ + n ) j=1 j { could derive from marginal distn on previous slide { could also derive from joint posterior distn { not a known distn (on ®; ¯) but easy to evaluate 73 Beta-binomial example ² Hyperprior distn p(®; ¯) { First try: p(®; ¯) / 1 (flat, noninformative?) ¤ equivalent to p(®=(® + ¯); ® + ¯) / (® + ¯) ¤ equivalent to p(log(®=¯); log(® + ¯)) / ®¯ ¤ check to see if posterior is proper ¢ consider di®'t cases (e.g., ® ! 0; ¯ ¯xed) ¢ if ®; ¯ ! 1 with ®=(® + ¯) = c, then p(®; ¯jy) / constant (not integrable) ¢ this is an improper distn ¢ contour plot would also show this (lots of probability extending out towards in¯nity) 74 Beta-binomial example ² Hyperprior distn p(®; ¯) { Second try: p(®=(® + ¯); ® + ¯) / 1 (flat on prior mean and precision) ¤ more intuitive, these two params are plausibly independent ¤ equivalent to p(®; ¯) / 1=(® + ¯) ¤ still leads to improper posterior distn { Third try: p(log(®=¯); log(® + ¯)) / 1 (flat on natural transformation of prior mean and variance) ¤ equivalent to p(®; ¯) / 1=(®¯) ¤ still leads to improper posterior distn { Fourth try: p(®=(® + ¯); (® + ¯)¡1=2) / 1 (flat on prior s.d. and prior mean) ¤ equivalent to p(®; ¯) / (® + ¯)¡5=2 ¤ "¯nal answer" - proper posterior distn ¤ equivalent to p(log(®=¯); log(® + ¯)) / ®¯(® + ¯)¡5=2 (this will come up later) 75 Beta-binomial example ² Computing { later consider more sophisticated approaches { for now, use grid approach ¤ simulate ®; ¯ from grid approx to posterior distn ¤ then simulate θ's using conjugate beta posterior distn { convenient to use (log(®=¯); log(® + ¯)) scale because contours "look better" and we can get away with smaller grid ² Illustrate with rat tumor data (separate handout) 76 Normal-normal hierarchical model ² Data model 2 { yjjθj » N(θj; σj ); j = 1;:::;J (indep) 2 { σj 's are assumed known (can release this assumption later) { motivation: yj could be a summary statistic with (approx) normal distn from the j-th study (e.g., regression coe±cient, sample mean) ² Prior distn { need a prior distn p(θ1; : : : ; θJ ) { if exchangeable, then model θ's as iid given parameters Á { some additional comments follow 77 Normal-normal hierarchical model ² Constructing a prior distribution Can think of this data model as a one-way ANOVA model (especially if yj is a sample mean of nj obs in group j).
Recommended publications
  • Assessing Fairness with Unlabeled Data and Bayesian Inference
    Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference Disi Ji1 Padhraic Smyth1 Mark Steyvers2 1Department of Computer Science 2Department of Cognitive Sciences University of California, Irvine [email protected] [email protected] [email protected] Abstract We investigate the problem of reliably assessing group fairness when labeled examples are few but unlabeled examples are plentiful. We propose a general Bayesian framework that can augment labeled data with unlabeled data to produce more accurate and lower-variance estimates compared to methods based on labeled data alone. Our approach estimates calibrated scores for unlabeled examples in each group using a hierarchical latent variable model conditioned on labeled examples. This in turn allows for inference of posterior distributions with associated notions of uncertainty for a variety of group fairness metrics. We demonstrate that our approach leads to significant and consistent reductions in estimation error across multiple well-known fairness datasets, sensitive attributes, and predictive models. The results show the benefits of using both unlabeled data and Bayesian inference in terms of assessing whether a prediction model is fair or not. 1 Introduction Machine learning models are increasingly used to make important decisions about individuals. At the same time it has become apparent that these models are susceptible to producing systematically biased decisions with respect to sensitive attributes such as gender, ethnicity, and age [Angwin et al., 2017, Berk et al., 2018, Corbett-Davies and Goel, 2018, Chen et al., 2019, Beutel et al., 2019]. This has led to a significant amount of recent work in machine learning addressing these issues, including research on both (i) definitions of fairness in a machine learning context (e.g., Dwork et al.
    [Show full text]
  • Bayesian Inference Chapter 4: Regression and Hierarchical Models
    Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35 Objective AFM Smith Dennis Lindley We analyze the Bayesian approach to fitting normal and generalized linear models and introduce the Bayesian hierarchical modeling approach. Also, we study the modeling and forecasting of time series. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35 Contents 1 Normal linear models 1.1. ANOVA model 1.2. Simple linear regression model 2 Generalized linear models 3 Hierarchical models 4 Dynamic models Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35 Normal linear models A normal linear model is of the following form: y = Xθ + ; 0 where y = (y1;:::; yn) is the observed data, X is a known n × k matrix, called 0 the design matrix, θ = (θ1; : : : ; θk ) is the parameter set and follows a multivariate normal distribution. Usually, it is assumed that: 1 ∼ N 0 ; I : k φ k A simple example of normal linear model is the simple linear regression model T 1 1 ::: 1 where X = and θ = (α; β)T . x1 x2 ::: xn Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35 Normal linear models Consider a normal linear model, y = Xθ + . A conjugate prior distribution is a normal-gamma distribution:
    [Show full text]
  • The Exponential Family 1 Definition
    The Exponential Family David M. Blei Columbia University November 9, 2016 The exponential family is a class of densities (Brown, 1986). It encompasses many familiar forms of likelihoods, such as the Gaussian, Poisson, multinomial, and Bernoulli. It also encompasses their conjugate priors, such as the Gamma, Dirichlet, and beta. 1 Definition A probability density in the exponential family has this form p.x / h.x/ exp >t.x/ a./ ; (1) j D f g where is the natural parameter; t.x/ are sufficient statistics; h.x/ is the “base measure;” a./ is the log normalizer. Examples of exponential family distributions include Gaussian, gamma, Poisson, Bernoulli, multinomial, Markov models. Examples of distributions that are not in this family include student-t, mixtures, and hidden Markov models. (We are considering these families as distributions of data. The latent variables are implicitly marginalized out.) The statistic t.x/ is called sufficient because the probability as a function of only depends on x through t.x/. The exponential family has fundamental connections to the world of graphical models (Wainwright and Jordan, 2008). For our purposes, we’ll use exponential 1 families as components in directed graphical models, e.g., in the mixtures of Gaussians. The log normalizer ensures that the density integrates to 1, Z a./ log h.x/ exp >t.x/ d.x/ (2) D f g This is the negative logarithm of the normalizing constant. The function h.x/ can be a source of confusion. One way to interpret h.x/ is the (unnormalized) distribution of x when 0. It might involve statistics of x that D are not in t.x/, i.e., that do not vary with the natural parameter.
    [Show full text]
  • Machine Learning Conjugate Priors and Monte Carlo Methods
    Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods CPSC 540: Machine Learning Conjugate Priors and Monte Carlo Methods Mark Schmidt University of British Columbia Winter 2016 Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Admin Nothing exciting? We discussed empirical Bayes, where you optimize prior using marginal likelihood, Z argmax p(xjα; β) = argmax p(xjθ)p(θjα; β)dθ: α,β α,β θ Can be used to optimize λj, polynomial degree, RBF σi, polynomial vs. RBF, etc. We also considered hierarchical Bayes, where you put a prior on the prior, p(xjα; β)p(α; βjγ) p(α; βjx; γ) = : p(xjγ) But is the hyper-prior really needed? Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Last Time: Bayesian Statistics In Bayesian statistics we work with posterior over parameters, p(xjθ)p(θjα; β) p(θjx; α; β) = : p(xjα; β) We also considered hierarchical Bayes, where you put a prior on the prior, p(xjα; β)p(α; βjγ) p(α; βjx; γ) = : p(xjγ) But is the hyper-prior really needed? Hierarchical Bayes for Non-IID Data Conjugate Priors Monte Carlo Methods Last Time: Bayesian Statistics In Bayesian statistics we work with posterior over parameters, p(xjθ)p(θjα; β) p(θjx; α; β) = : p(xjα; β) We discussed empirical Bayes, where you optimize prior using marginal likelihood, Z argmax p(xjα; β) = argmax p(xjθ)p(θjα; β)dθ: α,β α,β θ Can be used to optimize λj, polynomial degree, RBF σi, polynomial vs.
    [Show full text]
  • Practical Statistics for Particle Physics Lecture 1 AEPS2018, Quy Nhon, Vietnam
    Practical Statistics for Particle Physics Lecture 1 AEPS2018, Quy Nhon, Vietnam Roger Barlow The University of Huddersfield August 2018 Roger Barlow ( Huddersfield) Statistics for Particle Physics August 2018 1 / 34 Lecture 1: The Basics 1 Probability What is it? Frequentist Probability Conditional Probability and Bayes' Theorem Bayesian Probability 2 Probability distributions and their properties Expectation Values Binomial, Poisson and Gaussian 3 Hypothesis testing Roger Barlow ( Huddersfield) Statistics for Particle Physics August 2018 2 / 34 Question: What is Probability? Typical exam question Q1 Explain what is meant by the Probability PA of an event A [1] Roger Barlow ( Huddersfield) Statistics for Particle Physics August 2018 3 / 34 Four possible answers PA is number obeying certain mathematical rules. PA is a property of A that determines how often A happens For N trials in which A occurs NA times, PA is the limit of NA=N for large N PA is my belief that A will happen, measurable by seeing what odds I will accept in a bet. Roger Barlow ( Huddersfield) Statistics for Particle Physics August 2018 4 / 34 Mathematical Kolmogorov Axioms: For all A ⊂ S PA ≥ 0 PS = 1 P(A[B) = PA + PB if A \ B = ϕ and A; B ⊂ S From these simple axioms a complete and complicated structure can be − ≤ erected. E.g. show PA = 1 PA, and show PA 1.... But!!! This says nothing about what PA actually means. Kolmogorov had frequentist probability in mind, but these axioms apply to any definition. Roger Barlow ( Huddersfield) Statistics for Particle Physics August 2018 5 / 34 Classical or Real probability Evolved during the 18th-19th century Developed (Pascal, Laplace and others) to serve the gambling industry.
    [Show full text]
  • 3.3 Bayes' Formula
    Ismor Fischer, 5/29/2012 3.3-1 3.3 Bayes’ Formula Suppose that, for a certain population of individuals, we are interested in comparing sleep disorders – in particular, the occurrence of event A = “Apnea” – between M = Males and F = Females. S = Adults under 50 M F A A ∩ M A ∩ F Also assume that we know the following information: P(M) = 0.4 P(A | M) = 0.8 (80% of males have apnea) prior probabilities P(F) = 0.6 P(A | F) = 0.3 (30% of females have apnea) Given here are the conditional probabilities of having apnea within each respective gender, but these are not necessarily the probabilities of interest. We actually wish to calculate the probability of each gender, given A. That is, the posterior probabilities P(M | A) and P(F | A). To do this, we first need to reconstruct P(A) itself from the given information. P(A | M) P(A ∩ M) = P(A | M) P(M) P(M) P(Ac | M) c c P(A ∩ M) = P(A | M) P(M) P(A) = P(A | M) P(M) + P(A | F) P(F) P(A | F) P(A ∩ F) = P(A | F) P(F) P(F) P(Ac | F) c c P(A ∩ F) = P(A | F) P(F) Ismor Fischer, 5/29/2012 3.3-2 So, given A… P(M ∩ A) P(A | M) P(M) P(M | A) = P(A) = P(A | M) P(M) + P(A | F) P(F) (0.8)(0.4) 0.32 = (0.8)(0.4) + (0.3)(0.6) = 0.50 = 0.64 and posterior P(F ∩ A) P(A | F) P(F) P(F | A) = = probabilities P(A) P(A | M) P(M) + P(A | F) P(F) (0.3)(0.6) 0.18 = (0.8)(0.4) + (0.3)(0.6) = 0.50 = 0.36 S Thus, the additional information that a M F randomly selected individual has apnea (an A event with probability 50% – why?) increases the likelihood of being male from a prior probability of 40% to a posterior probability 0.32 0.18 of 64%, and likewise, decreases the likelihood of being female from a prior probability of 60% to a posterior probability of 36%.
    [Show full text]
  • Naïve Bayes Classifier
    CS 60050 Machine Learning Naïve Bayes Classifier Some slides taken from course materials of Tan, Steinbach, Kumar Bayes Classifier ● A probabilistic framework for solving classification problems ● Approach for modeling probabilistic relationships between the attribute set and the class variable – May not be possible to certainly predict class label of a test record even if it has identical attributes to some training records – Reason: noisy data or presence of certain factors that are not included in the analysis Probability Basics ● P(A = a, C = c): joint probability that random variables A and C will take values a and c respectively ● P(A = a | C = c): conditional probability that A will take the value a, given that C has taken value c P(A,C) P(C | A) = P(A) P(A,C) P(A | C) = P(C) Bayes Theorem ● Bayes theorem: P(A | C)P(C) P(C | A) = P(A) ● P(C) known as the prior probability ● P(C | A) known as the posterior probability Example of Bayes Theorem ● Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20 ● If a patient has stiff neck, what’s the probability he/she has meningitis? P(S | M )P(M ) 0.5×1/50000 P(M | S) = = = 0.0002 P(S) 1/ 20 Bayesian Classifiers ● Consider each attribute and class label as random variables ● Given a record with attributes (A1, A2,…,An) – Goal is to predict class C – Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) ● Can we estimate
    [Show full text]
  • Polynomial Singular Value Decompositions of a Family of Source-Channel Models
    Polynomial Singular Value Decompositions of a Family of Source-Channel Models The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Makur, Anuran and Lizhong Zheng. "Polynomial Singular Value Decompositions of a Family of Source-Channel Models." IEEE Transactions on Information Theory 63, 12 (December 2017): 7716 - 7728. © 2017 IEEE As Published http://dx.doi.org/10.1109/tit.2017.2760626 Publisher Institute of Electrical and Electronics Engineers (IEEE) Version Author's final manuscript Citable link https://hdl.handle.net/1721.1/131019 Terms of Use Creative Commons Attribution-Noncommercial-Share Alike Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/ IEEE TRANSACTIONS ON INFORMATION THEORY 1 Polynomial Singular Value Decompositions of a Family of Source-Channel Models Anuran Makur, Student Member, IEEE, and Lizhong Zheng, Fellow, IEEE Abstract—In this paper, we show that the conditional expec- interested in a particular subclass of one-parameter exponential tation operators corresponding to a family of source-channel families known as natural exponential families with quadratic models, defined by natural exponential families with quadratic variance functions (NEFQVF). So, we define natural exponen- variance functions and their conjugate priors, have orthonor- mal polynomials as singular vectors. These models include the tial families next. Gaussian channel with Gaussian source, the Poisson channel Definition 1 (Natural Exponential Family). Given a measur- with gamma source, and the binomial channel with beta source. To derive the singular vectors of these models, we prove and able space (Y; B(Y)) with a σ-finite measure µ, where Y ⊆ R employ the equivalent condition that their conditional moments and B(Y) denotes the Borel σ-algebra on Y, the parametrized are strictly degree preserving polynomials.
    [Show full text]
  • Numerical Physics with Probabilities: the Monte Carlo Method and Bayesian Statistics Part I for Assignment 2
    Numerical Physics with Probabilities: The Monte Carlo Method and Bayesian Statistics Part I for Assignment 2 Department of Physics, University of Surrey module: Energy, Entropy and Numerical Physics (PHY2063) 1 Numerical Physics part of Energy, Entropy and Numerical Physics This numerical physics course is part of the second-year Energy, Entropy and Numerical Physics mod- ule. It is online at the EENP module on SurreyLearn. See there for assignments, deadlines etc. The course is about numerically solving ODEs (ordinary differential equations) and PDEs (partial differential equations), and introducing the (large) part of numerical physics where probabilities are used. This assignment is on numerical physics of probabilities, and looks at the Monte Carlo (MC) method, and at the Bayesian statistics approach to data analysis. It covers MC and Bayesian statistics, in that order. MC is a widely used numerical technique, it is used, amongst other things, for modelling many random processes. MC is used in fields from statistical physics, to nuclear and particle physics. Bayesian statistics is a powerful data analysis method, and is used everywhere from particle physics to spam-email filters. Data analysis is fundamental to science. For example, analysis of the data from the Large Hadron Collider was required to extract a most probable value for the mass of the Higgs boson, together with an estimate of the region of masses where the scientists think the mass is. This region is typically expressed as a range of mass values where the they think the true mass lies with high (e.g., 95%) probability. Many of you will be analysing data (physics data, commercial data, etc) for your PTY or RY, or future careers1 .
    [Show full text]
  • The Bayesian Approach to Statistics
    THE BAYESIAN APPROACH TO STATISTICS ANTHONY O’HAGAN INTRODUCTION the true nature of scientific reasoning. The fi- nal section addresses various features of modern By far the most widely taught and used statisti- Bayesian methods that provide some explanation for the rapid increase in their adoption since the cal methods in practice are those of the frequen- 1980s. tist school. The ideas of frequentist inference, as set out in Chapter 5 of this book, rest on the frequency definition of probability (Chapter 2), BAYESIAN INFERENCE and were developed in the first half of the 20th century. This chapter concerns a radically differ- We first present the basic procedures of Bayesian ent approach to statistics, the Bayesian approach, inference. which depends instead on the subjective defini- tion of probability (Chapter 3). In some respects, Bayesian methods are older than frequentist ones, Bayes’s Theorem and the Nature of Learning having been the basis of very early statistical rea- Bayesian inference is a process of learning soning as far back as the 18th century. Bayesian from data. To give substance to this statement, statistics as it is now understood, however, dates we need to identify who is doing the learning and back to the 1950s, with subsequent development what they are learning about. in the second half of the 20th century. Over that time, the Bayesian approach has steadily gained Terms and Notation ground, and is now recognized as a legitimate al- ternative to the frequentist approach. The person doing the learning is an individual This chapter is organized into three sections.
    [Show full text]
  • The Bayesian Lasso
    The Bayesian Lasso Trevor Park and George Casella y University of Florida, Gainesville, Florida, USA Summary. The Lasso estimate for linear regression parameters can be interpreted as a Bayesian posterior mode estimate when the priors on the regression parameters are indepen- dent double-exponential (Laplace) distributions. This posterior can also be accessed through a Gibbs sampler using conjugate normal priors for the regression parameters, with indepen- dent exponential hyperpriors on their variances. This leads to tractable full conditional distri- butions through a connection with the inverse Gaussian distribution. Although the Bayesian Lasso does not automatically perform variable selection, it does provide standard errors and Bayesian credible intervals that can guide variable selection. Moreover, the structure of the hierarchical model provides both Bayesian and likelihood methods for selecting the Lasso pa- rameter. The methods described here can also be extended to other Lasso-related estimation methods like bridge regression and robust variants. Keywords: Gibbs sampler, inverse Gaussian, linear regression, empirical Bayes, penalised regression, hierarchical models, scale mixture of normals 1. Introduction The Lasso of Tibshirani (1996) is a method for simultaneous shrinkage and model selection in regression problems. It is most commonly applied to the linear regression model y = µ1n + Xβ + ; where y is the n 1 vector of responses, µ is the overall mean, X is the n p matrix × T × of standardised regressors, β = (β1; : : : ; βp) is the vector of regression coefficients to be estimated, and is the n 1 vector of independent and identically distributed normal errors with mean 0 and unknown× variance σ2. The estimate of µ is taken as the average y¯ of the responses, and the Lasso estimate β minimises the sum of the squared residuals, subject to a given bound t on its L1 norm.
    [Show full text]
  • Posterior Propriety and Admissibility of Hyperpriors in Normal
    The Annals of Statistics 2005, Vol. 33, No. 2, 606–646 DOI: 10.1214/009053605000000075 c Institute of Mathematical Statistics, 2005 POSTERIOR PROPRIETY AND ADMISSIBILITY OF HYPERPRIORS IN NORMAL HIERARCHICAL MODELS1 By James O. Berger, William Strawderman and Dejun Tang Duke University and SAMSI, Rutgers University and Novartis Pharmaceuticals Hierarchical modeling is wonderful and here to stay, but hyper- parameter priors are often chosen in a casual fashion. Unfortunately, as the number of hyperparameters grows, the effects of casual choices can multiply, leading to considerably inferior performance. As an ex- treme, but not uncommon, example use of the wrong hyperparameter priors can even lead to impropriety of the posterior. For exchangeable hierarchical multivariate normal models, we first determine when a standard class of hierarchical priors results in proper or improper posteriors. We next determine which elements of this class lead to admissible estimators of the mean under quadratic loss; such considerations provide one useful guideline for choice among hierarchical priors. Finally, computational issues with the resulting posterior distributions are addressed. 1. Introduction. 1.1. The model and the problems. Consider the block multivariate nor- mal situation (sometimes called the “matrix of means problem”) specified by the following hierarchical Bayesian model: (1) X ∼ Np(θ, I), θ ∼ Np(B, Σπ), where X1 θ1 X2 θ2 Xp×1 = . , θp×1 = . , arXiv:math/0505605v1 [math.ST] 27 May 2005 . X θ m m Received February 2004; revised July 2004. 1Supported by NSF Grants DMS-98-02261 and DMS-01-03265. AMS 2000 subject classifications. Primary 62C15; secondary 62F15. Key words and phrases. Covariance matrix, quadratic loss, frequentist risk, posterior impropriety, objective priors, Markov chain Monte Carlo.
    [Show full text]