Empirical Bayes Methods for Combining Likelihoods Bradley EFRON

Total Page:16

File Type:pdf, Size:1020Kb

Empirical Bayes Methods for Combining Likelihoods Bradley EFRON Empirical Bayes Methods for Combining Likelihoods Bradley EFRON Supposethat several independent experiments are observed,each one yieldinga likelihoodLk (0k) for a real-valuedparameter of interestOk. For example,Ok might be the log-oddsratio for a 2 x 2 table relatingto the kth populationin a series of medical experiments.This articleconcerns the followingempirical Bayes question:How can we combineall of the likelihoodsLk to get an intervalestimate for any one of the Ok'S, say 01? The resultsare presented in the formof a realisticcomputational scheme that allows model buildingand model checkingin the spiritof a regressionanalysis. No specialmathematical forms are requiredfor the priorsor the likelihoods.This schemeis designedto take advantageof recentmethods that produceapproximate numerical likelihoodsLk(6k) even in very complicatedsituations, with all nuisanceparameters eliminated. The empiricalBayes likelihood theoryis extendedto situationswhere the Ok'S have a regressionstructure as well as an empiricalBayes relationship.Most of the discussionis presentedin termsof a hierarchicalBayes model and concernshow such a model can be implementedwithout requiringlarge amountsof Bayesianinput. Frequentist approaches, such as bias correctionand robustness,play a centralrole in the methodology. KEY WORDS: ABC method;Confidence expectation; Generalized linear mixed models;Hierarchical Bayes; Meta-analysis for likelihoods;Relevance; Special exponential families. 1. INTRODUCTION for 0k, also appearing in Table 1, is A typical statistical analysis blends data from indepen- dent experimental units into a single combined inference for a parameter of interest 0. Empirical Bayes, or hierar- SDk ={ ak +.S + bb + .5 +k + -5 dk + .5 chical or meta-analytic analyses, involve a second level of SD13 = .61, for example. data acquisition. Several independent experiments are ob- The statistic 0k is an estimate of the true log-odds ratio served, each involving many units, but each perhaps having 0k in the kth experimental population, a different value of the parameter 0. Inferences about one or all of the 0's are then made on the basis of the full two-level Pk {OccurrencelTreatment} compound data set. This article concerns the construction log ( Pk {NonoccurrencelTreatment} of empirical Bayes interval estimates for the individual pa- rameters 0, when the observed data consist of a separate Pk {OccurrencelControl} N (4) likelihood for each case. Pk{NonoccurrencelControl} Table 1, the ulcer data, exemplifies this situation. Forty- where Pk indicates probabilities for population k. The data one randomized trials of a new surgical treatment for stom- (ak, bk, Ck, dk), considered as a 2 x 2 table with fixed mar- ach ulcers were conducted between 1980 and 1989 (Sacks, gins, give a conditional likelihood for 0k, Chalmers, Blum, Berrier, and Pagano 1990). The kth exper- iment's data are recorded as a Lk(O) (a + bk) (Ck + dk )okak/S(O) (5) (ak,bk,ck,dk) (k= 1,2,...,41), (1) (Lehmann 1959, sec. 4.6). Here Sk(Ok) is the sum of the numerator in (5) over the allowable choices of ak subject where ak and bk are the number of occurrences and nonoc- to the 2 x 2 table's marginal constraints. currences for the Treatment (the new surgery) and Ck and Figure 1 shows 10 of the 41 likelihoods (5), each nor- dk are the occurrences and nonoccurrences for Control (an malized so as to integrate to 1 over the range of the graph. older surgery). Occurrence here refers to an adverse event- the 0k values are not all the same. For recurrent bleeding. It seems clear that barely overlap. On the other The estimated log odds ratio for experiment k, instance, L8(08) and L13(013) hand, the 0k values are not wildly discrepant, most of the 41 likelihood functions being concentrated in the range 0k log (2) Ok E (-6,3). bk dk 2 This article concerns making interval estimates for any one of the parameters, say 08, on the basis of all the data in measures the excess occurrence of Treatment over Control. Table 1. We could of course use only the data from exper- For example 013 = .60, suggesting a greater rate of occur- iment 8 to form a classical confidence interval for 08. This rence for the Treatment in the 13th experimental popula- amounts to assuming that the other 40 experiments have no tion. But 08 =-4.17 indicates a lesser rate of occurrence relevance to the 8th population. At the other extreme, we in the 8th population. An approximate standard deviation could assume that all of the 0k values were equal and form Bradley Efron is Professor, Department of Statistics, Stanford Univer- sity, Stanford, CA 94305. Research was supported by National Science ? 1996 American Statistical Association Foundation Grant DMS92-04864 and National Institutes of Health Grant Journal of the American Statistical Association 5 ROI CA59039-20. June 1996, Vol. 91, No. 434, Theory and Methods 538 Efron: Empirical Bayes Methods for Combining Likelihoods 539 Our empiricalBayes confidenceintervals will be con- * L4Asll structed directly from the likelihoods Lk (0k), k = 1, 2 ... , K, with no other reference to the original data that gave these likelihoods. This can be a big advantage *. I .'1~~' in situations where the individualexperiments are more 'cJ<L complicatedthan those in Table 1. Suppose, for instance, that we had observed d-dimensionalnormal vectors xk Nd(/,k,Zk) independentlyfor k =1,2,... , K and that the parameterof interest was Ok = second-largest o L eigenvalue of Xlk. Currentresearch has provided several good ways to constructapproximate likelihoods LXk(Ok) _~~~~~~_ for Ok alone, with all nuisance parameterseliminated (see Barndorff-Nielsen1986, Cox and Reid 1987, and . V_It Efron 1993).The set of approximatelikelihoods, Lxk (Ok), k 1,2, ... , K, could serve as the input to our empirical Bayes analysis.In this way empiricalBayes methods can be broughtto bear on very complicatedsituations. The emphasishere is on a generalapproach to empirical Bayes confidenceintervals that does not requiremathemat- the oter40lieihod6 4 2 4 ically tractableforms for the likelihoodsor for the family of possible prior densities.Our results, which aresprimar- log odds ratio e ily methodological,are presentedin the form of a realis- Figure 1. Tenof the 41 IndividualLikelihoods Lk(Ok), (5); k 1, 3, 5, tic computationalalgorithm that allows model buildingand 6, 7, 8, 10, 12, 13,and 40; LikelihoodsNormalized to Satisfyf58 Lk(6,0 model checking,in the spiritof a regressionanalysis. As a dok ==1. Starsindicate L40(040), which is morenegatively located than price for this degree of generality,all of our methods are the other40 likelihoods;L41 (041), not shown, is perfectlyflat. approximate.They can be thought of as computer-based generalizationsof the analyticresults of Carlinand Gelfand a confidenceinterval for the commonlog-odds ratio 0 from (1990, 1991), Laird and Louis (1987), Morris (1983), and the totals (a+, b+Ic+Id) d (170,736,352,556). The em- otherworks referred to by those authors.Section 6 extends piricalBayes methodsdiscussed here compromisebetween the empiricalBayes likelihood methods to problemsalso these extremes.Our goal is to make a correctBayesian as- having a regressionstructure. This extension gives a very sessmentof the informationconcerning 08 in the other 40 general version of mixed models applyingto generalized experimentswithout having to specify a full Bayes prior linearmodel (GLM)analyses. distribution.Morris (1983) gives an excellent description The frameworkfor our discussion is the hierarchical of the empiricalBayes viewpoint. Bayes model describedin Section 2. Philosophically,this Table 1. Ulcer Data Experiment a b c d 0 SD Experiment a b c d 0 SD *1 7 8 11 2 -1.84 .86 21 6 34 13 8 -2.22 .61 2 8 1 1 8 8 -.32 .66 22 4 14 5 34 .66 .71 *3 5 29 4 35 .41 .68 23 14 54 13 61 .20 .42 4 7 29 4 27 .49 .65 24 6 15 8 13 -.43 .64 *5 3 9 0 12 Inf 1.57 25 0 6 6 0 -Inf 2.08 *6 4 3 4 0 -Inf 1.65 26 1 9 5 10 -1.50 1.02 *7 4 13 13 11 -1.35 .68 27 5 12 5 10 -.18 .73 *8 1 15 13 3 -4.17 1.04 28 0 10 12 2 -Inf 1.60 9 3 1 1 7 15 -.54 .76 29 0 22 8 16 -Inf 1.49 *10 2 36 12 20 -2.38 .75 30 2 16 10 1 1 -1.98 .80 11 6 6 8 0 -Inf 1.56 31 1 14 7 6 -2.79 1.01 *12 2 5 7 2 -2.17 1.06 32 8 16 15 12 -.92 .57 *13 9 12 7 17 .60 .61 33 6 6 7 2 -1.25 .92 14 7 14 5 20 .69 .66 34 0 20 5 18 -Inf 1.51 15 3 22 1 1 21 -1.35 .68 35 4 13 2 14 .77 .87 16 4 7 6 4 -.97 .86 36 10 30 12 8 -1.50 .57 17 2 8 8 2 -2.77 1.02 37 3 13 2 14 .48 .91 18 1 30 4 23 -1.65 .98 38 4 30 5 14 -.99 .71 19 4 24 15 16 -1.73 .62 39 7 31 15 22 -1.11 .52 20 7 36 16 27 -1.11 .51 *40 0 34 34 0 -Inf 2.01 41 0 9 0 16 NA 2.04 NOTE: 41 independent experiments comparing Treatment, a new surgery for stomach ulcer, with Control, an older surgery; (a, b) = (occurrences, nonoccurrences) on Treatment; (c, d) = (occurrences, nonoccurrences) on Control; occurrence refers to an adverse event, recurrent bleeding. 4 is estimated log odds ratio (2); SD = estimated standard deviation for 4 (3). Ex- periments 40 and 41 are not included in the empirical Bayes analysis of Section 3. Stars indicate likelihoods shown in Figure 1.
Recommended publications
  • Assessing Fairness with Unlabeled Data and Bayesian Inference
    Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference Disi Ji1 Padhraic Smyth1 Mark Steyvers2 1Department of Computer Science 2Department of Cognitive Sciences University of California, Irvine [email protected] [email protected] [email protected] Abstract We investigate the problem of reliably assessing group fairness when labeled examples are few but unlabeled examples are plentiful. We propose a general Bayesian framework that can augment labeled data with unlabeled data to produce more accurate and lower-variance estimates compared to methods based on labeled data alone. Our approach estimates calibrated scores for unlabeled examples in each group using a hierarchical latent variable model conditioned on labeled examples. This in turn allows for inference of posterior distributions with associated notions of uncertainty for a variety of group fairness metrics. We demonstrate that our approach leads to significant and consistent reductions in estimation error across multiple well-known fairness datasets, sensitive attributes, and predictive models. The results show the benefits of using both unlabeled data and Bayesian inference in terms of assessing whether a prediction model is fair or not. 1 Introduction Machine learning models are increasingly used to make important decisions about individuals. At the same time it has become apparent that these models are susceptible to producing systematically biased decisions with respect to sensitive attributes such as gender, ethnicity, and age [Angwin et al., 2017, Berk et al., 2018, Corbett-Davies and Goel, 2018, Chen et al., 2019, Beutel et al., 2019]. This has led to a significant amount of recent work in machine learning addressing these issues, including research on both (i) definitions of fairness in a machine learning context (e.g., Dwork et al.
    [Show full text]
  • Bayesian Inference Chapter 4: Regression and Hierarchical Models
    Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35 Objective AFM Smith Dennis Lindley We analyze the Bayesian approach to fitting normal and generalized linear models and introduce the Bayesian hierarchical modeling approach. Also, we study the modeling and forecasting of time series. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35 Contents 1 Normal linear models 1.1. ANOVA model 1.2. Simple linear regression model 2 Generalized linear models 3 Hierarchical models 4 Dynamic models Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35 Normal linear models A normal linear model is of the following form: y = Xθ + ; 0 where y = (y1;:::; yn) is the observed data, X is a known n × k matrix, called 0 the design matrix, θ = (θ1; : : : ; θk ) is the parameter set and follows a multivariate normal distribution. Usually, it is assumed that: 1 ∼ N 0 ; I : k φ k A simple example of normal linear model is the simple linear regression model T 1 1 ::: 1 where X = and θ = (α; β)T . x1 x2 ::: xn Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35 Normal linear models Consider a normal linear model, y = Xθ + . A conjugate prior distribution is a normal-gamma distribution:
    [Show full text]
  • Fast Inference in Generalized Linear Models Via Expected Log-Likelihoods
    Fast inference in generalized linear models via expected log-likelihoods Alexandro D. Ramirez 1,*, Liam Paninski 2 1 Weill Cornell Medical College, NY. NY, U.S.A * [email protected] 2 Columbia University Department of Statistics, Center for Theoretical Neuroscience, Grossman Center for the Statistics of Mind, Kavli Institute for Brain Science, NY. NY, U.S.A Abstract Generalized linear models play an essential role in a wide variety of statistical applications. This paper discusses an approximation of the likelihood in these models that can greatly facilitate compu- tation. The basic idea is to replace a sum that appears in the exact log-likelihood by an expectation over the model covariates; the resulting \expected log-likelihood" can in many cases be computed sig- nificantly faster than the exact log-likelihood. In many neuroscience experiments the distribution over model covariates is controlled by the experimenter and the expected log-likelihood approximation be- comes particularly useful; for example, estimators based on maximizing this expected log-likelihood (or a penalized version thereof) can often be obtained with orders of magnitude computational savings com- pared to the exact maximum likelihood estimators. A risk analysis establishes that these maximum EL estimators often come with little cost in accuracy (and in some cases even improved accuracy) compared to standard maximum likelihood estimates. Finally, we find that these methods can significantly de- crease the computation time of marginal likelihood calculations for model selection and of Markov chain Monte Carlo methods for sampling from the posterior parameter distribution. We illustrate our results by applying these methods to a computationally-challenging dataset of neural spike trains obtained via large-scale multi-electrode recordings in the primate retina.
    [Show full text]
  • The Bayesian Lasso
    The Bayesian Lasso Trevor Park and George Casella y University of Florida, Gainesville, Florida, USA Summary. The Lasso estimate for linear regression parameters can be interpreted as a Bayesian posterior mode estimate when the priors on the regression parameters are indepen- dent double-exponential (Laplace) distributions. This posterior can also be accessed through a Gibbs sampler using conjugate normal priors for the regression parameters, with indepen- dent exponential hyperpriors on their variances. This leads to tractable full conditional distri- butions through a connection with the inverse Gaussian distribution. Although the Bayesian Lasso does not automatically perform variable selection, it does provide standard errors and Bayesian credible intervals that can guide variable selection. Moreover, the structure of the hierarchical model provides both Bayesian and likelihood methods for selecting the Lasso pa- rameter. The methods described here can also be extended to other Lasso-related estimation methods like bridge regression and robust variants. Keywords: Gibbs sampler, inverse Gaussian, linear regression, empirical Bayes, penalised regression, hierarchical models, scale mixture of normals 1. Introduction The Lasso of Tibshirani (1996) is a method for simultaneous shrinkage and model selection in regression problems. It is most commonly applied to the linear regression model y = µ1n + Xβ + ; where y is the n 1 vector of responses, µ is the overall mean, X is the n p matrix × T × of standardised regressors, β = (β1; : : : ; βp) is the vector of regression coefficients to be estimated, and is the n 1 vector of independent and identically distributed normal errors with mean 0 and unknown× variance σ2. The estimate of µ is taken as the average y¯ of the responses, and the Lasso estimate β minimises the sum of the squared residuals, subject to a given bound t on its L1 norm.
    [Show full text]
  • Posterior Propriety and Admissibility of Hyperpriors in Normal
    The Annals of Statistics 2005, Vol. 33, No. 2, 606–646 DOI: 10.1214/009053605000000075 c Institute of Mathematical Statistics, 2005 POSTERIOR PROPRIETY AND ADMISSIBILITY OF HYPERPRIORS IN NORMAL HIERARCHICAL MODELS1 By James O. Berger, William Strawderman and Dejun Tang Duke University and SAMSI, Rutgers University and Novartis Pharmaceuticals Hierarchical modeling is wonderful and here to stay, but hyper- parameter priors are often chosen in a casual fashion. Unfortunately, as the number of hyperparameters grows, the effects of casual choices can multiply, leading to considerably inferior performance. As an ex- treme, but not uncommon, example use of the wrong hyperparameter priors can even lead to impropriety of the posterior. For exchangeable hierarchical multivariate normal models, we first determine when a standard class of hierarchical priors results in proper or improper posteriors. We next determine which elements of this class lead to admissible estimators of the mean under quadratic loss; such considerations provide one useful guideline for choice among hierarchical priors. Finally, computational issues with the resulting posterior distributions are addressed. 1. Introduction. 1.1. The model and the problems. Consider the block multivariate nor- mal situation (sometimes called the “matrix of means problem”) specified by the following hierarchical Bayesian model: (1) X ∼ Np(θ, I), θ ∼ Np(B, Σπ), where X1 θ1 X2 θ2 Xp×1 = . , θp×1 = . , arXiv:math/0505605v1 [math.ST] 27 May 2005 . X θ m m Received February 2004; revised July 2004. 1Supported by NSF Grants DMS-98-02261 and DMS-01-03265. AMS 2000 subject classifications. Primary 62C15; secondary 62F15. Key words and phrases. Covariance matrix, quadratic loss, frequentist risk, posterior impropriety, objective priors, Markov chain Monte Carlo.
    [Show full text]
  • Stat 3701 Lecture Notes: Statistical Models, Part II
    Stat 3701 Lecture Notes: Statistical Models, Part II Charles J. Geyer August 12, 2020 1 License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http: //creativecommons.org/licenses/by-sa/4.0/). 2 R • The version of R used to make this document is 4.0.2. • The version of the rmarkdown package used to make this document is 2.3. • The version of the alabama package used to make this document is 2015.3.1. • The version of the numDeriv package used to make this document is 2016.8.1.1. 3 General Statistical Models The time has come to learn some theory. This is a preview of STAT 5101–5102. We don’t need to learn much theory. We will proceed with the general strategy of all introductory statistics: don’t derive anything, just tell you stuff. 3.1 Probability Models 3.1.1 Kinds of Probability Theory There are two kinds of probability theory. There is the kind you will learn in STAT 5101-5102 (or 4101–4102, which is more or less the same except for leaving out the multivariable stuff). The material covered goes back hundreds of years, some of it discovered in the early 1600’s. And there is the kind you would learn in MATH 8651–8652 if you could take it, which no undergraduate does (it is a very hard Ph. D. level math course). The material covered goes back to 1933. It is all “new math”. These two kinds of probability can be called classical and measure-theoretic, respectively.
    [Show full text]
  • Observed Information Matrix for MUB Models
    Quaderni di Statistica Vol.8, 2006 Observed information matrix for MUB models Domenico Piccolo Dipartimento di Scienze Statistiche Università degli Studi di Napoli Federico II E-mail: [email protected] Summary: In this paper we derive the observed information matrix for MUB models, without and with covariates. After a review of this class of models for ordinal data and of the E-M algorithms, we derive some closed forms for the asymptotic variance- covariance matrix of the maximum likelihood estimators of MUB models. Also, some new results about feeling and uncertainty parameters are presented. The work lingers over the computational aspects of the procedure with explicit reference to a matrix- oriented language. Finally, the finite sample performance of the asymptotic results is investigated by means of a simulation experiment. General considerations aimed at extending the application of MUB models conclude the paper. Keywords: Ordinal data, MUB models, Observed information matrix. 1. Introduction “Choosing to do or not to do something is a ubiquitous state of ac- tivity in all societies” (Louvier et al., 2000, 1). Indeed, “Almost without exception, every human beings undertake involves a choice (consciously or sub-consciously), including the choice not to choose” (Hensher et al., 2005, xxiii). From an operational point of view, the finiteness of alternatives limits the analysis to discrete choices (Train, 2003) and the statistical interest in this area is mainly devoted to generate probability structures adequate to interpret, fit and forecast human choices1. 1 An extensive literature focuses on several aspects related to economic and market- 34 D. Piccolo Generally, the nature of the choices is qualitative (or categorical), and classical statistical models introduced for continuous phenomena are nei- ther suitable nor effective.
    [Show full text]
  • A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics
    bioRxiv preprint doi: https://doi.org/10.1101/102244; this version posted January 22, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics Renato Rodrigues Silva 1,* 1 Institute of Mathematics and Statistics, Federal University of Goi´as, Goi^ania,Goi´as,Brazil *[email protected] Abstract In the regression analysis, there are situations where the model have more predictor variables than observations of dependent variable, resulting in the problem known as \large p small n". In the last fifteen years, this problem has been received a lot of attention, specially in the genome-wide context. Here we purposed the bayes H model, a bayesian regression model using mixture of two scaled inverse chi square as hyperprior distribution of variance for each regression coefficient. This model is implemented in the R package BayesH. Introduction 1 In the regression analysis, there are situations where the model have more predictor 2 variables than observations of dependent variable, resulting in the problem known as 3 \large p small n" [1]. 4 To figure out this problem, there are already exists some methods developed as ridge 5 regression [2], least absolute shrinkage and selection operator (LASSO) regression [3], 6 bridge regression [4], smoothly clipped absolute deviation (SCAD) regression [5] and 7 others. This class of regression models is known in the literature as regression model 8 with penalized likelihood [6]. In the bayesian paradigma, there are also some methods 9 purposed as stochastic search variable selection [7], and Bayesian LASSO [8].
    [Show full text]
  • Bayesian Learning
    SC4/SM8 Advanced Topics in Statistical Machine Learning Bayesian Learning Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 1 / 14 Bayesian Learning Review of Bayesian Inference The Bayesian Learning Framework Bayesian learning: treat parameter vector θ as a random variable: process of learning is then computation of the posterior distribution p(θjD). In addition to the likelihood p(Djθ) need to specify a prior distribution p(θ). Posterior distribution is then given by the Bayes Theorem: p(Djθ)p(θ) p(θjD) = p(D) Likelihood: p(Djθ) Posterior: p(θjD) Prior: p(θ) Marginal likelihood: p(D) = Θ p(Djθ)p(θ)dθ ´ Summarizing the posterior: MAP Posterior mode: θb = argmaxθ2Θ p(θjD) (maximum a posteriori). mean Posterior mean: θb = E [θjD]. Posterior variance: Var[θjD]. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 2 / 14 Bayesian Learning Review of Bayesian Inference Bayesian Inference on the Categorical Distribution Suppose we observe the with yi 2 f1;:::; Kg, and model them as i.i.d. with pmf π = (π1; : : : ; πK): n K Y Y nk p(Djπ) = πyi = πk i=1 k=1 Pn PK with nk = i=1 1(yi = k) and πk > 0, k=1 πk = 1. The conjugate prior on π is the Dirichlet distribution Dir(α1; : : : ; αK) with parameters αk > 0, and density PK K Γ( αk) Y p(π) = k=1 παk−1 QK k k=1 Γ(αk) k=1 PK on the probability simplex fπ : πk > 0; k=1 πk = 1g.
    [Show full text]
  • Method for Computation of the Fisher Information Matrix in the Expectation–Maximization Algorithm
    1 August 2016 Method for Computation of the Fisher Information Matrix in the Expectation–Maximization Algorithm Lingyao Meng Department of Applied Mathematics & Statistics The Johns Hopkins University Baltimore, Maryland 21218 Abstract The expectation–maximization (EM) algorithm is an iterative computational method to calculate the maximum likelihood estimators (MLEs) from the sample data. It converts a complicated one-time calculation for the MLE of the incomplete data to a series of relatively simple calculations for the MLEs of the complete data. When the MLE is available, we naturally want the Fisher information matrix (FIM) of unknown parameters. The FIM is, in fact, a good measure of the amount of information a sample of data provides and can be used to determine the lower bound of the variance and the asymptotic variance of the estimators. However, one of the limitations of the EM is that the FIM is not an automatic by-product of the algorithm. In this paper, we review some basic ideas of the EM and the FIM. Then we construct a simple Monte Carlo-based method requiring only the gradient values of the function we obtain from the E step and basic operations. Finally, we conduct theoretical analysis and numerical examples to show the efficiency of our method. The key part of our method is to utilize the simultaneous perturbation stochastic approximation method to approximate the Hessian matrix from the gradient of the conditional expectation of the complete-data log-likelihood function. Key words: Fisher information matrix, EM algorithm, Monte Carlo, Simultaneous perturbation stochastic approximation 1. Introduction The expectation–maximization (EM) algorithm introduced by Dempster, Laird and Rubin in 1977 is a well-known method to compute the MLE iteratively from the observed data.
    [Show full text]
  • 9 Introduction to Hierarchical Models
    9 Introduction to Hierarchical Models One of the important features of a Bayesian approach is the relative ease with which hierarchical models can be constructed and estimated using Gibbs sampling. In fact, one of the key reasons for the recent growth in the use of Bayesian methods in the social sciences is that the use of hierarchical models has also increased dramatically in the last two decades. Hierarchical models serve two purposes. One purpose is methodological; the other is substantive. Methodologically, when units of analysis are drawn from clusters within a population (communities, neighborhoods, city blocks, etc.), they can no longer be considered independent. Individuals who come from the same cluster will be more similar to each other than they will be to individuals from other clusters. Therefore, unobserved variables may in- duce statistical dependence between observations within clusters that may be uncaptured by covariates within the model, violating a key assumption of maximum likelihood estimation as it is typically conducted when indepen- dence of errors is assumed. Recall that a likelihood function, when observations are independent, is simply the product of the density functions for each ob- servation taken over all the observations. However, when independence does not hold, we cannot construct the likelihood as simply. Thus, one reason for constructing hierarchical models is to compensate for the biases—largely in the standard errors—that are introduced when the independence assumption is violated. See Ezell, Land, and Cohen (2003) for a thorough review of the approaches that have been used to correct standard errors in hazard model- ing applications with repeated events, one class of models in which repeated measurement yields hierarchical clustering.
    [Show full text]
  • Model Checking and Improvement
    CHAPTER 6 Model checking and improvement 6.1 The place of model checking in applied Bayesian statistics Once we have accomplished the first two steps of a Bayesian analysis—con- structing a probability model and computing (typically using simulation) the posterior distribution of all estimands—we should not ignore the relatively easy step of assessing the fit of the model to the data and to our substantive knowledge. It is difficult to include in a probability distribution all of one’s knowledge about a problem, and so it is wise to investigate what aspects of reality are not captured by the model. Checking the model is crucial to statistical analysis. Bayesian prior-to- posterior inferences assume the whole structure of a probability model and can yield misleading inferences when the model is poor. A goodBayesian analysis, therefore, should include at least some check of the adequacy of the fit of the model to the data and the plausibility of the model forthepurposes for which the model will be used. This is sometimes discussed as a problem of sensitivity to the prior distribution, but in practice thelikelihoodmodelis typically just as suspect; throughout, we use ‘model’ to encompass the sam- pling distribution, the prior distribution, any hierarchical structure, and issues such as which explanatory variables have been included in a regression. Sensitivity analysis and model improvement It is typically the case that more than one reasonable probability model can provide an adequate fit to the data in a scientific problem. The basic question of a sensitivity analysis is: how much do posterior inferences change when other reasonable probability models are used in place of the present model? Other reasonable models may differ substantially from the present model in the prior specification, the sampling distribution, or in what information is included (for example, predictor variables in a regression).
    [Show full text]