Bayesian Methods for Function Estimation

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Methods for Function Estimation hs25 v.2005/04/21 Prn:19/05/2005; 15:11 F:hs25013.tex; VTEX/VJ p. 1 aid: 25013 pii: S0169-7161(05)25013-7 docsubty: REV Handbook of Statistics, Vol. 25 ISSN: 0169-7161 13 © 2005 Elsevier B.V. All rights reserved. 1 1 DOI 10.1016/S0169-7161(05)25013-7 2 2 3 3 4 4 5 5 6 Bayesian Methods for Function Estimation 6 7 7 8 8 9 9 10 Nidhan Choudhuri, Subhashis Ghosal and Anindya Roy 10 11 11 12 12 Abstract 13 13 14 14 15 Keywords: consistency; convergence rate; Dirichlet process; density estimation; 15 16 Markov chain Monte Carlo; posterior distribution; regression function; spectral den- 16 17 sity; transition density 17 18 18 19 19 20 20 21 1. Introduction 21 22 22 23 Nonparametric and semiparametric statistical models are increasingly replacing para- 23 24 metric models, for the latter’s lack of sufficient flexibility to address a wide vari- 24 25 ety of data. A nonparametric or semiparametric model involves at least one infinite- 25 26 dimensional parameter, usually a function, and hence may also be referred to as an 26 27 infinite-dimensional model. Functions of common interest, among many others, include 27 28 the cumulative distribution function, density function, regression function, hazard rate, 28 29 transition density of a Markov process, and spectral density of a time series. While fre- 29 30 quentist methods for nonparametric estimation have been flourishing for many of these 30 31 problems, nonparametric Bayesian estimation methods had been relatively less devel- 31 32 oped. 32 33 Besides philosophical reasons, there are some practical advantages of the Bayesian 33 34 approach. On the one hand, the Bayesian approach allows one to reflect ones prior be- 34 35 liefs into the analysis. On the other hand, the Bayesian approach is straightforward in 35 36 principle where inference is based on the posterior distribution only. Subjective elici- 36 37 tation of priors is relatively simple in a parametric framework, and in the absence of 37 38 any concrete knowledge, there are many default mechanisms for prior specification. 38 39 However, the recent popularity of Bayesian analysis comes from the availability of 39 40 various Markov chain Monte Carlo (MCMC) algorithms that make the computation 40 41 feasible with today’s computers in almost every parametric problem. Prediction, which 41 42 is sometimes the primary objective of a statistical analysis, is solved most naturally if 42 43 one follows the Bayesian approach. Many non-Bayesian methods, including the max- 43 44 imum likelihood estimator (MLE), can have very unnatural behavior (such as staying 44 45 on the boundary with high probability) when the parameter space is restricted, while 45 377 hs25 v.2005/04/21 Prn:19/05/2005; 15:11 F:hs25013.tex; VTEX/VJ p. 2 aid: 25013 pii: S0169-7161(05)25013-7 docsubty: REV 378 N. Choudhuri, S. Ghosal and A. Roy 1 a Bayesian estimator does not suffer from this drawback. Besides, the optimality of a 1 2 parametric Bayesian procedure is often justified through large sample as well as finite 2 3 sample admissibility properties. 3 4 The difficulties for a Bayesian analysis in a nonparametric framework is threefold. 4 5 First, a subjective elicitation of a prior is not possible due to the vastness of the para- 5 6 meter space and the construction of a default prior becomes difficult mainly due to the 6 7 absence of the Lebesgue measure. Secondly, common MCMC techniques do not di- 7 8 rectly apply as the parameter space is infinite-dimensional. Sampling from the posterior 8 9 distribution often requires innovative MCMC algorithms that depend on the problem at 9 10 hand as well as the prior given on the functional parameter. Some of these techniques 10 11 include the introduction of latent variables, data augmentation and reparametrization of 11 12 the parameter space. Thus, the problem of prior elicitation cannot be separated from the 12 13 computational issues. 13 14 When a statistical method is developed, particular attention should be given to the 14 15 quality of the corresponding solution. Of the many different criteria, asymptotic con- 15 16 sistency and rate of convergence are perhaps among the least disputed. Consistency 16 17 may be thought of as a validation of the method used by the Bayesian. Consider an 17 18 18 imaginary experiment where an experimenter generates observations from a given sto- 19 19 chastic model with some value of the parameter and presents the data to a Bayesian 20 20 without revealing the true value of the parameter. If enough information is provided in 21 21 the form of a large number of observations, the Bayesian’s assessment of the unknown 22 22 parameter should be close to the true value of it. Another reason to study consistency is 23 23 its relationship with robustness with respect to the choice of the prior. Due to the lack 24 24 of complete faith in the prior, we should require that at least eventually, the data over- 25 25 26 rides the prior opinion. Alternatively two Bayesians, with two different priors, presented 26 27 with the same data eventually must agree. This large sample “merging of opinions” 27 28 is equivalent to consistency (Blackwell and Dubins, 1962; Diaconis and Freedman, 28 29 1986a, 1986b; Ghosh et al., 1994). For virtually all finite-dimensional problems, the 29 30 posterior distribution is consistent (Ibragimov and Has’minskii, 1981; Le Cam, 1986; 30 31 Ghosal et al., 1995) if the prior does not rule out the true value. This is roughly a 31 32 consequence of the fact that the likelihood is highly peaked near the true value of the 32 33 parameter if the sample size is large. However, for infinite-dimensional problems, such 33 34 a conclusion is false (Freedman, 1963; Diaconis and Freedman, 1986a, 1986b; Doss, 34 35 1985a, 1985b; Kim and Lee, 2001). Thus posterior consistency must be verified before 35 36 using a prior. 36 37 In this chapter, we review Bayesian methods for some important curve estimation 37 38 problems. There are several good reviews available in the literature such as Hjort (1996, 38 39 2003), Wasserman (1998), Ghosal et al. (1999a), the monograph of Ghosh and Ra- 39 40 mamoorthi (2003) and several chapters in this volume. We omit many details which 40 41 may be found from these sources. We focus on three different aspects of the problem: 41 42 prior specification, computation and asymptotic properties of the posterior distribution. 42 43 In Section 2, we describe various priors on infinite-dimensional spaces. General results 43 44 on posterior consistency and rate of convergence are reviewed in Section 3. Specific 44 45 curve estimation problems are addressed in the subsequent sections. 45 hs25 v.2005/04/21 Prn:19/05/2005; 15:11 F:hs25013.tex; VTEX/VJ p. 3 aid: 25013 pii: S0169-7161(05)25013-7 docsubty: REV Bayesian methods for function estimation 379 1 2. Priors on infinite-dimensional spaces 1 2 2 3 A well accepted criterion for the choice of a nonparametric prior is that the prior has 3 4 a large or full topological support. Intuitively, such a prior can reach every corner of 4 5 the parameter space and thus can be expected to have a consistent posterior. More flex- 5 6 ible models have higher complexity and hence the process of prior elicitation becomes 6 7 more complex. Priors are usually constructed from the consideration of mathematical 7 8 tractability, feasibility of computation, and good large sample behavior. The form of the 8 9 prior is chosen according to some default mechanism while the key hyper-parameters 9 10 are chosen to reflect any prior beliefs. A prior on a function space may be thought of 10 11 as a stochastic process taking values in the given function space. Thus, a prior may be 11 12 specified by describing a sampling scheme that generate random function with desired 12 13 properties or they can be specified by describing the finite-dimensional laws. An advan- 13 14 tage of the first approach is that the existence of the prior measure is automatic, while 14 15 for the latter, the nontrivial proposition of existence needs to be established. Often the 15 16 function space is approximated by a sequence of sieves in such a way that it is easier to 16 17 put a prior on these sieves. A prior on the entire space is then described by letting the 17 18 index of the sieve vary with the sample size, or by putting a further prior on the index 18 19 thus leading to a hierarchical mixture prior. Here we describe some general methods of 19 20 prior construction on function spaces. 20 21 21 22 2.1. Dirichlet process 22 23 Dirichlet processes were introduced by Ferguson (1973) as prior distributions on the 23 24 space of probability measures on a given measurable space (X, B).LetM>0 and G be 24 25 a probability measure on (X, B). A Dirichlet process on (X, B) with parameters (M, G) 25 26 is a random probability measure P which assigns a number P(B)to every B ∈ B such 26 27 that 27 28 28 [ ] 29 (i) P(B)is a measurable 0, 1 -valued random variable; 29 B 30 (ii) each realization of P is a probability measure on (X, ); 30 { } 31 (iii) for each measurable finite partition B1,...,Bk of X, the joint distribution of 31 32 the vector (P (B1),...,P(Bk)) on the k-dimensional unit simplex has Dirichlet 32 (k; MG(B ), .
Recommended publications
  • ACTS 4304 FORMULA SUMMARY Lesson 1: Basic Probability Summary of Probability Concepts Probability Functions
    ACTS 4304 FORMULA SUMMARY Lesson 1: Basic Probability Summary of Probability Concepts Probability Functions F (x) = P r(X ≤ x) S(x) = 1 − F (x) dF (x) f(x) = dx H(x) = − ln S(x) dH(x) f(x) h(x) = = dx S(x) Functions of random variables Z 1 Expected Value E[g(x)] = g(x)f(x)dx −∞ 0 n n-th raw moment µn = E[X ] n n-th central moment µn = E[(X − µ) ] Variance σ2 = E[(X − µ)2] = E[X2] − µ2 µ µ0 − 3µ0 µ + 2µ3 Skewness γ = 3 = 3 2 1 σ3 σ3 µ µ0 − 4µ0 µ + 6µ0 µ2 − 3µ4 Kurtosis γ = 4 = 4 3 2 2 σ4 σ4 Moment generating function M(t) = E[etX ] Probability generating function P (z) = E[zX ] More concepts • Standard deviation (σ) is positive square root of variance • Coefficient of variation is CV = σ/µ • 100p-th percentile π is any point satisfying F (π−) ≤ p and F (π) ≥ p. If F is continuous, it is the unique point satisfying F (π) = p • Median is 50-th percentile; n-th quartile is 25n-th percentile • Mode is x which maximizes f(x) (n) n (n) • MX (0) = E[X ], where M is the n-th derivative (n) PX (0) • n! = P r(X = n) (n) • PX (1) is the n-th factorial moment of X. Bayes' Theorem P r(BjA)P r(A) P r(AjB) = P r(B) fY (yjx)fX (x) fX (xjy) = fY (y) Law of total probability 2 If Bi is a set of exhaustive (in other words, P r([iBi) = 1) and mutually exclusive (in other words P r(Bi \ Bj) = 0 for i 6= j) events, then for any event A, X X P r(A) = P r(A \ Bi) = P r(Bi)P r(AjBi) i i Correspondingly, for continuous distributions, Z P r(A) = P r(Ajx)f(x)dx Conditional Expectation Formula EX [X] = EY [EX [XjY ]] 3 Lesson 2: Parametric Distributions Forms of probability
    [Show full text]
  • STAT 830 the Basics of Nonparametric Models The
    STAT 830 The basics of nonparametric models The Empirical Distribution Function { EDF The most common interpretation of probability is that the probability of an event is the long run relative frequency of that event when the basic experiment is repeated over and over independently. So, for instance, if X is a random variable then P (X ≤ x) should be the fraction of X values which turn out to be no more than x in a long sequence of trials. In general an empirical probability or expected value is just such a fraction or average computed from the data. To make this precise, suppose we have a sample X1;:::;Xn of iid real valued random variables. Then we make the following definitions: Definition: The empirical distribution function, or EDF, is n 1 X F^ (x) = 1(X ≤ x): n n i i=1 This is a cumulative distribution function. It is an estimate of F , the cdf of the Xs. People also speak of the empirical distribution of the sample: n 1 X P^(A) = 1(X 2 A) n i i=1 ^ This is the probability distribution whose cdf is Fn. ^ Now we consider the qualities of Fn as an estimate, the standard error of the estimate, the estimated standard error, confidence intervals, simultaneous confidence intervals and so on. To begin with we describe the best known summaries of the quality of an estimator: bias, variance, mean squared error and root mean squared error. Bias, variance, MSE and RMSE There are many ways to judge the quality of estimates of a parameter φ; all of them focus on the distribution of the estimation error φ^−φ.
    [Show full text]
  • Estimation of Rate of Rare Occurrence of Events with Empirical Bayes
    Estimating Rate of Occurrence of Rare Events with Empirical Bayes: A Railway Application John Quigley (Corresponding author), Tim Bedford and Lesley Walls Department of Management Science University of Strathclyde, Glasgow, G1 1QE, SCOTLAND [email protected] Tel +44 141 548 3152 Fax +44 141 552 6686 Abstract Classical approaches to estimating the rate of occurrence of events perform poorly when data are few. Maximum Likelihood Estimators result in overly optimistic point estimates of zero for situations where there have been no events. Alternative empirical based approaches have been proposed based on median estimators or non- informative prior distributions. While these alternatives offer an improvement over point estimates of zero, they can be overly conservative. Empirical Bayes procedures offer an unbiased approach through pooling data across different hazards to support stronger statistical inference. This paper considers the application of Empirical Bayes to high consequence low frequency events, where estimates are required for risk mitigation decision support such as As Low As Reasonably Possible (ALARP). A summary of Empirical Bayes methods is given and the choices of estimation procedures to obtain interval estimates are discussed. The approaches illustrated within the case study are based on the estimation of the rate of occurrence of train derailments within the UK. The usefulness of Empirical Bayes within this context is discussed. 1 Key words: Empirical Bayes, Credibility Theory, Zero-failure, Estimation, Homogeneous Poisson Process 1. Introduction The Safety Risk Model (SRM) is a large scale Fault and Event Tree model used to assess risk on the UK Railways. The objectives of the SRM (see Risk Profile Bulletin [1]) are to provide an understanding of the nature of the current risk on the mainline railway and risk information and profiles relating to the mainline railway.
    [Show full text]
  • Long-Term Actuarial Mathematics Study Note
    EDUCATION COMMITTEE OF THE SOCIETY OF ACTUARIES LONG-TERM ACTUARIAL MATHEMATICS STUDY NOTE CHAPTERS 10–12 OF LOSS MODELS, FROM DATA TO DECISIONS, FIFTH EDITION by Stuart A. Klugman, Harry H. Panjer and Gordon E. Willmot Copyright © 2018. Posted with permission of the authors. The Education Committee provides study notes to persons preparing for the examinations of the Society of Actuaries. They are intended to acquaint candidates with some of the theoretical and practical considerations involved in the various subjects. While varying opinions are presented where appropriate, limits on the length of the material and other considerations sometimes prevent the inclusion of all possible opinions. These study notes do not, however, represent any official opinion, interpretations or endorsement of the Society of Actuaries or its Education Committee. The Society is grateful to the authors for their contributions in preparing the study notes. LTAM-22-18 Printed in U.S.A. CONTENTS Review of mathematical statistics 3 10.1 Introduction and four data sets 3 10.2 Point estimation 5 10.2.1 Introduction 5 10.2.2 Measures of quality 6 10.2.3 Exercises 16 10.3 Interval estimation 17 10.3.1 Exercises 19 10.4 Construction of Parametric Estimators 20 10.4.1 Method of moments and percentile matching 20 10.4.2 Exercises 23 10.5 Tests of hypotheses 26 10.5.1 Exercise 29 10.6 Solutions to Exercises 30 Maximum likelihood estimation 39 11.1 Introduction 39 11.2 Individual data 41 11.2.1 Exercises 42 11.3 Grouped data 45 11.3.1 Exercises 46 i ii CONTENTS 11.4 Truncated
    [Show full text]
  • Empirical Information Metrics for Prediction Power and Experiment Planning
    Information 2011, 2, 17-40; doi:10.3390/info2010017 OPEN ACCESS information ISSN 2078-2489 www.mdpi.com/journal/information Article Empirical Information Metrics for Prediction Power and Experiment Planning Christopher Lee 1;2;3 1 Department of Chemistry & Biochemistry, University of California, Los Angeles, CA 90095, USA 2 Department of Computer Science, University of California, Los Angeles, CA 90095, USA 3 Institute for Genomics & Proteomics, University of California, Los Angeles, CA 90095, USA; E-Mail: [email protected]; Tel: 310-825-7374; Fax: 310-206-7286 Received: 8 October 2010; in revised form: 30 November 2010 / Accepted: 21 December 2010 / Published: 11 January 2011 Abstract: In principle, information theory could provide useful metrics for statistical inference. In practice this is impeded by divergent assumptions: Information theory assumes the joint distribution of variables of interest is known, whereas in statistical inference it is hidden and is the goal of inference. To integrate these approaches we note a common theme they share, namely the measurement of prediction power. We generalize this concept as an information metric, subject to several requirements: Calculation of the metric must be objective or model-free; unbiased; convergent; probabilistically bounded; and low in computational complexity. Unfortunately, widely used model selection metrics such as Maximum Likelihood, the Akaike Information Criterion and Bayesian Information Criterion do not necessarily meet all these requirements. We define four distinct empirical information metrics measured via sampling, with explicit Law of Large Numbers convergence guarantees, which meet these requirements: Ie, the empirical information, a measure of average prediction power; Ib, the overfitting bias information, which measures selection bias in the modeling procedure; Ip, the potential information, which measures the total remaining information in the observations not yet discovered by the model; and Im, the model information, which measures the model’s extrapolation prediction power.
    [Show full text]