Bayes Estimator Recap - Example

Total Page:16

File Type:pdf, Size:1020Kb

Bayes Estimator Recap - Example Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Last Lecture . Biostatistics 602 - Statistical Inference Lecture 16 • What is a Bayes Estimator? Evaluation of Bayes Estimator • Is a Bayes Estimator the best unbiased estimator? . • Compared to other estimators, what are advantages of Bayes Estimator? Hyun Min Kang • What is conjugate family? • What are the conjugate families of Binomial, Poisson, and Normal distribution? March 14th, 2013 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 1 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 2 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Recap - Bayes Estimator Recap - Example • θ : parameter • π(θ) : prior distribution i.i.d. • X1, , Xn Bernoulli(p) • X θ fX(x θ) : sampling distribution ··· ∼ | ∼ | • π(p) Beta(α, β) • Posterior distribution of θ x ∼ | • α Prior guess : pˆ = α+β . Joint fX(x θ)π(θ) π(θ x) = = | • Posterior distribution : π(p x) Beta( xi + α, n xi + β) | Marginal m(x) | ∼ − • Bayes estimator ∑ ∑ m(x) = f(x θ)π(θ)dθ (Bayes’ rule) | α + x x n α α + β ∫ pˆ = i = i + α + β + n n α + β + n α + β α + β + n • Bayes Estimator of θ is ∑ ∑ E(θ x) = θπ(θ x)dθ | θ Ω | ∫ ∈ Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 3 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 4 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Loss Function Optimality Loss Function Let L(θ, θˆ) be a function of θ and θˆ. • Squared error loss L(θ,ˆ θ) = (θˆ θ)2 The mean squared error (MSE) is defined as − MSE = Average Loss = E[L(θ, θˆ)] MSE(θˆ) = E[θˆ θ]2 − which is the expectation of the loss if θˆ is used to estimate θ. ˆ Let θ is an estimator. • Absolute error loss • If θˆ = θ, it makes a correct decision and loss is 0 L ˆ ˆ • If θˆ = θ, it makes a mistake and loss is not 0. (θ) = θ θ ̸ | − | • A loss that penalties overestimation more than underestimation L(θ, θˆ) = (θˆ θ)2I(θˆ < θ) + 10(θˆ θ)2I(θˆ θ) − − ≥ Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 5 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 6 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Risk Function - Average Loss Alternative definition of Bayes Risk R(θ, θˆ) = E[L(θ, θˆ(X)) θ] | R(θ, θˆ)π(θ)dθ = E[L(θ, θˆ(X))]π(θ)dθ If L(θ, θˆ) = (θˆ θ)2, R(θ, θˆ) is MSE. ∫Ω ∫Ω − An estimator with smaller R(θ, θˆ) is preferred. = f(x θ)L(θ, θˆ(x))dx π(θ)dθ | . ∫Ω [∫ ] Definition : Bayes Risk X . = f(x θ)L(θ, θˆ(x))π(θ)dx dθ Bayes risk is defined as the average risk across all values of θ given prior Ω | ∫ [∫X ] π(θ) = π(θ x)m(x)L(θ, θˆ(x))dx dθ R(θ, θˆ)π(θ)dθ Ω | ∫ [∫X ] . ∫Ω ˆ The Bayes rule with respect to a prior π is the optimal estimator with = L(θ, θ(X))π(θ x)dθ m(x)dx Ω | respect to a Bayes risk, which is defined as the one that minimize the ∫X [∫ ] Bayes risk. Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 7 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 8 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Posterior Expected Loss Bayes Estimator based on squared error loss L(θ,ˆ θ) = (θˆ θ)2 − Posterior expected loss = (θ θˆ)2π(θ x)dθ Posterior expected loss is defined as − | ∫Ω = E[(θ θˆ)2 X = x] π(θ x)L(θ, θˆ(x))dθ − | Ω | So, the goal is to minimize E[(θ θˆ)2 X = x] ∫ − | An alternative definition of Bayes rule estimator is the estimator that E (θ θˆ)2 X = x = E (θ E(θ x) + E(θ x) θˆ)2 X = x minimizes the posterior expected loss. − | − | | − | [ ] [ ] = E (θ E(θ x))2 X = x + E (E(θ x) θˆ)2 X = x − | | | − | [ 2 ] = E [(θ E(θ x))2 X = x] + E(θ x) θˆ − | | | − [ ] which is minimized when θˆ =[ E(θ x). ] | Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 9 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 10 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Summary so far Bayes Estimator based on absolute error loss Loss function L(θ, θˆ) ˆ ˆ • e.g. (θˆ θ)2, θˆ θ Suppose that L(θ, θ) = θ θ . The posterior expected loss is − | − | | − | Risk function R(θ, θˆ) is average of L(θ, θˆ) across all x ∈ X E[L(θ, θˆ(x))] = θ θˆ(x) π(θ x)dθ • For squared error loss, risk function is the same to MSE. | − | | ∫Ω ˆ Bayes risk Average risk across all θ, based on the prior of θ. = E[ θ θ X = x] |ˆ − || • Alternatively, average posterior error loss across all θ = (θ θˆ)π(θ x)dθ + ∞(θ θˆ)π(θ x)dθ x . − − | θˆ − | ∈ X ∫−∞ ∫ Bayes estimator θˆ = E[θ x]. Based on squared error loss, | ∂ E[L(θ, θˆ(x))] = 0, and θˆ is posterior median. • Minimize Bayes risk ∂θˆ • Minimize Posterior Expected Loss Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 11 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 12 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Two Bayes Rules Example i.i.d. • X , , Xn Bernoulli(p). 1 ··· ∼ Consider a point estimation problem for real-valued parameter θ. • π(p) Beta(α, β) . ∼ • The posterior distribution follows Beta( xi + α, n xi + β). For squared error loss, the posterior expected loss is − • (θ θˆ)2π(θ x)dθ = E[(θ θˆ)2 X = x] Bayes estimator that minimizes posterior∑ expected squared∑ error loss − | − | is the posterior mean ∫Ω xi + α ˆ pˆ = This expected value is minimized by θ = E(θ x). So the Bayes rule α + β + n | ∑ .estimator is the mean of the posterior distribution. • Bayes estimator that minimizes posterior expected absolute error loss For absolute error loss, the posterior expected loss is E( θ θˆ X = x). As is the posterior median ˆ | − || .shown previously, this is minimized by choosing θ as the median of π(θ x). | θˆ n Γ(α + β + ) xi+α 1 n xi+β 1 1 p − (1 p) − − dp = Γ( xi + α)Γ(n xi + β) − 2 ∫0 − ∑ ∑ ∑ ∑ Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 13 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 14 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Asymptotic Evaluation of Point Estimators Tools for proving consistency When the sample size n approaches infinity, the behaviors of an estimator are unknown as its asymptotic properties. Definition - Consistency • Use definition (complicated) . Let Wn = Wn(X , , Xn) = Wn(X) be a sequence of estimators for • Chebychev’s Inequality 1 ··· . We say W is consistent for estimating if W P under τ(θ) n τ(θ) n τ(θ) 2 2 → Pr( Wn τ(θ) ϵ) = Pr((Wn τ(θ)) ϵ ) .Pθ for every θ Ω. | − | ≥ − ≥ ∈ E W 2 P [ n τ(θ)] Wn τ(θ) (converges in probability to τ(θ)) means that, given any −2 → ≤ ϵ ϵ > 0. 2 MSE(Wn) Bias (Wn) + Var(Wn) = 2 = 2 lim Pr( Wn τ(θ) ϵ) = 0 ϵ ϵ n →∞ | − | ≥ lim Pr( Wn τ(θ) < ϵ) = 1 Need to show that both Bias(Wn) and Var(Wn) converges to zero n →∞ | − | When Wn τ(θ) < ϵ can also be represented that Wn is close to τ(θ). | − | Consistency implies that the probability of Wn close to τ(θ) approaches to 1 as n goes to . Hyun Min Kang∞ Biostatistics 602 - Lecture 16 March 14th, 2013 15 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 16 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Theorem for consistency Weak Law of Large Numbers . Theorem 10.1.3 . Theorem 5.5.2 If Wn is a sequence of estimators of τ(θ) satisfying . Let X1, , Xn be iid random variables with E(X) = µ and • limn > Bias(Wn) = 0. ··· 2 Var(X) = σ < . Then Xn converges in probability to µ. − ∞ ∞ • limn > Var(Wn) = 0. P − ∞ .i.e. Xn µ. → .for all θ, then Wn is consistent for τ(θ) Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 17 / 28 Hyun Min Kang Biostatistics 602 - Lecture 16 March 14th, 2013 18 / 28 Recap Bayes Risk Consistency Summary Recap Bayes Risk Consistency Summary . Consistent sequence of estimators Example . Theorem 10.1.5 . Let Wn is a consistent sequence of estimators of τ(θ). Let an, bn be Problem sequences of constants satisfying . X1, , Xn are iid samples from a distribution with mean µ and variance 1. limn an = 1 2 ··· →∞ σ < . 2. limn bn = 0. ∞ →∞ 1. Show that Xn is consistent for µ. Then Un = anWn + bn is also a consistent sequence of estimators of τ(θ). 1 n 2 2 . 2. Show that (Xi X) is consistent for σ . n i=1 − 1 n 2 2 . 3. Show that (Xi X) is consistent for σ . Continuous Map Theorem . n ∑1 i=1 − . − If Wn is consistent for θ and g is a continuous function, then g(Wn) is ∑ .consistent for g(θ).
Recommended publications
  • Lecture 12 Robust Estimation
    Lecture 12 Robust Estimation Prof. Dr. Svetlozar Rachev Institute for Statistics and Mathematical Economics University of Karlsruhe Financial Econometrics, Summer Semester 2007 Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Copyright These lecture-notes cannot be copied and/or distributed without permission. The material is based on the text-book: Financial Econometrics: From Basics to Advanced Modeling Techniques (Wiley-Finance, Frank J. Fabozzi Series) by Svetlozar T. Rachev, Stefan Mittnik, Frank Fabozzi, Sergio M. Focardi,Teo Jaˇsic`. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Outline I Robust statistics. I Robust estimators of regressions. I Illustration: robustness of the corporate bond yield spread model. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Robust Statistics I Robust statistics addresses the problem of making estimates that are insensitive to small changes in the basic assumptions of the statistical models employed. I The concepts and methods of robust statistics originated in the 1950s. However, the concepts of robust statistics had been used much earlier. I Robust statistics: 1. assesses the changes in estimates due to small changes in the basic assumptions; 2. creates new estimates that are insensitive to small changes in some of the assumptions. I Robust statistics is also useful to separate the contribution of the tails from the contribution of the body of the data. Prof. Dr. Svetlozar Rachev Institute for Statistics and MathematicalLecture Economics 12 Robust University Estimation of Karlsruhe Robust Statistics I Peter Huber observed, that robust, distribution-free, and nonparametrical actually are not closely related properties.
    [Show full text]
  • Should We Think of a Different Median Estimator?
    Comunicaciones en Estad´ıstica Junio 2014, Vol. 7, No. 1, pp. 11–17 Should we think of a different median estimator? ¿Debemos pensar en un estimator diferente para la mediana? Jorge Iv´an V´eleza Juan Carlos Correab [email protected] [email protected] Resumen La mediana, una de las medidas de tendencia central m´as populares y utilizadas en la pr´actica, es el valor num´erico que separa los datos en dos partes iguales. A pesar de su popularidad y aplicaciones, muchos desconocen la existencia de dife- rentes expresiones para calcular este par´ametro. A continuaci´on se presentan los resultados de un estudio de simulaci´on en el que se comparan el estimador cl´asi- co y el propuesto por Harrell & Davis (1982). Mostramos que, comparado con el estimador de Harrell–Davis, el estimador cl´asico no tiene un buen desempe˜no pa- ra tama˜nos de muestra peque˜nos. Basados en los resultados obtenidos, se sugiere promover la utilizaci´on de un mejor estimador para la mediana. Palabras clave: mediana, cuantiles, estimador Harrell-Davis, simulaci´on estad´ısti- ca. Abstract The median, one of the most popular measures of central tendency widely-used in the statistical practice, is often described as the numerical value separating the higher half of the sample from the lower half. Despite its popularity and applica- tions, many people are not aware of the existence of several formulas to estimate this parameter. We present the results of a simulation study comparing the classic and the Harrell-Davis (Harrell & Davis 1982) estimators of the median for eight continuous statistical distributions.
    [Show full text]
  • Lecture 13: Simple Linear Regression in Matrix Format
    11:55 Wednesday 14th October, 2015 See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/ Lecture 13: Simple Linear Regression in Matrix Format 36-401, Section B, Fall 2015 13 October 2015 Contents 1 Least Squares in Matrix Form 2 1.1 The Basic Matrices . .2 1.2 Mean Squared Error . .3 1.3 Minimizing the MSE . .4 2 Fitted Values and Residuals 5 2.1 Residuals . .7 2.2 Expectations and Covariances . .7 3 Sampling Distribution of Estimators 8 4 Derivatives with Respect to Vectors 9 4.1 Second Derivatives . 11 4.2 Maxima and Minima . 11 5 Expectations and Variances with Vectors and Matrices 12 6 Further Reading 13 1 2 So far, we have not used any notions, or notation, that goes beyond basic algebra and calculus (and probability). This has forced us to do a fair amount of book-keeping, as it were by hand. This is just about tolerable for the simple linear model, with one predictor variable. It will get intolerable if we have multiple predictor variables. Fortunately, a little application of linear algebra will let us abstract away from a lot of the book-keeping details, and make multiple linear regression hardly more complicated than the simple version1. These notes will not remind you of how matrix algebra works. However, they will review some results about calculus with matrices, and about expectations and variances with vectors and matrices. Throughout, bold-faced letters will denote matrices, as a as opposed to a scalar a. 1 Least Squares in Matrix Form Our data consists of n paired observations of the predictor variable X and the response variable Y , i.e., (x1; y1);::: (xn; yn).
    [Show full text]
  • Logistic Regression Trained with Different Loss Functions Discussion
    Logistic Regression Trained with Different Loss Functions Discussion CS6140 1 Notations We restrict our discussions to the binary case. 1 g(z) = 1 + e−z @g(z) g0(z) = = g(z)(1 − g(z)) @z 1 1 h (x) = g(wx) = = w −wx − P wdxd 1 + e 1 + e d P (y = 1jx; w) = hw(x) P (y = 0jx; w) = 1 − hw(x) 2 Maximum Likelihood Estimation 2.1 Goal Maximize likelihood: L(w) = p(yjX; w) m Y = p(yijxi; w) i=1 m Y yi 1−yi = (hw(xi)) (1 − hw(xi)) i=1 1 Or equivalently, maximize the log likelihood: l(w) = log L(w) m X = yi log h(xi) + (1 − yi) log(1 − h(xi)) i=1 2.2 Stochastic Gradient Descent Update Rule @ 1 1 @ j l(w) = (y − (1 − y) ) j g(wxi) @w g(wxi) 1 − g(wxi) @w 1 1 @ = (y − (1 − y) )g(wxi)(1 − g(wxi)) j wxi g(wxi) 1 − g(wxi) @w j = (y(1 − g(wxi)) − (1 − y)g(wxi))xi j = (y − hw(xi))xi j j j w := w + λ(yi − hw(xi)))xi 3 Least Squared Error Estimation 3.1 Goal Minimize sum of squared error: m 1 X L(w) = (y − h (x ))2 2 i w i i=1 3.2 Stochastic Gradient Descent Update Rule @ @h (x ) L(w) = −(y − h (x )) w i @wj i w i @wj j = −(yi − hw(xi))hw(xi)(1 − hw(xi))xi j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi 4 Comparison 4.1 Update Rule For maximum likelihood logistic regression: j j j w := w + λ(yi − hw(xi)))xi 2 For least squared error logistic regression: j j j w := w + λ(yi − hw(xi))hw(xi)(1 − hw(xi))xi Let f1(h) = (y − h); y 2 f0; 1g; h 2 (0; 1) f2(h) = (y − h)h(1 − h); y 2 f0; 1g; h 2 (0; 1) When y = 1, the plots of f1(h) and f2(h) are shown in figure 1.
    [Show full text]
  • Bias, Mean-Square Error, Relative Efficiency
    3 Evaluating the Goodness of an Estimator: Bias, Mean-Square Error, Relative Efficiency Consider a population parameter ✓ for which estimation is desired. For ex- ample, ✓ could be the population mean (traditionally called µ) or the popu- lation variance (traditionally called σ2). Or it might be some other parame- ter of interest such as the population median, population mode, population standard deviation, population minimum, population maximum, population range, population kurtosis, or population skewness. As previously mentioned, we will regard parameters as numerical charac- teristics of the population of interest; as such, a parameter will be a fixed number, albeit unknown. In Stat 252, we will assume that our population has a distribution whose density function depends on the parameter of interest. Most of the examples that we will consider in Stat 252 will involve continuous distributions. Definition 3.1. An estimator ✓ˆ is a statistic (that is, it is a random variable) which after the experiment has been conducted and the data collected will be used to estimate ✓. Since it is true that any statistic can be an estimator, you might ask why we introduce yet another word into our statistical vocabulary. Well, the answer is quite simple, really. When we use the word estimator to describe a particular statistic, we already have a statistical estimation problem in mind. For example, if ✓ is the population mean, then a natural estimator of ✓ is the sample mean. If ✓ is the population variance, then a natural estimator of ✓ is the sample variance. More specifically, suppose that Y1,...,Yn are a random sample from a population whose distribution depends on the parameter ✓.The following estimators occur frequently enough in practice that they have special notations.
    [Show full text]
  • Regularized Regression Under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss
    Regularized Regression under Quadratic Loss, Logistic Loss, Sigmoidal Loss, and Hinge Loss Here we considerthe problem of learning binary classiers. We assume a set X of possible inputs and we are interested in classifying inputs into one of two classes. For example we might be interesting in predicting whether a given persion is going to vote democratic or republican. We assume a function Φ which assigns a feature vector to each element of x — we assume that for x ∈ X we have d Φ(x) ∈ R . For 1 ≤ i ≤ d we let Φi(x) be the ith coordinate value of Φ(x). For example, for a person x we might have that Φ(x) is a vector specifying income, age, gender, years of education, and other properties. Discrete properties can be represented by binary valued fetures (indicator functions). For example, for each state of the United states we can have a component Φi(x) which is 1 if x lives in that state and 0 otherwise. We assume that we have training data consisting of labeled inputs where, for convenience, we assume that the labels are all either −1 or 1. S = hx1, yyi,..., hxT , yT i xt ∈ X yt ∈ {−1, 1} Our objective is to use the training data to construct a predictor f(x) which predicts y from x. Here we will be interested in predictors of the following form where β ∈ Rd is a parameter vector to be learned from the training data. fβ(x) = sign(β · Φ(x)) (1) We are then interested in learning a parameter vector β from the training data.
    [Show full text]
  • A Joint Central Limit Theorem for the Sample Mean and Regenerative Variance Estimator*
    Annals of Operations Research 8(1987)41-55 41 A JOINT CENTRAL LIMIT THEOREM FOR THE SAMPLE MEAN AND REGENERATIVE VARIANCE ESTIMATOR* P.W. GLYNN Department of Industrial Engineering, University of Wisconsin, Madison, W1 53706, USA and D.L. IGLEHART Department of Operations Research, Stanford University, Stanford, CA 94305, USA Abstract Let { V(k) : k t> 1 } be a sequence of independent, identically distributed random vectors in R d with mean vector ~. The mapping g is a twice differentiable mapping from R d to R 1. Set r = g(~). A bivariate central limit theorem is proved involving a point estimator for r and the asymptotic variance of this point estimate. This result can be applied immediately to the ratio estimation problem that arises in regenerative simulation. Numerical examples show that the variance of the regenerative variance estimator is not necessarily minimized by using the "return state" with the smallest expected cycle length. Keywords and phrases Bivariate central limit theorem,j oint limit distribution, ratio estimation, regenerative simulation, simulation output analysis. 1. Introduction Let X = {X(t) : t I> 0 } be a (possibly) delayed regenerative process with regeneration times 0 = T(- 1) ~< T(0) < T(1) < T(2) < .... To incorporate regenerative sequences {Xn: n I> 0 }, we pass to the continuous time process X = {X(t) : t/> 0}, where X(0 = X[t ] and [t] is the greatest integer less than or equal to t. Under quite general conditions (see Smith [7] ), *This research was supported by Army Research Office Contract DAAG29-84-K-0030. The first author was also supported by National Science Foundation Grant ECS-8404809 and the second author by National Science Foundation Grant MCS-8203483.
    [Show full text]
  • The Central Limit Theorem in Differential Privacy
    Privacy Loss Classes: The Central Limit Theorem in Differential Privacy David M. Sommer Sebastian Meiser Esfandiar Mohammadi ETH Zurich UCL ETH Zurich [email protected] [email protected] [email protected] August 12, 2020 Abstract Quantifying the privacy loss of a privacy-preserving mechanism on potentially sensitive data is a complex and well-researched topic; the de-facto standard for privacy measures are "-differential privacy (DP) and its versatile relaxation (, δ)-approximate differential privacy (ADP). Recently, novel variants of (A)DP focused on giving tighter privacy bounds under continual observation. In this paper we unify many previous works via the privacy loss distribution (PLD) of a mechanism. We show that for non-adaptive mechanisms, the privacy loss under sequential composition undergoes a convolution and will converge to a Gauss distribution (the central limit theorem for DP). We derive several relevant insights: we can now characterize mechanisms by their privacy loss class, i.e., by the Gauss distribution to which their PLD converges, which allows us to give novel ADP bounds for mechanisms based on their privacy loss class; we derive exact analytical guarantees for the approximate randomized response mechanism and an exact analytical and closed formula for the Gauss mechanism, that, given ", calculates δ, s.t., the mechanism is ("; δ)-ADP (not an over- approximating bound). 1 Contents 1 Introduction 4 1.1 Contribution . .4 2 Overview 6 2.1 Worst-case distributions . .6 2.2 The privacy loss distribution . .6 3 Related Work 7 4 Privacy Loss Space 7 4.1 Privacy Loss Variables / Distributions .
    [Show full text]
  • In This Segment, We Discuss a Little More the Mean Squared Error
    MITOCW | MITRES6_012S18_L20-04_300k In this segment, we discuss a little more the mean squared error. Consider some estimator. It can be any estimator, not just the sample mean. We can decompose the mean squared error as a sum of two terms. Where does this formula come from? Well, we know that for any random variable Z, this formula is valid. And if we let Z be equal to the difference between the estimator and the value that we're trying to estimate, then we obtain this formula here. The expected value of our random variable Z squared is equal to the variance of that random variable plus the square of its mean. Let us now rewrite these two terms in a more suggestive way. We first notice that theta is a constant. When you add or subtract the constant from a random variable, the variance does not change. So this term is the same as the variance of theta hat. This quantity here, we will call it the bias of the estimator. It tells us whether theta hat is systematically above or below than the unknown parameter theta that we're trying to estimate. And using this terminology, this term here is just equal to the square of the bias. So the mean squared error consists of two components, and these capture different aspects of an estimator's performance. Let us see what they are in a concrete setting. Suppose that we're estimating the unknown mean of some distribution, and that our estimator is the sample mean. In this case, the mean squared error is the variance, which we know to be sigma squared over n, plus the bias term.
    [Show full text]
  • Bayesian Classifiers Under a Mixture Loss Function
    Hunting for Significance: Bayesian Classifiers under a Mixture Loss Function Igar Fuki, Lawrence Brown, Xu Han, Linda Zhao February 13, 2014 Abstract Detecting significance in a high-dimensional sparse data structure has received a large amount of attention in modern statistics. In the current paper, we introduce a compound decision rule to simultaneously classify signals from noise. This procedure is a Bayes rule subject to a mixture loss function. The loss function minimizes the number of false discoveries while controlling the false non discoveries by incorporating the signal strength information. Based on our criterion, strong signals will be penalized more heavily for non discovery than weak signals. In constructing this classification rule, we assume a mixture prior for the parameter which adapts to the unknown spar- sity. This Bayes rule can be viewed as thresholding the \local fdr" (Efron 2007) by adaptive thresholds. Both parametric and nonparametric methods will be discussed. The nonparametric procedure adapts to the unknown data structure well and out- performs the parametric one. Performance of the procedure is illustrated by various simulation studies and a real data application. Keywords: High dimensional sparse inference, Bayes classification rule, Nonparametric esti- mation, False discoveries, False nondiscoveries 1 2 1 Introduction Consider a normal mean model: Zi = βi + i; i = 1; ··· ; p (1) n T where fZigi=1 are independent random variables, the random errors (1; ··· ; p) follow 2 T a multivariate normal distribution Np(0; σ Ip), and β = (β1; ··· ; βp) is a p-dimensional unknown vector. For simplicity, in model (1), we assume σ2 is known. Without loss of generality, let σ2 = 1.
    [Show full text]
  • Approximated Bayes and Empirical Bayes Confidence Intervals—
    Ann. Inst. Statist. Math. Vol. 40, No. 4, 747-767 (1988) APPROXIMATED BAYES AND EMPIRICAL BAYES CONFIDENCE INTERVALSmTHE KNOWN VARIANCE CASE* A. J. VAN DER MERWE, P. C. N. GROENEWALD AND C. A. VAN DER MERWE Department of Mathematical Statistics, University of the Orange Free State, PO Box 339, Bloemfontein, Republic of South Africa (Received June 11, 1986; revised September 29, 1987) Abstract. In this paper hierarchical Bayes and empirical Bayes results are used to obtain confidence intervals of the population means in the case of real problems. This is achieved by approximating the posterior distribution with a Pearson distribution. In the first example hierarchical Bayes confidence intervals for the Efron and Morris (1975, J. Amer. Statist. Assoc., 70, 311-319) baseball data are obtained. The same methods are used in the second example to obtain confidence intervals of treatment effects as well as the difference between treatment effects in an analysis of variance experiment. In the third example hierarchical Bayes intervals of treatment effects are obtained and compared with normal approximations in the unequal variance case. Key words and phrases: Hierarchical Bayes, empirical Bayes estimation, Stein estimator, multivariate normal mean, Pearson curves, confidence intervals, posterior distribution, unequal variance case, normal approxima- tions. 1. Introduction In the Bayesian approach to inference, a posterior distribution of unknown parameters is produced as the normalized product of the like- lihood and a prior distribution. Inferences about the unknown parameters are then based on the entire posterior distribution resulting from the one specific data set which has actually occurred. In most hierarchical and empirical Bayes cases these posterior distributions are difficult to derive and cannot be obtained in closed form.
    [Show full text]
  • Minimum Mean Squared Error Model Averaging in Likelihood Models
    Statistica Sinica 26 (2016), 809-840 doi:http://dx.doi.org/10.5705/ss.202014.0067 MINIMUM MEAN SQUARED ERROR MODEL AVERAGING IN LIKELIHOOD MODELS Ali Charkhi1, Gerda Claeskens1 and Bruce E. Hansen2 1KU Leuven and 2University of Wisconsin, Madison Abstract: A data-driven method for frequentist model averaging weight choice is developed for general likelihood models. We propose to estimate the weights which minimize an estimator of the mean squared error of a weighted estimator in a local misspecification framework. We find that in general there is not a unique set of such weights, meaning that predictions from multiple model averaging estimators might not be identical. This holds in both the univariate and multivariate case. However, we show that a unique set of empirical weights is obtained if the candidate models are appropriately restricted. In particular a suitable class of models are the so-called singleton models where each model only includes one parameter from the candidate set. This restriction results in a drastic reduction in the computational cost of model averaging weight selection relative to methods which include weights for all possible parameter subsets. We investigate the performance of our methods in both linear models and generalized linear models, and illustrate the methods in two empirical applications. Key words and phrases: Frequentist model averaging, likelihood regression, local misspecification, mean squared error, weight choice. 1. Introduction We study a focused version of frequentist model averaging where the mean squared error plays a central role. Suppose we have a collection of models S 2 S to estimate a population quantity µ, this is the focus, leading to a set of estimators fµ^S : S 2 Sg.
    [Show full text]