Point Estimators Math 218, Mathematical Statistics

Total Page:16

File Type:pdf, Size:1020Kb

Point Estimators Math 218, Mathematical Statistics that has \central tendencies." Which of these is best, and why? Desirable criteria for point estimators. The MSE. Well, of course, we want our point estima- ^ Point estimators tor θ of the parameter θ to be close to θ. We can't Math 218, Mathematical Statistics expect them to be equal, of course, because of sam- D Joyce, Spring 2016 pling error. How should we measure how far off the random variable θ^ is from the unknown constant θ? The main job of statistics is to make inferences, One standard measure is what is called the mean specifically inferences bout parameters based on squared error, abbreviated MSE and defined by data from a sample. MSE(θ^) = E((θ^ − θ)2) We assume that a sample X1;:::;Xn comes from a particular distribution, called the population dis- the expected square of the error. If we have two dif- tribution, and although that particular distribution ferent estimators of θ, the one that has the smaller is not known, it is assumed that it is one of a fam- MSE is closer to the actual parameter (in some ily of distributions parametrized by one or more sense of \closer"). parameters. For example, if there are only two possible out- Variance and bias. The MSE of an estimator θ^ comes, then the distribution is a Bernoulli distri- can be split into two parts, the estimator's variance bution parametrized by one parameter p, the prob- and a \bias" term. We're familiar with the variance ability of success. of a random variable X; it's For another example, many measurements are as- sumed to be normally distributed, and for a normal Var(X) = σ2 = E((X − µ )2) = E(X2) − µ2 : distribution, there are two parameters µ and σ2. X X X Right now, though, our random variable is θ^, so its Point estimation. The first kind of inference variance is that we'll look at is estimating the values of pa- rameters. A point estimator θ^ of a parameter θ is Var(θ^) = E((θ^ − E(θ^))2) = E(θ^2) − (E(θ^))2: some function of the sample X ;:::;X . Any func- 1 n ^ tion of a sample is called a sample statistic. The MSE(θ) is the sum of this variance and one For example, for the Bernoulli distribution, a other component. Let's see what that other com- typical estimator for p is the sample mean X, that ponent is. is,p ^ is often taken to be X. MSE = E((θ^ − θ)2) For another example, for the normal distribution, ^2 ^ 2 µ^ is often taken to be X, andσ ^2 is often taken = E(θ − 2θθ + θ ) 2 1 P 2 = E(θ^2) − (E(θ^))2 + (E(θ^))2 − 2E(θ^)θ + θ2 to be the sample variance S = n−1 (Xi − X) , although sometimesσ ^2 is taken to be the sample = Var(θ^) + (E(θ^))2 − 2E(θ^)θ + θ2 1 2 variance P(X − X) 2 n i = Var(θ^) + (E(θ^) − θ) Sometimes, there are many different choices for estimators. To estimate the population mean µ, The expression E(θ^) − θ is called the bias of the besides (1) the sample mean X, you might instead estimator θ^. 1 take (2) the midrange value 2 (Xmin + Xmax), or (3) the median, or (4) just about any other statistic Bias(θ^) = E(θ^) − θ: 1 If Bias(θ^) is positive that means that you expect This particular estimated sample error is so com- the estimator θ^ to be too large, if negative, then monly used, it's got its own name|the standard too small. From the computation above, we see error of the mean|abbreviated SEM. that the MSE is the sum of the variance of θ^ and the square of the bias of θ^: The method of moments. There are other ways of coming up with estimators. One is called 2 MSE(θ^) = Var(θ^) + (Bias(θ^)) : the method of moments. The idea is that to esti- mate a moment µk of the population distribution, So, what makes a good estimator? We've only just use the corresponding momentµ ^k of the sam- seen a couple of measures of the goodness of an es- ple. They're analogous, anyway. timator, and people disagree just based on them. n The authors say the best estimator is the unbi- 1 X µ = E(Xk) whileµ ^ = Xk: ased one with the smallest variance. Others say k k k i the smallest MSE is most important since it's even i=1 closer to the parameter. Others say that there are The method of moments goes further than that, other criteria that are more important than the though. If what you want to estimate k param- ones we've just discussed. Example 6.4 discusses eters θ1; : : : ; θk, and those parameters aren't mo- the sample variance S2 as an estimator for variance ments themselves, then find those parameters in σ2 for a normal distribution. Dividing by n − 1 terms of the moments. leads to an unbiased estimator, but dividing by n Example 6.6, page 202, shows how this is done leads to an estimator with a much smaller MSE, at when a random sample is taken from a uniform dis- least for small n, but as n increases the difference tribution on an unknown interval [θ1; θ2]. Using the between the MSE's for the two estimators (the one method of moments, it turns out that involving n−1 and the one involving n) approaches q 0. ^ ^ 2 θ1; θ2 =µ ^1 ± 3(^µ2 − µ^1) 1 P 1 P 2 Standard error and estimated standard er- where µ1 = X = k Xi, and µ2 = k Xi . ror of an estimator. The variance of θ^ is one measure of its error, and that's often reported as Math 218 Home Page at its square root, the standard deviation of θ^, called http://math.clarku.edu/~djoyce/ma218/ the standard error of the estimator θ^, abbreviated SE. Unfortunately, the actual value of this standard error is not known because the parameters of the distribution are unknown. But they can be esti- mated and so can the standard error be estimated. So, what's usually reported is the estimated stan- dard error. An example. Suppose the population has two unknown parameters µ and σ2, as is the case for normal populations. The sample mean x is often used to estimate µ. The standard deviation of X, that is, the SE of the estimator X, is σ=n. But σ is estimated by the sample standard deviationpS. Thus, the estimated sample error for X is S= n. 2.
Recommended publications
  • A Recursive Formula for Moments of a Binomial Distribution Arp´ Ad´ Benyi´ ([email protected]), University of Massachusetts, Amherst, MA 01003 and Saverio M
    A Recursive Formula for Moments of a Binomial Distribution Arp´ ad´ Benyi´ ([email protected]), University of Massachusetts, Amherst, MA 01003 and Saverio M. Manago ([email protected]) Naval Postgraduate School, Monterey, CA 93943 While teaching a course in probability and statistics, one of the authors came across an apparently simple question about the computation of higher order moments of a ran- dom variable. The topic of moments of higher order is rarely emphasized when teach- ing a statistics course. The textbooks we came across in our classes, for example, treat this subject rather scarcely; see [3, pp. 265–267], [4, pp. 184–187], also [2, p. 206]. Most of the examples given in these books stop at the second moment, which of course suffices if one is only interested in finding, say, the dispersion (or variance) of a ran- 2 2 dom variable X, D (X) = M2(X) − M(X) . Nevertheless, moments of order higher than 2 are relevant in many classical statistical tests when one assumes conditions of normality. These assumptions may be checked by examining the skewness or kurto- sis of a probability distribution function. The skewness, or the first shape parameter, corresponds to the the third moment about the mean. It describes the symmetry of the tails of a probability distribution. The kurtosis, also known as the second shape pa- rameter, corresponds to the fourth moment about the mean and measures the relative peakedness or flatness of a distribution. Significant skewness or kurtosis indicates that the data is not normal. However, we arrived at higher order moments unintentionally.
    [Show full text]
  • Multiple-Change-Point Detection for Auto-Regressive Conditional Heteroscedastic Processes
    J. R. Statist. Soc. B (2014) Multiple-change-point detection for auto-regressive conditional heteroscedastic processes P.Fryzlewicz London School of Economics and Political Science, UK and S. Subba Rao Texas A&M University, College Station, USA [Received March 2010. Final revision August 2013] Summary.The emergence of the recent financial crisis, during which markets frequently under- went changes in their statistical structure over a short period of time, illustrates the importance of non-stationary modelling in financial time series. Motivated by this observation, we propose a fast, well performing and theoretically tractable method for detecting multiple change points in the structure of an auto-regressive conditional heteroscedastic model for financial returns with piecewise constant parameter values. Our method, termed BASTA (binary segmentation for transformed auto-regressive conditional heteroscedasticity), proceeds in two stages: process transformation and binary segmentation. The process transformation decorrelates the original process and lightens its tails; the binary segmentation consistently estimates the change points. We propose and justify two particular transformations and use simulation to fine-tune their parameters as well as the threshold parameter for the binary segmentation stage. A compara- tive simulation study illustrates good performance in comparison with the state of the art, and the analysis of the Financial Times Stock Exchange FTSE 100 index reveals an interesting correspondence between the estimated change points
    [Show full text]
  • Chapter 6 Point Estimation
    Chapter 6 Point estimation This chapter deals with one of the central problems of statistics, that of using a sample of data to estimate the parameters for a hypothesized population model from which the data are assumed to arise. There are many approaches to estimation, and several of these are mentioned; but one method has very wide acceptance, the method of maximum likelihood. So far in the course, a major distinction has been drawn between sample quan- tities on the one hand-values calculated from data, such as the sample mean and the sample standard deviation-and corresponding population quantities on the other. The latter arise when a statistical model is assumed to be an ad- equate representation of the underlying variation in the population. Usually the model family is specified (binomial, Poisson, normal, . ) but the index- ing parameters (the binomial probability p, the Poisson mean p, the normal variance u2, . ) might be unknown-indeed, they usually will be unknown. Often one of the main reasons for collecting data is to estimate, from a sample, the value of the model parameter or parameters. Here are two examples. Example 6.1 Counts of females in queues In a paper about graphical methods for testing model quality, an experiment Jinkinson, R.A. and Slater, M. is described to count the number of females in each of 100 queues, all of length (1981) Critical discussion of a ten, at a London underground train station. graphical method for identifying discrete distributions. The The number of females in each queue could theoretically take any value be- Statistician, 30, 239-248.
    [Show full text]
  • THE THEORY of POINT ESTIMATION a Point Estimator Uses the Information Available in a Sample to Obtain a Single Number That Estimates a Population Parameter
    THE THEORY OF POINT ESTIMATION A point estimator uses the information available in a sample to obtain a single number that estimates a population parameter. There can be a variety of estimators of the same parameter that derive from different principles of estimation, and it is necessary to choose amongst them. Therefore, criteria are required that will indicate which are the acceptable estimators and which of these is the best in given circumstances. Often, the choice of an estimate is governed by practical considerations such as the ease of computation or the ready availability of a computer program. In what follows, we shall ignore such considerations in order to concentrate on the criteria that, ideally, we would wish to adopt. The circumstances that affect the choice are more complicated than one might imagine at first. Consider an estimator θˆn based on a sample of size n that purports to estimate a population parameter θ, and let θ˜n be any other estimator based on the same sample. If P (θ − c1 ≤ θˆn ≤ θ + c2) ≥ P (θ − c1 ≤ θ˜n ≤ θ + c2) for all values of c1,c2 > 0, then θˆn would be unequivocally the best estimator. However, it is possible to show that an estimator has such a property only in very restricted circumstances. Instead, we must evaluate an estimator against a list of partial criteria, of which we shall itemise the principal ones. The first is the criterion of unbiasedness. (a) θˆ is an unbiased estimator of the population parameter θ if E(θˆ)=θ. On it own, this is an insufficient criterion.
    [Show full text]
  • Section 7 Testing Hypotheses About Parameters of Normal Distribution. T-Tests and F-Tests
    Section 7 Testing hypotheses about parameters of normal distribution. T-tests and F-tests. We will postpone a more systematic approach to hypotheses testing until the following lectures and in this lecture we will describe in an ad hoc way T-tests and F-tests about the parameters of normal distribution, since they are based on a very similar ideas to confidence intervals for parameters of normal distribution - the topic we have just covered. Suppose that we are given an i.i.d. sample from normal distribution N(µ, ν2) with some unknown parameters µ and ν2 : We will need to decide between two hypotheses about these unknown parameters - null hypothesis H0 and alternative hypothesis H1: Hypotheses H0 and H1 will be one of the following: H : µ = µ ; H : µ = µ ; 0 0 1 6 0 H : µ µ ; H : µ < µ ; 0 ∼ 0 1 0 H : µ µ ; H : µ > µ ; 0 ≈ 0 1 0 where µ0 is a given ’hypothesized’ parameter. We will also consider similar hypotheses about parameter ν2 : We want to construct a decision rule α : n H ; H X ! f 0 1g n that given an i.i.d. sample (X1; : : : ; Xn) either accepts H0 or rejects H0 (accepts H1). Null hypothesis is usually a ’main’ hypothesis2 X in a sense that it is expected or presumed to be true and we need a lot of evidence to the contrary to reject it. To quantify this, we pick a parameter � [0; 1]; called level of significance, and make sure that a decision rule α rejects H when it is2 actually true with probability �; i.e.
    [Show full text]
  • Point Estimation Decision Theory
    Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we select a likelihood function p(D j θ), explaining how observed data D are expected to be generated given the value of θ. Then we select a prior distribution p(θ) reecting our initial beliefs about θ. Finally, we conduct an experiment to gather data and use Bayes’ theorem to derive the posterior p(θ j D). In a sense, the posterior contains all information about θ that we care about. However, the process of inference will often require us to use this posterior to answer various questions. For example, we might be compelled to choose a single value θ^ to serve as a point estimate of θ. To a Bayesian, the selection of θ^ is a decision, and in dierent contexts we might want to select dierent values to report. In general, we should not expect to be able to select the true value of θ, unless we have somehow observed data that unambiguously determine it. Instead, we can only hope to select an estimate that is “close” to the true value. Dierent denitions of “closeness” can naturally lead to dierent estimates. The Bayesian approach to point estimation will be to analyze the impact of our choice in terms of a loss function, which describes how “bad” dierent types of mistakes can be. We then select the estimate which appears to be the least “bad” according to our current beliefs about θ.
    [Show full text]
  • Chapter 7 Point Estimation Method of Moments
    Introduction Method of Moments Procedure Mark and Recapture Monsoon Rains Chapter 7 Point Estimation Method of Moments 1 / 23 Introduction Method of Moments Procedure Mark and Recapture Monsoon Rains Outline Introduction Classical Statistics Method of Moments Procedure Mark and Recapture Monsoon Rains 2 / 23 Introduction Method of Moments Procedure Mark and Recapture Monsoon Rains Parameter Estimation For parameter estimation, we consider X = (X1;:::; Xn), independent random variables chosen according to one of a family of probabilities Pθ where θ is element from the parameter spaceΘ. Based on our analysis, we choose an estimator θ^(X ). If the data x takes on the values x1; x2;:::; xn, then θ^(x1; x2;:::; xn) is called the estimate of θ. Thus we have three closely related objects. 1. θ - the parameter, an element of the parameter space, is a number or a vector. 2. θ^(x1; x2;:::; xn) - the estimate, is a number or a vector obtained by evaluating the estimator on the data x = (x1; x2;:::; xn). 3. θ^(X1;:::; Xn) - the estimator, is a random variable. We will analyze the distribution of this random variable to decide how well it performs in estimating θ. 3 / 23 Introduction Method of Moments Procedure Mark and Recapture Monsoon Rains Classical Statistics In classical statistics, the state of nature is assumed to be fixed, but unknown to us. Thus, one goal of estimation is to determine which of the Pθ is the source of the data. The estimate is a statistic θ^ : data ! Θ: For estimation procedures, the classical approach to statistics is based
    [Show full text]
  • 11. Parameter Estimation
    11. Parameter Estimation Chris Piech and Mehran Sahami May 2017 We have learned many different distributions for random variables and all of those distributions had parame- ters: the numbers that you provide as input when you define a random variable. So far when we were working with random variables, we either were explicitly told the values of the parameters, or, we could divine the values by understanding the process that was generating the random variables. What if we don’t know the values of the parameters and we can’t estimate them from our own expert knowl- edge? What if instead of knowing the random variables, we have a lot of examples of data generated with the same underlying distribution? In this chapter we are going to learn formal ways of estimating parameters from data. These ideas are critical for artificial intelligence. Almost all modern machine learning algorithms work like this: (1) specify a probabilistic model that has parameters. (2) Learn the value of those parameters from data. Parameters Before we dive into parameter estimation, first let’s revisit the concept of parameters. Given a model, the parameters are the numbers that yield the actual distribution. In the case of a Bernoulli random variable, the single parameter was the value p. In the case of a Uniform random variable, the parameters are the a and b values that define the min and max value. Here is a list of random variables and the corresponding parameters. From now on, we are going to use the notation q to be a vector of all the parameters: Distribution Parameters Bernoulli(p) q = p Poisson(l) q = l Uniform(a,b) q = (a;b) Normal(m;s 2) q = (m;s 2) Y = mX + b q = (m;b) In the real world often you don’t know the “true” parameters, but you get to observe data.
    [Show full text]
  • A Widely Applicable Bayesian Information Criterion
    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
    [Show full text]
  • Statistic: a Quantity That We Can Calculate from Sample Data That Summarizes a Characteristic of That Sample
    STAT 509 – Section 4.1 – Estimation Parameter: A numerical characteristic of a population. Examples: Statistic: A quantity that we can calculate from sample data that summarizes a characteristic of that sample. Examples: Point Estimator: A statistic which is a single number meant to estimate a parameter. It would be nice if the average value of the estimator (over repeated sampling) equaled the target parameter. An estimator is called unbiased if the mean of its sampling distribution is equal to the parameter being estimated. Examples: Another nice property of an estimator: we want it to be as precise as possible. The standard deviation of a statistic’s sampling distribution is called the standard error of the statistic. The standard error of the sample mean Y is / n . Note: As the sample size gets larger, the spread of the sampling distribution gets smaller. When the sample size is large, the sample mean varies less across samples. Evaluating an estimator: (1) Is it unbiased? (2) Does it have a small standard error? Interval Estimates • With a point estimate, we used a single number to estimate a parameter. • We can also use a set of numbers to serve as “reasonable” estimates for the parameter. Example: Assume we have a sample of size n from a normally distributed population. Y T We know: s / n has a t-distribution with n – 1 degrees of freedom. (Exactly true when data are normal, approximately true when data non-normal but n is large.) Y P(t t ) So: 1 – = n1, / 2 s / n n1, / 2 = where tn–1, /2 = the t-value with /2 area to the right (can be found from Table 2) This formula is called a “confidence interval” for .
    [Show full text]
  • THE ONE-SAMPLE Z TEST
    10 THE ONE-SAMPLE z TEST Only the Lonely Difficulty Scale ☺ ☺ ☺ (not too hard—this is the first chapter of this kind, but youdistribute know more than enough to master it) or WHAT YOU WILL LEARN IN THIS CHAPTERpost, • Deciding when the z test for one sample is appropriate to use • Computing the observed z value • Interpreting the z value • Understandingcopy, what the z value means • Understanding what effect size is and how to interpret it not INTRODUCTION TO THE Do ONE-SAMPLE z TEST Lack of sleep can cause all kinds of problems, from grouchiness to fatigue and, in rare cases, even death. So, you can imagine health care professionals’ interest in seeing that their patients get enough sleep. This is especially the case for patients 186 Copyright ©2020 by SAGE Publications, Inc. This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher. Chapter 10 ■ The One-Sample z Test 187 who are ill and have a real need for the healing and rejuvenating qualities that sleep brings. Dr. Joseph Cappelleri and his colleagues looked at the sleep difficul- ties of patients with a particular illness, fibromyalgia, to evaluate the usefulness of the Medical Outcomes Study (MOS) Sleep Scale as a measure of sleep problems. Although other analyses were completed, including one that compared a treat- ment group and a control group with one another, the important analysis (for our discussion) was the comparison of participants’ MOS scores with national MOS norms. Such a comparison between a sample’s mean score (the MOS score for par- ticipants in this study) and a population’s mean score (the norms) necessitates the use of a one-sample z test.
    [Show full text]
  • Chapter 9 Sampling Distributions Parameter – Number That Describes
    Chapter 9 Sampling distributions Parameter – number that describes the population A parameter is a fixed number, but in reality we do not know its value because we can not examine the entire population. Statistic – number that describes a sample Use statistic to estimate an unknown parameter. μ = mean of population x = mean of the sample Sampling variability – the differences in each sample mean P(p-hat) - the proportion of the sample 160 out of 515 people believe in ghosts P(hat) = = .31 .31 is a statistic Proportion of US adults – parameter Sampling variability Take a large number of samples from the same population Calculate x or p(hat) for each sample Make a histogram of x(bar) or p(hat) Examine the distribution displayed in the histogram for shape, center, and spread as well as outliers and other deviations Use simulations for multiple samples – much cheaper than using actual samples Sampling distribution of a statistic is the distribution of values taken by the statistics in all possible samples of the same size from the same population Describing sample distributions Ex 9.5 page 494-495, 496 Bias of a statistic Unbiased statistic- a statistic used to estimate a parameter is unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated. Ex 9.6 page 498 Variability of a statistic is described by the spread of its sampling distribution. The spread is determined by the sampling design and the size of the sample. Larger samples give smaller spreads!! Homework read pages 488 – 503 do problems
    [Show full text]