A Framework for Sample Efficient Interval Estimation with Control

Total Page:16

File Type:pdf, Size:1020Kb

A Framework for Sample Efficient Interval Estimation with Control A Framework for Sample Efficient Interval Estimation with Control Variates Shengjia Zhao Christopher Yeh Stefano Ermon Stanford University Stanford University Stanford University Abstract timates, e.g. for each individual district or county, meaning that we need to draw a sufficient number of We consider the problem of estimating confi- samples for every district or county. Another diffi- dence intervals for the mean of a random vari- culty can arise when we need high accuracy (i.e. a able, where the goal is to produce the smallest small confidence interval), since confidence intervals possible interval for a given number of sam- producedp by typical concentration inequalities have size ples. While minimax optimal algorithms are O(1= number of samples). In other words, to reduce known for this problem in the general case, the size of a confidence interval by a factor of 10, we improved performance is possible under addi- need 100 times more samples. tional assumptions. In particular, we design This dilemma is generally unavoidable because concen- an estimation algorithm to take advantage of tration inequalities such as Chernoff or Chebychev are side information in the form of a control vari- minimax optimal: there exist distributions for which ate, leveraging order statistics. Under certain these inequalities cannot be improved. No-free-lunch conditions on the quality of the control vari- results (Van der Vaart, 2000) imply that any alter- ates, we show improved asymptotic efficiency native estimation algorithm that performs better (i.e. compared to existing estimation algorithms. outputs a confidence interval with smaller size) on some Empirically, we demonstrate superior perfor- problems, must perform worse on other problems. Nev- mance on several real world surveying and ertheless, we can identify a subset of problems where estimation tasks where we use the output of better estimation algorithms are possible. regression models as the control variates. We consider the class of problems where we have side information. We formalize side information as a ran- 1 Introduction dom variable with known expectation and whose value is close (with high probability) to the random variable we want to estimate. Following the terminology in the Many real world problems require estimation of the Monte Carlo simulation literature, we call this side in- mean of a random variable from unbiased samples. In formation random variable a “control variate” (Lemieux, high risk applications, the estimation algorithm should 2017). Instead of the original estimation task, we can output a confidence interval, with the guarantee that estimate the expected difference between the original the true mean belongs to the interval with e.g. 99% random variable and the control variate. The hope is probability. The classic tools for this task are concen- that the distribution of this difference is concentrated tration inequalities, such as the Chernoff, Bernstein, around 0. Some estimation algorithms output very or Chebychev inequalities (Hoeffding, 1962; Bernstein, good (small sized) confidence intervals for distributions 1924; Vershynin, 2010). concentrated around 0 compared to classic methods However, for many tasks, obtaining unbiased samples is such as Chernoff bounds. expensive. For example, when estimating demographic Many practical problems have very good control vari- quantities, such as income or political preference, draw- ates. One important class of problems is when we have ing unbiased samples can require field survey. To make a predictor for the random variable, and we can use the the situation worse, often we seek more granular es- prediction as the control variate. For example, if we would like to estimate a neighborhood’s average voting Proceedings of the 23rdInternational Conference on Artificial pattern, then we might have a prediction function for Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. political preference from Google Street View images of PMLR: Volume 108. Copyright 2020 by the author(s). that neighborhood (Gebru et al., 2017); if we would like A Framework for Sample Efficient Interval Estimation with Control Variates to estimate the average asset wealth of households in a very small probability that Y = +1, then E[Y ] is certain geographic region, we might have a regressor unbounded, but µmean can be finite, which makes it that predicts income from satellite images (Jean et al., impossible to bound jµmean − E[Y ]j. Common assump- 2016). These classifiers could be trained on past data tions include sub-Gaussian, bounded moments, sub- (e.g. previous year survey results) or similar datasets, exponential, etc (Vershynin, 2010). We will consider or they could even be crafted by hand. the sub-Gaussian and bounded moments assumptions in this paper. For these problems we propose concentration bounds based on order statistics. In particular, we draw a Sub-Gaussian: If we assume 8t 2 R; E[et(Y −E[Y ])] ≤ 2 2 connection between the recently proposed concept of eσ t =2, then Chernoff-Hoeffding is the classic con- resilience (Steinhardt, 2018) and concentration bounds centration inequality for confidence interval estima- on order statistics. We show how to use these con- tion (Hoeffding, 1962) centration inequalities to design estimation algorithms " s # that output better confidence intervals when we have 2σ2 2 Pr jµ^ − [Y ]j > log ≤ ζ: a good control variate (e.g. output confidence intervals E m ζ of size O(1=number of samples)). Our proposed estimation algorithm always produces In particular, when Y is supported on [a; b] we have valid confidence intervals, i.e. the true mean belongs to " s # the interval with a specified probability. The only risk (b − a)2 2 is that when the control variate is poor, the confidence Pr jµ^ − [Y ]j > log ≤ ζ: E 2m ζ interval could be worse (larger) than classic baselines such as Chernoff inequalities. We empirically show superior performance of the proposed estimation algo- Bounded Moments: Another common assumption rithm on three real world tasks: bounding regression is that the k-th order moment of Y is bounded, i.e. error, estimating average wealth with satellite images, E[jY − E[Y ]jk] ≤ σk for some σ > 0. and estimating the covariance between wealth and ed- Under bounded moment assumptions, several concen- ucation level. tration inequalities are known, such as the Chebychev inequality, Kolmogorov inequality (Hájek and Rényi, 2 Problem Setup 1955) and Bernstein inequality (Bernstein, 1924). For example, when Y has a bounded second order moment, Our objective is to estimate the mean of some random the Chebychev inequality states the following: variable Y taking values in Y ⊆ d. Given i.i.d. sam- R ples y ; : : : ; y ∼ Y and some choice of confidence level σ 1 m Pr jµ^ − E[Y ]j ≥ p ≤ ζ (2) ζ 2 (0; 1), an estimation algorithm outputs µ^ 2 Rd and ζm an confidence interval size c 2 R+. The estimation algorithm must satisfy It is known that these bounds are (asymptotically in m) minimax optimal. There exist random variables Y Pr [kµ^ − E[Y ]k > c] ≤ ζ (1) that satisfy the respective assumptions of each inequal- ity, and the inequality cannot be improved (Hoeffding, where the probability Pr is with respect to the random 1962). Therefore, to further improve these estimation samples of y ; : : : ; y and any additional randomness 1 m algorithms, additional assumptions will be necessary. in the execution of the (randomized) algorithm, and kµ^ − E[Y ]k is any choice of semi-norm. 2.2 Control Variates We will first focus on one dimensional problems where ~ Y 2 R, and choose the norm kµ^ − E[Y ]k = jµ^ − E[Y ]j, Suppose Y is another random variable jointly dis- then extend several results to more general setups. tributed with Y and that we know its mean E[Y~ ] (or have a very accurate estimate of it). We also 2.1 Baseline Estimators and Optimality have samples drawn from their joint distribution (y1; y~1);:::; (ym; y~m) ∼ Y; Y~ . The classical approach is to estimate [Y ] with the E For Y~ to be useful for our application, its value needs empirical mean µ = 1 Pm y , and its estimation mean m i=1 i to be close to Y . In other words, Y − Y~ should be a error jµ − [Y ]j can be controlled using concentra- mean E random variable that is concentrated around 0. The tion inequalities. purpose of this variable Y~ is similar to a “control variate” To obtain concentration inequalities, we need some in the Monte Carlo community, but we use it here for assumptions on Y. For example, if there is some a different task of interval estimation. Shengjia Zhao, Christopher Yeh, Stefano Ermon For example, in our household wealth estimation ex- Eq.(4) is satisfied for some ζ 2 (0; 1) and c > 0, then ample, let Y denote the household-level wealth in a randomly sampled village, and let Y~ be the predicted Z(m−k) − Z(1+k) Pr jµ^ − E[Y ]j > r + ≤ ζ: (5) household-level wealth based on the satellite image of 2 that village. If our predictor is accurate (i.e. Y ≈ Y~ with high probability), then Y~ could be an effective Proof of Proposition 1. See Appendix. control variate for Y . In addition, we can also estimate E[Y~ ] very accurately because a very large number of Note that in Eq.(5), the confidence interval size con- satellite images are available with little cost (Wulder Z(m−k)−Z(1+k) sists of two parts: 2 and c. We will show et al., 2012), so obtaining samples from Y~ (without the that c is much smaller (in fact, has asymptotically bet- corresponding Y ) is inexpensive. ter rates) compared to baselines previously discussed ~ We would like to design an estimation algorithm that in Section 2.1. If we have a good control variate Y (close to Y ), then Z(m−k)−Z(1+k) will be small.
Recommended publications
  • Confidence Interval Estimation in System Dynamics Models: Bootstrapping Vs
    Confidence Interval Estimation in System Dynamics Models: Bootstrapping vs. Likelihood Ratio Method Gokhan Dogan* MIT Sloan School of Management, 30 Wadsworth Street, E53-364, Cambridge, Massachusetts 02142 [email protected] Abstract In this paper we discuss confidence interval estimation for system dynamics models. Confidence interval estimation is important because without confidence intervals, we cannot determine whether an estimated parameter value is significantly different from 0 or any other value, and therefore we cannot determine how much confidence to place in the estimate. We compare two methods for confidence interval estimation. The first, the “likelihood ratio method,” is based on maximum likelihood estimation. This method has been used in the system dynamics literature and is built in to some popular software packages. It is computationally efficient but requires strong assumptions about the model and data. These assumptions are frequently violated by the autocorrelation, endogeneity of explanatory variables and heteroskedasticity properties of dynamic models. The second method is called “bootstrapping.” Bootstrapping requires more computation but does not impose strong assumptions on the model or data. We describe the methods and illustrate them with a series of applications from actual modeling projects. Considering the empirical results presented in the paper and the fact that the bootstrapping method requires less assumptions, we suggest that bootstrapping is a better tool for confidence interval estimation in system dynamics
    [Show full text]
  • WHAT DID FISHER MEAN by an ESTIMATE? 3 Ideas but Is in Conflict with His Ideology of Statistical Inference
    Submitted to the Annals of Applied Probability WHAT DID FISHER MEAN BY AN ESTIMATE? By Esa Uusipaikka∗ University of Turku Fisher’s Method of Maximum Likelihood is shown to be a proce- dure for the construction of likelihood intervals or regions, instead of a procedure of point estimation. Based on Fisher’s articles and books it is justified that by estimation Fisher meant the construction of likelihood intervals or regions from appropriate likelihood function and that an estimate is a statistic, that is, a function from a sample space to a parameter space such that the likelihood function obtained from the sampling distribution of the statistic at the observed value of the statistic is used to construct likelihood intervals or regions. Thus Problem of Estimation is how to choose the ’best’ estimate. Fisher’s solution for the problem of estimation is Maximum Likeli- hood Estimate (MLE). Fisher’s Theory of Statistical Estimation is a chain of ideas used to justify MLE as the solution of the problem of estimation. The construction of confidence intervals by the delta method from the asymptotic normal distribution of MLE is based on Fisher’s ideas, but is against his ’logic of statistical inference’. Instead the construc- tion of confidence intervals from the profile likelihood function of a given interest function of the parameter vector is considered as a solution more in line with Fisher’s ’ideology’. A new method of cal- culation of profile likelihood-based confidence intervals for general smooth interest functions in general statistical models is considered. 1. Introduction. ’Collected Papers of R.A.
    [Show full text]
  • Likelihood Confidence Intervals When Only Ranges Are Available
    Article Likelihood Confidence Intervals When Only Ranges are Available Szilárd Nemes Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Gothenburg 405 30, Sweden; [email protected] Received: 22 December 2018; Accepted: 3 February 2019; Published: 6 February 2019 Abstract: Research papers represent an important and rich source of comparative data. The change is to extract the information of interest. Herein, we look at the possibilities to construct confidence intervals for sample averages when only ranges are available with maximum likelihood estimation with order statistics (MLEOS). Using Monte Carlo simulation, we looked at the confidence interval coverage characteristics for likelihood ratio and Wald-type approximate 95% confidence intervals. We saw indication that the likelihood ratio interval had better coverage and narrower intervals. For single parameter distributions, MLEOS is directly applicable. For location-scale distribution is recommended that the variance (or combination of it) to be estimated using standard formulas and used as a plug-in. Keywords: range, likelihood, order statistics, coverage 1. Introduction One of the tasks statisticians face is extracting and possibly inferring biologically/clinically relevant information from published papers. This aspect of applied statistics is well developed, and one can choose to form many easy to use and performant algorithms that aid problem solving. Often, these algorithms aim to aid statisticians/practitioners to extract variability of different measures or biomarkers that is needed for power calculation and research design [1,2]. While these algorithms are efficient and easy to use, they mostly are not probabilistic in nature, thus they do not offer means for statistical inference. Yet another field of applied statistics that aims to help practitioners in extracting relevant information when only partial data is available propose a probabilistic approach with order statistics.
    [Show full text]
  • 32. Statistics 1 32
    32. Statistics 1 32. STATISTICS Revised September 2007 by G. Cowan (RHUL). This chapter gives an overview of statistical methods used in High Energy Physics. In statistics, we are interested in using a given sample of data to make inferences about a probabilistic model, e.g., to assess the model’s validity or to determine the values of its parameters. There are two main approaches to statistical inference, which we may call frequentist and Bayesian. In frequentist statistics, probability is interpreted as the frequency of the outcome of a repeatable experiment. The most important tools in this framework are parameter estimation, covered in Section 32.1, and statistical tests, discussed in Section 32.2. Frequentist confidence intervals, which are constructed so as to cover the true value of a parameter with a specified probability, are treated in Section 32.3.2. Note that in frequentist statistics one does not define a probability for a hypothesis or for a parameter. Frequentist statistics provides the usual tools for reporting objectively the outcome of an experiment without needing to incorporate prior beliefs concerning the parameter being measured or the theory being tested. As such they are used for reporting most measurements and their statistical uncertainties in High Energy Physics. In Bayesian statistics, the interpretation of probability is more general and includes degree of belief (called subjective probability). One can then speak of a probability density function (p.d.f.) for a parameter, which expresses one’s state of knowledge about where its true value lies. Bayesian methods allow for a natural way to input additional information, such as physical boundaries and subjective information; in fact they require as input the prior p.d.f.
    [Show full text]
  • Interval Estimation Statistics (OA3102)
    Module 5: Interval Estimation Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chapter 8.5-8.9 Revision: 1-12 1 Goals for this Module • Interval estimation – i.e., confidence intervals – Terminology – Pivotal method for creating confidence intervals • Types of intervals – Large-sample confidence intervals – One-sided vs. two-sided intervals – Small-sample confidence intervals for the mean, differences in two means – Confidence interval for the variance • Sample size calculations Revision: 1-12 2 Interval Estimation • Instead of estimating a parameter with a single number, estimate it with an interval • Ideally, interval will have two properties: – It will contain the target parameter q – It will be relatively narrow • But, as we will see, since interval endpoints are a function of the data, – They will be variable – So we cannot be sure q will fall in the interval Revision: 1-12 3 Objective for Interval Estimation • So, we can’t be sure that the interval contains q, but we will be able to calculate the probability the interval contains q • Interval estimation objective: Find an interval estimator capable of generating narrow intervals with a high probability of enclosing q Revision: 1-12 4 Why Interval Estimation? • As before, we want to use a sample to infer something about a larger population • However, samples are variable – We’d get different values with each new sample – So our point estimates are variable • Point estimates do not give any information about how far
    [Show full text]
  • Confidence Intervals in Analysis and Reporting of Clinical Trials Guangbin Peng, Eli Lilly and Company, Indianapolis, IN
    Confidence Intervals in Analysis and Reporting of Clinical Trials Guangbin Peng, Eli Lilly and Company, Indianapolis, IN ABSTRACT for more frequent use of confidence Regulatory agencies around the world intervals (Simon, 1993). have recommended reporting confidence There is a close relationship between intervals for treatment differences along confidence intervals and significance with the results of significance tests. tests (Hahn and Meeker, 1991); in fact, a SAS provides easy and convenient ways confidence interval can often be used to to produce confidence intervals using test a hypothesis. If the 100(1-α)% procedures such as PROC GLM and confidence interval for the mean PROC UNIVARIATE in conjunction treatment difference in a clinical trial with ODS (output delivery system). In does not contain zero, there is evidence this paper, I will discuss the relationship to indicate a treatment difference at the between significance tests and 100 α % significance level. This strategy confidence intervals, summarize the is equivalent to the hypothesis test that types of confidence intervals used in rejects the null hypothesis of no mean clinical study reports, and provide treatment difference at the level of α. examples from clinical trials to illustrate Compared to p-values, confidence the computation of distribution- intervals are generally more informative. dependent confidence intervals for the They provide quantitative bounds that mean treatment difference and express the uncertainty inherent in distribution-free confidence intervals for estimation, instead of merely an accept the median response within each or reject statement. The length of a treatment group using SAS. confidence interval depends on the sample size; this influence of sample INTRODUCTION size is evident from observing the length Confidence interval estimation and of the interval, while this is not the case significance testing (hypothesis testing) for a significance test.
    [Show full text]
  • Interval Estimation for Drop-The-Losers Designs
    Biometrika (2010), 97,2,pp. 405–418 doi: 10.1093/biomet/asq003 C 2010 Biometrika Trust Advance Access publication 14 April 2010 Printed in Great Britain Interval estimation for drop-the-losers designs BY SAMUEL S. WU Department of Epidemiology and Health Policy Research, University of Florida, Gainesville, Florida 32610, U.S.A. [email protected]fl.edu WEIZHEN WANG Downloaded from Department of Mathematics and Statistics, Wright State University, Dayton, Ohio 45435, U.S.A. [email protected] AND MARK C. K. YANG Department of Statistics, University of Florida, Gainesville, Florida 32610, U.S.A. http://biomet.oxfordjournals.org [email protected]fl.edu SUMMARY In the first stage of a two-stage, drop-the-losers design, a candidate for the best treatment is selected. At the second stage, additional observations are collected to decide whether the candidate is actually better than the control. The design also allows the investigator to stop the trial for ethical reasons at the end of the first stage if there is already strong evidence of futility or superiority. Two types of tests have recently been developed, one based on the combined means and the other based on the combined p-values, but corresponding inter- at University of Florida on May 27, 2010 val estimators are unavailable except in special cases. The problem is that, in most cases, the interval estimators depend on the mean configuration of all treatments in the first stage, which is unknown. In this paper, we prove a basic stochastic ordering lemma that enables us to bridge the gap between hypothesis testing and interval estimation.
    [Show full text]
  • Interval Estimation, Point Estimation, and Null Hypothesis Significance Testing Calibrated by an Estimated Posterior Probability of the Null Hypothesis David R
    Interval estimation, point estimation, and null hypothesis significance testing calibrated by an estimated posterior probability of the null hypothesis David R. Bickel To cite this version: David R. Bickel. Interval estimation, point estimation, and null hypothesis significance testing cali- brated by an estimated posterior probability of the null hypothesis. 2020. hal-02496126 HAL Id: hal-02496126 https://hal.archives-ouvertes.fr/hal-02496126 Preprint submitted on 2 Mar 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Interval estimation, point estimation, and null hypothesis significance testing calibrated by an estimated posterior probability of the null hypothesis March 2, 2020 David R. Bickel Ottawa Institute of Systems Biology Department of Biochemistry, Microbiology and Immunology Department of Mathematics and Statistics University of Ottawa 451 Smyth Road Ottawa, Ontario, K1H 8M5 +01 (613) 562-5800, ext. 8670 [email protected] Abstract Much of the blame for failed attempts to replicate reports of scientific findings has been placed on ubiquitous and persistent misinterpretations of the p value. An increasingly popular solution is to transform a two-sided p value to a lower bound on a Bayes factor. Another solution is to interpret a one-sided p value as an approximate posterior probability.
    [Show full text]
  • Statistical Inference II: Interval Estimation and Hypotheses Tests
    Statistical Inference II: The Principles of Interval Estimation and Hypothesis Testing The Principles of Interval Estimation and Hypothesis Testing 1. Introduction In Statistical Inference I we described how to estimate the mean and variance of a population, and the properties of those estimation procedures. In Statistical Inference II we introduce two more aspects of statistical inference: confidence intervals and hypothesis tests. In contrast to a point estimate of the population mean β, like b = 17.158, a confidence interval estimate is a range of values which may contain the true population mean. A confidence interval estimate contains information not only about the location of the population mean but also about the precision with which we estimate it. A hypothesis test is a statistical procedure for using data to check the compatibility of a conjecture about a population with the information contained in a sample of data. Continuing the example from Statistical Inference I, suppose airplane designers have been basing seat designs based on the assumption that the average hip width of U.S. passengers is 16 inches. Is the information contained in the random sample of 50 hip measurements compatible with this conjecture, or not? These are the issues we consider in Statistical Inference II. 2. Interval Estimation for Mean of Normal Population When σ2 is Known Let Y be a random variable from a normal population. That is, assume YN~,()β σ2 . Assume that we have a random sample of size T from this population, YY12,,, YT . The least squares estimator of the population mean is T = bYT∑ i (2.1) i=1 This estimator has a normal distribution if the population is normal, bN~,()βσ2 T (2.2) For the present, let us assume that the population variance σ2 is known.
    [Show full text]
  • Contents 3 Parameter and Interval Estimation
    Probability and Statistics (part 3) : Parameter and Interval Estimation (by Evan Dummit, 2020, v. 1.25) Contents 3 Parameter and Interval Estimation 1 3.1 Parameter Estimation . 1 3.1.1 Maximum Likelihood Estimates . 2 3.1.2 Biased and Unbiased Estimators . 5 3.1.3 Eciency of Estimators . 8 3.2 Interval Estimation . 12 3.2.1 Condence Intervals . 12 3.2.2 Normal Condence Intervals . 13 3.2.3 Binomial Condence Intervals . 16 3 Parameter and Interval Estimation In the previous chapter, we discussed random variables and developed the notion of a probability distribution, and then established some fundamental results such as the central limit theorem that give strong and useful information about the statistical properties of a sample drawn from a xed, known distribution. Our goal in this chapter is, in some sense, to invert this analysis: starting instead with data obtained by sampling a distribution or probability model with certain unknown parameters, we would like to extract information about the most reasonable values for these parameters given the observed data. We begin by discussing pointwise parameter estimates and estimators, analyzing various properties that we would like these estimators to have such as unbiasedness and eciency, and nally establish the optimality of several estimators that arise from the basic distributions we have encountered such as the normal and binomial distributions. We then broaden our focus to interval estimation: from nding the best estimate of a single value to nding such an estimate along with a measurement of its expected precision. We treat in scrupulous detail several important cases for constructing such condence intervals, and close with some applications of these ideas to polling data.
    [Show full text]
  • Applied Resampling Methods
    APPLIED RESAMPLING METHODS Michael A. Martin, PhD, AStat Steven E. Stern, PhD, AStat © 2001 Michael A. Martin and Steven E. Stern [email protected] [email protected] 3 MAJOR GOALS OF STATISTICS 1. What data should I collect? How do I collect my data (sampling)? What assumptions am I making about my data? 2. How do I organise and analyse my data? Graphically? Numerically? 3. What do my data analyses and summaries mean? How “accurate” are my estimators? How “precise” are my estimators? What do “accurate” and “precise” even mean? Statistical Inference promises answers to 3. A simple example: Does aspirin prevent heart attacks? Experimental situation: population consists of healthy, middle-aged men Type of study: Controlled – half of the subjects were given aspirin, the other half a placebo (control group) Randomised – subjects were randomly assigned to aspirin and placebo groups Double-blind – neither patients nor doctors knew to which group patients belonged The Data: Subjects heart attacks Aspirin Group 11037 104 Placebo Group 11034 189 How might we analyse this data? One possible approach (there are others…) Rate of heart attacks in aspirin group: 104 =0.0094. 11037 Rate of heart attacks in placebo group: 189 =0.0171. 11034 A reasonable statistic of interest is qˆ = Rate in Aspirin Group =0.55. Rate in Placebo Group This is a ratio estimator. We could also take logs to produce a location-type estimator based † on differences. † † Now what? How did we do? Data Collection: • the use of a control group allows meaningful comparison………………… 3 • randomisation reduces the risk of systematic bias……………………………3 • double-blind study removes potential for “psychological” effects or the tendency for doctors to treat higher-risk patients……3 Analysing data: • The statistic of interest makes sense – it obviously compares the rate of heart attacks between the two groups.
    [Show full text]
  • Resampling for Estimation and Inference for Variances 1 Introduction
    ResamplingInt. Statistical Inst.: Proc. for 58th Estimation World Statistical and Congress, Inference 2011, Dublin for(Session Variances IPS056) p.1022 Thompson, Mary (1st author) University of Waterloo, Department of Statistics and Actuarial Science 200 University Avenue West Waterloo N2L 3G1, Canada E-mail: [email protected] Wang, Zilin (2nd author) Wilfrid Laurier University, Department of Mathematics 75 University Avenue West Waterloo N2L 3C5, Canada E-mail: [email protected] 1 Introduction In survey sampling, estimating and making inferences on population variances can be challenging when sampling designs are complicated. Most of the literature limits the investigation to the case where a random sample is obtained without replacement (For example, Thompson (1997) and Cho and Cho (2008)). In this paper, we pursue a computational approach to inference for population variances and, ultimately, variance components with complex survey data. This computational approach is based on a resampling technique, artificial population boot- strapping (APB), introduced in Wang and Thompson (2010). The APB procedure departs from the common application of resampling procedures in survey sampling. It belongs to the category of bootstrapping without replacement, with unequal probability sampling being used to select the resamples. In particular, using a set of complex survey data, many artificial populations are built, and within each of the those populations, resamples are selected by using the same probability sam- pling design as the original sample. The APB procedure entails not only mimicking the sampling design, but also creating artificial populations to resemble as closely as possible the population from which the original sample is selected. As a result, inferences on parameters for the artificial population using resamples would resemble inferences on parameters from the finite population using the original sample.
    [Show full text]