Introduction to Resampling Methods Using R

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Resampling Methods Using R Introduction to Resampling Methods Using R Contents 1 Sampling from known distributions and simulation 1.1 Sampling from normal distributions 1.2 Specifying seeds 1.3 Sampling from exponential distributions 2 Bootstrapping 2.1 Bootstrap distributions 2.2 Bootstrap confidence intervals 2.2.1 Percentile method 2.2.2 Pivot method 2.2.3 Standard bootstrap 3 Randomization tests 3.1 Creating random permutations 3.2 Comparing groups 3.2.1 Exact randomization distribution 3.2.2 Random sampling the randomization distribution 3.2.3 Choice of test statistic 3.3 Wilcoxon rank-sum test 3.4 Selecting among two-sample tests 3.5 More than two groups 3.6 Contingency tables 4 Methods for correlation and regression 4.1 Randomization test for linear relation 4.1.1 Pearson correlation or slope of regression line 4.1.2 Rank correlation 4.2 Bootstrap intervals for correlation and slope 4.2.1 Bivariate bootstrap sampling 4.2.2 Confidence intervals 4.2.3 Fixed-X sampling for the slope 5 Two-sample bootstrap intervals 1 1. Sampling from known distributions and simulation In introductory statistics courses we are told that the t-test is “robust” to departures from normality, especially if the sample size is large. What this means is if we specify a particular Type I error rate, then the actual proportion of false rejections will be close to the Type I error rate. Let’s create and run a simulation to explore this. Steps. 1. Generate a random sample from some population distribution 2. Calculate sample mean, standard deviation and t test statistic 3. Decide if the null hypothesis is rejected 4. Repeat 1-3, counting the number of rejections 1.1. Sampling from normal distributions counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.06 If we execute this code again, a different set of random samples will be selected, and a different estimate obtained counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant 2 counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.043 1.2 Specify a seed to get identical results each time. set.seed(4123) counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.043 Execute the same code again: set.seed(4123) counter <‐ 0 # set counter to 0 t.crit <‐ qt(0.95,14) #5% critical value for (i in 1:1000) { x <‐ rnorm(15, 25, 4) # draw a random sample of size 15 from a N(25,4) distribution t <‐ (mean(x)‐25)*sqrt(15)/sd(x) if (t >= t.crit) # check to see if result is significant counter <‐ counter + 1 # increase counter by 1 } counter/1000 #compute estimate of Type I error rate ## [1] 0.043 3 Instead of using a counter, we may want to store the results so they can be explored later. In the code below, a vector is created and used to store the calculated t-statistics. set.seed(4123) nsims <‐ 1000 t.crit <‐ qt(0.95,14) #5% critical value results <‐ numeric(nsims) #Vector to store t statistics for (i in 1:nsims) { x <‐ rnorm(15, mean=0, sd=1) # draw a random sample of size 15 from a N(25,4) distribution results[i] <‐ (mean(x)‐0)*sqrt(15)/sd(x) } sum(results >= t.crit)/nsims #compute estimate of error rate ## [1] 0.043 Having the results saved in a vector allows us to explore the actual sampling distribution. Below we graphically assess agreement between theoretical and actual distributions. hist(results, freq = F, ylim=c(0,0.4)) # Plot histogram of t statistics curve(dt(x,14), add = TRUE) # superimpose t(14) density 4 1.3 Sampling from an exponential distributions set.seed(4123) nsims <‐ 1000 t.crit <‐ qt(0.95,14) #5% critical value results <‐ numeric(nsims) #Vector to store t statistics for (i in 1:nsims) { x <‐ rexp(15, rate=1/25) # draw a random sample of size 15 from an Exp(mean=25) distribution results[i] <‐ (mean(x)‐25)*sqrt(15)/sd(x) } sum(results >= t.crit)/nsims #compute estimate of error rate ## [1] 0.015 Graphically assess agreement between theoretical and actual distributions. hist(results, freq = F, xlim=c(‐6,4), ylim=c(0,0.4)) # Plot histogram of t statistics curve(dt(x,14), add = TRUE) # superimpose t(14) density 5 Available distributions (http://www.stat.umn.edu/geyer/old/5101/rlook.html) Distribution Functions Beta pbeta qbeta dbeta rbeta Binomial pbinom qbinom dbinom rbinom Cauchy pcauchy qcauchy dcauchy rcauchy Chi-Square pchisq qchisq dchisq rchisq Exponential pexp qexp dexp rexp F pf qf df rf Gamma pgamma qgamma dgamma rgamma Geometric pgeom qgeom dgeom rgeom Hypergeometric phyper qhyper dhyper rhyper Logistic plogis qlogis dlogis rlogis Log Normal plnorm qlnorm dlnorm rlnorm Negative Binomial pnbinom qnbinom dnbinom rnbinom Normal pnorm qnorm dnorm rnorm Poisson ppois qpois dpois rpois Student t pt qt dt rt Studentized Range ptukey qtukey dtukey rtukey Uniform punif qunif dunif runif 6 Weibull pweibull qweibull dweibull rweibull Wilcoxon Rank Sum Statistic pwilcox qwilcox dwilcox rwilcox Wilcoxon Signed Rank Statistic psignrank qsignrank dsignrank rsignrank 7 2. Bootstrap Confidence intervals Suppose we want to estimate a population parameter, based on random sample. - Classical World: Observe one sample and the value of the sample statistic. Sampling distribution is determined by considering all possible (unobserved) samples from the same assumed population. Cannot directly observe the sampling distribution. - Bootstrap World: Rather than assume a population, consider the observed sample to be the best estimate of the population. In fact, we will assume that it represents the probability distribution for the population. We can then generate all (or at least very many) possible samples by taking bootstrap samples (with replacement), from this estimated population and thus observe the sampling distribution of the sample estimator. 2.1 Drawing bootstrap samples using R. We start with a very small data set, a set of new employee test scores: 23, 31, 37, 46, 49, 55, 57 First select a sample of size 7, with replacement and compute the mean of the bootstrap sample. score <‐ c(37,49,55,57,23,31,46) mean <‐ mean(score) mean ## [1] 42.57143 boot <‐ sample(score, size=7, replace=TRUE) boot ## [1] 31 37 31 31 31 37 31 mean.boot <‐ mean(boot) mean.boot ## [1] 32.71429 8 We need to do this many times to estimate the sampling distribution of the mean. score <‐ c(37,49,55,57,23,31,46) mean <‐ mean(score) mean ## [1] 42.57143 N <‐ length(score) nboots <‐ 10000 boot.result <‐ numeric(nboots) for(i in 1:nboots) { boot.samp <‐ sample(score, N, replace=TRUE) boot.result[i] <‐ mean(boot.samp) } hist(boot.result) 9 2.2 Bootstrap confidence intervals Example. Suppose we have a random sample of size 30 from an exponential distribution with mean 25. We want to use the sample mean to estimate the population mean. We will discuss three ways to construct confidence intervals using bootstrapping. 2.2.1 Percentile method If the estimator of the population parameter is the statistic used to create the distribution (e.g., the sample mean), then the confidence interval is simply the equal-tail quantiles that correspond to the confidence level. Bootstrap 95% percentile confidence interval set.seed(4123) x.exp <‐ rexp(30, rate=1/25) x.exp ## [1] 2.528853 2.235845 40.423011 5.255557 3.355874 2.724010 ## [7] 10.787030 9.792154 31.882324 15.816492 13.925713 38.726646 ## [13] 19.283214 70.833128 30.556819 31.620638 64.401698 34.713028 ## [19] 6.153141 81.498008 16.034828 80.867315 41.204157 77.066567 ## [25] 14.237094 47.647705 35.505526 13.203980 104.795832 11.897545 boxplot(x.exp) 10 n <‐ length(x.exp) mean.exp <‐ mean(x.exp) nboots <‐ 10000 boot.result <‐ numeric(nboots) for(i in 1:nboots) { boot.samp <‐ sample(x.exp, n, replace=TRUE) boot.result[i] <‐ mean(boot.samp) } hist(boot.result) 11 mean.exp ## [1] 31.96579 quantile(boot.result, c(0.025,0.975)) ## 2.5% 97.5% ## 22.49355 42.19327 2.2.2 Pivot method A pivot quantity is a function of the estimator whose distribution does not depend on the parameter being estimated. Example: Estimating the population mean, based on the sample mean, Y . Then the Y statistic Ttn~( 1) has a Student’s t distribution with n-1 degrees of freedom. Sn/ Because the distribution of T does not depend on , T is a pivot quantity. 12 When such a quantity exists, we can then use bootstrapping to estimate the distribution of the pivot quantity—essentially a custom table—and use quantiles from the table to create the confidence interval.
Recommended publications
  • Analysis of Imbalance Strategies Recommendation Using a Meta-Learning Approach
    7th ICML Workshop on Automated Machine Learning (2020) Analysis of Imbalance Strategies Recommendation using a Meta-Learning Approach Afonso Jos´eCosta [email protected] CISUC, Department of Informatics Engineering, University of Coimbra, Portugal Miriam Seoane Santos [email protected] CISUC, Department of Informatics Engineering, University of Coimbra, Portugal Carlos Soares [email protected] Fraunhofer AICOS and LIACC, Faculty of Engineering, University of Porto, Portugal Pedro Henriques Abreu [email protected] CISUC, Department of Informatics Engineering, University of Coimbra, Portugal Abstract Class imbalance is known to degrade the performance of classifiers. To address this is- sue, imbalance strategies, such as oversampling or undersampling, are often employed to balance the class distribution of datasets. However, although several strategies have been proposed throughout the years, there is no one-fits-all solution for imbalanced datasets. In this context, meta-learning arises a way to recommend proper strategies to handle class imbalance, based on data characteristics. Nonetheless, in related work, recommendation- based systems do not focus on how the recommendation process is conducted, thus lacking interpretable knowledge. In this paper, several meta-characteristics are identified in order to provide interpretable knowledge to guide the selection of imbalance strategies, using Ex- ceptional Preferences Mining. The experimental setup considers 163 real-world imbalanced datasets, preprocessed with 9 well-known resampling algorithms. Our findings elaborate on the identification of meta-characteristics that demonstrate when using preprocessing algo- rithms is advantageous for the problem and when maintaining the original dataset benefits classification performance. 1. Introduction Class imbalance is known to severely affect the data quality.
    [Show full text]
  • Particle Filtering-Based Maximum Likelihood Estimation for Financial Parameter Estimation
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Spiral - Imperial College Digital Repository Particle Filtering-based Maximum Likelihood Estimation for Financial Parameter Estimation Jinzhe Yang Binghuan Lin Wayne Luk Terence Nahar Imperial College & SWIP Techila Technologies Ltd Imperial College SWIP London & Edinburgh, UK Tampere University of Technology London, UK London, UK [email protected] Tampere, Finland [email protected] [email protected] [email protected] Abstract—This paper presents a novel method for estimating result, which cannot meet runtime requirement in practice. In parameters of financial models with jump diffusions. It is a addition, high performance computing (HPC) technologies are Particle Filter based Maximum Likelihood Estimation process, still new to the finance industry. We provide a PF based MLE which uses particle streams to enable efficient evaluation of con- straints and weights. We also provide a CPU-FPGA collaborative solution to estimate parameters of jump diffusion models, and design for parameter estimation of Stochastic Volatility with we utilise HPC technology to speedup the estimation scheme Correlated and Contemporaneous Jumps model as a case study. so that it would be acceptable for practical use. The result is evaluated by comparing with a CPU and a cloud This paper presents the algorithm development, as well computing platform. We show 14 times speed up for the FPGA as the pipeline design solution in FPGA technology, of PF design compared with the CPU, and similar speedup but better convergence compared with an alternative parallelisation scheme based MLE for parameter estimation of Stochastic Volatility using Techila Middleware on a multi-CPU environment.
    [Show full text]
  • Resampling Hypothesis Tests for Autocorrelated Fields
    JANUARY 1997 WILKS 65 Resampling Hypothesis Tests for Autocorrelated Fields D. S. WILKS Department of Soil, Crop and Atmospheric Sciences, Cornell University, Ithaca, New York (Manuscript received 5 March 1996, in ®nal form 10 May 1996) ABSTRACT Presently employed hypothesis tests for multivariate geophysical data (e.g., climatic ®elds) require the as- sumption that either the data are serially uncorrelated, or spatially uncorrelated, or both. Good methods have been developed to deal with temporal correlation, but generalization of these methods to multivariate problems involving spatial correlation has been problematic, particularly when (as is often the case) sample sizes are small relative to the dimension of the data vectors. Spatial correlation has been handled successfully by resampling methods when the temporal correlation can be neglected, at least according to the null hypothesis. This paper describes the construction of resampling tests for differences of means that account simultaneously for temporal and spatial correlation. First, univariate tests are derived that respect temporal correlation in the data, using the relatively new concept of ``moving blocks'' bootstrap resampling. These tests perform accurately for small samples and are nearly as powerful as existing alternatives. Simultaneous application of these univariate resam- pling tests to elements of data vectors (or ®elds) yields a powerful (i.e., sensitive) multivariate test in which the cross correlation between elements of the data vectors is successfully captured by the resampling, rather than through explicit modeling and estimation. 1. Introduction dle portion of Fig. 1. First, standard tests, including the t test, rest on the assumption that the underlying data It is often of interest to compare two datasets for are composed of independent samples from their parent differences with respect to particular attributes.
    [Show full text]
  • STATS 305 Notes1
    STATS 305 Notes1 Art Owen2 Autumn 2013 1The class notes were beautifully scribed by Eric Min. He has kindly allowed his notes to be placed online for stat 305 students. Reading these at leasure, you will spot a few errors and omissions due to the hurried nature of scribing and probably my handwriting too. Reading them ahead of class will help you understand the material as the class proceeds. 2Department of Statistics, Stanford University. 0.0: Chapter 0: 2 Contents 1 Overview 9 1.1 The Math of Applied Statistics . .9 1.2 The Linear Model . .9 1.2.1 Other Extensions . 10 1.3 Linearity . 10 1.4 Beyond Simple Linearity . 11 1.4.1 Polynomial Regression . 12 1.4.2 Two Groups . 12 1.4.3 k Groups . 13 1.4.4 Different Slopes . 13 1.4.5 Two-Phase Regression . 14 1.4.6 Periodic Functions . 14 1.4.7 Haar Wavelets . 15 1.4.8 Multiphase Regression . 15 1.5 Concluding Remarks . 16 2 Setting Up the Linear Model 17 2.1 Linear Model Notation . 17 2.2 Two Potential Models . 18 2.2.1 Regression Model . 18 2.2.2 Correlation Model . 18 2.3 TheLinear Model . 18 2.4 Math Review . 19 2.4.1 Quadratic Forms . 20 3 The Normal Distribution 23 3.1 Friends of N (0; 1)...................................... 23 3.1.1 χ2 .......................................... 23 3.1.2 t-distribution . 23 3.1.3 F -distribution . 24 3.2 The Multivariate Normal . 24 3.2.1 Linear Transformations . 25 3.2.2 Normal Quadratic Forms .
    [Show full text]
  • Sampling and Resampling
    Sampling and Resampling Thomas Lumley Biostatistics 2005-11-3 Basic Problem (Frequentist) statistics is based on the sampling distribution of statistics: Given a statistic Tn and a true data distribution, what is the distribution of Tn • The mean of samples of size 10 from an Exponential • The median of samples of size 20 from a Cauchy • The difference in Pearson’s coefficient of skewness ((X¯ − m)/s) between two samples of size 40 from a Weibull(3,7). This is easy: you have written programs to do it. In 512-3 you learn how to do it analytically in some cases. But... We really want to know: What is the distribution of Tn in sampling from the true data distribution But we don’t know the true data distribution. Empirical distribution On the other hand, we do have an estimate of the true data distribution. It should look like the sample data distribution. (we write Fn for the sample data distribution and F for the true data distribution). The Glivenko–Cantelli theorem says that the cumulative distribution function of the sample converges uniformly to the true cumulative distribution. We can work out the sampling distribution of Tn(Fn) by simulation, and hope that this is close to that of Tn(F ). Simulating from Fn just involves taking a sample, with replace- ment, from the observed data. This is called the bootstrap. We ∗ write Fn for the data distribution of a resample. Too good to be true? There are obviously some limits to this • It requires large enough samples for Fn to be close to F .
    [Show full text]
  • Efficient Computation of Adjusted P-Values for Resampling-Based Stepdown Multiple Testing
    University of Zurich Department of Economics Working Paper Series ISSN 1664-7041 (print) ISSN 1664-705X (online) Working Paper No. 219 Efficient Computation of Adjusted p-Values for Resampling-Based Stepdown Multiple Testing Joseph P. Romano and Michael Wolf February 2016 Efficient Computation of Adjusted p-Values for Resampling-Based Stepdown Multiple Testing∗ Joseph P. Romano† Michael Wolf Departments of Statistics and Economics Department of Economics Stanford University University of Zurich Stanford, CA 94305, USA 8032 Zurich, Switzerland [email protected] [email protected] February 2016 Abstract There has been a recent interest in reporting p-values adjusted for resampling-based stepdown multiple testing procedures proposed in Romano and Wolf (2005a,b). The original papers only describe how to carry out multiple testing at a fixed significance level. Computing adjusted p-values instead in an efficient manner is not entirely trivial. Therefore, this paper fills an apparent gap by detailing such an algorithm. KEY WORDS: Adjusted p-values; Multiple testing; Resampling; Stepdown procedure. JEL classification code: C12. ∗We thank Henning M¨uller for helpful comments. †Research supported by NSF Grant DMS-1307973. 1 1 Introduction Romano and Wolf (2005a,b) propose resampling-based stepdown multiple testing procedures to control the familywise error rate (FWE); also see Romano et al. (2008, Section 3). The procedures as described are designed to be carried out at a fixed significance level α. Therefore, the result of applying such a procedure to a set of data will be a ‘list’ of binary decisions concerning the individual null hypotheses under study: reject or do not reject a given null hypothesis at the chosen significance level α.
    [Show full text]
  • Pivotal Quantities with Arbitrary Small Skewness Arxiv:1605.05985V1 [Stat
    Pivotal Quantities with Arbitrary Small Skewness Masoud M. Nasari∗ School of Mathematics and Statistics of Carleton University Ottawa, ON, Canada Abstract In this paper we present randomization methods to enhance the accuracy of the central limit theorem (CLT) based inferences about the population mean µ. We introduce a broad class of randomized versions of the Student t- statistic, the classical pivot for µ, that continue to possess the pivotal property for µ and their skewness can be made arbitrarily small, for each fixed sam- ple size n. Consequently, these randomized pivots admit CLTs with smaller errors. The randomization framework in this paper also provides an explicit relation between the precision of the CLTs for the randomized pivots and the volume of their associated confidence regions for the mean for both univariate and multivariate data. This property allows regulating the trade-off between the accuracy and the volume of the randomized confidence regions discussed in this paper. 1 Introduction The CLT is an essential tool for inferring on parameters of interest in a nonpara- metric framework. The strength of the CLT stems from the fact that, as the sample size increases, the usually unknown sampling distribution of a pivot, a function of arXiv:1605.05985v1 [stat.ME] 19 May 2016 the data and an associated parameter, approaches the standard normal distribution. This, in turn, validates approximating the percentiles of the sampling distribution of the pivot by those of the normal distribution, in both univariate and multivariate cases. The CLT is an approximation method whose validity relies on large enough sam- ples.
    [Show full text]
  • Resampling Methods: Concepts, Applications, and Justification
    Practical Assessment, Research, and Evaluation Volume 8 Volume 8, 2002-2003 Article 19 2002 Resampling methods: Concepts, Applications, and Justification Chong Ho Yu Follow this and additional works at: https://scholarworks.umass.edu/pare Recommended Citation Yu, Chong Ho (2002) "Resampling methods: Concepts, Applications, and Justification," Practical Assessment, Research, and Evaluation: Vol. 8 , Article 19. DOI: https://doi.org/10.7275/9cms-my97 Available at: https://scholarworks.umass.edu/pare/vol8/iss1/19 This Article is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Practical Assessment, Research, and Evaluation by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Yu: Resampling methods: Concepts, Applications, and Justification A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. PARE has the right to authorize third party reproduction of this article in print, electronic and database forms. Volume 8, Number 19, September, 2003 ISSN=1531-7714 Resampling methods: Concepts, Applications, and Justification Chong Ho Yu Aries Technology/Cisco Systems Introduction In recent years many emerging statistical analytical tools, such as exploratory data analysis (EDA), data visualization, robust procedures, and resampling methods, have been gaining attention among psychological and educational researchers. However, many researchers tend to embrace traditional statistical methods rather than experimenting with these new techniques, even though the data structure does not meet certain parametric assumptions.
    [Show full text]
  • Monte Carlo Simulation and Resampling
    Monte Carlo Simulation and Resampling Tom Carsey (Instructor) Jeff Harden (TA) ICPSR Summer Course Summer, 2011 | Monte Carlo Simulation and Resampling 1/68 Resampling Resampling methods share many similarities to Monte Carlo simulations { in fact, some refer to resampling methods as a type of Monte Carlo simulation. Resampling methods use a computer to generate a large number of simulated samples. Patterns in these samples are then summarized and analyzed. However, in resampling methods, the simulated samples are drawn from the existing sample of data you have in your hands and NOT from a theoretically defined (researcher defined) DGP. Thus, in resampling methods, the researcher DOES NOT know or control the DGP, but the goal of learning about the DGP remains. | Monte Carlo Simulation and Resampling 2/68 Resampling Principles Begin with the assumption that there is some population DGP that remains unobserved. That DGP produced the one sample of data you have in your hands. Now, draw a new \sample" of data that consists of a different mix of the cases in your original sample. Repeat that many times so you have a lot of new simulated \samples." The fundamental assumption is that all information about the DGP contained in the original sample of data is also contained in the distribution of these simulated samples. If so, then resampling from the one sample you have is equivalent to generating completely new random samples from the population DGP. | Monte Carlo Simulation and Resampling 3/68 Resampling Principles (2) Another way to think about this is that if the sample of data you have in your hands is a reasonable representation of the population, then the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution of that statistics in the population.
    [Show full text]
  • Resampling: an Improvement of Importance Sampling in Varying Population Size Models Coralie Merle, Raphaël Leblois, François Rousset, Pierre Pudlo
    Resampling: an improvement of Importance Sampling in varying population size models Coralie Merle, Raphaël Leblois, François Rousset, Pierre Pudlo To cite this version: Coralie Merle, Raphaël Leblois, François Rousset, Pierre Pudlo. Resampling: an improvement of Importance Sampling in varying population size models. Theoretical Population Biology, Elsevier, 2017, 114, pp.70 - 87. 10.1016/j.tpb.2016.09.002. hal-01270437 HAL Id: hal-01270437 https://hal.archives-ouvertes.fr/hal-01270437 Submitted on 22 Mar 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Resampling: an improvement of Importance Sampling in varying population size models C. Merle 1,2,3,*, R. Leblois2,3, F. Rousset3,4, and P. Pudlo 1,3,5,y 1Institut Montpelli´erain Alexander Grothendieck UMR CNRS 5149, Universit´ede Montpellier, Place Eug`eneBataillon, 34095 Montpellier, France 2Centre de Biologie pour la Gestion des Populations, Inra, CBGP (UMR INRA / IRD / Cirad / Montpellier SupAgro), Campus International de Baillarguet, 34988 Montferrier sur Lez, France 3Institut de Biologie Computationnelle, Universit´ede Montpellier, 95 rue de la Galera, 34095 Montpellier 4Institut des Sciences de l’Evolution, Universit´ede Montpellier, CNRS, IRD, EPHE, Place Eug`eneBataillon, 34392 Montpellier, France 5Institut de Math´ematiquesde Marseille, I2M UMR 7373, Aix-Marseille Universit´e,CNRS, Centrale Marseille, 39, rue F.
    [Show full text]
  • Stat 3701 Lecture Notes: Bootstrap Charles J
    Stat 3701 Lecture Notes: Bootstrap Charles J. Geyer April 17, 2017 1 License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http: //creativecommons.org/licenses/by-sa/4.0/). 2 R • The version of R used to make this document is 3.3.3. • The version of the rmarkdown package used to make this document is 1.4. • The version of the knitr package used to make this document is 1.15.1. • The version of the bootstrap package used to make this document is 2017.2. 3 Relevant and Irrelevant Simulation 3.1 Irrelevant Most statisticians think a statistics paper isn’t really a statistics paper or a statistics talk isn’t really a statistics talk if it doesn’t have simulations demonstrating that the methods proposed work great (at least in some toy problems). IMHO, this is nonsense. Simulations of the kind most statisticians do prove nothing. The toy problems used are often very special and do not stress the methods at all. In fact, they may be (consciously or unconsciously) chosen to make the methods look good. In scientific experiments, we know how to use randomization, blinding, and other techniques to avoid biasing the results. Analogous things are never AFAIK done with simulations. When all of the toy problems simulated are very different from the statistical model you intend to use for your data, what could the simulation study possibly tell you that is relevant? Nothing. Hence, for short, your humble author calls all of these millions of simulation studies statisticians have done irrelevant simulation.
    [Show full text]
  • Interval Estimation Statistics (OA3102)
    Module 5: Interval Estimation Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chapter 8.5-8.9 Revision: 1-12 1 Goals for this Module • Interval estimation – i.e., confidence intervals – Terminology – Pivotal method for creating confidence intervals • Types of intervals – Large-sample confidence intervals – One-sided vs. two-sided intervals – Small-sample confidence intervals for the mean, differences in two means – Confidence interval for the variance • Sample size calculations Revision: 1-12 2 Interval Estimation • Instead of estimating a parameter with a single number, estimate it with an interval • Ideally, interval will have two properties: – It will contain the target parameter q – It will be relatively narrow • But, as we will see, since interval endpoints are a function of the data, – They will be variable – So we cannot be sure q will fall in the interval Revision: 1-12 3 Objective for Interval Estimation • So, we can’t be sure that the interval contains q, but we will be able to calculate the probability the interval contains q • Interval estimation objective: Find an interval estimator capable of generating narrow intervals with a high probability of enclosing q Revision: 1-12 4 Why Interval Estimation? • As before, we want to use a sample to infer something about a larger population • However, samples are variable – We’d get different values with each new sample – So our point estimates are variable • Point estimates do not give any information about how far
    [Show full text]