4 Resampling Methods: the Bootstrap • Situation: Let X 1,X2,...,Xn Be A

Total Page:16

File Type:pdf, Size:1020Kb

4 Resampling Methods: the Bootstrap • Situation: Let X 1,X2,...,Xn Be A 4 Resampling Methods: The Bootstrap • Situation: Let x1; x2; : : : ; xn be a SRS of size n taken from a distribution that is unknown. Let θ be a parameter of interest associated with this distribution and let θb = S(x1; x2; : : : ; xn) be a statistic used to estimate θ. { For example, the sample mean θb = S(x1; x2; : : : ; xn) = x is a statistic used to estimate the true mean. • Goals: (i) provide a standard error seB(θb) estimate for θb, (ii) estimate the bias of θb, and (iii) generate a confidence interval for the parameter θ. • Bootstrap methods are computer-intensive methods of providing these estimates and depend on bootstrap samples. • An (independent) bootstrap sample is a SRS of size n taken with replacement from the data x1; x2; : : : ; xn. ∗ ∗ ∗ • We denote a bootstrap sample as x1; x2; : : : ; xn which consists of members of the original data set x1; x2; : : : ; xn with some members appearing zero times, some appearing only once, some appearing twice, and so on. • A bootstrap sample replication of θb, denoted θb∗, is the value of θb evaluated using the ∗ ∗ ∗ bootstrap sample x1; x2; : : : ; xn. • The bootstrap algorithm requires that a large number (B) of bootstrap samples be taken. The bootstrap sample replication θb∗ is then calculated for each of the B bootstrap samples. We will denote the bth bootstrap replication as θb∗(b) for b = 1; 2;:::;B. • The notes in this section are based on Efron and Tibshirani (1993) and Manly (2007). 4.1 The Bootstrap Estimate of the Standard Error • The bootstrap estimate of the standard error of θb is v u h i2 uPB θ∗(b) − θ∗(·) t b=1 b b seB(θb) = (7) B − 1 B P θb∗(b) where θb∗(·) = b=1 is the sample mean of the B bootstrap replications. B • Note that seB(θb) is just the sample standard deviation of the B bootstrap replications. • The limit of seB(θb) as B −! 1 is the ideal bootstrap estimate of the standard error. • Under most circumstances, as the sample size n increases, the sampling distribution of θb becomes more normally distributed. Under this assumption, an approximate t-based bootstrap confidence interval can be generated using seB(θb) and a t-distribution: ∗ θb ± t seB(θb) where t∗ has n − 1 degrees of freedom. 52 4.2 The Bootstrap Estimate of Bias • The bias of θb = S(X1;X2;:::;Xn) as an estimator of θ is defined to be bias(θb) = EF (θb) − θ with the expectation taken with respect to distribution F . • The bootstrap estimate of the bias of θb as an estimate of θ is calculated by replacing the distribution F with the empirical cumulative distribution function Fb. This yields B ∗ ∗ 1 X ∗ biasd B(θb) = θb (·) − θb where θb (·) = θb (b): B b=1 • Then, the bias-corrected estimate of θ is ~ ∗ θB = θb− biasd B(θb) = 2θb− θb (·): • One problem with estimating the bias is that the variance of biasd B(θb) is often large. Efron and Tibshirani (1993) recommend using: 1. θb if biasd B(θb) is small relative to the seB(θb). ~ 2. θ if biasd B(θb) is large relative to the seB(θb). They suggest and justify that if biasd B(θb) < :25seB(θb) then the bias can be ignored (unless the goal is precise confidence interval estimation using this standard error). • Manly (2007) suggests that when using bias correction, it is better to center the confi- dence interval limits using θ~. This would yield the approximate bias-corrected t-based confidence interval: ~ ∗ θ ± t seB(θb) 53 4.3 Introductory Bootstrap Example • Consider a SRS with n = 10 having y-values 0 1 2 3 4 8 8 9 10 11 • The following output is based on B = 40 bootstrap replications of the sample mean x, the sample standard deviation s, the sample variance s2, and the sample median. • The terms in the output are equivalent to the following: theta(hat) = θb = the sample estimate of a parameter mean = the sample mean x s = the sample standard deviation s variance = the sample variance s2 median = the sample median m • These statistics are used as estimates of the population parameters µ, σ, σ2, and M. theta(hat) values for mean, standard deviation, variance, median mean s variance median 5.6000 4.0332 16.2667 6.0000 The number of bootstrap samples B = 40 The bootstrap samples 8 3 2 0 2 11 11 10 8 2 2 9 9 1 9 1 10 4 11 4 8 9 3 8 2 8 3 9 2 2 10 2 11 9 4 8 2 8 3 9 2 8 3 9 8 2 1 8 9 9 10 10 2 11 4 8 8 3 2 3 10 8 0 8 9 10 9 1 1 10 3 10 8 3 2 3 11 9 3 10 10 1 10 8 1 9 10 0 3 8 3 3 3 4 3 0 1 11 11 4 2 3 8 2 10 1 4 8 8 2 2 10 1 3 2 0 10 3 8 4 8 2 11 11 1 9 11 3 8 4 9 11 10 2 8 1 9 10 9 9 8 4 8 4 0 1 3 1 9 11 4 9 8 0 10 10 4 1 8 8 9 11 8 10 3 0 8 10 2 4 8 11 0 3 2 10 1 4 8 8 8 8 3 4 8 0 3 10 11 2 2 1 8 0 2 3 11 11 9 8 2 11 4 2 8 8 8 0 1 10 4 2 0 4 11 0 0 9 1 10 10 2 4 1 4 3 3 10 3 2 1 8 1 8 4 8 3 0 9 10 8 3 9 1 1 8 10 3 10 0 8 11 8 2 1 2 0 11 3 11 0 8 3 8 10 10 8 1 8 9 2 0 3 3 9 11 2 11 3 11 11 1 0 10 9 9 8 2 11 1 2 8 10 1 11 4 8 4 3 2 10 9 1 0 3 2 2 0 2 10 2 8 11 3 0 4 4 4 3 3 3 0 8 2 4 8 1 3 10 4 10 9 8 10 8 10 0 0 11 8 8 11 1 3 3 11 3 0 10 8 8 3 1 4 8 3 8 9 1 1 0 9 11 2 4 10 2 8 10 4 8 11 0 8 0 8 1 3 0 2 9 9 8 11 10 2 11 11 9 1 1 11 11 11 3 0 0 3 0 9 Next, we calculate the sample mean, sample standard deviation, sample variance, and sample ∗ median for each of the bootstrap samples. These represent θbb values. 54 ∗ The column labels below represent bootstrap labels for four different θbb cases: mean = a bootstrap sample mean x∗ b ∗ std dev = a bootstrap sample standard deviation sb variance = a bootstrap sample variance s2∗ ∗b median = a bootstrap sample median mb Bootstrap replications: theta(hat)^*_b mean std dev variance median 5.7000 4.2960 18.4556 5.5000 6.0000 3.9721 15.7778 6.5000 5.4000 3.2042 10.2667 5.5000 6.6000 3.4705 12.0444 8.0000 5.9000 3.4140 11.6556 8.0000 6.1000 3.6347 13.2111 6.0000 6.6000 4.1687 17.3778 8.5000 6.2000 3.6757 13.5111 5.5000 6.0000 4.2164 17.7778 8.0000 4.3000 3.7431 14.0111 3.0000 4.8000 3.3267 11.0667 3.5000 4.3000 3.6833 13.5667 3.0000 6.8000 3.9384 15.5111 8.0000 7.8000 3.4254 11.7333 9.0000 4.9000 3.8427 14.7667 4.0000 6.2000 3.6757 13.5111 8.0000 6.5000 3.8944 15.1667 8.0000 5.5000 3.9511 15.6111 6.0000 5.7000 3.7431 14.0111 6.0000 5.5000 4.3012 18.5000 5.5000 5.4000 4.0332 16.2667 6.0000 4.1000 4.3576 18.9889 3.0000 4.2000 3.1903 10.1778 3.0000 5.2000 3.7947 14.4000 6.0000 5.3000 4.0565 16.4556 5.5000 5.7000 4.5228 20.4556 5.5000 6.5000 3.7193 13.8333 8.0000 5.5000 4.4284 19.6111 3.0000 6.2000 4.5898 21.0667 8.5000 5.3000 3.6225 13.1222 4.0000 3.9000 4.0947 16.7667 2.0000 4.2000 3.1198 9.7333 3.5000 4.3000 3.3015 10.9000 3.5000 7.4000 4.0332 16.2667 8.5000 5.8000 4.2374 17.9556 5.5000 4.6000 3.3066 10.9333 3.5000 6.0000 4.0277 16.2222 6.0000 4.1000 4.2019 17.6556 2.5000 8.1000 3.6347 13.2111 9.0000 4.9000 4.9766 24.7667 3.0000 ∗ ∗ ∗ 2∗ ∗ Take the mean of each column (θb(·)) yielding x(·), s(·), s(·), and m(·). Mean of the B bootstrap replications: theta(hat)^*_(.) mean s variance median 5.5875 3.8707 15.1581 5.6250 ∗ ∗ 2∗ ∗ Take the standard deviation of each column (seB(θb)) yielding seB(x ), seB(s ), seB(s ), and seB(m ). Bootstrap standard error: s.e._B(theta(hat)) mean s variance median 1.0229 0.4248 3.3422 2.1266 ∗ Finally, calculate the estimates of bias Biasd B(θb) = θb(·) − θb for the four cases.
Recommended publications
  • Confidence Interval Estimation in System Dynamics Models: Bootstrapping Vs
    Confidence Interval Estimation in System Dynamics Models: Bootstrapping vs. Likelihood Ratio Method Gokhan Dogan* MIT Sloan School of Management, 30 Wadsworth Street, E53-364, Cambridge, Massachusetts 02142 [email protected] Abstract In this paper we discuss confidence interval estimation for system dynamics models. Confidence interval estimation is important because without confidence intervals, we cannot determine whether an estimated parameter value is significantly different from 0 or any other value, and therefore we cannot determine how much confidence to place in the estimate. We compare two methods for confidence interval estimation. The first, the “likelihood ratio method,” is based on maximum likelihood estimation. This method has been used in the system dynamics literature and is built in to some popular software packages. It is computationally efficient but requires strong assumptions about the model and data. These assumptions are frequently violated by the autocorrelation, endogeneity of explanatory variables and heteroskedasticity properties of dynamic models. The second method is called “bootstrapping.” Bootstrapping requires more computation but does not impose strong assumptions on the model or data. We describe the methods and illustrate them with a series of applications from actual modeling projects. Considering the empirical results presented in the paper and the fact that the bootstrapping method requires less assumptions, we suggest that bootstrapping is a better tool for confidence interval estimation in system dynamics
    [Show full text]
  • Analysis of Imbalance Strategies Recommendation Using a Meta-Learning Approach
    7th ICML Workshop on Automated Machine Learning (2020) Analysis of Imbalance Strategies Recommendation using a Meta-Learning Approach Afonso Jos´eCosta [email protected] CISUC, Department of Informatics Engineering, University of Coimbra, Portugal Miriam Seoane Santos [email protected] CISUC, Department of Informatics Engineering, University of Coimbra, Portugal Carlos Soares [email protected] Fraunhofer AICOS and LIACC, Faculty of Engineering, University of Porto, Portugal Pedro Henriques Abreu [email protected] CISUC, Department of Informatics Engineering, University of Coimbra, Portugal Abstract Class imbalance is known to degrade the performance of classifiers. To address this is- sue, imbalance strategies, such as oversampling or undersampling, are often employed to balance the class distribution of datasets. However, although several strategies have been proposed throughout the years, there is no one-fits-all solution for imbalanced datasets. In this context, meta-learning arises a way to recommend proper strategies to handle class imbalance, based on data characteristics. Nonetheless, in related work, recommendation- based systems do not focus on how the recommendation process is conducted, thus lacking interpretable knowledge. In this paper, several meta-characteristics are identified in order to provide interpretable knowledge to guide the selection of imbalance strategies, using Ex- ceptional Preferences Mining. The experimental setup considers 163 real-world imbalanced datasets, preprocessed with 9 well-known resampling algorithms. Our findings elaborate on the identification of meta-characteristics that demonstrate when using preprocessing algo- rithms is advantageous for the problem and when maintaining the original dataset benefits classification performance. 1. Introduction Class imbalance is known to severely affect the data quality.
    [Show full text]
  • WHAT DID FISHER MEAN by an ESTIMATE? 3 Ideas but Is in Conflict with His Ideology of Statistical Inference
    Submitted to the Annals of Applied Probability WHAT DID FISHER MEAN BY AN ESTIMATE? By Esa Uusipaikka∗ University of Turku Fisher’s Method of Maximum Likelihood is shown to be a proce- dure for the construction of likelihood intervals or regions, instead of a procedure of point estimation. Based on Fisher’s articles and books it is justified that by estimation Fisher meant the construction of likelihood intervals or regions from appropriate likelihood function and that an estimate is a statistic, that is, a function from a sample space to a parameter space such that the likelihood function obtained from the sampling distribution of the statistic at the observed value of the statistic is used to construct likelihood intervals or regions. Thus Problem of Estimation is how to choose the ’best’ estimate. Fisher’s solution for the problem of estimation is Maximum Likeli- hood Estimate (MLE). Fisher’s Theory of Statistical Estimation is a chain of ideas used to justify MLE as the solution of the problem of estimation. The construction of confidence intervals by the delta method from the asymptotic normal distribution of MLE is based on Fisher’s ideas, but is against his ’logic of statistical inference’. Instead the construc- tion of confidence intervals from the profile likelihood function of a given interest function of the parameter vector is considered as a solution more in line with Fisher’s ’ideology’. A new method of cal- culation of profile likelihood-based confidence intervals for general smooth interest functions in general statistical models is considered. 1. Introduction. ’Collected Papers of R.A.
    [Show full text]
  • Particle Filtering-Based Maximum Likelihood Estimation for Financial Parameter Estimation
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Spiral - Imperial College Digital Repository Particle Filtering-based Maximum Likelihood Estimation for Financial Parameter Estimation Jinzhe Yang Binghuan Lin Wayne Luk Terence Nahar Imperial College & SWIP Techila Technologies Ltd Imperial College SWIP London & Edinburgh, UK Tampere University of Technology London, UK London, UK [email protected] Tampere, Finland [email protected] [email protected] [email protected] Abstract—This paper presents a novel method for estimating result, which cannot meet runtime requirement in practice. In parameters of financial models with jump diffusions. It is a addition, high performance computing (HPC) technologies are Particle Filter based Maximum Likelihood Estimation process, still new to the finance industry. We provide a PF based MLE which uses particle streams to enable efficient evaluation of con- straints and weights. We also provide a CPU-FPGA collaborative solution to estimate parameters of jump diffusion models, and design for parameter estimation of Stochastic Volatility with we utilise HPC technology to speedup the estimation scheme Correlated and Contemporaneous Jumps model as a case study. so that it would be acceptable for practical use. The result is evaluated by comparing with a CPU and a cloud This paper presents the algorithm development, as well computing platform. We show 14 times speed up for the FPGA as the pipeline design solution in FPGA technology, of PF design compared with the CPU, and similar speedup but better convergence compared with an alternative parallelisation scheme based MLE for parameter estimation of Stochastic Volatility using Techila Middleware on a multi-CPU environment.
    [Show full text]
  • Resampling Hypothesis Tests for Autocorrelated Fields
    JANUARY 1997 WILKS 65 Resampling Hypothesis Tests for Autocorrelated Fields D. S. WILKS Department of Soil, Crop and Atmospheric Sciences, Cornell University, Ithaca, New York (Manuscript received 5 March 1996, in ®nal form 10 May 1996) ABSTRACT Presently employed hypothesis tests for multivariate geophysical data (e.g., climatic ®elds) require the as- sumption that either the data are serially uncorrelated, or spatially uncorrelated, or both. Good methods have been developed to deal with temporal correlation, but generalization of these methods to multivariate problems involving spatial correlation has been problematic, particularly when (as is often the case) sample sizes are small relative to the dimension of the data vectors. Spatial correlation has been handled successfully by resampling methods when the temporal correlation can be neglected, at least according to the null hypothesis. This paper describes the construction of resampling tests for differences of means that account simultaneously for temporal and spatial correlation. First, univariate tests are derived that respect temporal correlation in the data, using the relatively new concept of ``moving blocks'' bootstrap resampling. These tests perform accurately for small samples and are nearly as powerful as existing alternatives. Simultaneous application of these univariate resam- pling tests to elements of data vectors (or ®elds) yields a powerful (i.e., sensitive) multivariate test in which the cross correlation between elements of the data vectors is successfully captured by the resampling, rather than through explicit modeling and estimation. 1. Introduction dle portion of Fig. 1. First, standard tests, including the t test, rest on the assumption that the underlying data It is often of interest to compare two datasets for are composed of independent samples from their parent differences with respect to particular attributes.
    [Show full text]
  • A Framework for Sample Efficient Interval Estimation with Control
    A Framework for Sample Efficient Interval Estimation with Control Variates Shengjia Zhao Christopher Yeh Stefano Ermon Stanford University Stanford University Stanford University Abstract timates, e.g. for each individual district or county, meaning that we need to draw a sufficient number of We consider the problem of estimating confi- samples for every district or county. Another diffi- dence intervals for the mean of a random vari- culty can arise when we need high accuracy (i.e. a able, where the goal is to produce the smallest small confidence interval), since confidence intervals possible interval for a given number of sam- producedp by typical concentration inequalities have size ples. While minimax optimal algorithms are O(1= number of samples). In other words, to reduce known for this problem in the general case, the size of a confidence interval by a factor of 10, we improved performance is possible under addi- need 100 times more samples. tional assumptions. In particular, we design This dilemma is generally unavoidable because concen- an estimation algorithm to take advantage of tration inequalities such as Chernoff or Chebychev are side information in the form of a control vari- minimax optimal: there exist distributions for which ate, leveraging order statistics. Under certain these inequalities cannot be improved. No-free-lunch conditions on the quality of the control vari- results (Van der Vaart, 2000) imply that any alter- ates, we show improved asymptotic efficiency native estimation algorithm that performs better (i.e. compared to existing estimation algorithms. outputs a confidence interval with smaller size) on some Empirically, we demonstrate superior perfor- problems, must perform worse on other problems.
    [Show full text]
  • Likelihood Confidence Intervals When Only Ranges Are Available
    Article Likelihood Confidence Intervals When Only Ranges are Available Szilárd Nemes Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Gothenburg 405 30, Sweden; [email protected] Received: 22 December 2018; Accepted: 3 February 2019; Published: 6 February 2019 Abstract: Research papers represent an important and rich source of comparative data. The change is to extract the information of interest. Herein, we look at the possibilities to construct confidence intervals for sample averages when only ranges are available with maximum likelihood estimation with order statistics (MLEOS). Using Monte Carlo simulation, we looked at the confidence interval coverage characteristics for likelihood ratio and Wald-type approximate 95% confidence intervals. We saw indication that the likelihood ratio interval had better coverage and narrower intervals. For single parameter distributions, MLEOS is directly applicable. For location-scale distribution is recommended that the variance (or combination of it) to be estimated using standard formulas and used as a plug-in. Keywords: range, likelihood, order statistics, coverage 1. Introduction One of the tasks statisticians face is extracting and possibly inferring biologically/clinically relevant information from published papers. This aspect of applied statistics is well developed, and one can choose to form many easy to use and performant algorithms that aid problem solving. Often, these algorithms aim to aid statisticians/practitioners to extract variability of different measures or biomarkers that is needed for power calculation and research design [1,2]. While these algorithms are efficient and easy to use, they mostly are not probabilistic in nature, thus they do not offer means for statistical inference. Yet another field of applied statistics that aims to help practitioners in extracting relevant information when only partial data is available propose a probabilistic approach with order statistics.
    [Show full text]
  • Sampling and Resampling
    Sampling and Resampling Thomas Lumley Biostatistics 2005-11-3 Basic Problem (Frequentist) statistics is based on the sampling distribution of statistics: Given a statistic Tn and a true data distribution, what is the distribution of Tn • The mean of samples of size 10 from an Exponential • The median of samples of size 20 from a Cauchy • The difference in Pearson’s coefficient of skewness ((X¯ − m)/s) between two samples of size 40 from a Weibull(3,7). This is easy: you have written programs to do it. In 512-3 you learn how to do it analytically in some cases. But... We really want to know: What is the distribution of Tn in sampling from the true data distribution But we don’t know the true data distribution. Empirical distribution On the other hand, we do have an estimate of the true data distribution. It should look like the sample data distribution. (we write Fn for the sample data distribution and F for the true data distribution). The Glivenko–Cantelli theorem says that the cumulative distribution function of the sample converges uniformly to the true cumulative distribution. We can work out the sampling distribution of Tn(Fn) by simulation, and hope that this is close to that of Tn(F ). Simulating from Fn just involves taking a sample, with replace- ment, from the observed data. This is called the bootstrap. We ∗ write Fn for the data distribution of a resample. Too good to be true? There are obviously some limits to this • It requires large enough samples for Fn to be close to F .
    [Show full text]
  • Efficient Computation of Adjusted P-Values for Resampling-Based Stepdown Multiple Testing
    University of Zurich Department of Economics Working Paper Series ISSN 1664-7041 (print) ISSN 1664-705X (online) Working Paper No. 219 Efficient Computation of Adjusted p-Values for Resampling-Based Stepdown Multiple Testing Joseph P. Romano and Michael Wolf February 2016 Efficient Computation of Adjusted p-Values for Resampling-Based Stepdown Multiple Testing∗ Joseph P. Romano† Michael Wolf Departments of Statistics and Economics Department of Economics Stanford University University of Zurich Stanford, CA 94305, USA 8032 Zurich, Switzerland [email protected] [email protected] February 2016 Abstract There has been a recent interest in reporting p-values adjusted for resampling-based stepdown multiple testing procedures proposed in Romano and Wolf (2005a,b). The original papers only describe how to carry out multiple testing at a fixed significance level. Computing adjusted p-values instead in an efficient manner is not entirely trivial. Therefore, this paper fills an apparent gap by detailing such an algorithm. KEY WORDS: Adjusted p-values; Multiple testing; Resampling; Stepdown procedure. JEL classification code: C12. ∗We thank Henning M¨uller for helpful comments. †Research supported by NSF Grant DMS-1307973. 1 1 Introduction Romano and Wolf (2005a,b) propose resampling-based stepdown multiple testing procedures to control the familywise error rate (FWE); also see Romano et al. (2008, Section 3). The procedures as described are designed to be carried out at a fixed significance level α. Therefore, the result of applying such a procedure to a set of data will be a ‘list’ of binary decisions concerning the individual null hypotheses under study: reject or do not reject a given null hypothesis at the chosen significance level α.
    [Show full text]
  • Resampling Methods: Concepts, Applications, and Justification
    Practical Assessment, Research, and Evaluation Volume 8 Volume 8, 2002-2003 Article 19 2002 Resampling methods: Concepts, Applications, and Justification Chong Ho Yu Follow this and additional works at: https://scholarworks.umass.edu/pare Recommended Citation Yu, Chong Ho (2002) "Resampling methods: Concepts, Applications, and Justification," Practical Assessment, Research, and Evaluation: Vol. 8 , Article 19. DOI: https://doi.org/10.7275/9cms-my97 Available at: https://scholarworks.umass.edu/pare/vol8/iss1/19 This Article is brought to you for free and open access by ScholarWorks@UMass Amherst. It has been accepted for inclusion in Practical Assessment, Research, and Evaluation by an authorized editor of ScholarWorks@UMass Amherst. For more information, please contact [email protected]. Yu: Resampling methods: Concepts, Applications, and Justification A peer-reviewed electronic journal. Copyright is retained by the first or sole author, who grants right of first publication to Practical Assessment, Research & Evaluation. Permission is granted to distribute this article for nonprofit, educational purposes if it is copied in its entirety and the journal is credited. PARE has the right to authorize third party reproduction of this article in print, electronic and database forms. Volume 8, Number 19, September, 2003 ISSN=1531-7714 Resampling methods: Concepts, Applications, and Justification Chong Ho Yu Aries Technology/Cisco Systems Introduction In recent years many emerging statistical analytical tools, such as exploratory data analysis (EDA), data visualization, robust procedures, and resampling methods, have been gaining attention among psychological and educational researchers. However, many researchers tend to embrace traditional statistical methods rather than experimenting with these new techniques, even though the data structure does not meet certain parametric assumptions.
    [Show full text]
  • Monte Carlo Simulation and Resampling
    Monte Carlo Simulation and Resampling Tom Carsey (Instructor) Jeff Harden (TA) ICPSR Summer Course Summer, 2011 | Monte Carlo Simulation and Resampling 1/68 Resampling Resampling methods share many similarities to Monte Carlo simulations { in fact, some refer to resampling methods as a type of Monte Carlo simulation. Resampling methods use a computer to generate a large number of simulated samples. Patterns in these samples are then summarized and analyzed. However, in resampling methods, the simulated samples are drawn from the existing sample of data you have in your hands and NOT from a theoretically defined (researcher defined) DGP. Thus, in resampling methods, the researcher DOES NOT know or control the DGP, but the goal of learning about the DGP remains. | Monte Carlo Simulation and Resampling 2/68 Resampling Principles Begin with the assumption that there is some population DGP that remains unobserved. That DGP produced the one sample of data you have in your hands. Now, draw a new \sample" of data that consists of a different mix of the cases in your original sample. Repeat that many times so you have a lot of new simulated \samples." The fundamental assumption is that all information about the DGP contained in the original sample of data is also contained in the distribution of these simulated samples. If so, then resampling from the one sample you have is equivalent to generating completely new random samples from the population DGP. | Monte Carlo Simulation and Resampling 3/68 Resampling Principles (2) Another way to think about this is that if the sample of data you have in your hands is a reasonable representation of the population, then the distribution of parameter estimates produced from running a model on a series of resampled data sets will provide a good approximation of the distribution of that statistics in the population.
    [Show full text]
  • Resampling: an Improvement of Importance Sampling in Varying Population Size Models Coralie Merle, Raphaël Leblois, François Rousset, Pierre Pudlo
    Resampling: an improvement of Importance Sampling in varying population size models Coralie Merle, Raphaël Leblois, François Rousset, Pierre Pudlo To cite this version: Coralie Merle, Raphaël Leblois, François Rousset, Pierre Pudlo. Resampling: an improvement of Importance Sampling in varying population size models. Theoretical Population Biology, Elsevier, 2017, 114, pp.70 - 87. 10.1016/j.tpb.2016.09.002. hal-01270437 HAL Id: hal-01270437 https://hal.archives-ouvertes.fr/hal-01270437 Submitted on 22 Mar 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Resampling: an improvement of Importance Sampling in varying population size models C. Merle 1,2,3,*, R. Leblois2,3, F. Rousset3,4, and P. Pudlo 1,3,5,y 1Institut Montpelli´erain Alexander Grothendieck UMR CNRS 5149, Universit´ede Montpellier, Place Eug`eneBataillon, 34095 Montpellier, France 2Centre de Biologie pour la Gestion des Populations, Inra, CBGP (UMR INRA / IRD / Cirad / Montpellier SupAgro), Campus International de Baillarguet, 34988 Montferrier sur Lez, France 3Institut de Biologie Computationnelle, Universit´ede Montpellier, 95 rue de la Galera, 34095 Montpellier 4Institut des Sciences de l’Evolution, Universit´ede Montpellier, CNRS, IRD, EPHE, Place Eug`eneBataillon, 34392 Montpellier, France 5Institut de Math´ematiquesde Marseille, I2M UMR 7373, Aix-Marseille Universit´e,CNRS, Centrale Marseille, 39, rue F.
    [Show full text]