Lecture 3: Statistical Sampling Uncertainty
Total Page:16
File Type:pdf, Size:1020Kb

Load more
Recommended publications
-
04 – Everything You Want to Know About Correlation but Were
EVERYTHING YOU WANT TO KNOW ABOUT CORRELATION BUT WERE AFRAID TO ASK F R E D K U O 1 MOTIVATION • Correlation as a source of confusion • Some of the confusion may arise from the literary use of the word to convey dependence as most people use “correlation” and “dependence” interchangeably • The word “correlation” is ubiquitous in cost/schedule risk analysis and yet there are a lot of misconception about it. • A better understanding of the meaning and derivation of correlation coefficient, and what it truly measures is beneficial for cost/schedule analysts. • Many times “true” correlation is not obtainable, as will be demonstrated in this presentation, what should the risk analyst do? • Is there any other measures of dependence other than correlation? • Concordance and Discordance • Co-monotonicity and Counter-monotonicity • Conditional Correlation etc. 2 CONTENTS • What is Correlation? • Correlation and dependence • Some examples • Defining and Estimating Correlation • How many data points for an accurate calculation? • The use and misuse of correlation • Some example • Correlation and Cost Estimate • How does correlation affect cost estimates? • Portfolio effect? • Correlation and Schedule Risk • How correlation affect schedule risks? • How Shall We Go From Here? • Some ideas for risk analysis 3 POPULARITY AND SHORTCOMINGS OF CORRELATION • Why Correlation Is Popular? • Correlation is a natural measure of dependence for a Multivariate Normal Distribution (MVN) and the so-called elliptical family of distributions • It is easy to calculate analytically; we only need to calculate covariance and variance to get correlation • Correlation and covariance are easy to manipulate under linear operations • Correlation Shortcomings • Variances of R.V. -
11. Correlation and Linear Regression
11. Correlation and linear regression The goal in this chapter is to introduce correlation and linear regression. These are the standard tools that statisticians rely on when analysing the relationship between continuous predictors and continuous outcomes. 11.1 Correlations In this section we’ll talk about how to describe the relationships between variables in the data. To do that, we want to talk mostly about the correlation between variables. But first, we need some data. 11.1.1 The data Table 11.1: Descriptive statistics for the parenthood data. variable min max mean median std. dev IQR Dan’s grumpiness 41 91 63.71 62 10.05 14 Dan’s hours slept 4.84 9.00 6.97 7.03 1.02 1.45 Dan’s son’s hours slept 3.25 12.07 8.05 7.95 2.07 3.21 ............................................................................................ Let’s turn to a topic close to every parent’s heart: sleep. The data set we’ll use is fictitious, but based on real events. Suppose I’m curious to find out how much my infant son’s sleeping habits affect my mood. Let’s say that I can rate my grumpiness very precisely, on a scale from 0 (not at all grumpy) to 100 (grumpy as a very, very grumpy old man or woman). And lets also assume that I’ve been measuring my grumpiness, my sleeping patterns and my son’s sleeping patterns for - 251 - quite some time now. Let’s say, for 100 days. And, being a nerd, I’ve saved the data as a file called parenthood.csv. -
Construct Validity and Reliability of the Work Environment Assessment Instrument WE-10
International Journal of Environmental Research and Public Health Article Construct Validity and Reliability of the Work Environment Assessment Instrument WE-10 Rudy de Barros Ahrens 1,*, Luciana da Silva Lirani 2 and Antonio Carlos de Francisco 3 1 Department of Business, Faculty Sagrada Família (FASF), Ponta Grossa, PR 84010-760, Brazil 2 Department of Health Sciences Center, State University Northern of Paraná (UENP), Jacarezinho, PR 86400-000, Brazil; [email protected] 3 Department of Industrial Engineering and Post-Graduation in Production Engineering, Federal University of Technology—Paraná (UTFPR), Ponta Grossa, PR 84017-220, Brazil; [email protected] * Correspondence: [email protected] Received: 1 September 2020; Accepted: 29 September 2020; Published: 9 October 2020 Abstract: The purpose of this study was to validate the construct and reliability of an instrument to assess the work environment as a single tool based on quality of life (QL), quality of work life (QWL), and organizational climate (OC). The methodology tested the construct validity through Exploratory Factor Analysis (EFA) and reliability through Cronbach’s alpha. The EFA returned a Kaiser–Meyer–Olkin (KMO) value of 0.917; which demonstrated that the data were adequate for the factor analysis; and a significant Bartlett’s test of sphericity (χ2 = 7465.349; Df = 1225; p 0.000). ≤ After the EFA; the varimax rotation method was employed for a factor through commonality analysis; reducing the 14 initial factors to 10. Only question 30 presented commonality lower than 0.5; and the other questions returned values higher than 0.5 in the commonality analysis. Regarding the reliability of the instrument; all of the questions presented reliability as the values varied between 0.953 and 0.956. -
14: Correlation
14: Correlation Introduction | Scatter Plot | The Correlational Coefficient | Hypothesis Test | Assumptions | An Additional Example Introduction Correlation quantifies the extent to which two quantitative variables, X and Y, “go together.” When high values of X are associated with high values of Y, a positive correlation exists. When high values of X are associated with low values of Y, a negative correlation exists. Illustrative data set. We use the data set bicycle.sav to illustrate correlational methods. In this cross-sectional data set, each observation represents a neighborhood. The X variable is socioeconomic status measured as the percentage of children in a neighborhood receiving free or reduced-fee lunches at school. The Y variable is bicycle helmet use measured as the percentage of bicycle riders in the neighborhood wearing helmets. Twelve neighborhoods are considered: X Y Neighborhood (% receiving reduced-fee lunch) (% wearing bicycle helmets) Fair Oaks 50 22.1 Strandwood 11 35.9 Walnut Acres 2 57.9 Discov. Bay 19 22.2 Belshaw 26 42.4 Kennedy 73 5.8 Cassell 81 3.6 Miner 51 21.4 Sedgewick 11 55.2 Sakamoto 2 33.3 Toyon 19 32.4 Lietz 25 38.4 Three are twelve observations (n = 12). Overall, = 30.83 and = 30.883. We want to explore the relation between socioeconomic status and the use of bicycle helmets. It should be noted that an outlier (84, 46.6) has been removed from this data set so that we may quantify the linear relation between X and Y. Page 14.1 (C:\data\StatPrimer\correlation.wpd) Scatter Plot The first step is create a scatter plot of the data. -
The Probability Lifesaver: Order Statistics and the Median Theorem
The Probability Lifesaver: Order Statistics and the Median Theorem Steven J. Miller December 30, 2015 Contents 1 Order Statistics and the Median Theorem 3 1.1 Definition of the Median 5 1.2 Order Statistics 10 1.3 Examples of Order Statistics 15 1.4 TheSampleDistributionoftheMedian 17 1.5 TechnicalboundsforproofofMedianTheorem 20 1.6 TheMedianofNormalRandomVariables 22 2 • Greetings again! In this supplemental chapter we develop the theory of order statistics in order to prove The Median Theorem. This is a beautiful result in its own, but also extremely important as a substitute for the Central Limit Theorem, and allows us to say non- trivial things when the CLT is unavailable. Chapter 1 Order Statistics and the Median Theorem The Central Limit Theorem is one of the gems of probability. It’s easy to use and its hypotheses are satisfied in a wealth of problems. Many courses build towards a proof of this beautiful and powerful result, as it truly is ‘central’ to the entire subject. Not to detract from the majesty of this wonderful result, however, what happens in those instances where it’s unavailable? For example, one of the key assumptions that must be met is that our random variables need to have finite higher moments, or at the very least a finite variance. What if we were to consider sums of Cauchy random variables? Is there anything we can say? This is not just a question of theoretical interest, of mathematicians generalizing for the sake of generalization. The following example from economics highlights why this chapter is more than just of theoretical interest. -
Central Limit Theorem and Its Applications to Baseball
Central Limit Theorem and Its Applications to Baseball by Nicole Anderson A project submitted to the Department of Mathematical Sciences in conformity with the requirements for Math 4301 (Honours Seminar) Lakehead University Thunder Bay, Ontario, Canada copyright c (2014) Nicole Anderson Abstract This honours project is on the Central Limit Theorem (CLT). The CLT is considered to be one of the most powerful theorems in all of statistics and probability. In probability theory, the CLT states that, given certain conditions, the sample mean of a sufficiently large number or iterates of independent random variables, each with a well-defined ex- pected value and well-defined variance, will be approximately normally distributed. In this project, a brief historical review of the CLT is provided, some basic concepts, two proofs of the CLT and several properties are discussed. As an application, we discuss how to use the CLT to study the sampling distribution of the sample mean and hypothesis testing using baseball statistics. i Acknowledgements I would like to thank my supervisor, Dr. Li, who helped me by sharing his knowledge and many resources to help make this paper come to life. I would also like to thank Dr. Adam Van Tuyl for all of his help with Latex, and support throughout this project. Thank you very much! ii Contents Abstract i Acknowledgements ii Chapter 1. Introduction 1 1. Historical Review of Central Limit Theorem 1 2. Central Limit Theorem in Practice 1 Chapter 2. Preliminaries 3 1. Definitions 3 2. Central Limit Theorem 7 Chapter 3. Proofs of Central Limit Theorem 8 1. -
CORRELATION COEFFICIENTS Ice Cream and Crimedistribute Difficulty Scale ☺ ☺ (Moderately Hard)Or
5 COMPUTING CORRELATION COEFFICIENTS Ice Cream and Crimedistribute Difficulty Scale ☺ ☺ (moderately hard)or WHAT YOU WILLpost, LEARN IN THIS CHAPTER • Understanding what correlations are and how they work • Computing a simple correlation coefficient • Interpretingcopy, the value of the correlation coefficient • Understanding what other types of correlations exist and when they notshould be used Do WHAT ARE CORRELATIONS ALL ABOUT? Measures of central tendency and measures of variability are not the only descrip- tive statistics that we are interested in using to get a picture of what a set of scores 76 Copyright ©2020 by SAGE Publications, Inc. This work may not be reproduced or distributed in any form or by any means without express written permission of the publisher. Chapter 5 ■ Computing Correlation Coefficients 77 looks like. You have already learned that knowing the values of the one most repre- sentative score (central tendency) and a measure of spread or dispersion (variability) is critical for describing the characteristics of a distribution. However, sometimes we are as interested in the relationship between variables—or, to be more precise, how the value of one variable changes when the value of another variable changes. The way we express this interest is through the computation of a simple correlation coefficient. For example, what’s the relationship between age and strength? Income and years of education? Memory skills and amount of drug use? Your political attitudes and the attitudes of your parents? A correlation coefficient is a numerical index that reflects the relationship or asso- ciation between two variables. The value of this descriptive statistic ranges between −1.00 and +1.00. -
Lecture 4 Multivariate Normal Distribution and Multivariate CLT
Lecture 4 Multivariate normal distribution and multivariate CLT. T We start with several simple observations. If X = (x1; : : : ; xk) is a k 1 random vector then its expectation is × T EX = (Ex1; : : : ; Exk) and its covariance matrix is Cov(X) = E(X EX)(X EX)T : − − Notice that a covariance matrix is always symmetric Cov(X)T = Cov(X) and nonnegative definite, i.e. for any k 1 vector a, × a T Cov(X)a = Ea T (X EX)(X EX)T a T = E a T (X EX) 2 0: − − j − j � We will often use that for any vector X its squared length can be written as X 2 = XT X: If we multiply a random k 1 vector X by a n k matrix A then the covariancej j of Y = AX is a n n matrix × × × Cov(Y ) = EA(X EX)(X EX)T AT = ACov(X)AT : − − T Multivariate normal distribution. Let us consider a k 1 vector g = (g1; : : : ; gk) of i.i.d. standard normal random variables. The covariance of g is,× obviously, a k k identity × matrix, Cov(g) = I: Given a n k matrix A, the covariance of Ag is a n n matrix × × � := Cov(Ag) = AIAT = AAT : Definition. The distribution of a vector Ag is called a (multivariate) normal distribution with covariance � and is denoted N(0; �): One can also shift this disrtibution, the distribution of Ag + a is called a normal distri bution with mean a and covariance � and is denoted N(a; �): There is one potential problem 23 with the above definition - we assume that the distribution depends only on covariance ma trix � and does not depend on the construction, i.e. -
Lectures on Statistics
Lectures on Statistics William G. Faris December 1, 2003 ii Contents 1 Expectation 1 1.1 Random variables and expectation . 1 1.2 The sample mean . 3 1.3 The sample variance . 4 1.4 The central limit theorem . 5 1.5 Joint distributions of random variables . 6 1.6 Problems . 7 2 Probability 9 2.1 Events and probability . 9 2.2 The sample proportion . 10 2.3 The central limit theorem . 11 2.4 Problems . 13 3 Estimation 15 3.1 Estimating means . 15 3.2 Two population means . 17 3.3 Estimating population proportions . 17 3.4 Two population proportions . 18 3.5 Supplement: Confidence intervals . 18 3.6 Problems . 19 4 Hypothesis testing 21 4.1 Null and alternative hypothesis . 21 4.2 Hypothesis on a mean . 21 4.3 Two means . 23 4.4 Hypothesis on a proportion . 23 4.5 Two proportions . 24 4.6 Independence . 24 4.7 Power . 25 4.8 Loss . 29 4.9 Supplement: P-values . 31 4.10 Problems . 33 iii iv CONTENTS 5 Order statistics 35 5.1 Sample median and population median . 35 5.2 Comparison of sample mean and sample median . 37 5.3 The Kolmogorov-Smirnov statistic . 38 5.4 Other goodness of fit statistics . 39 5.5 Comparison with a fitted distribution . 40 5.6 Supplement: Uniform order statistics . 41 5.7 Problems . 42 6 The bootstrap 43 6.1 Bootstrap samples . 43 6.2 The ideal bootstrap estimator . 44 6.3 The Monte Carlo bootstrap estimator . 44 6.4 Supplement: Sampling from a finite population . -
Eight Things You Need to Know About Interpreting Correlations
Research Skills One, Correlation interpretation, Graham Hole v.1.0. Page 1 Eight things you need to know about interpreting correlations: A correlation coefficient is a single number that represents the degree of association between two sets of measurements. It ranges from +1 (perfect positive correlation) through 0 (no correlation at all) to -1 (perfect negative correlation). Correlations are easy to calculate, but their interpretation is fraught with difficulties because the apparent size of the correlation can be affected by so many different things. The following are some of the issues that you need to take into account when interpreting the results of a correlation test. 1. Correlation does not imply causality: This is the single most important thing to remember about correlations. If there is a strong correlation between two variables, it's easy to jump to the conclusion that one of the variables causes the change in the other. However this is not a valid conclusion. If you have two variables, X and Y, it might be that X causes Y; that Y causes X; or that a third factor, Z (or even a whole set of other factors) gives rise to the changes in both X and Y. For example, suppose there is a correlation between how many slices of pizza I eat (variable X), and how happy I am (variable Y). It might be that X causes Y - so that the more pizza slices I eat, the happier I become. But it might equally well be that Y causes X - the happier I am, the more pizza I eat. -
Central Limit Theorems When Data Are Dependent: Addressing the Pedagogical Gaps
Institute for Empirical Research in Economics University of Zurich Working Paper Series ISSN 1424-0459 Working Paper No. 480 Central Limit Theorems When Data Are Dependent: Addressing the Pedagogical Gaps Timothy Falcon Crack and Olivier Ledoit February 2010 Central Limit Theorems When Data Are Dependent: Addressing the Pedagogical Gaps Timothy Falcon Crack1 University of Otago Olivier Ledoit2 University of Zurich Version: August 18, 2009 1Corresponding author, Professor of Finance, University of Otago, Department of Finance and Quantitative Analysis, PO Box 56, Dunedin, New Zealand, [email protected] 2Research Associate, Institute for Empirical Research in Economics, University of Zurich, [email protected] Central Limit Theorems When Data Are Dependent: Addressing the Pedagogical Gaps ABSTRACT Although dependence in financial data is pervasive, standard doctoral-level econometrics texts do not make clear that the common central limit theorems (CLTs) contained therein fail when applied to dependent data. More advanced books that are clear in their CLT assumptions do not contain any worked examples of CLTs that apply to dependent data. We address these pedagogical gaps by discussing dependence in financial data and dependence assumptions in CLTs and by giving a worked example of the application of a CLT for dependent data to the case of the derivation of the asymptotic distribution of the sample variance of a Gaussian AR(1). We also provide code and the results for a Monte-Carlo simulation used to check the results of the derivation. INTRODUCTION Financial data exhibit dependence. This dependence invalidates the assumptions of common central limit theorems (CLTs). Although dependence in financial data has been a high- profile research area for over 70 years, standard doctoral-level econometrics texts are not always clear about the dependence assumptions needed for common CLTs. -
Chapter 11 -- Correlation
Contents 11 Association Between Variables 795 11.4 Correlation . 795 11.4.1 Introduction . 795 11.4.2 Correlation Coe±cient . 796 11.4.3 Pearson's r . 800 11.4.4 Test of Signi¯cance for r . 810 11.4.5 Correlation and Causation . 813 11.4.6 Spearman's rho . 819 11.4.7 Size of r for Various Types of Data . 828 794 Chapter 11 Association Between Variables 11.4 Correlation 11.4.1 Introduction The measures of association examined so far in this chapter are useful for describing the nature of association between two variables which are mea- sured at no more than the nominal scale of measurement. All of these measures, Á, C, Cramer's xV and ¸, can also be used to describe association between variables that are measured at the ordinal, interval or ratio level of measurement. But where variables are measured at these higher levels, it is preferable to employ measures of association which use the information contained in these higher levels of measurement. These higher level scales either rank values of the variable, or permit the researcher to measure the distance between values of the variable. The association between rankings of two variables, or the association of distances between values of the variable provides a more detailed idea of the nature of association between variables than do the measures examined so far in this chapter. While there are many measures of association for variables which are measured at the ordinal or higher level of measurement, correlation is the most commonly used approach.