Reproducibility of Statistical Test Results Based on P-Value

Total Page:16

File Type:pdf, Size:1020Kb

Reproducibility of Statistical Test Results Based on P-Value Japanese Journal of Biometrics Vol. 40, No. 2, 69 − 79 (2020) Original Article Reproducibility of statistical test results based on p-value Takashi Yanagawa Biostatistics Center Kurume University e-mail : [email protected] Reproducibility is the essence of a scientific research. Focusing on two-sample problems we discuss in this paper the reproducibility of statistical test results based on p-values. First, demonstrating large variability of p-values it is shown that p-values lack the reproducibility, in particular, if sample sizes are not enough. Second, a sample size formula is developed to assure the reproducibility probability of p-value at given level by assuming normal distributions with known variance. Finally, the sample size formula for the reproducibility in general framework is shown equivalent to the sample size formula that has been developed in the Neyman-Pearson type testing statistical hypothesis by employing the level of significance and size of power. Key words: Interquartile range, Median, Lower and upper quartiles, Neyman-Pearson type statistical test, Reproducibility probability 1.Introduction p-value was introduced by K. Pearson in his chi-square test (1900) and its use promoted by R. A. Fisher as a useful tool for statistical inference from data (Fisher, R. A. 1925). Fisher suggested looked into data to check whether the treatment effect existed if p-value ⩽ 0.05, but not to consider data anymore if p-value>0.05 (Ono, 2018). On the other hand, Neyman-and Pearson (1928) introduced a method of testing statistical hypotheses by establishing a null and alternative hypotheses, as well as introducng the first and second kind of errors, and by solving a mathematical problem of maximizing the power, which is defined as 1-probability of the second kind of error, keeping the probability of the first kind of error less than given level of significance. Fisher emphasized statistical inference, but Neyman-Pearson emphasized statistical decision; philosophical disputes between Fisher and Neyman on the difference of their ideas continued nearly 30 years until the death of Fisher in 1962. Apparently, their ways of testing are equivalently expressed as follows. Reject the null hypothesis if p-value ⩽ α, and hold to reject it if p-value > α, where α is called the level of significance. Neyman-Pearsonʼ s approach of testing statistical hypothesis is exclusively illustrated in mathematical statistics text books. We call it in this paper the Neyman-Pearson type test of Received February 2019. Revised June 2019. Accepted August 2019 70 Yanagawa testing statistical hypothesis. On the other hand p-value and statistical test based on p-value are exclusively illustrated in most biostatistics text books. The author of the present paper tried to bridge these two extremes by introducing the idea of reproducibility of statistical test results based on p-value in his booklet (Yanagawa, 2019), written in Japanese by giving no mathematical proof. The purpose of the present paper is to describe it mathematically with rigorous proof and also in English so that to convey our findings not only to Japanese but also to overseas professionals. Recently, after the publication of the booklet, Valentin Amrhein, Sander Greenland, Blake McShane and more than 800 signatories called for the entire concept of statistical significance to be abandoned (Amrhein, V., Greenland, S., McShane, B., 2019). I appreciate it the calling for give up of Neyman-Pearson type test and coming back to Fisherʼs statistical inference. If this is the case the ʻdetermination of sample sizesʼ, an important topic in designing scientific research that has been undertaken using the concept of statistical power in the framework of Neyman-Pearson type test, could drift on the air. However, it will be demonstrated in the present paper that the concept of reproducibility could replace it. To present our idea simply, we focus the discussion in this paper on two-sample problems consisting of treatment group of sample size n and control group of the same sample size; and letting Δ be the treatment effect, we consider testing statistical hypothesis H0 : Δ=0vs.H1 : Δ=δ (δ>0). p-value is defined in the set up by p=Pr (T ≥ t0 |H0), (1) where T is a test statistic and t0 is the observed value of T . As is seen in (1) p-value is a random variable. The distribution and density functions of p-value have been developed by several authors. We summarize it in section 2. Numerical evaluations of location and scale parameters of the distribution of p-value are given in section 3 to demonstrate the large variability of p-values. Also it is indicated in the section that statistical test results based on p-values lack the reproducibility, in particular, if sample sizes are not enough, because of the large variability of p-values. In section 4, by assuming two-sample normal disributions with known common variance, a sample size formula is developed to assure the reproducibility probability at given level. Finally in the section, the sample size formula for the reproducibility in general framework is shown equivalent to the sample size formula that has been developed in the Neyman-Pearson type test for testing statistical hypothesis by employing the level of significance and size of power. Jpn J Biomet Vol. 40, No. 2, 2020 Reproducibility of statistical test results based on p-value 71 2.Distribution of p-value 2.1 Distribution function and probability density function of p-value Suppose two-sample problems consisting of treatment group of sample size n and control group with the same sample size as the treatment group; and letting Δ be the treatment effect, consider testing statistical hypothesis H0 : Δ=0vs.H1 :Δ=δ (δ>0 ) with test statistic T . Let F0 and F1 be the probability distribution functions of test statistic T under H0 and H1, respectively. Distribution function and probability density function of p-value have been given by Hung, OʼNeil, Bauer and Kohne (1997); Sackrowitz and Samuel-Cahn (1999); and Donahue (1999). We represent them in the following proposition. PROPOSITION 2.1 (1) p-value follows a uniform distribution on (0,1) under H0.(Hung, OʼNeil, Bauer and Kohne, 1997). (2) Denote by Gp (x) and gp (x) the distribution function and probability density function of p- value under H1, respectively, then Gp (x) and gp (x) are given as follows. (Sackrowitz and Samuel- Cahn, 1999; Donahue, 1999). G (x) =1−F F (1−x) (0<x<1) , (2) f F (1−x) g (x) = (0<x<1) . (3) f F (1−x) −1 (3) The distribution function of F0 (1−p) under H1 is F1 (x). COROLLARY 2.1 Let X1, X2,..., Xn be independently and identically distributed (i. i. d) 2 random variables as normal distribution with mean 0 and variance σ ; and let Y1, Y2,..., Yn be i.i. d random variables as normal distribution with mean μ and variance σ2, where σ is assumed known and the treatment effect is expressed by δ=μ/σ.Furthermore, let the test statistic T be given by Y −X T= . (2σ/n) Then, it follows that (1) Gp (x) and gp (x) are given as follows, where Φ is the distribution function of the standard −1 normal distribution, Φ is its inverse, ϕ is its density function, and μn= n/2 δ. G (x) =1−Φ Φ (1−x) −μ (0<x<1) , φ Φ (1−x) −μ g (x) = (0<x<1) . (4) φ Φ (1−x) −1 (2) Φ (1−p) follows normal distribution N(μn, 1) under H1. Jpn J Biomet Vol. 40, No. 2, 2020 72 Yanagawa Figure. 1. Probability density function of p-value given in (4) when n = 25 and μ = 0.176. 2.2 Location and dispersion parameters of the distribution of p-value Probability density function given in (4) is illustrated in Figure 1 when n=25 and μ=0.176, taking ln(p)=loge (p) in the horizontal line. The figure indicates that the distribution of p-value is not symmetric and has long right tail under H1. Thus we shall employ the median (50 % point of the distribution function) as a location parameter and the interquartile range, 75 % point - 25 % point, as a dispersion parameter of the distribution function of p-value, as employed in Box and Whisker plot (Tukey, 1977). Putting q1=0.25,q2=0.50,q3=0.75, the lower quartile (25 % point), median and upper quartile (75 % point) of the distribution of p-value are given by ξi, i=1, 2, 3, respectively, satisfying Gp (ξi)=qi. (5) More precisely, they are given in the following proposition. PROPOSITION 2.2 When qi is given ξi is given as follows. ξ =1−F F (1−q ) COROLLARY 2.2 In the framework given in Corollary 2.1 (1) ξi is given as follows. ξ =1−Φ μ +Φ (1−q ) , (6) i=1, 2, 3, where μn= n/2 δ (2) ξi is a monotone decreasing function of n, i=1, 2, 3. (3) The interquartile range is a monotone decreasing function of n. 3.Numerical evaluation of quartile and interquartile range Assuming the framework given in Corollary 2.1, we numerically evaluate in this section the Jpn J Biomet Vol. 40, No. 2, 2020 Reproducibility of statistical test results based on p-value 73 quartile and interquartile range of the distribution of p-value that are given in the previous section. Figure. 2. Median, lower and upper quartiles of the distribution of p-value when δ = 0.30. 3.1 Medians depend sharply on sample size Figure 2 gives curves of median, lower and upper quartiles of the distribution of p-value when δ =0.30 for n=20〜200, taking n in the horizontal line.
Recommended publications
  • Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)
    NYS COMMON CORE MATHEMATICS CURRICULUM Lesson 7 M2 ALGEBRA I Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range) Student Outcomes . Students explain why a median is a better description of a typical value for a skewed distribution. Students calculate the 5-number summary of a data set. Students construct a box plot based on the 5-number summary and calculate the interquartile range (IQR). Students interpret the IQR as a description of variability in the data. Students identify outliers in a data distribution. Lesson Notes Distributions that are not symmetrical pose some challenges in students’ thinking about center and variability. The observation that the distribution is not symmetrical is straightforward. The difficult part is to select a measure of center and a measure of variability around that center. In Lesson 3, students learned that, because the mean can be affected by unusual values in the data set, the median is a better description of a typical data value for a skewed distribution. This lesson addresses what measure of variability is appropriate for a skewed data distribution. Students construct a box plot of the data using the 5-number summary and describe variability using the interquartile range. Classwork Exploratory Challenge 1/Exercises 1–3 (10 minutes): Skewed Data and Their Measure of Center Verbally introduce the data set as described in the introductory paragraph and dot plot shown below. Exploratory Challenge 1/Exercises 1–3: Skewed Data and Their Measure of Center Consider the following scenario. A television game show, Fact or Fiction, was cancelled after nine shows. Many people watched the nine shows and were rather upset when it was taken off the air.
    [Show full text]
  • Estimation of Common Location and Scale Parameters in Nonregular Cases Ahmad Razmpour Iowa State University
    Iowa State University Capstones, Theses and Retrospective Theses and Dissertations Dissertations 1982 Estimation of common location and scale parameters in nonregular cases Ahmad Razmpour Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/rtd Part of the Statistics and Probability Commons Recommended Citation Razmpour, Ahmad, "Estimation of common location and scale parameters in nonregular cases " (1982). Retrospective Theses and Dissertations. 7528. https://lib.dr.iastate.edu/rtd/7528 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Retrospective Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. INFORMATION TO USERS This reproduction was made from a copy of a document sent to us for microfilming. While the most advanced technology has been used to photograph and reproduce this document, the quality of the reproduction is heavily dependent upon the quality of the material submitted. The following explanation of techniques is provided to help clarify markings or notations which may appear on this reproduction. 1. The sign or "target" for pages apparently lacking from the document photographed is "Missing Page(s)". If it was possible to obtain the missing page(s) or section, they are spliced into the film along with adjacent pages. This may have necessitated cutting through an image and duplicating adjacent pages to assure complete continuity. 2. When an image on the film is obliterated with a round black mark, it is an indication of either blurred copy because of movement during exposure, duplicate copy, or copyrighted materials that should not have been filmed.
    [Show full text]
  • A Comparison of Unbiased and Plottingposition Estimators of L
    WATER RESOURCES RESEARCH, VOL. 31, NO. 8, PAGES 2019-2025, AUGUST 1995 A comparison of unbiased and plotting-position estimators of L moments J. R. M. Hosking and J. R. Wallis IBM ResearchDivision, T. J. Watson ResearchCenter, Yorktown Heights, New York Abstract. Plotting-positionestimators of L momentsand L moment ratios have several disadvantagescompared with the "unbiased"estimators. For generaluse, the "unbiased'? estimatorsshould be preferred. Plotting-positionestimators may still be usefulfor estimatingextreme upper tail quantilesin regional frequencyanalysis. Probability-Weighted Moments and L Moments •r+l-" (--1)r • P*r,k Olk '- E p *r,!•[J!•. Probability-weightedmoments of a randomvariable X with k=0 k=0 cumulativedistribution function F( ) and quantile function It is convenient to define dimensionless versions of L mo- x( ) were definedby Greenwoodet al. [1979]to be the quan- tities ments;this is achievedby dividingthe higher-orderL moments by the scale measure h2. The L moment ratios •'r, r = 3, Mp,ra= E[XP{F(X)}r{1- F(X)} s] 4, '", are definedby ßr-" •r/•2 ß {X(u)}PUr(1 -- U)s du. L momentratios measure the shapeof a distributionindepen- dently of its scaleof measurement.The ratios *3 ("L skew- ness")and *4 ("L kurtosis")are nowwidely used as measures Particularlyuseful specialcases are the probability-weighted of skewnessand kurtosis,respectively [e.g., Schaefer,1990; moments Pilon and Adamowski,1992; Royston,1992; Stedingeret al., 1992; Vogeland Fennessey,1993]. 12•r= M1,0, r = •01 (1 - u)rx(u) du, Estimators Given an ordered sample of size n, Xl: n • X2:n • ''' • urx(u) du. X.... there are two establishedways of estimatingthe proba- /3r--- Ml,r, 0 =f01 bility-weightedmoments and L moments of the distribution from whichthe samplewas drawn.
    [Show full text]
  • The Asymmetric T-Copula with Individual Degrees of Freedom
    The Asymmetric t-Copula with Individual Degrees of Freedom d-fine GmbH Christ Church University of Oxford A thesis submitted for the degree of MSc in Mathematical Finance Michaelmas 2012 Abstract This thesis investigates asymmetric dependence structures of multivariate asset returns. Evidence of such asymmetry for equity returns has been reported in the literature. In order to model the dependence structure, a new t-copula approach is proposed called the skewed t-copula with individ- ual degrees of freedom (SID t-copula). This copula provides the flexibility to assign an individual degree-of-freedom parameter and an individual skewness parameter to each asset in a multivariate setting. Applying this approach to GARCH residuals of bivariate equity index return data and using maximum likelihood estimation, we find significant asymmetry. By means of the Akaike information criterion, it is demonstrated that the SID t-copula provides the best model for the market data compared to other copula approaches without explicit asymmetry parameters. In addition, it yields a better fit than the conventional skewed t-copula with a single degree-of-freedom parameter. In a model impact study, we analyse the errors which can occur when mod- elling asymmetric multivariate SID-t returns with the symmetric multi- variate Gauss or standard t-distribution. The comparison is done in terms of the risk measures value-at-risk and expected shortfall. We find large deviations between the modelled and the true VaR/ES of a spread posi- tion composed of asymmetrically distributed risk factors. Going from the bivariate case to a larger number of risk factors, the model errors increase.
    [Show full text]
  • Fan Chart: Methodology and Its Application to Inflation Forecasting in India
    W P S (DEPR) : 5 / 2011 RBI WORKING PAPER SERIES Fan Chart: Methodology and its Application to Inflation Forecasting in India Nivedita Banerjee and Abhiman Das DEPARTMENT OF ECONOMIC AND POLICY RESEARCH MAY 2011 The Reserve Bank of India (RBI) introduced the RBI Working Papers series in May 2011. These papers present research in progress of the staff members of RBI and are disseminated to elicit comments and further debate. The views expressed in these papers are those of authors and not that of RBI. Comments and observations may please be forwarded to authors. Citation and use of such papers should take into account its provisional character. Copyright: Reserve Bank of India 2011 Fan Chart: Methodology and its Application to Inflation Forecasting in India Nivedita Banerjee 1 (Email) and Abhiman Das (Email) Abstract Ever since Bank of England first published its inflation forecasts in February 1996, the Fan Chart has been an integral part of inflation reports of central banks around the world. The Fan Chart is basically used to improve presentation: to focus attention on the whole of the forecast distribution, rather than on small changes to the central projection. However, forecast distribution is a technical concept originated from the statistical sampling distribution. In this paper, we have presented the technical details underlying the derivation of Fan Chart used in representing the uncertainty in inflation forecasts. The uncertainty occurs because of the inter-play of the macro-economic variables affecting inflation. The uncertainty in the macro-economic variables is based on their historical standard deviation of the forecast errors, but we also allow these to be subjectively adjusted.
    [Show full text]
  • Finding Basic Statistics Using Minitab 1. Put Your Data Values in One of the Columns of the Minitab Worksheet
    Finding Basic Statistics Using Minitab 1. Put your data values in one of the columns of the Minitab worksheet. 2. Add a variable name in the gray box just above the data values. 3. Click on “Stat”, then click on “Basic Statistics”, and then click on "Display Descriptive Statistics". 4. Choose the variable you want the basic statistics for click on “Select”. 5. Click on the “Statistics” box and then check the box next to each statistic you want to see, and uncheck the boxes next to those you do not what to see. 6. Click on “OK” in that window and click on “OK” in the next window. 7. The values of all the statistics you selected will appear in the Session window. Example (Navidi & Monk, Elementary Statistics, 2nd edition, #31 p. 143, 1st 4 columns): This data gives the number of tribal casinos in a sample of 16 states. 3 7 14 114 2 3 7 8 26 4 3 14 70 3 21 1 Open Minitab and enter the data under C1. The table below shows a portion of the entered data. ↓ C1 C2 Casinos 1 3 2 7 3 14 4 114 5 2 6 3 7 7 8 8 9 26 10 4 Now click on “Stat” and then choose “Basic Statistics” and “Display Descriptive Statistics”. Click in the box under “Variables:”, choose C1 from the window at left, and then click on the “Select” button. Next click on the “Statistics” button and choose which statistics you want to find. We will usually be interested in the following statistics: mean, standard error of the mean, standard deviation, minimum, maximum, range, first quartile, median, third quartile, interquartile range, mode, and skewness.
    [Show full text]
  • A Bayesian Hierarchical Spatial Copula Model: an Application to Extreme Temperatures in Extremadura (Spain)
    atmosphere Article A Bayesian Hierarchical Spatial Copula Model: An Application to Extreme Temperatures in Extremadura (Spain) J. Agustín García 1,† , Mario M. Pizarro 2,*,† , F. Javier Acero 1,† and M. Isabel Parra 2,† 1 Departamento de Física, Universidad de Extremadura, Avenida de Elvas, 06006 Badajoz, Spain; [email protected] (J.A.G.); [email protected] (F.J.A.) 2 Departamento de Matemáticas, Universidad de Extremadura, Avenida de Elvas, 06006 Badajoz, Spain; [email protected] * Correspondence: [email protected] † These authors contributed equally to this work. Abstract: A Bayesian hierarchical framework with a Gaussian copula and a generalized extreme value (GEV) marginal distribution is proposed for the description of spatial dependencies in data. This spatial copula model was applied to extreme summer temperatures over the Extremadura Region, in the southwest of Spain, during the period 1980–2015, and compared with the spatial noncopula model. The Bayesian hierarchical model was implemented with a Monte Carlo Markov Chain (MCMC) method that allows the distribution of the model’s parameters to be estimated. The results show the GEV distribution’s shape parameter to take constant negative values, the location parameter to be altitude dependent, and the scale parameter values to be concentrated around the same value throughout the region. Further, the spatial copula model chosen presents lower deviance information criterion (DIC) values when spatial distributions are assumed for the GEV distribution’s Citation: García, J.A.; Pizarro, M.M.; location and scale parameters than when the scale parameter is taken to be constant over the region. Acero, F.J.; Parra, M.I. A Bayesian Hierarchical Spatial Copula Model: Keywords: Bayesian hierarchical model; extreme temperature; Gaussian copula; generalized extreme An Application to Extreme value distribution Temperatures in Extremadura (Spain).
    [Show full text]
  • Procedures for Estimation of Weibull Parameters James W
    United States Department of Agriculture Procedures for Estimation of Weibull Parameters James W. Evans David E. Kretschmann David W. Green Forest Forest Products General Technical Report February Service Laboratory FPL–GTR–264 2019 Abstract Contents The primary purpose of this publication is to provide an 1 Introduction .................................................................. 1 overview of the information in the statistical literature on 2 Background .................................................................. 1 the different methods developed for fitting a Weibull distribution to an uncensored set of data and on any 3 Estimation Procedures .................................................. 1 comparisons between methods that have been studied in the 4 Historical Comparisons of Individual statistics literature. This should help the person using a Estimator Types ........................................................ 8 Weibull distribution to represent a data set realize some advantages and disadvantages of some basic methods. It 5 Other Methods of Estimating Parameters of should also help both in evaluating other studies using the Weibull Distribution .......................................... 11 different methods of Weibull parameter estimation and in 6 Discussion .................................................................. 12 discussions on American Society for Testing and Materials Standard D5457, which appears to allow a choice for the 7 Conclusion ................................................................
    [Show full text]
  • Location-Scale Distributions
    Location–Scale Distributions Linear Estimation and Probability Plotting Using MATLAB Horst Rinne Copyright: Prof. em. Dr. Horst Rinne Department of Economics and Management Science Justus–Liebig–University, Giessen, Germany Contents Preface VII List of Figures IX List of Tables XII 1 The family of location–scale distributions 1 1.1 Properties of location–scale distributions . 1 1.2 Genuine location–scale distributions — A short listing . 5 1.3 Distributions transformable to location–scale type . 11 2 Order statistics 18 2.1 Distributional concepts . 18 2.2 Moments of order statistics . 21 2.2.1 Definitions and basic formulas . 21 2.2.2 Identities, recurrence relations and approximations . 26 2.3 Functions of order statistics . 32 3 Statistical graphics 36 3.1 Some historical remarks . 36 3.2 The role of graphical methods in statistics . 38 3.2.1 Graphical versus numerical techniques . 38 3.2.2 Manipulation with graphs and graphical perception . 39 3.2.3 Graphical displays in statistics . 41 3.3 Distribution assessment by graphs . 43 3.3.1 PP–plots and QQ–plots . 43 3.3.2 Probability paper and plotting positions . 47 3.3.3 Hazard plot . 54 3.3.4 TTT–plot . 56 4 Linear estimation — Theory and methods 59 4.1 Types of sampling data . 59 IV Contents 4.2 Estimators based on moments of order statistics . 63 4.2.1 GLS estimators . 64 4.2.1.1 GLS for a general location–scale distribution . 65 4.2.1.2 GLS for a symmetric location–scale distribution . 71 4.2.1.3 GLS and censored samples .
    [Show full text]
  • Continuous Random Variables
    ST 380 Probability and Statistics for the Physical Sciences Continuous Random Variables Recall: A continuous random variable X satisfies: 1 its range is the union of one or more real number intervals; 2 P(X = c) = 0 for every c in the range of X . Examples: The depth of a lake at a randomly chosen location. The pH of a random sample of effluent. The precipitation on a randomly chosen day is not a continuous random variable: its range is [0; 1), and P(X = c) = 0 for any c > 0, but P(X = 0) > 0. 1 / 12 Continuous Random Variables Probability Density Function ST 380 Probability and Statistics for the Physical Sciences Discretized Data Suppose that we measure the depth of the lake, but round the depth off to some unit. The rounded value Y is a discrete random variable; we can display its probability mass function as a bar graph, because each mass actually represents an interval of values of X . In R source("discretize.R") discretize(0.5) 2 / 12 Continuous Random Variables Probability Density Function ST 380 Probability and Statistics for the Physical Sciences As the rounding unit becomes smaller, the bar graph more accurately represents the continuous distribution: discretize(0.25) discretize(0.1) When the rounding unit is very small, the bar graph approximates a smooth function: discretize(0.01) plot(f, from = 1, to = 5) 3 / 12 Continuous Random Variables Probability Density Function ST 380 Probability and Statistics for the Physical Sciences The probability that X is between two values a and b, P(a ≤ X ≤ b), can be approximated by P(a ≤ Y ≤ b).
    [Show full text]
  • Normal Probability Distribution
    MATH2114 Week 8, Thursday The Normal Distribution Slides Adopted & modified from “Business Statistics: A First Course” 7th Ed, by Levine, Szabat and Stephan, 2016 Pearson Education Inc. Objective/Goals To compute probabilities from the normal distribution How to use the normal distribution to solve practical problems To use the normal probability plot to determine whether a set of data is approximately normally distributed Continuous Probability Distributions A continuous variable is a variable that can assume any value on a continuum (can assume an uncountable number of values) thickness of an item time required to complete a task temperature of a solution height, in inches These can potentially take on any value depending only on the ability to precisely and accurately measure The Normal Distribution Bell Shaped f(X) Symmetrical Mean=Median=Mode σ Location is determined by X the mean, μ μ Spread is determined by the standard deviation, σ Mean The random variable has an = Median infinite theoretical range: = Mode + to The Normal Distribution Density Function The formula for the normal probability density function is 2 1 (X μ) 1 f(X) e 2 2π Where e = the mathematical constant approximated by 2.71828 π = the mathematical constant approximated by 3.14159 μ = the population mean σ = the population standard deviation X = any value of the continuous variable The Normal Distribution Density Function By varying 휇 and 휎, we obtain different normal distributions Image taken from Wikipedia, released under public domain The
    [Show full text]
  • 2018 Statistics Final Exam Academic Success Programme Columbia University Instructor: Mark England
    2018 Statistics Final Exam Academic Success Programme Columbia University Instructor: Mark England Name: UNI: You have two hours to finish this exam. § Calculators are necessary for this exam. § Always show your working and thought process throughout so that even if you don’t get the answer totally correct I can give you as many marks as possible. § It has been a pleasure teaching you all. Try your best and good luck! Page 1 Question 1 True or False for the following statements? Warning, in this question (and only this question) you will get minus points for wrong answers, so it is not worth guessing randomly. For this question you do not need to explain your answer. a) Correlation values can only be between 0 and 1. False. Correlation values are between -1 and 1. b) The Student’s t distribution gives a lower percentage of extreme events occurring compared to a normal distribution. False. The student’s t distribution gives a higher percentage of extreme events. c) The units of the standard deviation of � are the same as the units of �. True. d) For a normal distribution, roughly 95% of the data is contained within one standard deviation of the mean. False. It is within two standard deviations of the mean e) For a left skewed distribution, the mean is larger than the median. False. This is a right skewed distribution. f) If an event � is independent of �, then � (�) = �(�|�) is always true. True. g) For a given confidence level, increasing the sample size of a survey should increase the margin of error.
    [Show full text]