Principles of Statistical Analyses: Old and New Tools
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
How Many Participants Do I Have to Include in Properly Powered
How many participants do we have to include in properly powered experiments? A tutorial of power analysis with some simple guidelines Marc Brysbaert Ghent University Belgium Keywords: power analysis, ANOVA, Bayesian statistics, effect size Address: Marc Brysbaert Department of Experimental Psychology Ghent University H. Dunantlaan 2 9000 Gent, Belgium [email protected] Abstract Given that the average effect size of pairwise comparisons in psychology is d = .4, very few studies are properly powered with less than 50 participants. For most designs and analyses, numbers of 100, 200, and even more are needed. These numbers become feasible with the recent introduction of internet-based studies and experiments, although they also require a change in the way research is evaluated by supervisors, examiners, reviewers, and editors. The present paper describes the numbers needed for the designs most often used by psychologists, including single-variable between-groups and repeated-measures designs with two and three levels, two-factor designs involving two repeated-measures variables or one between-groups variable and one repeated-measures variable (split-plot design). The numbers are given for the traditional, frequentist analysis with p < .05 and Bayesian analysis with BF > 10. These numbers should give a straightforward answer to researchers asking the question: “How many participants do I have to include in my experiment?” We also discuss how researchers can improve the power of their study by including multiple observations per condition per participant. 2 Statistical packages tend to be used as a kind of oracle …. In order to elicit a response from the oracle, one has to click one’s way through cascades of menus. -
Higher-Order Asymptotics
Higher-Order Asymptotics Todd Kuffner Washington University in St. Louis WHOA-PSI 2016 1 / 113 First- and Higher-Order Asymptotics Classical Asymptotics in Statistics: available sample size n ! 1 First-Order Asymptotic Theory: asymptotic statements that are correct to order O(n−1=2) Higher-Order Asymptotics: refinements to first-order results 1st order 2nd order 3rd order kth order error O(n−1=2) O(n−1) O(n−3=2) O(n−k=2) or or or or o(1) o(n−1=2) o(n−1) o(n−(k−1)=2) Why would anyone care? deeper understanding more accurate inference compare different approaches (which agree to first order) 2 / 113 Points of Emphasis Convergence pointwise or uniform? Error absolute or relative? Deviation region moderate or large? 3 / 113 Common Goals Refinements for better small-sample performance Example Edgeworth expansion (absolute error) Example Barndorff-Nielsen’s R∗ Accurate Approximation Example saddlepoint methods (relative error) Example Laplace approximation Comparative Asymptotics Example probability matching priors Example conditional vs. unconditional frequentist inference Example comparing analytic and bootstrap procedures Deeper Understanding Example sources of inaccuracy in first-order theory Example nuisance parameter effects 4 / 113 Is this relevant for high-dimensional statistical models? The Classical asymptotic regime is when the parameter dimension p is fixed and the available sample size n ! 1. What if p < n or p is close to n? 1. Find a meaningful non-asymptotic analysis of the statistical procedure which works for any n or p (concentration inequalities) 2. Allow both n ! 1 and p ! 1. 5 / 113 Some First-Order Theory Univariate (classical) CLT: Assume X1;X2;::: are i.i.d. -
The Method of Maximum Likelihood for Simple Linear Regression
08:48 Saturday 19th September, 2015 See updates and corrections at http://www.stat.cmu.edu/~cshalizi/mreg/ Lecture 6: The Method of Maximum Likelihood for Simple Linear Regression 36-401, Fall 2015, Section B 17 September 2015 1 Recapitulation We introduced the method of maximum likelihood for simple linear regression in the notes for two lectures ago. Let's review. We start with the statistical model, which is the Gaussian-noise simple linear regression model, defined as follows: 1. The distribution of X is arbitrary (and perhaps X is even non-random). 2. If X = x, then Y = β0 + β1x + , for some constants (\coefficients", \parameters") β0 and β1, and some random noise variable . 3. ∼ N(0; σ2), and is independent of X. 4. is independent across observations. A consequence of these assumptions is that the response variable Y is indepen- dent across observations, conditional on the predictor X, i.e., Y1 and Y2 are independent given X1 and X2 (Exercise 1). As you'll recall, this is a special case of the simple linear regression model: the first two assumptions are the same, but we are now assuming much more about the noise variable : it's not just mean zero with constant variance, but it has a particular distribution (Gaussian), and everything we said was uncorrelated before we now strengthen to independence1. Because of these stronger assumptions, the model tells us the conditional pdf 2 of Y for each x, p(yjX = x; β0; β1; σ ). (This notation separates the random variables from the parameters.) Given any data set (x1; y1); (x2; y2);::: (xn; yn), we can now write down the probability density, under the model, of seeing that data: n n (y −(β +β x ))2 Y 2 Y 1 − i 0 1 i p(yijxi; β0; β1; σ ) = p e 2σ2 2 i=1 i=1 2πσ 1See the notes for lecture 1 for a reminder, with an explicit example, of how uncorrelated random variables can nonetheless be strongly statistically dependent. -
Use of the Kurtosis Statistic in the Frequency Domain As an Aid In
lEEE JOURNALlEEE OF OCEANICENGINEERING, VOL. OE-9, NO. 2, APRIL 1984 85 Use of the Kurtosis Statistic in the FrequencyDomain as an Aid in Detecting Random Signals Absmact-Power spectral density estimation is often employed as a couldbe utilized in signal processing. The objective ofthis method for signal ,detection. For signals which occur randomly, a paper is to compare the PSD technique for signal processing frequency domain kurtosis estimate supplements the power spectral witha new methodwhich computes the frequency domain density estimate and, in some cases, can be.employed to detect their presence. This has been verified from experiments vith real data of kurtosis (FDK) [2] forthe real and imaginary parts of the randomly occurring signals. In order to better understand the detec- complex frequency components. Kurtosis is defined as a ratio tion of randomlyoccurring signals, sinusoidal and narrow-band of a fourth-order central moment to the square of a second- Gaussian signals are considered, which when modeled to represent a order central moment. fading or multipath environment, are received as nowGaussian in Using theNeyman-Pearson theory in thetime domain, terms of a frequency domain kurtosis estimate. Several fading and multipath propagation probability density distributions of practical Ferguson [3] , has shown that kurtosis is a locally optimum interestare considered, including Rayleigh and log-normal. The detectionstatistic under certain conditions. The reader is model is generalized to handle transient and frequency modulated referred to Ferguson'swork for the details; however, it can signals by taking into account the probability of the signal being in a be simply said thatit is concernedwith detecting outliers specific frequency range over the total data interval. -
Statistical Models in R Some Examples
Statistical Models Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Statistical Models Outline Statistical Models Linear Models in R Statistical Models Regression Regression analysis is the appropriate statistical method when the response variable and all explanatory variables are continuous. Here, we only discuss linear regression, the simplest and most common form. Remember that a statistical model attempts to approximate the response variable Y as a mathematical function of the explanatory variables X1;:::; Xn. This mathematical function may involve parameters. Regression analysis attempts to use sample data find the parameters that produce the best model Statistical Models Linear Models The simplest such model is a linear model with a unique explanatory variable, which takes the following form. y^ = a + bx: Here, y is the response variable vector, x the explanatory variable, y^ is the vector of fitted values and a (intercept) and b (slope) are real numbers. Plotting y versus x, this model represents a line through the points. For a given index i,y ^i = a + bxi approximates yi . Regression amounts to finding a and b that gives the best fit. Statistical Models Linear Model with 1 Explanatory Variable ● 10 ● ● ● ● ● 5 y ● ● ● ● ● ● y ● ● y−hat ● ● ● ● 0 ● ● ● ● x=2 0 1 2 3 4 5 x Statistical Models Plotting Commands for the record The plot was generated with test data xR, yR with: > plot(xR, yR, xlab = "x", ylab = "y") > abline(v = 2, lty = 2) > abline(a = -2, b = 2, col = "blue") > points(c(2), yR[9], pch = 16, col = "red") > points(c(2), c(2), pch = 16, col = "red") > text(2.5, -4, "x=2", cex = 1.5) > text(1.8, 3.9, "y", cex = 1.5) > text(2.5, 1.9, "y-hat", cex = 1.5) Statistical Models Linear Regression = Minimize RSS Least Squares Fit In linear regression the best fit is found by minimizing n n X 2 X 2 RSS = (yi − y^i ) = (yi − (a + bxi )) : i=1 i=1 This is a Calculus I problem. -
A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications
Special Publication 800-22 Revision 1a A Statistical Test Suite for Random and Pseudorandom Number Generators for Cryptographic Applications AndrewRukhin,JuanSoto,JamesNechvatal,Miles Smid,ElaineBarker,Stefan Leigh,MarkLevenson,Mark Vangel,DavidBanks,AlanHeckert,JamesDray,SanVo Revised:April2010 LawrenceE BasshamIII A Statistical Test Suite for Random and Pseudorandom Number Generators for NIST Special Publication 800-22 Revision 1a Cryptographic Applications 1 2 Andrew Rukhin , Juan Soto , James 2 2 Nechvatal , Miles Smid , Elaine 2 1 Barker , Stefan Leigh , Mark 1 1 Levenson , Mark Vangel , David 1 1 2 Banks , Alan Heckert , James Dray , 2 San Vo Revised: April 2010 2 Lawrence E Bassham III C O M P U T E R S E C U R I T Y 1 Statistical Engineering Division 2 Computer Security Division Information Technology Laboratory National Institute of Standards and Technology Gaithersburg, MD 20899-8930 Revised: April 2010 U.S. Department of Commerce Gary Locke, Secretary National Institute of Standards and Technology Patrick Gallagher, Director A STATISTICAL TEST SUITE FOR RANDOM AND PSEUDORANDOM NUMBER GENERATORS FOR CRYPTOGRAPHIC APPLICATIONS Reports on Computer Systems Technology The Information Technology Laboratory (ITL) at the National Institute of Standards and Technology (NIST) promotes the U.S. economy and public welfare by providing technical leadership for the nation’s measurement and standards infrastructure. ITL develops tests, test methods, reference data, proof of concept implementations, and technical analysis to advance the development and productive use of information technology. ITL’s responsibilities include the development of technical, physical, administrative, and management standards and guidelines for the cost-effective security and privacy of sensitive unclassified information in Federal computer systems. -
Continuous Dependent Variable Models
Chapter 4 Continuous Dependent Variable Models CHAPTER 4; SECTION A: ANALYSIS OF VARIANCE Purpose of Analysis of Variance: Analysis of Variance is used to analyze the effects of one or more independent variables (factors) on the dependent variable. The dependent variable must be quantitative (continuous). The dependent variable(s) may be either quantitative or qualitative. Unlike regression analysis no assumptions are made about the relation between the independent variable and the dependent variable(s). The theory behind ANOVA is that a change in the magnitude (factor level) of one or more of the independent variables or combination of independent variables (interactions) will influence the magnitude of the response, or dependent variable, and is indicative of differences in parent populations from which the samples were drawn. Analysis is Variance is the basic analytical procedure used in the broad field of experimental designs, and can be used to test the difference in population means under a wide variety of experimental settings—ranging from fairly simple to extremely complex experiments. Thus, it is important to understand that the selection of an appropriate experimental design is the first step in an Analysis of Variance. The following section discusses some of the fundamental differences in basic experimental designs—with the intent merely to introduce the reader to some of the basic considerations and concepts involved with experimental designs. The references section points to some more detailed texts and references on the subject, and should be consulted for detailed treatment on both basic and advanced experimental designs. Examples: An analyst or engineer might be interested to assess the effect of: 1. -
Chapter 5 Statistical Models in Simulation
Chapter 5 Statistical Models in Simulation Banks, Carson, Nelson & Nicol Discrete-Event System Simulation Purpose & Overview The world the model-builder sees is probabilistic rather than deterministic. Some statistical model might well describe the variations. An appropriate model can be developed by sampling the phenomenon of interest: Select a known distribution through educated guesses Make estimate of the parameter(s) Test for goodness of fit In this chapter: Review several important probability distributions Present some typical application of these models ٢ ١ Review of Terminology and Concepts In this section, we will review the following concepts: Discrete random variables Continuous random variables Cumulative distribution function Expectation ٣ Discrete Random Variables [Probability Review] X is a discrete random variable if the number of possible values of X is finite, or countably infinite. Example: Consider jobs arriving at a job shop. Let X be the number of jobs arriving each week at a job shop. Rx = possible values of X (range space of X) = {0,1,2,…} p(xi) = probability the random variable is xi = P(X = xi) p(xi), i = 1,2, … must satisfy: 1. p(xi ) ≥ 0, for all i ∞ 2. p(x ) =1 ∑i=1 i The collection of pairs [xi, p(xi)], i = 1,2,…, is called the probability distribution of X, and p(xi) is called the probability mass function (pmf) of X. ٤ ٢ Continuous Random Variables [Probability Review] X is a continuous random variable if its range space Rx is an interval or a collection of intervals. The probability that X lies in the interval [a,b] is given by: b P(a ≤ X ≤ b) = f (x)dx ∫a f(x), denoted as the pdf of X, satisfies: 1. -
Design of Engineering Experiments the Blocking Principle
Design of Engineering Experiments The Blocking Principle • Montgomery text Reference, Chapter 4 • Bloc king and nuiftisance factors • The randomized complete block design or the RCBD • Extension of the ANOVA to the RCBD • Other blocking scenarios…Latin square designs 1 The Blockinggp Principle • Blocking is a technique for dealing with nuisance factors • A nuisance factor is a factor that probably has some effect on the response, but it’s of no interest to the experimenter…however, the variability it transmits to the response needs to be minimized • Typical nuisance factors include batches of raw material, operators, pieces of test equipment, time (shifts, days, etc.), different experimental units • Many industrial experiments involve blocking (or should) • Failure to block is a common flaw in designing an experiment (consequences?) 2 The Blocking Principle • If the nuisance variable is known and controllable, we use blocking • If the nuisance factor is known and uncontrollable, sometimes we can use the analysis of covariance (see Chapter 15) to remove the effect of the nuisance factor from the analysis • If the nuisance factor is unknown and uncontrollable (a “lurking” variable), we hope that randomization balances out its impact across the experiment • Sometimes several sources of variability are combined in a block, so the block becomes an aggregate variable 3 The Hardness Testinggp Example • Text reference, pg 120 • We wish to determine whether 4 different tippps produce different (mean) hardness reading on a Rockwell hardness tester -
Statistical Modeling Methods: Challenges and Strategies
Biostatistics & Epidemiology ISSN: 2470-9360 (Print) 2470-9379 (Online) Journal homepage: https://www.tandfonline.com/loi/tbep20 Statistical modeling methods: challenges and strategies Steven S. Henley, Richard M. Golden & T. Michael Kashner To cite this article: Steven S. Henley, Richard M. Golden & T. Michael Kashner (2019): Statistical modeling methods: challenges and strategies, Biostatistics & Epidemiology, DOI: 10.1080/24709360.2019.1618653 To link to this article: https://doi.org/10.1080/24709360.2019.1618653 Published online: 22 Jul 2019. Submit your article to this journal Article views: 614 View related articles View Crossmark data Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=tbep20 BIOSTATISTICS & EPIDEMIOLOGY https://doi.org/10.1080/24709360.2019.1618653 Statistical modeling methods: challenges and strategies Steven S. Henleya,b,c, Richard M. Goldend and T. Michael Kashnera,b,e aDepartment of Medicine, Loma Linda University School of Medicine, Loma Linda, CA, USA; bCenter for Advanced Statistics in Education, VA Loma Linda Healthcare System, Loma Linda, CA, USA; cMartingale Research Corporation, Plano, TX, USA; dSchool of Behavioral and Brain Sciences, University of Texas at Dallas, Richardson, TX, USA; eDepartment of Veterans Affairs, Office of Academic Affiliations (10A2D), Washington, DC, USA ABSTRACT ARTICLE HISTORY Statistical modeling methods are widely used in clinical science, Received 13 June 2018 epidemiology, and health services research to -
Principles of Statistical Inference
Principles of Statistical Inference In this important book, D. R. Cox develops the key concepts of the theory of statistical inference, in particular describing and comparing the main ideas and controversies over foundational issues that have rumbled on for more than 200 years. Continuing a 60-year career of contribution to statistical thought, Professor Cox is ideally placed to give the comprehensive, balanced account of the field that is now needed. The careful comparison of frequentist and Bayesian approaches to inference allows readers to form their own opinion of the advantages and disadvantages. Two appendices give a brief historical overview and the author’s more personal assessment of the merits of different ideas. The content ranges from the traditional to the contemporary. While specific applications are not treated, the book is strongly motivated by applications across the sciences and associated technologies. The underlying mathematics is kept as elementary as feasible, though some previous knowledge of statistics is assumed. This book is for every serious user or student of statistics – in particular, for anyone wanting to understand the uncertainty inherent in conclusions from statistical analyses. Principles of Statistical Inference D.R. COX Nuffield College, Oxford CAMBRIDGE UNIVERSITY PRESS Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo Cambridge University Press The Edinburgh Building, Cambridge CB2 8RU, UK Published in the United States of America by Cambridge University Press, New York www.cambridge.org Information on this title: www.cambridge.org/9780521866736 © D. R. Cox 2006 This publication is in copyright. Subject to statutory exception and to the provision of relevant collective licensing agreements, no reproduction of any part may take place without the written permission of Cambridge University Press. -
Probability, Algorithmic Complexity, and Subjective Randomness
Probability, algorithmic complexity, and subjective randomness Thomas L. Griffiths Joshua B. Tenenbaum [email protected] [email protected] Department of Psychology Brain and Cognitive Sciences Department Stanford University Massachusetts Institute of Technology Abstract accounts that can express the strong prior knowl- We present a statistical account of human random- edge that contributes to our inferences. The struc- ness judgments that uses the idea of algorithmic tures that people find simple form a strict (and flex- complexity. We show that an existing measure of ible) subset of those easily expressed in a computer the randomness of a sequence corresponds to the as- sumption that non-random sequences are generated program. For example, the sequence of heads and by a particular probabilistic finite state automaton, tails TTHTTTHHTH appears quite complex to us, even and use this as the basis for an account that evalu- though, as the parity of the first 10 digits of π, it ates randomness in terms of the length of programs is easily generated by a computer. Identifying the for machines at different levels of the Chomsky hi- kinds of regularities that contribute to our sense of erarchy. This approach results in a model that pre- dicts human judgments better than the responses simplicity will be an important part of any cognitive of other participants in the same experiment. theory, and is in fact necessary since Kolmogorov complexity is not computable (Kolmogorov, 1965). The development of information theory prompted There is a crucial middle ground between Kol- cognitive scientists to formally examine how humans mogorov complexity and the arbitrary encoding encode experience, with a variety of schemes be- schemes to which Simon (1972) objected.