Linear Regression and Correlation
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
A Generalized Linear Model for Principal Component Analysis of Binary Data
A Generalized Linear Model for Principal Component Analysis of Binary Data Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a generalized linear model for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the exponential family. In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to linear regression. Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating least squares respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model's relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model's likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA|as tribution and the logistic function as its canonical link. -
The Statistical Analysis of Distributions, Percentile Rank Classes and Top-Cited
How to analyse percentile impact data meaningfully in bibliometrics: The statistical analysis of distributions, percentile rank classes and top-cited papers Lutz Bornmann Division for Science and Innovation Studies, Administrative Headquarters of the Max Planck Society, Hofgartenstraße 8, 80539 Munich, Germany; [email protected]. 1 Abstract According to current research in bibliometrics, percentiles (or percentile rank classes) are the most suitable method for normalising the citation counts of individual publications in terms of the subject area, the document type and the publication year. Up to now, bibliometric research has concerned itself primarily with the calculation of percentiles. This study suggests how percentiles can be analysed meaningfully for an evaluation study. Publication sets from four universities are compared with each other to provide sample data. These suggestions take into account on the one hand the distribution of percentiles over the publications in the sets (here: universities) and on the other hand concentrate on the range of publications with the highest citation impact – that is, the range which is usually of most interest in the evaluation of scientific performance. Key words percentiles; research evaluation; institutional comparisons; percentile rank classes; top-cited papers 2 1 Introduction According to current research in bibliometrics, percentiles (or percentile rank classes) are the most suitable method for normalising the citation counts of individual publications in terms of the subject area, the document type and the publication year (Bornmann, de Moya Anegón, & Leydesdorff, 2012; Bornmann, Mutz, Marx, Schier, & Daniel, 2011; Leydesdorff, Bornmann, Mutz, & Opthof, 2011). Until today, it has been customary in evaluative bibliometrics to use the arithmetic mean value to normalize citation data (Waltman, van Eck, van Leeuwen, Visser, & van Raan, 2011). -
Wavelet Entropy Energy Measure (WEEM): a Multiscale Measure to Grade a Geophysical System's Predictability
EGU21-703, updated on 28 Sep 2021 https://doi.org/10.5194/egusphere-egu21-703 EGU General Assembly 2021 © Author(s) 2021. This work is distributed under the Creative Commons Attribution 4.0 License. Wavelet Entropy Energy Measure (WEEM): A multiscale measure to grade a geophysical system's predictability Ravi Kumar Guntu and Ankit Agarwal Indian Institute of Technology Roorkee, Hydrology, Roorkee, India ([email protected]) Model-free gradation of predictability of a geophysical system is essential to quantify how much inherent information is contained within the system and evaluate different forecasting methods' performance to get the best possible prediction. We conjecture that Multiscale Information enclosed in a given geophysical time series is the only input source for any forecast model. In the literature, established entropic measures dealing with grading the predictability of a time series at multiple time scales are limited. Therefore, we need an additional measure to quantify the information at multiple time scales, thereby grading the predictability level. This study introduces a novel measure, Wavelet Entropy Energy Measure (WEEM), based on Wavelet entropy to investigate a time series's energy distribution. From the WEEM analysis, predictability can be graded low to high. The difference between the entropy of a wavelet energy distribution of a time series and entropy of wavelet energy of white noise is the basis for gradation. The metric quantifies the proportion of the deterministic component of a time series in terms of energy concentration, and its range varies from zero to one. One corresponds to high predictable due to its high energy concentration and zero representing a process similar to the white noise process having scattered energy distribution. -
04 – Everything You Want to Know About Correlation but Were
EVERYTHING YOU WANT TO KNOW ABOUT CORRELATION BUT WERE AFRAID TO ASK F R E D K U O 1 MOTIVATION • Correlation as a source of confusion • Some of the confusion may arise from the literary use of the word to convey dependence as most people use “correlation” and “dependence” interchangeably • The word “correlation” is ubiquitous in cost/schedule risk analysis and yet there are a lot of misconception about it. • A better understanding of the meaning and derivation of correlation coefficient, and what it truly measures is beneficial for cost/schedule analysts. • Many times “true” correlation is not obtainable, as will be demonstrated in this presentation, what should the risk analyst do? • Is there any other measures of dependence other than correlation? • Concordance and Discordance • Co-monotonicity and Counter-monotonicity • Conditional Correlation etc. 2 CONTENTS • What is Correlation? • Correlation and dependence • Some examples • Defining and Estimating Correlation • How many data points for an accurate calculation? • The use and misuse of correlation • Some example • Correlation and Cost Estimate • How does correlation affect cost estimates? • Portfolio effect? • Correlation and Schedule Risk • How correlation affect schedule risks? • How Shall We Go From Here? • Some ideas for risk analysis 3 POPULARITY AND SHORTCOMINGS OF CORRELATION • Why Correlation Is Popular? • Correlation is a natural measure of dependence for a Multivariate Normal Distribution (MVN) and the so-called elliptical family of distributions • It is easy to calculate analytically; we only need to calculate covariance and variance to get correlation • Correlation and covariance are easy to manipulate under linear operations • Correlation Shortcomings • Variances of R.V. -
11. Correlation and Linear Regression
11. Correlation and linear regression The goal in this chapter is to introduce correlation and linear regression. These are the standard tools that statisticians rely on when analysing the relationship between continuous predictors and continuous outcomes. 11.1 Correlations In this section we’ll talk about how to describe the relationships between variables in the data. To do that, we want to talk mostly about the correlation between variables. But first, we need some data. 11.1.1 The data Table 11.1: Descriptive statistics for the parenthood data. variable min max mean median std. dev IQR Dan’s grumpiness 41 91 63.71 62 10.05 14 Dan’s hours slept 4.84 9.00 6.97 7.03 1.02 1.45 Dan’s son’s hours slept 3.25 12.07 8.05 7.95 2.07 3.21 ............................................................................................ Let’s turn to a topic close to every parent’s heart: sleep. The data set we’ll use is fictitious, but based on real events. Suppose I’m curious to find out how much my infant son’s sleeping habits affect my mood. Let’s say that I can rate my grumpiness very precisely, on a scale from 0 (not at all grumpy) to 100 (grumpy as a very, very grumpy old man or woman). And lets also assume that I’ve been measuring my grumpiness, my sleeping patterns and my son’s sleeping patterns for - 251 - quite some time now. Let’s say, for 100 days. And, being a nerd, I’ve saved the data as a file called parenthood.csv. -
Krishnan Correlation CUSB.Pdf
SELF INSTRUCTIONAL STUDY MATERIAL PROGRAMME: M.A. DEVELOPMENT STUDIES COURSE: STATISTICAL METHODS FOR DEVELOPMENT RESEARCH (MADVS2005C04) UNIT III-PART A CORRELATION PROF.(Dr). KRISHNAN CHALIL 25, APRIL 2020 1 U N I T – 3 :C O R R E L A T I O N After reading this material, the learners are expected to : . Understand the meaning of correlation and regression . Learn the different types of correlation . Understand various measures of correlation., and, . Explain the regression line and equation 3.1. Introduction By now we have a clear idea about the behavior of single variables using different measures of Central tendency and dispersion. Here the data concerned with one variable is called ‗univariate data‘ and this type of analysis is called ‗univariate analysis‘. But, in nature, some variables are related. For example, there exists some relationships between height of father and height of son, price of a commodity and amount demanded, the yield of a plant and manure added, cost of living and wages etc. This is a a case of ‗bivariate data‘ and such analysis is called as ‗bivariate data analysis. Correlation is one type of bivariate statistics. Correlation is the relationship between two variables in which the changes in the values of one variable are followed by changes in the values of the other variable. 3.2. Some Definitions 1. ―When the relationship is of a quantitative nature, the approximate statistical tool for discovering and measuring the relationship and expressing it in a brief formula is known as correlation‖—(Craxton and Cowden) 2. ‗correlation is an analysis of the co-variation between two or more variables‖—(A.M Tuttle) 3. -
Construct Validity and Reliability of the Work Environment Assessment Instrument WE-10
International Journal of Environmental Research and Public Health Article Construct Validity and Reliability of the Work Environment Assessment Instrument WE-10 Rudy de Barros Ahrens 1,*, Luciana da Silva Lirani 2 and Antonio Carlos de Francisco 3 1 Department of Business, Faculty Sagrada Família (FASF), Ponta Grossa, PR 84010-760, Brazil 2 Department of Health Sciences Center, State University Northern of Paraná (UENP), Jacarezinho, PR 86400-000, Brazil; [email protected] 3 Department of Industrial Engineering and Post-Graduation in Production Engineering, Federal University of Technology—Paraná (UTFPR), Ponta Grossa, PR 84017-220, Brazil; [email protected] * Correspondence: [email protected] Received: 1 September 2020; Accepted: 29 September 2020; Published: 9 October 2020 Abstract: The purpose of this study was to validate the construct and reliability of an instrument to assess the work environment as a single tool based on quality of life (QL), quality of work life (QWL), and organizational climate (OC). The methodology tested the construct validity through Exploratory Factor Analysis (EFA) and reliability through Cronbach’s alpha. The EFA returned a Kaiser–Meyer–Olkin (KMO) value of 0.917; which demonstrated that the data were adequate for the factor analysis; and a significant Bartlett’s test of sphericity (χ2 = 7465.349; Df = 1225; p 0.000). ≤ After the EFA; the varimax rotation method was employed for a factor through commonality analysis; reducing the 14 initial factors to 10. Only question 30 presented commonality lower than 0.5; and the other questions returned values higher than 0.5 in the commonality analysis. Regarding the reliability of the instrument; all of the questions presented reliability as the values varied between 0.953 and 0.956. -
Biostatistics for Oral Healthcare
Biostatistics for Oral Healthcare Jay S. Kim, Ph.D. Loma Linda University School of Dentistry Loma Linda, California 92350 Ronald J. Dailey, Ph.D. Loma Linda University School of Dentistry Loma Linda, California 92350 Biostatistics for Oral Healthcare Biostatistics for Oral Healthcare Jay S. Kim, Ph.D. Loma Linda University School of Dentistry Loma Linda, California 92350 Ronald J. Dailey, Ph.D. Loma Linda University School of Dentistry Loma Linda, California 92350 JayS.Kim, PhD, is Professor of Biostatistics at Loma Linda University, CA. A specialist in this area, he has been teaching biostatistics since 1997 to students in public health, medical school, and dental school. Currently his primary responsibility is teaching biostatistics courses to hygiene students, predoctoral dental students, and dental residents. He also collaborates with the faculty members on a variety of research projects. Ronald J. Dailey is the Associate Dean for Academic Affairs at Loma Linda and an active member of American Dental Educational Association. C 2008 by Blackwell Munksgaard, a Blackwell Publishing Company Editorial Offices: Blackwell Publishing Professional, 2121 State Avenue, Ames, Iowa 50014-8300, USA Tel: +1 515 292 0140 9600 Garsington Road, Oxford OX4 2DQ Tel: 01865 776868 Blackwell Publishing Asia Pty Ltd, 550 Swanston Street, Carlton South, Victoria 3053, Australia Tel: +61 (0)3 9347 0300 Blackwell Wissenschafts Verlag, Kurf¨urstendamm57, 10707 Berlin, Germany Tel: +49 (0)30 32 79 060 The right of the Author to be identified as the Author of this Work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the prior permission of the publisher. -
CONFIDENCE Vs PREDICTION INTERVALS 12/2/04 Inference for Coefficients Mean Response at X Vs
STAT 141 REGRESSION: CONFIDENCE vs PREDICTION INTERVALS 12/2/04 Inference for coefficients Mean response at x vs. New observation at x Linear Model (or Simple Linear Regression) for the population. (“Simple” means single explanatory variable, in fact we can easily add more variables ) – explanatory variable (independent var / predictor) – response (dependent var) Probability model for linear regression: 2 i ∼ N(0, σ ) independent deviations Yi = α + βxi + i, α + βxi mean response at x = xi 2 Goals: unbiased estimates of the three parameters (α, β, σ ) tests for null hypotheses: α = α0 or β = β0 C.I.’s for α, β or to predictE(Y |X = x0). (A model is our ‘stereotype’ – a simplification for summarizing the variation in data) For example if we simulate data from a temperature model of the form: 1 Y = 65 + x + , x = 1, 2,..., 30 i 3 i i i Model is exactly true, by construction An equivalent statement of the LM model: Assume xi fixed, Yi independent, and 2 Yi|xi ∼ N(µy|xi , σ ), µy|xi = α + βxi, population regression line Remark: Suppose that (Xi,Yi) are a random sample from a bivariate normal distribution with means 2 2 (µX , µY ), variances σX , σY and correlation ρ. Suppose that we condition on the observed values X = xi. Then the data (xi, yi) satisfy the LM model. Indeed, we saw last time that Y |x ∼ N(µ , σ2 ), with i y|xi y|xi 2 2 2 µy|xi = α + βxi, σY |X = (1 − ρ )σY Example: Galton’s fathers and sons: µy|x = 35 + 0.5x ; σ = 2.34 (in inches). -
A Logistic Regression Equation for Estimating the Probability of a Stream in Vermont Having Intermittent Flow
Prepared in cooperation with the Vermont Center for Geographic Information A Logistic Regression Equation for Estimating the Probability of a Stream in Vermont Having Intermittent Flow Scientific Investigations Report 2006–5217 U.S. Department of the Interior U.S. Geological Survey A Logistic Regression Equation for Estimating the Probability of a Stream in Vermont Having Intermittent Flow By Scott A. Olson and Michael C. Brouillette Prepared in cooperation with the Vermont Center for Geographic Information Scientific Investigations Report 2006–5217 U.S. Department of the Interior U.S. Geological Survey U.S. Department of the Interior DIRK KEMPTHORNE, Secretary U.S. Geological Survey P. Patrick Leahy, Acting Director U.S. Geological Survey, Reston, Virginia: 2006 For product and ordering information: World Wide Web: http://www.usgs.gov/pubprod Telephone: 1-888-ASK-USGS For more information on the USGS--the Federal source for science about the Earth, its natural and living resources, natural hazards, and the environment: World Wide Web: http://www.usgs.gov Telephone: 1-888-ASK-USGS Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government. Although this report is in the public domain, permission must be secured from the individual copyright owners to reproduce any copyrighted materials contained within this report. Suggested citation: Olson, S.A., and Brouillette, M.C., 2006, A logistic regression equation for estimating the probability of a stream in Vermont having intermittent -
Linear, Ridge Regression, and Principal Component Analysis
Linear, Ridge Regression, and Principal Component Analysis Linear, Ridge Regression, and Principal Component Analysis Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Introduction to Regression I Input vector: X = (X1, X2, ..., Xp). I Output Y is real-valued. I Predict Y from X by f (X ) so that the expected loss function E(L(Y , f (X ))) is minimized. I Square loss: L(Y , f (X )) = (Y − f (X ))2 . I The optimal predictor ∗ 2 f (X ) = argminf (X )E(Y − f (X )) = E(Y | X ) . I The function E(Y | X ) is the regression function. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Example The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y , is expected to be related to total population (X1, measured in thousands), land area (X2, measured in square miles), and total personal income (X3, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table. i : 1 2 3 ... 139 140 141 X1 9387 7031 7017 ... 233 232 231 X2 1348 4069 3719 ... 1011 813 654 X3 72100 52737 54542 ... 1337 1589 1148 Y 25627 15389 13326 ... 264 371 140 Goal: Predict Y from X1, X2, and X3. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Linear Methods I The linear regression model p X f (X ) = β0 + Xj βj . -
Choosing a Coverage Probability for Prediction Intervals
Choosing a Coverage Probability for Prediction Intervals Joshua LANDON and Nozer D. SINGPURWALLA We start by noting that inherent to the above techniques is an underlying distribution (or error) theory, whose net effect Coverage probabilities for prediction intervals are germane to is to produce predictions with an uncertainty bound; the nor- filtering, forecasting, previsions, regression, and time series mal (Gaussian) distribution is typical. An exception is Gard- analysis. It is a common practice to choose the coverage proba- ner (1988), who used a Chebychev inequality in lieu of a spe- bilities for such intervals by convention or by astute judgment. cific distribution. The result was a prediction interval whose We argue here that coverage probabilities can be chosen by de- width depends on a coverage probability; see, for example, Box cision theoretic considerations. But to do so, we need to spec- and Jenkins (1976, p. 254), or Chatfield (1993). It has been a ify meaningful utility functions. Some stylized choices of such common practice to specify coverage probabilities by conven- functions are given, and a prototype approach is presented. tion, the 90%, the 95%, and the 99% being typical choices. In- deed Granger (1996) stated that academic writers concentrate KEY WORDS: Confidence intervals; Decision making; Filter- almost exclusively on 95% intervals, whereas practical fore- ing; Forecasting; Previsions; Time series; Utilities. casters seem to prefer 50% intervals. The larger the coverage probability, the wider the prediction interval, and vice versa. But wide prediction intervals tend to be of little value [see Granger (1996), who claimed 95% prediction intervals to be “embarass- 1.