Logistic Regression (A Type of Generalized Linear Model)

Total Page:16

File Type:pdf, Size:1020Kb

Logistic Regression (A Type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today I Review of GLMs I Logistic Regression 2/36 How do we find patterns in data? I We begin with a model of how the world works I We use our knowledge of a system to create a model of a Data Generating Process I We know that there is variation in any relationship due to an Error Generating Process I We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate 3/36 We started Linear. Why? I Often, our first stab at a hypothesis is that two variables are associated I Linearity is a naive, but reasonable, first assumption I Y = a + BX is straightforward to fit 10 5 y 0 −5 −10 −2 −1 0 1 2 3 4/36 x We started Normal. Why? I It is reasonable to assume that small errors are common I It is reasonable to assume that large errors are large I It is reasonable to assume that error is additive for many phenomena I Many processes we measure are continuous I Y = a + BX + e implies additive error I Y ∼ N(mean = a + BX, sd = σ) Histogram of rnorm(100) 30 20 Frequency 10 0 −3 −2 −1 0 1 2 3 Deviation from Mean 5/36 Example: Pufferfish Mimics & Predator Approaches I What assumptions would you make about similarity and predator response? I How might predators vary in response? I What kinds of error might we have in measuring predator responses? 6/36 Example: A Linear Data Generating Process and Gaussian Error Generating Process 15 10 predators 5 0 1 2 3 4 resemblance 7/36 What if We Have More Information about the Data Generating Process I We often have real biological models of a phenomenon! I For example? I Even if we do not, we often know something about theory, we know the shape of the data I For example? 8/36 Example: Michaelis-Mented Enzyme Kinetics I We know how Enzymes work I We have no reason to suspect non-normal error I We build a model that fits biology 9/36 Example: Michaelis-Mented Enzyme Kinetics 0.8 I We know how 0.6 Enzymes work I We have no reason 0.4 to suspect Rate non-normal error 0.2 I We build a model 0.0 that fits biology 0 1 2 Concentration 10/36 Example: Michaelis-Mented Enzyme Kinetics I Even if we had no biological model, saturating data is striking I We may have fit some other curve - examples? I We will discuss model selection later 11/36 Many Data Types Cannot Have a Normal Error Generating Process I Count data: discrete, cannot0.8 be <0, variance increases with mean 0.6 I Poisson I Overdispersed Count data: discrete, cannot be <0, variance 0.4 increases faster than mean Rate I Negatie Binomial or Quasipoisson0.2 I Multiplicative Error: Many errors, typically small, but biological process is multiplicative0.0 I Log-Normal 0 1 2 I Data discribes distribution of propertiesConcentration of mutiple events: cannot be <0, variance increases faster than mean I Gamma 12/36 Example: Wolf Inbreeding and Litter Size I The Number of Pups is a Count! I The Number of Pups are Additive! I No a priori reason to think the relationship nonlinear 13/36 Example: Wolf Inbreeding and Litter Size I The Number of 7.5 Pups is a Count! I The Number of 5.0 Pups are additive! pups I No a priori reason to think the 2.5 relationship nonlinear 0.0 0.1 0.2 0.3 0.4 inbreeding.coefficient 14/36 So what is with this Generalized Linear Modeling Thing? I Many models have data generating processes that can be linearized a+BX I E.g., Y = e → log(Y ) = a + BX I Many error generating processes are in the exponential family I This is *easy* to fit using Likelihood and IWLS - the glm framework I We can use other Likelihood functions, or Bayesian methods I Or Least Squares fits for normal linear models 15/36 Can I Stop Now? Is GLMs All I Need? I NO! I Many models have data generating processes that cannot be linearized a+sin(BX) I E.g., Y = e I Many possible error generating processes I My favorite - the Gumbel distribution, for maximum values I And we haven’t even started with mixed models, autocorrelation, etc... I For these, we use other Likelihood or Bayesian methods I Some problems have shortcuts, others do not 16/36 Logistic Regression!!! 17/36 The Logitistic Curve (for Probabilities) 1.0 0.8 0.6 Probability 0.4 0.2 0.0 −4 −2 0 2 4 X 18/36 Binomial Error Generating Process Possible values bounded by probability Probability = 0.01 Probability = 0.3 12 40 8 20 4 Frequency Frequency 0 0 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 Probability = 0.7 Probability = 0.99 40 12 8 20 4 Frequency Frequency 0 0 4 5 6 7 8 9 8.0 8.5 9.0 9.5 19/36 The Logitistic Function e(a+BX) p = 1 + e(a+BX) logit(p) = a + BX 20/36 Generalized Linear Model with a Logit Link logit(p) = a + BX Y ∼ Binom(T rials, p) 21/36 Cryptosporidium 22/36 Drug Trial with Mice 23/36 Fraction of Mice Infected = Probability of Infection 1.00 0.75 0.50 0.25 Fraction of Mice Infected Fraction 0.00 0 100 200 300 400 Dose 24/36 Two Different Ways of Writing the Model # 1) using Heads, Tails glm(cbind(Y, N-Y)˜ Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(Y/N˜ Dose, weights=N, data=crypto, family=binomial) 25/36 The Fit Model 1.00 0.75 0.50 0.25 Fraction of Mice Infected Fraction 0.00 0 100 200 300 400 Dose 26/36 The Fit Model # # Call: # glm(formula = cbind(Y, N - Y) ˜ Dose, family = binomial, data = crypto) # # Deviance Residuals: # Min 1Q Median 3Q Max # -3.9532 -1.2442 0.2327 1.5531 3.6013 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.407769 0.148479 -9.481 <2e-16 # Dose 0.013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.34 on 67 degrees of freedom # Residual deviance: 200.51 on 66 degrees of freedom # AIC: 327.03 # 27/36 # Number of Fisher Scoring iterations: 4 The Odds p Odds = 1 − p p Log − Odds = Log = logit(p) 1 − p 28/36 The Meaning of a Logit Coefficient Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response. β = logit(p2) − logit(p1) p p β = Log 1 − Log 2 1 − p1 1 − p2 We need to know both p1 and β to interpret this. If p1 = 0.5, β = 0.01347, then p2 = 0.503 If p1 = 0.7, β = 0.01347, then p2 = 0.702 29/36 What if we Only Have 1’s and 0’s? 1.00 0.75 0.50 Predation 0.25 0.00 0 2 4 6 8 log.seed.weight 30/36 Seed Predators http://denimandtweed.com 31/36 The GLM seed.glm <- glm(Predation˜ log.seed.weight, data=seeds, family=binomial) 32/36 Fitted Seed Predation Plot 1.00 0.75 0.50 Predation 0.25 0.00 0.0 2.5 5.0 7.5 log.seed.weight 33/36 Diagnostics Look Odd Due to Binned Nature of the Data Residuals vs Fitted Normal Q−Q 5545721113 1113 554572 2 2 1 1 Residuals −1 −1 Std. deviance resid. Std. deviance −3 −2 −1 0 −3 −1 1 2 3 Predicted values Theoretical Quantiles . Scale−Location Residuals vs Leverage d i s 5545721113 e 5545721113 r e 3 c n 1.0 a i 1 v e d . Cook's distance d −1 t 0.0 S Std. Pearson resid. Std. Pearson −3 −2 −1 0 0.000 0.003 0.006 Predicted values Leverage 34/36 Creating Binned Residuals 2 1 0 residuals(seed.glm) −1 0.1 0.2 0.3 0.4 0.5 fitted(seed.glm, type = "deviance") 35/36 Binned Residuals Should Look Spread Out 200 Bins 2.0 1.5 1.0 0.5 Residual 0.0 −1.0 0.1 0.2 0.3 0.4 0.5 Fitted 36/36.
Recommended publications
  • A Generalized Linear Model for Principal Component Analysis of Binary Data
    A Generalized Linear Model for Principal Component Analysis of Binary Data Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a generalized linear model for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the exponential family. In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to linear regression. Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating least squares respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model's relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model's likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA|as tribution and the logistic function as its canonical link.
    [Show full text]
  • Generalized Linear Models (Glms)
    San Jos´eState University Math 261A: Regression Theory & Methods Generalized Linear Models (GLMs) Dr. Guangliang Chen This lecture is based on the following textbook sections: • Chapter 13: 13.1 – 13.3 Outline of this presentation: • What is a GLM? • Logistic regression • Poisson regression Generalized Linear Models (GLMs) What is a GLM? In ordinary linear regression, we assume that the response is a linear function of the regressors plus Gaussian noise: 0 2 y = β0 + β1x1 + ··· + βkxk + ∼ N(x β, σ ) | {z } |{z} linear form x0β N(0,σ2) noise The model can be reformulate in terms of • distribution of the response: y | x ∼ N(µ, σ2), and • dependence of the mean on the predictors: µ = E(y | x) = x0β Dr. Guangliang Chen | Mathematics & Statistics, San Jos´e State University3/24 Generalized Linear Models (GLMs) beta=(1,2) 5 4 3 β0 + β1x b y 2 y 1 0 −1 0.0 0.2 0.4 0.6 0.8 1.0 x x Dr. Guangliang Chen | Mathematics & Statistics, San Jos´e State University4/24 Generalized Linear Models (GLMs) Generalized linear models (GLM) extend linear regression by allowing the response variable to have • a general distribution (with mean µ = E(y | x)) and • a mean that depends on the predictors through a link function g: That is, g(µ) = β0x or equivalently, µ = g−1(β0x) Dr. Guangliang Chen | Mathematics & Statistics, San Jos´e State University5/24 Generalized Linear Models (GLMs) In GLM, the response is typically assumed to have a distribution in the exponential family, which is a large class of probability distributions that have pdfs of the form f(x | θ) = a(x)b(θ) exp(c(θ) · T (x)), including • Normal - ordinary linear regression • Bernoulli - Logistic regression, modeling binary data • Binomial - Multinomial logistic regression, modeling general cate- gorical data • Poisson - Poisson regression, modeling count data • Exponential, Gamma - survival analysis Dr.
    [Show full text]
  • Logistic Regression, Dependencies, Non-Linear Data and Model Reduction
    COMP6237 – Logistic Regression, Dependencies, Non-linear Data and Model Reduction Markus Brede [email protected] Lecture slides available here: http://users.ecs.soton.ac.uk/mb8/stats/datamining.html (Thanks to Jason Noble and Cosma Shalizi whose lecture materials I used to prepare) COMP6237: Logistic Regression ● Outline: – Introduction – Basic ideas of logistic regression – Logistic regression using R – Some underlying maths and MLE – The multinomial case – How to deal with non-linear data ● Model reduction and AIC – How to deal with dependent data – Summary – Problems Introduction ● Previous lecture: Linear regression – tried to predict a continuous variable from variation in another continuous variable (E.g. basketball ability from height) ● Here: Logistic regression – Try to predict results of a binary (or categorical) outcome variable Y from a predictor variable X – This is a classification problem: classify X as belonging to one of two classes – Occurs quite often in science … e.g. medical trials (will a patient live or die dependent on medication?) Dependent variable Y Predictor Variables X The Oscars Example ● A fictional data set that looks at what it takes for a movie to win an Oscar ● Outcome variable: Oscar win, yes or no? ● Predictor variables: – Box office takings in millions of dollars – Budget in millions of dollars – Country of origin: US, UK, Europe, India, other – Critical reception (scores 0 … 100) – Length of film in minutes – This (fictitious) data set is available here: https://www.southampton.ac.uk/~mb1a10/stats/filmData.txt Predicting Oscar Success ● Let's start simple and look at only one of the predictor variables ● Do big box office takings make Oscar success more likely? ● Could use same techniques as below to look at budget size, film length, etc.
    [Show full text]
  • Bayesian Inference Chapter 4: Regression and Hierarchical Models
    Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35 Objective AFM Smith Dennis Lindley We analyze the Bayesian approach to fitting normal and generalized linear models and introduce the Bayesian hierarchical modeling approach. Also, we study the modeling and forecasting of time series. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35 Contents 1 Normal linear models 1.1. ANOVA model 1.2. Simple linear regression model 2 Generalized linear models 3 Hierarchical models 4 Dynamic models Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35 Normal linear models A normal linear model is of the following form: y = Xθ + ; 0 where y = (y1;:::; yn) is the observed data, X is a known n × k matrix, called 0 the design matrix, θ = (θ1; : : : ; θk ) is the parameter set and follows a multivariate normal distribution. Usually, it is assumed that: 1 ∼ N 0 ; I : k φ k A simple example of normal linear model is the simple linear regression model T 1 1 ::: 1 where X = and θ = (α; β)T . x1 x2 ::: xn Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35 Normal linear models Consider a normal linear model, y = Xθ + . A conjugate prior distribution is a normal-gamma distribution:
    [Show full text]
  • Small-Sample Properties of Tests for Heteroscedasticity in the Conditional Logit Model
    HEDG Working Paper 06/04 Small-sample properties of tests for heteroscedasticity in the conditional logit model Arne Risa Hole May 2006 ISSN 1751-1976 york.ac.uk/res/herc/hedgwp Small-sample properties of tests for heteroscedasticity in the conditional logit model Arne Risa Hole National Primary Care Research and Development Centre Centre for Health Economics University of York May 16, 2006 Abstract This paper compares the small-sample properties of several asymp- totically equivalent tests for heteroscedasticity in the conditional logit model. While no test outperforms the others in all of the experiments conducted, the likelihood ratio test and a particular variety of the Wald test are found to have good properties in moderate samples as well as being relatively powerful. Keywords: conditional logit, heteroscedasticity JEL classi…cation: C25 National Primary Care Research and Development Centre, Centre for Health Eco- nomics, Alcuin ’A’Block, University of York, York YO10 5DD, UK. Tel.: +44 1904 321404; fax: +44 1904 321402. E-mail address: [email protected]. NPCRDC receives funding from the Department of Health. The views expressed are not necessarily those of the funders. 1 1 Introduction In most applications of the conditional logit model the error term is assumed to be homoscedastic. Recently, however, there has been a growing interest in testing the homoscedasticity assumption in applied work (Hensher et al., 1999; DeShazo and Fermo, 2002). This is partly due to the well-known result that heteroscedasticity causes the coe¢ cient estimates in discrete choice models to be inconsistent (Yatchew and Griliches, 1985), but also re‡ects a behavioural interest in factors in‡uencing the variance of the latent variables in the model (Louviere, 2001; Louviere et al., 2002).
    [Show full text]
  • Multinomial Logistic Regression
    26-Osborne (Best)-45409.qxd 10/9/2007 5:52 PM Page 390 26 MULTINOMIAL LOGISTIC REGRESSION CAROLYN J. ANDERSON LESLIE RUTKOWSKI hapter 24 presented logistic regression Many of the concepts used in binary logistic models for dichotomous response vari- regression, such as the interpretation of parame- C ables; however, many discrete response ters in terms of odds ratios and modeling prob- variables have three or more categories (e.g., abilities, carry over to multicategory logistic political view, candidate voted for in an elec- regression models; however, two major modifica- tion, preferred mode of transportation, or tions are needed to deal with multiple categories response options on survey items). Multicate- of the response variable. One difference is that gory response variables are found in a wide with three or more levels of the response variable, range of experiments and studies in a variety of there are multiple ways to dichotomize the different fields. A detailed example presented in response variable. If J equals the number of cate- this chapter uses data from 600 students from gories of the response variable, then J(J – 1)/2 dif- the High School and Beyond study (Tatsuoka & ferent ways exist to dichotomize the categories. Lohnes, 1988) to look at the differences among In the High School and Beyond study, the three high school students who attended academic, program types can be dichotomized into pairs general, or vocational programs. The students’ of programs (i.e., academic and general, voca- socioeconomic status (ordinal), achievement tional and general, and academic and vocational). test scores (numerical), and type of school How the response variable is dichotomized (nominal) are all examined as possible explana- depends, in part, on the nature of the variable.
    [Show full text]
  • Logit and Ordered Logit Regression (Ver
    Getting Started in Logit and Ordered Logit Regression (ver. 3.1 beta) Oscar Torres-Reyna Data Consultant [email protected] http://dss.princeton.edu/training/ PU/DSS/OTR Logit model • Use logit models whenever your dependent variable is binary (also called dummy) which takes values 0 or 1. • Logit regression is a nonlinear regression model that forces the output (predicted values) to be either 0 or 1. • Logit models estimate the probability of your dependent variable to be 1 (Y=1). This is the probability that some event happens. PU/DSS/OTR Logit odelm From Stock & Watson, key concept 9.3. The logit model is: Pr(YXXXFXX 1 | 1= , 2 ,...=k β ) +0 β ( 1 +2 β 1 +βKKX 2 + ... ) 1 Pr(YXXX 1= | 1 , 2k = ,... ) 1−+(eβ0 + βXX 1 1 + β 2 2 + ...βKKX + ) 1 Pr(YXXX 1= | 1 , 2= ,... ) k ⎛ 1 ⎞ 1+ ⎜ ⎟ (⎝ eβ+0 βXX 1 1 + β 2 2 + ...βKK +X ⎠ ) Logit nd probita models are basically the same, the difference is in the distribution: • Logit – Cumulative standard logistic distribution (F) • Probit – Cumulative standard normal distribution (Φ) Both models provide similar results. PU/DSS/OTR It tests whether the combined effect, of all the variables in the model, is different from zero. If, for example, < 0.05 then the model have some relevant explanatory power, which does not mean it is well specified or at all correct. Logit: predicted probabilities After running the model: logit y_bin x1 x2 x3 x4 x5 x6 x7 Type predict y_bin_hat /*These are the predicted probabilities of Y=1 */ Here are the estimations for the first five cases, type: 1 x2 x3 x4 x5 x6 x7 y_bin_hatbrowse y_bin x Predicted probabilities To estimate the probability of Y=1 for the first row, replace the values of X into the logit regression equation.
    [Show full text]
  • Linear, Ridge Regression, and Principal Component Analysis
    Linear, Ridge Regression, and Principal Component Analysis Linear, Ridge Regression, and Principal Component Analysis Jia Li Department of Statistics The Pennsylvania State University Email: [email protected] http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Introduction to Regression I Input vector: X = (X1, X2, ..., Xp). I Output Y is real-valued. I Predict Y from X by f (X ) so that the expected loss function E(L(Y , f (X ))) is minimized. I Square loss: L(Y , f (X )) = (Y − f (X ))2 . I The optimal predictor ∗ 2 f (X ) = argminf (X )E(Y − f (X )) = E(Y | X ) . I The function E(Y | X ) is the regression function. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Example The number of active physicians in a Standard Metropolitan Statistical Area (SMSA), denoted by Y , is expected to be related to total population (X1, measured in thousands), land area (X2, measured in square miles), and total personal income (X3, measured in millions of dollars). Data are collected for 141 SMSAs, as shown in the following table. i : 1 2 3 ... 139 140 141 X1 9387 7031 7017 ... 233 232 231 X2 1348 4069 3719 ... 1011 813 654 X3 72100 52737 54542 ... 1337 1589 1148 Y 25627 15389 13326 ... 264 371 140 Goal: Predict Y from X1, X2, and X3. Jia Li http://www.stat.psu.edu/∼jiali Linear, Ridge Regression, and Principal Component Analysis Linear Methods I The linear regression model p X f (X ) = β0 + Xj βj .
    [Show full text]
  • Bayesian Methods: Review of Generalized Linear Models
    Bayesian Methods: Review of Generalized Linear Models RYAN BAKKER University of Georgia ICPSR Day 2 Bayesian Methods: GLM [1] Likelihood and Maximum Likelihood Principles Likelihood theory is an important part of Bayesian inference: it is how the data enter the model. • The basis is Fisher’s principle: what value of the unknown parameter is “most likely” to have • generated the observed data. Example: flip a coin 10 times, get 5 heads. MLE for p is 0.5. • This is easily the most common and well-understood general estimation process. • Bayesian Methods: GLM [2] Starting details: • – Y is a n k design or observation matrix, θ is a k 1 unknown coefficient vector to be esti- × × mated, we want p(θ Y) (joint sampling distribution or posterior) from p(Y θ) (joint probabil- | | ity function). – Define the likelihood function: n L(θ Y) = p(Y θ) | i| i=1 Y which is no longer on the probability metric. – Our goal is the maximum likelihood value of θ: θˆ : L(θˆ Y) L(θ Y) θ Θ | ≥ | ∀ ∈ where Θ is the class of admissable values for θ. Bayesian Methods: GLM [3] Likelihood and Maximum Likelihood Principles (cont.) Its actually easier to work with the natural log of the likelihood function: • `(θ Y) = log L(θ Y) | | We also find it useful to work with the score function, the first derivative of the log likelihood func- • tion with respect to the parameters of interest: ∂ `˙(θ Y) = `(θ Y) | ∂θ | Setting `˙(θ Y) equal to zero and solving gives the MLE: θˆ, the “most likely” value of θ from the • | parameter space Θ treating the observed data as given.
    [Show full text]
  • Generalized Linear Models
    Generalized Linear Models A generalized linear model (GLM) consists of three parts. i) The first part is a random variable giving the conditional distribution of a response Yi given the values of a set of covariates Xij. In the original work on GLM’sby Nelder and Wedderburn (1972) this random variable was a member of an exponential family, but later work has extended beyond this class of random variables. ii) The second part is a linear predictor, i = + 1Xi1 + 2Xi2 + + ··· kXik . iii) The third part is a smooth and invertible link function g(.) which transforms the expected value of the response variable, i = E(Yi) , and is equal to the linear predictor: g(i) = i = + 1Xi1 + 2Xi2 + + kXik. ··· As shown in Tables 15.1 and 15.2, both the general linear model that we have studied extensively and the logistic regression model from Chapter 14 are special cases of this model. One property of members of the exponential family of distributions is that the conditional variance of the response is a function of its mean, (), and possibly a dispersion parameter . The expressions for the variance functions for common members of the exponential family are shown in Table 15.2. Also, for each distribution there is a so-called canonical link function, which simplifies some of the GLM calculations, which is also shown in Table 15.2. Estimation and Testing for GLMs Parameter estimation in GLMs is conducted by the method of maximum likelihood. As with logistic regression models from the last chapter, the generalization of the residual sums of squares from the general linear model is the residual deviance, Dm 2(log Ls log Lm), where Lm is the maximized likelihood for the model of interest, and Ls is the maximized likelihood for a saturated model, which has one parameter per observation and fits the data as well as possible.
    [Show full text]
  • Principal Component Analysis (PCA) As a Statistical Tool for Identifying Key Indicators of Nuclear Power Plant Cable Insulation
    Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2017 Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation Chamila Chandima De Silva Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Materials Science and Engineering Commons, Mechanics of Materials Commons, and the Statistics and Probability Commons Recommended Citation De Silva, Chamila Chandima, "Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation" (2017). Graduate Theses and Dissertations. 16120. https://lib.dr.iastate.edu/etd/16120 This Thesis is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Principal component analysis (PCA) as a statistical tool for identifying key indicators of nuclear power plant cable insulation degradation by Chamila C. De Silva A thesis submitted to the graduate faculty in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Major: Materials Science and Engineering Program of Study Committee: Nicola Bowler, Major Professor Richard LeSar Steve Martin The student author and the program of study committee are solely responsible for the content of this thesis. The Graduate College will ensure this thesis is globally accessible and will not permit alterations after a degree is conferred.
    [Show full text]
  • Heteroscedastic Errors
    Heteroscedastic Errors ◮ Sometimes plots and/or tests show that the error variances 2 σi = Var(ǫi ) depend on i ◮ Several standard approaches to fixing the problem, depending on the nature of the dependence. ◮ Weighted Least Squares. ◮ Transformation of the response. ◮ Generalized Linear Models. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Weighted Least Squares ◮ Suppose variances are known except for a constant factor. 2 2 ◮ That is, σi = σ /wi . ◮ Use weighted least squares. (See Chapter 10 in the text.) ◮ This usually arises realistically in the following situations: ◮ Yi is an average of ni measurements where you know ni . Then wi = ni . 2 ◮ Plots suggest that σi might be proportional to some power of 2 γ γ some covariate: σi = kxi . Then wi = xi− . Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Variances depending on (mean of) Y ◮ Two standard approaches are available: ◮ Older approach is transformation. ◮ Newer approach is use of generalized linear model; see STAT 402. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Transformation ◮ Compute Yi∗ = g(Yi ) for some function g like logarithm or square root. ◮ Then regress Yi∗ on the covariates. ◮ This approach sometimes works for skewed response variables like income; ◮ after transformation we occasionally find the errors are more nearly normal, more homoscedastic and that the model is simpler. ◮ See page 130ff and check under transformations and Box-Cox in the index. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Generalized Linear Models ◮ Transformation uses the model T E(g(Yi )) = xi β while generalized linear models use T g(E(Yi )) = xi β ◮ Generally latter approach offers more flexibility.
    [Show full text]