Logistic Regression (A Type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) 1/36 Today I Review of GLMs I Logistic Regression 2/36 How do we find patterns in data? I We begin with a model of how the world works I We use our knowledge of a system to create a model of a Data Generating Process I We know that there is variation in any relationship due to an Error Generating Process I We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate 3/36 We started Linear. Why? I Often, our first stab at a hypothesis is that two variables are associated I Linearity is a naive, but reasonable, first assumption I Y = a + BX is straightforward to fit 10 5 y 0 −5 −10 −2 −1 0 1 2 3 4/36 x We started Normal. Why? I It is reasonable to assume that small errors are common I It is reasonable to assume that large errors are large I It is reasonable to assume that error is additive for many phenomena I Many processes we measure are continuous I Y = a + BX + e implies additive error I Y ∼ N(mean = a + BX, sd = σ) Histogram of rnorm(100) 30 20 Frequency 10 0 −3 −2 −1 0 1 2 3 Deviation from Mean 5/36 Example: Pufferfish Mimics & Predator Approaches I What assumptions would you make about similarity and predator response? I How might predators vary in response? I What kinds of error might we have in measuring predator responses? 6/36 Example: A Linear Data Generating Process and Gaussian Error Generating Process 15 10 predators 5 0 1 2 3 4 resemblance 7/36 What if We Have More Information about the Data Generating Process I We often have real biological models of a phenomenon! I For example? I Even if we do not, we often know something about theory, we know the shape of the data I For example? 8/36 Example: Michaelis-Mented Enzyme Kinetics I We know how Enzymes work I We have no reason to suspect non-normal error I We build a model that fits biology 9/36 Example: Michaelis-Mented Enzyme Kinetics 0.8 I We know how 0.6 Enzymes work I We have no reason 0.4 to suspect Rate non-normal error 0.2 I We build a model 0.0 that fits biology 0 1 2 Concentration 10/36 Example: Michaelis-Mented Enzyme Kinetics I Even if we had no biological model, saturating data is striking I We may have fit some other curve - examples? I We will discuss model selection later 11/36 Many Data Types Cannot Have a Normal Error Generating Process I Count data: discrete, cannot0.8 be <0, variance increases with mean 0.6 I Poisson I Overdispersed Count data: discrete, cannot be <0, variance 0.4 increases faster than mean Rate I Negatie Binomial or Quasipoisson0.2 I Multiplicative Error: Many errors, typically small, but biological process is multiplicative0.0 I Log-Normal 0 1 2 I Data discribes distribution of propertiesConcentration of mutiple events: cannot be <0, variance increases faster than mean I Gamma 12/36 Example: Wolf Inbreeding and Litter Size I The Number of Pups is a Count! I The Number of Pups are Additive! I No a priori reason to think the relationship nonlinear 13/36 Example: Wolf Inbreeding and Litter Size I The Number of 7.5 Pups is a Count! I The Number of 5.0 Pups are additive! pups I No a priori reason to think the 2.5 relationship nonlinear 0.0 0.1 0.2 0.3 0.4 inbreeding.coefficient 14/36 So what is with this Generalized Linear Modeling Thing? I Many models have data generating processes that can be linearized a+BX I E.g., Y = e → log(Y ) = a + BX I Many error generating processes are in the exponential family I This is *easy* to fit using Likelihood and IWLS - the glm framework I We can use other Likelihood functions, or Bayesian methods I Or Least Squares fits for normal linear models 15/36 Can I Stop Now? Is GLMs All I Need? I NO! I Many models have data generating processes that cannot be linearized a+sin(BX) I E.g., Y = e I Many possible error generating processes I My favorite - the Gumbel distribution, for maximum values I And we haven’t even started with mixed models, autocorrelation, etc... I For these, we use other Likelihood or Bayesian methods I Some problems have shortcuts, others do not 16/36 Logistic Regression!!! 17/36 The Logitistic Curve (for Probabilities) 1.0 0.8 0.6 Probability 0.4 0.2 0.0 −4 −2 0 2 4 X 18/36 Binomial Error Generating Process Possible values bounded by probability Probability = 0.01 Probability = 0.3 12 40 8 20 4 Frequency Frequency 0 0 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 Probability = 0.7 Probability = 0.99 40 12 8 20 4 Frequency Frequency 0 0 4 5 6 7 8 9 8.0 8.5 9.0 9.5 19/36 The Logitistic Function e(a+BX) p = 1 + e(a+BX) logit(p) = a + BX 20/36 Generalized Linear Model with a Logit Link logit(p) = a + BX Y ∼ Binom(T rials, p) 21/36 Cryptosporidium 22/36 Drug Trial with Mice 23/36 Fraction of Mice Infected = Probability of Infection 1.00 0.75 0.50 0.25 Fraction of Mice Infected Fraction 0.00 0 100 200 300 400 Dose 24/36 Two Different Ways of Writing the Model # 1) using Heads, Tails glm(cbind(Y, N-Y)˜ Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(Y/N˜ Dose, weights=N, data=crypto, family=binomial) 25/36 The Fit Model 1.00 0.75 0.50 0.25 Fraction of Mice Infected Fraction 0.00 0 100 200 300 400 Dose 26/36 The Fit Model # # Call: # glm(formula = cbind(Y, N - Y) ˜ Dose, family = binomial, data = crypto) # # Deviance Residuals: # Min 1Q Median 3Q Max # -3.9532 -1.2442 0.2327 1.5531 3.6013 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.407769 0.148479 -9.481 <2e-16 # Dose 0.013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.34 on 67 degrees of freedom # Residual deviance: 200.51 on 66 degrees of freedom # AIC: 327.03 # 27/36 # Number of Fisher Scoring iterations: 4 The Odds p Odds = 1 − p p Log − Odds = Log = logit(p) 1 − p 28/36 The Meaning of a Logit Coefficient Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response. β = logit(p2) − logit(p1) p p β = Log 1 − Log 2 1 − p1 1 − p2 We need to know both p1 and β to interpret this. If p1 = 0.5, β = 0.01347, then p2 = 0.503 If p1 = 0.7, β = 0.01347, then p2 = 0.702 29/36 What if we Only Have 1’s and 0’s? 1.00 0.75 0.50 Predation 0.25 0.00 0 2 4 6 8 log.seed.weight 30/36 Seed Predators http://denimandtweed.com 31/36 The GLM seed.glm <- glm(Predation˜ log.seed.weight, data=seeds, family=binomial) 32/36 Fitted Seed Predation Plot 1.00 0.75 0.50 Predation 0.25 0.00 0.0 2.5 5.0 7.5 log.seed.weight 33/36 Diagnostics Look Odd Due to Binned Nature of the Data Residuals vs Fitted Normal Q−Q 5545721113 1113 554572 2 2 1 1 Residuals −1 −1 Std. deviance resid. Std. deviance −3 −2 −1 0 −3 −1 1 2 3 Predicted values Theoretical Quantiles . Scale−Location Residuals vs Leverage d i s 5545721113 e 5545721113 r e 3 c n 1.0 a i 1 v e d . Cook's distance d −1 t 0.0 S Std. Pearson resid. Std. Pearson −3 −2 −1 0 0.000 0.003 0.006 Predicted values Leverage 34/36 Creating Binned Residuals 2 1 0 residuals(seed.glm) −1 0.1 0.2 0.3 0.4 0.5 fitted(seed.glm, type = "deviance") 35/36 Binned Residuals Should Look Spread Out 200 Bins 2.0 1.5 1.0 0.5 Residual 0.0 −1.0 0.1 0.2 0.3 0.4 0.5 Fitted 36/36.

Logistic Regression (A Type of Generalized Linear Model)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support