Logistic Regression (A Type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) 1/36 Today I Review of GLMs I Logistic Regression 2/36 How do we find patterns in data? I We begin with a model of how the world works I We use our knowledge of a system to create a model of a Data Generating Process I We know that there is variation in any relationship due to an Error Generating Process I We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate 3/36 We started Linear. Why? I Often, our first stab at a hypothesis is that two variables are associated I Linearity is a naive, but reasonable, first assumption I Y = a + BX is straightforward to fit 10 5 y 0 −5 −10 −2 −1 0 1 2 3 4/36 x We started Normal. Why? I It is reasonable to assume that small errors are common I It is reasonable to assume that large errors are large I It is reasonable to assume that error is additive for many phenomena I Many processes we measure are continuous I Y = a + BX + e implies additive error I Y ∼ N(mean = a + BX, sd = σ) Histogram of rnorm(100) 30 20 Frequency 10 0 −3 −2 −1 0 1 2 3 Deviation from Mean 5/36 Example: Pufferfish Mimics & Predator Approaches I What assumptions would you make about similarity and predator response? I How might predators vary in response? I What kinds of error might we have in measuring predator responses? 6/36 Example: A Linear Data Generating Process and Gaussian Error Generating Process 15 10 predators 5 0 1 2 3 4 resemblance 7/36 What if We Have More Information about the Data Generating Process I We often have real biological models of a phenomenon! I For example? I Even if we do not, we often know something about theory, we know the shape of the data I For example? 8/36 Example: Michaelis-Mented Enzyme Kinetics I We know how Enzymes work I We have no reason to suspect non-normal error I We build a model that fits biology 9/36 Example: Michaelis-Mented Enzyme Kinetics 0.8 I We know how 0.6 Enzymes work I We have no reason 0.4 to suspect Rate non-normal error 0.2 I We build a model 0.0 that fits biology 0 1 2 Concentration 10/36 Example: Michaelis-Mented Enzyme Kinetics I Even if we had no biological model, saturating data is striking I We may have fit some other curve - examples? I We will discuss model selection later 11/36 Many Data Types Cannot Have a Normal Error Generating Process I Count data: discrete, cannot0.8 be <0, variance increases with mean 0.6 I Poisson I Overdispersed Count data: discrete, cannot be <0, variance 0.4 increases faster than mean Rate I Negatie Binomial or Quasipoisson0.2 I Multiplicative Error: Many errors, typically small, but biological process is multiplicative0.0 I Log-Normal 0 1 2 I Data discribes distribution of propertiesConcentration of mutiple events: cannot be <0, variance increases faster than mean I Gamma 12/36 Example: Wolf Inbreeding and Litter Size I The Number of Pups is a Count! I The Number of Pups are Additive! I No a priori reason to think the relationship nonlinear 13/36 Example: Wolf Inbreeding and Litter Size I The Number of 7.5 Pups is a Count! I The Number of 5.0 Pups are additive! pups I No a priori reason to think the 2.5 relationship nonlinear 0.0 0.1 0.2 0.3 0.4 inbreeding.coefficient 14/36 So what is with this Generalized Linear Modeling Thing? I Many models have data generating processes that can be linearized a+BX I E.g., Y = e → log(Y ) = a + BX I Many error generating processes are in the exponential family I This is *easy* to fit using Likelihood and IWLS - the glm framework I We can use other Likelihood functions, or Bayesian methods I Or Least Squares fits for normal linear models 15/36 Can I Stop Now? Is GLMs All I Need? I NO! I Many models have data generating processes that cannot be linearized a+sin(BX) I E.g., Y = e I Many possible error generating processes I My favorite - the Gumbel distribution, for maximum values I And we haven’t even started with mixed models, autocorrelation, etc... I For these, we use other Likelihood or Bayesian methods I Some problems have shortcuts, others do not 16/36 Logistic Regression!!! 17/36 The Logitistic Curve (for Probabilities) 1.0 0.8 0.6 Probability 0.4 0.2 0.0 −4 −2 0 2 4 X 18/36 Binomial Error Generating Process Possible values bounded by probability Probability = 0.01 Probability = 0.3 12 40 8 20 4 Frequency Frequency 0 0 0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5 Probability = 0.7 Probability = 0.99 40 12 8 20 4 Frequency Frequency 0 0 4 5 6 7 8 9 8.0 8.5 9.0 9.5 19/36 The Logitistic Function e(a+BX) p = 1 + e(a+BX) logit(p) = a + BX 20/36 Generalized Linear Model with a Logit Link logit(p) = a + BX Y ∼ Binom(T rials, p) 21/36 Cryptosporidium 22/36 Drug Trial with Mice 23/36 Fraction of Mice Infected = Probability of Infection 1.00 0.75 0.50 0.25 Fraction of Mice Infected Fraction 0.00 0 100 200 300 400 Dose 24/36 Two Different Ways of Writing the Model # 1) using Heads, Tails glm(cbind(Y, N-Y)˜ Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(Y/N˜ Dose, weights=N, data=crypto, family=binomial) 25/36 The Fit Model 1.00 0.75 0.50 0.25 Fraction of Mice Infected Fraction 0.00 0 100 200 300 400 Dose 26/36 The Fit Model # # Call: # glm(formula = cbind(Y, N - Y) ˜ Dose, family = binomial, data = crypto) # # Deviance Residuals: # Min 1Q Median 3Q Max # -3.9532 -1.2442 0.2327 1.5531 3.6013 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.407769 0.148479 -9.481 <2e-16 # Dose 0.013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.34 on 67 degrees of freedom # Residual deviance: 200.51 on 66 degrees of freedom # AIC: 327.03 # 27/36 # Number of Fisher Scoring iterations: 4 The Odds p Odds = 1 − p p Log − Odds = Log = logit(p) 1 − p 28/36 The Meaning of a Logit Coefficient Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response. β = logit(p2) − logit(p1) p p β = Log 1 − Log 2 1 − p1 1 − p2 We need to know both p1 and β to interpret this. If p1 = 0.5, β = 0.01347, then p2 = 0.503 If p1 = 0.7, β = 0.01347, then p2 = 0.702 29/36 What if we Only Have 1’s and 0’s? 1.00 0.75 0.50 Predation 0.25 0.00 0 2 4 6 8 log.seed.weight 30/36 Seed Predators http://denimandtweed.com 31/36 The GLM seed.glm <- glm(Predation˜ log.seed.weight, data=seeds, family=binomial) 32/36 Fitted Seed Predation Plot 1.00 0.75 0.50 Predation 0.25 0.00 0.0 2.5 5.0 7.5 log.seed.weight 33/36 Diagnostics Look Odd Due to Binned Nature of the Data Residuals vs Fitted Normal Q−Q 5545721113 1113 554572 2 2 1 1 Residuals −1 −1 Std. deviance resid. Std. deviance −3 −2 −1 0 −3 −1 1 2 3 Predicted values Theoretical Quantiles . Scale−Location Residuals vs Leverage d i s 5545721113 e 5545721113 r e 3 c n 1.0 a i 1 v e d . Cook's distance d −1 t 0.0 S Std. Pearson resid. Std. Pearson −3 −2 −1 0 0.000 0.003 0.006 Predicted values Leverage 34/36 Creating Binned Residuals 2 1 0 residuals(seed.glm) −1 0.1 0.2 0.3 0.4 0.5 fitted(seed.glm, type = "deviance") 35/36 Binned Residuals Should Look Spread Out 200 Bins 2.0 1.5 1.0 0.5 Residual 0.0 −1.0 0.1 0.2 0.3 0.4 0.5 Fitted 36/36.

Load more