Logistic Regression (a type of Generalized Linear Model)
1/36
Today
I Review of GLMs
I Logistic Regression
2/36 How do we find patterns in data?
I We begin with a model of how the world works
I We use our knowledge of a system to create a model of a Data Generating Process
I We know that there is variation in any relationship due to an Error Generating Process
I We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate
3/36
We started Linear. Why?
I Often, our first stab at a hypothesis is that two variables are associated I Linearity is a naive, but reasonable, first assumption I Y = a + BX is straightforward to fit 10 5 y 0 −5 −10
−2 −1 0 1 2 3 4/36 x We started Normal. Why?
I It is reasonable to assume that small errors are common I It is reasonable to assume that large errors are large I It is reasonable to assume that error is additive for many phenomena I Many processes we measure are continuous I Y = a + BX + e implies additive error I Y ∼ N(mean = a + BX, sd = σ)
Histogram of rnorm(100) 30 20 Frequency 10 0
−3 −2 −1 0 1 2 3
Deviation from Mean 5/36
Example: Pufferfish Mimics & Predator Approaches
I What assumptions would you make about similarity and predator response?
I How might predators vary in response?
I What kinds of error might we have in measuring predator responses?
6/36 Example: A Linear Data Generating Process and Gaussian Error Generating Process
15
10 predators
5
0 1 2 3 4 resemblance
7/36
What if We Have More Information about the Data Generating Process
I We often have real biological models of a phenomenon!
I For example?
I Even if we do not, we often know something about theory, we know the shape of the data
I For example?
8/36 Example: Michaelis-Mented Enzyme Kinetics
I We know how Enzymes work
I We have no reason to suspect non-normal error
I We build a model that fits biology
9/36
Example: Michaelis-Mented Enzyme Kinetics
0.8
I We know how 0.6 Enzymes work
I We have no reason 0.4 to suspect Rate non-normal error 0.2
I We build a model 0.0 that fits biology 0 1 2 Concentration
10/36 Example: Michaelis-Mented Enzyme Kinetics
0.8 I Even if we had no biological model, 0.6 saturating data is striking 0.4
Rate I We may have fit 0.2 some other curve - examples? 0.0 I We will discuss
0 1 2 model selection Concentration later
11/36
Many Data Types Cannot Have a Normal Error Generating Process
I Count data: discrete, cannot be <0, variance increases with mean
I Poisson I Overdispersed Count data: discrete, cannot be <0, variance increases faster than mean
I Negatie Binomial or Quasipoisson I Multiplicative Error: Many errors, typically small, but biological process is multiplicative
I Log-Normal I Data discribes distribution of properties of mutiple events: cannot be <0, variance increases faster than mean
I Gamma
12/36 Example: Wolf Inbreeding and Litter Size
I The Number of Pups is a Count!
I The Number of Pups are Additive!
I No a priori reason to think the relationship nonlinear
13/36
Example: Wolf Inbreeding and Litter Size
I The Number of 7.5 Pups is a Count!
I The Number of 5.0
Pups are additive! pups
I No a priori reason to think the 2.5 relationship
nonlinear 0.0 0.1 0.2 0.3 0.4 inbreeding.coefficient
14/36 So what is with this Generalized Linear Modeling Thing?
I Many models have data generating processes that can be linearized a+BX I E.g., Y = e → log(Y ) = a + BX
I Many error generating processes are in the exponential family
I This is *easy* to fit using Likelihood and IWLS - the glm framework I We can use other Likelihood functions, or Bayesian methods
I Or Least Squares fits for normal linear models
15/36
Can I Stop Now? Is GLMs All I Need?
I NO! I Many models have data generating processes that cannot be linearized a+sin(BX) I E.g., Y = e
I Many possible error generating processes
I My favorite - the Gumbel distribution, for maximum values
I And we haven’t even started with mixed models, autocorrelation, etc... I For these, we use other Likelihood or Bayesian methods
I Some problems have shortcuts, others do not
16/36 Logistic Regression!!!
17/36
The Logitistic Curve (for Probabilities) 1.0 0.8 0.6 Probability 0.4 0.2 0.0
−4 −2 0 2 4
X
18/36 Binomial Error Generating Process Possible values bounded by probability
Probability = 0.01 Probability = 0.3 12 40 8 20 4 Frequency Frequency 0 0
0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5
Probability = 0.7 Probability = 0.99 40 12 8 20 4 Frequency Frequency 0 0
4 5 6 7 8 9 8.0 8.5 9.0 9.5
19/36
The Logitistic Function
e(a+BX) p = 1 + e(a+BX)
logit(p) = a + BX
20/36 Generalized Linear Model with a Logit Link
logit(p) = a + BX
Y ∼ Binom(T rials, p)
21/36
Cryptosporidium
22/36 Drug Trial with Mice
23/36
Fraction of Mice Infected = Probability of Infection
1.00
0.75
0.50
0.25 Fraction of Mice Infected Fraction
0.00 0 100 200 300 400 Dose
24/36 Two Different Ways of Writing the Model
# 1) using Heads, Tails glm(cbind(Y, N-Y)˜ Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(Y/N˜ Dose, weights=N, data=crypto, family=binomial)
25/36
The Fit Model
1.00
0.75
0.50
0.25 Fraction of Mice Infected Fraction
0.00 0 100 200 300 400 Dose
26/36 The Fit Model
# # Call: # glm(formula = cbind(Y, N - Y) ˜ Dose, family = binomial, data = crypto) # # Deviance Residuals: # Min 1Q Median 3Q Max # -3.9532 -1.2442 0.2327 1.5531 3.6013 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.407769 0.148479 -9.481 <2e-16 # Dose 0.013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.34 on 67 degrees of freedom # Residual deviance: 200.51 on 66 degrees of freedom # AIC: 327.03
# 27/36 # Number of Fisher Scoring iterations: 4
The Odds
p Odds = 1 − p
p Log − Odds = Log = logit(p) 1 − p
28/36 The Meaning of a Logit Coefficient
Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response.
β = logit(p2) − logit(p1)
p p β = Log 1 − Log 2 1 − p1 1 − p2
We need to know both p1 and β to interpret this. If p1 = 0.5, β = 0.01347, then p2 = 0.503 If p1 = 0.7, β = 0.01347, then p2 = 0.702
29/36
What if we Only Have 1’s and 0’s?
1.00
0.75
0.50 Predation
0.25
0.00
0 2 4 6 8 log.seed.weight
30/36 Seed Predators
http://denimandtweed.com 31/36
The GLM
seed.glm <- glm(Predation˜ log.seed.weight, data=seeds, family=binomial)
32/36 Fitted Seed Predation Plot
1.00
0.75
0.50 Predation
0.25
0.00
0.0 2.5 5.0 7.5 log.seed.weight
33/36
Diagnostics Look Odd Due to Binned Nature of the Data
Residuals vs Fitted Normal Q−Q 5545721113 1113 554572 2 2 1 1 Residuals −1 −1 Std. deviance resid. Std. deviance −3 −2 −1 0 −3 −1 1 2 3
Predicted values Theoretical Quantiles
. Scale−Location Residuals vs Leverage d i
s 5545721113
e 5545721113 r
e 3 c n 1.0 a i 1 v e d
. Cook's distance d −1 t 0.0 S Std. Pearson resid. Std. Pearson −3 −2 −1 0 0.000 0.003 0.006
Predicted values Leverage
34/36 Creating Binned Residuals 2 1 0 residuals(seed.glm) −1
0.1 0.2 0.3 0.4 0.5
fitted(seed.glm, type = "deviance")
35/36
Binned Residuals Should Look Spread Out
200 Bins 2.0 1.5 1.0 0.5 Residual 0.0 −1.0
0.1 0.2 0.3 0.4 0.5
Fitted
36/36