<<

(a type of Generalized )

1/36

Today

I Review of GLMs

I Logistic Regression

2/36 How do we find patterns in ?

I We begin with a model of how the world works

I We use our knowledge of a system to create a model of a Data Generating Process

I We know that there is variation in any relationship due to an Error Generating Process

I We build hypothesis tests on top of this error generating process based on assuming our model of the data generating process is accurate

3/36

We started Linear. Why?

I Often, our first stab at a hypothesis is that two variables are associated I Linearity is a naive, but reasonable, first assumption I Y = a + BX is straightforward to fit 10 5 y 0 −5 −10

−2 −1 0 1 2 3 4/36 x We started Normal. Why?

I It is reasonable to assume that small errors are common I It is reasonable to assume that large errors are large I It is reasonable to assume that error is additive for many phenomena I Many processes we measure are continuous I Y = a + BX + e implies additive error I Y ∼ N( = a + BX, sd = σ)

Histogram of rnorm(100) 30 20 10 0

−3 −2 −1 0 1 2 3

Deviation from Mean 5/36

Example: Pufferfish Mimics & Predator Approaches

I What assumptions would you make about similarity and predator response?

I How might predators vary in response?

I What kinds of error might we have in measuring predator responses?

6/36 Example: A Linear Data Generating Process and Gaussian Error Generating Process

15

10 predators

5

0 1 2 3 4 resemblance

7/36

What if We Have More Information about the Data Generating Process

I We often have real biological models of a phenomenon!

I For example?

I Even if we do not, we often know something about theory, we know the shape of the data

I For example?

8/36 Example: Michaelis-Mented Enzyme Kinetics

I We know how Enzymes work

I We have no reason to suspect non-normal error

I We build a model that fits biology

9/36

Example: Michaelis-Mented Enzyme Kinetics

0.8

I We know how 0.6 Enzymes work

I We have no reason 0.4 to suspect Rate non-normal error 0.2

I We build a model 0.0 that fits biology 0 1 2 Concentration

10/36 Example: Michaelis-Mented Enzyme Kinetics

0.8 I Even if we had no biological model, 0.6 saturating data is striking 0.4

Rate I We may have fit 0.2 some other curve - examples? 0.0 I We will discuss

0 1 2 Concentration later

11/36

Many Data Types Cannot Have a Normal Error Generating Process

I : discrete, cannot be <0, variance increases with mean

I Poisson I Overdispersed Count data: discrete, cannot be <0, variance increases faster than mean

I Negatie Binomial or Quasipoisson I Multiplicative Error: Many errors, typically small, but biological process is multiplicative

I Log-Normal I Data discribes distribution of properties of mutiple events: cannot be <0, variance increases faster than mean

I Gamma

12/36 Example: Wolf Inbreeding and Litter Size

I The Number of Pups is a Count!

I The Number of Pups are Additive!

I No a priori reason to think the relationship nonlinear

13/36

Example: Wolf Inbreeding and Litter Size

I The Number of 7.5 Pups is a Count!

I The Number of 5.0

Pups are additive! pups

I No a priori reason to think the 2.5 relationship

nonlinear 0.0 0.1 0.2 0.3 0.4 inbreeding.coefficient

14/36 So what is with this Generalized Linear Modeling Thing?

I Many models have data generating processes that can be linearized a+BX I E.g., Y = e → log(Y ) = a + BX

I Many error generating processes are in the

I This is *easy* to fit using Likelihood and IWLS - the glm framework I We can use other Likelihood functions, or Bayesian methods

I Or fits for normal linear models

15/36

Can I Stop Now? Is GLMs All I Need?

I NO! I Many models have data generating processes that cannot be linearized a+sin(BX) I E.g., Y = e

I Many possible error generating processes

I My favorite - the Gumbel distribution, for maximum values

I And we haven’t even started with mixed models, , etc... I For these, we use other Likelihood or Bayesian methods

I Some problems have shortcuts, others do not

16/36 Logistic Regression!!!

17/36

The Logitistic Curve (for ) 1.0 0.8 0.6 0.4 0.2 0.0

−4 −2 0 2 4

X

18/36 Binomial Error Generating Process Possible values bounded by probability

Probability = 0.01 Probability = 0.3 12 40 8 20 4 Frequency Frequency 0 0

0.0 0.5 1.0 1.5 2.0 0 1 2 3 4 5

Probability = 0.7 Probability = 0.99 40 12 8 20 4 Frequency Frequency 0 0

4 5 6 7 8 9 8.0 8.5 9.0 9.5

19/36

The Logitistic Function

e(a+BX) p = 1 + e(a+BX)

(p) = a + BX

20/36 with a Logit Link

logit(p) = a + BX

Y ∼ Binom(T rials, p)

21/36

Cryptosporidium

22/36 Drug Trial with Mice

23/36

Fraction of Mice Infected = Probability of Infection

1.00

0.75

0.50

0.25 Fraction of Mice Infected Fraction

0.00 0 100 200 300 400 Dose

24/36 Two Different Ways of Writing the Model

# 1) using Heads, Tails glm(cbind(Y, N-Y)˜ Dose, data=crypto, family=binomial) # # # 2) using weights as size parameter for Binomial glm(Y/N˜ Dose, weights=N, data=crypto, family=binomial)

25/36

The Fit Model

1.00

0.75

0.50

0.25 Fraction of Mice Infected Fraction

0.00 0 100 200 300 400 Dose

26/36 The Fit Model

# # Call: # glm(formula = cbind(Y, N - Y) ˜ Dose, family = binomial, data = crypto) # # Residuals: # Min 1Q 3Q Max # -3.9532 -1.2442 0.2327 1.5531 3.6013 # # Coefficients: # Estimate Std. Error z value Pr(>|z|) # (Intercept) -1.407769 0.148479 -9.481 <2e-16 # Dose 0.013468 0.001046 12.871 <2e-16 # # (Dispersion parameter for binomial family taken to be 1) # # Null deviance: 434.34 on 67 degrees of freedom # Residual deviance: 200.51 on 66 degrees of freedom # AIC: 327.03

# 27/36 # Number of Fisher Scoring iterations: 4

The

p Odds = 1 − p

p Log − Odds = Log = logit(p) 1 − p

28/36 The Meaning of a Logit Coefficient

Logit Coefficient: A 1 unit increase in a predictor = an increase of β increase in the log-odds of the response.

β = logit(p2) − logit(p1)

p p β = Log 1 − Log 2 1 − p1 1 − p2

We need to know both p1 and β to interpret this. If p1 = 0.5, β = 0.01347, then p2 = 0.503 If p1 = 0.7, β = 0.01347, then p2 = 0.702

29/36

What if we Only Have 1’s and 0’s?

1.00

0.75

0.50 Predation

0.25

0.00

0 2 4 6 8 log.seed.weight

30/36 Seed Predators

http://denimandtweed.com 31/36

The GLM

seed.glm <- glm(Predation˜ log.seed.weight, data=seeds, family=binomial)

32/36 Fitted Seed Predation Plot

1.00

0.75

0.50 Predation

0.25

0.00

0.0 2.5 5.0 7.5 log.seed.weight

33/36

Diagnostics Look Odd Due to Binned Nature of the Data

Residuals vs Fitted Normal Q−Q 5545721113 1113 554572 2 2 1 1 Residuals −1 −1 Std. deviance resid. Std. deviance −3 −2 −1 0 −3 −1 1 2 3

Predicted values Theoretical Quantiles

. Scale−Location Residuals vs Leverage d i

s 5545721113

e 5545721113

e 3 n 1.0 a i 1 v e d

. Cook's distance d −1 t 0.0 S Std. Pearson resid. Std. Pearson −3 −2 −1 0 0.000 0.003 0.006

Predicted values Leverage

34/36 Creating Binned Residuals 2 1 0 residuals(seed.glm) −1

0.1 0.2 0.3 0.4 0.5

fitted(seed.glm, type = "deviance")

35/36

Binned Residuals Should Look Spread Out

200 Bins 2.0 1.5 1.0 0.5 Residual 0.0 −1.0

0.1 0.2 0.3 0.4 0.5

Fitted

36/36