Logistic Regression

Logistic Regression: The Multivariate Problem Q: What is the relationship between one (or more) exposure variables, E, and a binary disease or illness outcome, Y, while adjusting for potential confounding effects of C1, C2, …? Example: • Y is coronary heart disease (CHD). Y = 1 is “with CHD” and Y = 0 Logistic Regression is “without CHD”. Part II • Our exposure of interest is smoking, E = 1 for smokers (current, ever), and E = 0 for non-smokers. • What is the extent of association between smoking and CHD? • We want to “account for” or control for other variables (potential confounders) such as age, race and gender. E, C1, C2, C3 ⇒ Y “independent” “dependent” Spring 2013 Biostat 513 132 Spring 2013 Biostat 513 133 Logistic Regression: The Multivariate Problem Binary Response Regression Recall: For binary Y (= 0/1) Independent variables: X=X1, X2,…,Xp µ = E(Y) = P(Y=1)=π Dependent variable: Y , binary • The mean of a binary variable is a probability, i.e. π ∈ (0,1). • We have a flexible choice for the type of independent variables. These may be continuous, categorical, binary. • The mean may depend on covariates. • This suggests considering: ( ) = + + + • We can adopt a mathematical model to structure the systematic variation in the response variable Y as a function of X. • Can we use linear regressionπ for푋 π(X)?훽표 훽1푋1 ⋯ 훽푝푋푝 • We can adopt a probability model to represent the random variation • Two issues: in the response. • If we model the mean for Y we’ll need to impose the constraint Recall: Linear regression = + + + + 0 < π(X) < 1 for all X. 2 e ~ N(0,σ ) o binary X 표 1 1 푝 푝 푌 훽 훽 푋 ⋯ 훽 푋 푒 o multi-categorical X Instead, consider the equivalent representation … YN~ (µσ ( X ),2 ) o continuous X σ2 µ()X=++ ββ0 11 XX... βpp • What is for binary data? Spring 2013 Biostat 513 134 Spring 2013 Biostat 513 135 The Logistic Function The Logistic Regression Model exp(z ) The logistic function is given by: fz()= Define a “linear predictor” by 1+ exp(z ) 1 = 1+− exp( z) Then the model for π(Xβ) is: Properties: an “S” shaped curve with: this is called the “expit” transform this is called the log lim fz( ) =11+= 0 1 z→ +∞ odds or “logit” transform lim fz( ) =11+∞ = 0 z→ −∞ f (0)= 1 2 1.0 0.8 0.6 f(z) 0.4 0.2 0.0 -6 -4 -2 0 2 4 6 z Spring 2013 Biostat 513 136 Spring 2013 Biostat 513 137 The Logistic Regression Model Logistic Regression: Some special cases Q: Why is logistic regression so popular? Binary Exposure 1. Dichotomous outcomes are common. Q: What is the logistic regression model for a simple binary exposure variable, E, and a binary disease or illness outcome, D? 2. Logistic regression ensures that predicted probabilities lie between 0 and 1. 3. Regression parameters are log odds ratios – hence, estimable from Example: Pauling (1971) case-control studies E=1 E=0 (Vit C) (placebo) D (cold=Yes) 17 31 48 D (cold=No) 122 109 231 139 140 279 Spring 2013 Biostat 513 138 Spring 2013 Biostat 513 139 Binary Exposure Binary Exposure X1 : exposure: Y: outcome Q: How would we approach these data using logistic regression? X1 = 1 if in group E (vitamin C) Y = 1 if in group D (cold) X1 = 0 if in group E (placebo) Y = 0 if in group D (no cold) Data: Y X1 count X1=1 X1=0 1 1 17 Y=1 17 31 48 0 1 122 Y=0 122 109 231 1 0 31 139 140 279 0 0 109 PYˆ[= 1 X = 1] = 17 139 = 0.122 1 PYˆ[= 1 X = 0] = 31140 = 0.221 Model: 1 ∧ = probability that Y=1 given X RR = 0.122 0.221= 0.554 logit ( ) = + ∧ − 휋 푋 OR = 0.122 (1 0.122) = 0.490 0.221 (1− 0.221) 휋 푋 훽0 훽1푋1 Spring 2013 Biostat 513 140 Spring 2013 Biostat 513 141 Binary Exposure – Parameter Interpretation Binary Exposure – Parameter Estimates Q: How can we estimate the logistic regression model parameters? • Model logit(π (X ))= ββ+ X 1 0 11 In this simple case we could calculate by hand: • Probabilities: β We know that: exp(0 ) PY(= 1| X = 0) = π( X = 0) = πˆ(X = 0) = 0.221, 1 1 1+ exp(β ) 1 0 πˆ(X = 0) and log1 = βˆ . exp(ββ+ ) 1−=πˆ (X 0) 0 = == π = = 01 1 PY( 1| X1 1) ( X1 1) 1++ exp(ββ ) 0.221 01 log= −1.260 1− 0.221 • Odds: We also know that: Odds of disease when (X1 = 0) = exp(β0) πˆ(X1 = 1) = 0.122, Odds of disease when (X1 = 1) = exp(β0 + β1) πˆ(X = 1) and log1 =ββˆˆ + . 1−=πˆ (X 1) 01 • Odds Ratio: 1 Odds(X = 1) 0.122 = − log1.974 OR= 1 = exp(β ) 1− 0.122 Odds(X = 0) 1 1 Hence: βˆ = −1.260 • Log Odds ratio: β1 0 ˆ and β1 =−1.974 −− ( 1.260) =− 0.713 ∧ OR =−=exp( 0.713) .490 Spring 2013 Biostat 513 142 Spring 2013 Biostat 513 143 Binary Exposure – Parameter Estimates Maximum Likelihood Estimation Q: How can we estimate the logistic regression model parameters? • In simple cases, corresponds with our common sense estimates; but A: More generally, for models with multiple covariates, the computer applies to complex problems as well implements an estimation method known as “maximum likelihood estimation” • Maximum likelihood is the “best” method of estimation for any situation that you are willing to write down a probability model. Generic idea – Find the value of the parameter (β) where the log- likelihood function, l(β;data), is maximum • We can use computers to find these estimates by maximizing a particular function, known as the likelihood function. • We use comparisons in the value of the (log) likelihood function as a preferred method for testing whether certain variables (coefficients) are significant (i.e. to test H0: β=0). likelihood - Log βmax β • for multiple parameters β1, β2, … imagine a likelihood “mountain” Spring 2013 Biostat 513 144 Spring 2013 Biostat 513 145 Vitamin C Study Example – STATA Logistic Regression – Vitamin C Study Example . input count y x . logistic y x count y x 1. 17 1 1 Logistic regression Number of obs = 279 2. 31 1 0 LR chi2(1) = 4.87 3. 122 0 1 Prob > chi2 = 0.0273 4. 109 0 0 Log likelihood = -125.6561 Pseudo R2 = 0.0190 5. end ------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] . expand count -------------+---------------------------------------------------------------- (275 observations created) x | .4899524 .1613518 -2.17 0.030 .2569419 .9342709 ------------------------------------------------------------------------------ . logit y x . cs y x, or | x | | Exposed Unexposed | Total Logit estimates Number of obs = 279 -----------------+------------------------+---------- LR chi2(1) = 4.87 Cases | 17 31 | 48 Prob > chi2 = 0.0273 Noncases | 122 109 | 231 Log likelihood = -125.6561 Pseudo R2 = 0.0190 -----------------+------------------------+---------- Total | 139 140 | 279 ------------------------------------------------------------------------------ | | y | Coef. Std. Err. z P>|z| [95% Conf. Interval] Risk | .1223022 .2214286 | .172043 -------------+---------------------------------------------------------------- | | x | -.713447 .3293214 -2.17 0.030 -1.358905 -.0679889 | Point estimate | [95% Conf. Interval] _cons | -1.257361 .2035494 -6.18 0.000 -1.65631 -.8584111 |------------------------+---------------------- ------------------------------------------------------------------------------ Risk difference | -.0991264 | -.1868592 -.0113937 Risk ratio | .5523323 | .3209178 .9506203 Prev. frac. ex. | .4476677 | .0493797 .6790822 Alternatives: Prev. frac. pop | .2230316 | . glm y x, family(binomial) link(logit) eform Odds ratio | .4899524 | .2588072 .9282861 (Cornfield) +----------------------------------------------- . binreg y x, or chi2(1) = 4.81 Pr>chi2 = 0.0283 Spring 2013 Biostat 513 146 Spring 2013 Biostat 513 147 Logistic Regression – Single Binary Predictor Logistic Regression – Single Binary Predictor π= ββ + Model: logit[ (XX1 )] 0 11 Model: logit[π (XX1 )]= ββ + Probability 0 11 X1 = 1 X1 = 0 X1 logit(π(X1)) exp(ββ01+ ) exp(β0 ) ++ββ + β i.e., 1 exp(01 ) 1 exp(0 ) 0 β0 • logit(P(Y=1|X1=0))=β0 1 β0 +β1 X1 = 1 X1 = 0 • logit(P(Y=1|X1=1))=β0+ β1 Odds exp(ββ01+ ) exp(β0 ) • logit(P(Y=1|X1=1)) – logit(P(Y=1|X1=0)) = β1 • exp(β1) is the odds ratio that compares the odds of a success (Y=1) in the “exposed” (X1=1) group to the “unexposed” (X1=0) group. β1 is the log odds ratio of “success” (Y=1) comparing two • The logistic regression OR and the simple 2x2 OR are identical ∧∧ groups with X1=1 (first group) and X1=0 (second group) OR=exp(β1 ) = ad / bc β0 is the log odds (of Y) for X1=0 • Also, expit( ) is P(Y = 1 | X = 0) 훽̂0 Spring 2013 Biostat 513 148 Spring 2013 Biostat 513 149 Logistic Regression – Case-Control Studies Logistic Regression – Case-Control Studies Recall: The case-control study design is particularly effective when “disease” is • In case-control studies we sample cases (Y = 1) and controls (Y = rare 0) and then ascertain covariates (X). • If the “disease” affects only 1 person per 10,000 per year, we would • From this study design we cannot estimate disease risk, P(Y=1|X), need a very large prospective study nor relative risk, but we can estimate exposure odds ratios. • But, if we consider a large urban area of 1,000,000, we would expect • Exposure odds ratios are equal to disease odds ratios. to see 100 cases a year • The result is that we can use case-control data to estimate disease • We could sample all 100 cases and 100 random controls odds ratios, which for rare outcomes approximate relative risks. • Sampling fractions, f, for cases and controls are then very different: Q: Can one do any regression modeling using data from a case- 100 control study? f = = .00001 for controls 0 999,900 A: Yes, one can use standard logistic regression to estimate ORs but 100 f = =1 for cases not disease risk probabilities 1 100 Spring 2013 Biostat 513 150 Spring 2013 Biostat 513 151 Logistic Regression – Case-Control Studies Logistic Regression – Case-Control Studies Example 2: Keller (AJPH, 1965) Key points: Case Control Total 1.

Load more