Logistic Regression

Logistic Regression: The Multivariate Problem Q: What is the relationship between one (or more) exposure variables, E, and a binary disease or illness outcome, Y, while adjusting for potential confounding effects of C1, C2, …? Example: • Y is coronary heart disease (CHD). Y = 1 is “with CHD” and Y = 0 Logistic Regression is “without CHD”. Part II • Our exposure of interest is smoking, E = 1 for smokers (current, ever), and E = 0 for non-smokers. • What is the extent of association between smoking and CHD? • We want to “account for” or control for other variables (potential confounders) such as age, race and gender. E, C1, C2, C3 ⇒ Y “independent” “dependent” Spring 2013 Biostat 513 132 Spring 2013 Biostat 513 133 Logistic Regression: The Multivariate Problem Binary Response Regression Recall: For binary Y (= 0/1) Independent variables: X=X1, X2,…,Xp µ = E(Y) = P(Y=1)=π Dependent variable: Y , binary • The mean of a binary variable is a probability, i.e. π ∈ (0,1). • We have a flexible choice for the type of independent variables. These may be continuous, categorical, binary. • The mean may depend on covariates. • This suggests considering: ( ) = + + + • We can adopt a mathematical model to structure the systematic variation in the response variable Y as a function of X. • Can we use linear regressionπ for푋 π(X)?훽표 훽1푋1 ⋯ 훽푝푋푝 • We can adopt a probability model to represent the random variation • Two issues: in the response. • If we model the mean for Y we’ll need to impose the constraint Recall: Linear regression = + + + + 0 < π(X) < 1 for all X. 2 e ~ N(0,σ ) o binary X 표 1 1 푝 푝 푌 훽 훽 푋 ⋯ 훽 푋 푒 o multi-categorical X Instead, consider the equivalent representation … YN~ (µσ ( X ),2 ) o continuous X σ2 µ()X=++ ββ0 11 XX... βpp • What is for binary data? Spring 2013 Biostat 513 134 Spring 2013 Biostat 513 135 The Logistic Function The Logistic Regression Model exp(z ) The logistic function is given by: fz()= Define a “linear predictor” by 1+ exp(z ) 1 = 1+− exp( z) Then the model for π(Xβ) is: Properties: an “S” shaped curve with: this is called the “expit” transform this is called the log lim fz( ) =11+= 0 1 z→ +∞ odds or “logit” transform lim fz( ) =11+∞ = 0 z→ −∞ f (0)= 1 2 1.0 0.8 0.6 f(z) 0.4 0.2 0.0 -6 -4 -2 0 2 4 6 z Spring 2013 Biostat 513 136 Spring 2013 Biostat 513 137 The Logistic Regression Model Logistic Regression: Some special cases Q: Why is logistic regression so popular? Binary Exposure 1. Dichotomous outcomes are common. Q: What is the logistic regression model for a simple binary exposure variable, E, and a binary disease or illness outcome, D? 2. Logistic regression ensures that predicted probabilities lie between 0 and 1. 3. Regression parameters are log odds ratios – hence, estimable from Example: Pauling (1971) case-control studies E=1 E=0 (Vit C) (placebo) D (cold=Yes) 17 31 48 D (cold=No) 122 109 231 139 140 279 Spring 2013 Biostat 513 138 Spring 2013 Biostat 513 139 Binary Exposure Binary Exposure X1 : exposure: Y: outcome Q: How would we approach these data using logistic regression? X1 = 1 if in group E (vitamin C) Y = 1 if in group D (cold) X1 = 0 if in group E (placebo) Y = 0 if in group D (no cold) Data: Y X1 count X1=1 X1=0 1 1 17 Y=1 17 31 48 0 1 122 Y=0 122 109 231 1 0 31 139 140 279 0 0 109 PYˆ[= 1 X = 1] = 17 139 = 0.122 1 PYˆ[= 1 X = 0] = 31140 = 0.221 Model: 1 ∧ = probability that Y=1 given X RR = 0.122 0.221= 0.554 logit ( ) = + ∧ − 휋 푋 OR = 0.122 (1 0.122) = 0.490 0.221 (1− 0.221) 휋 푋 훽0 훽1푋1 Spring 2013 Biostat 513 140 Spring 2013 Biostat 513 141 Binary Exposure – Parameter Interpretation Binary Exposure – Parameter Estimates Q: How can we estimate the logistic regression model parameters? • Model logit(π (X ))= ββ+ X 1 0 11 In this simple case we could calculate by hand: • Probabilities: β We know that: exp(0 ) PY(= 1| X = 0) = π( X = 0) = πˆ(X = 0) = 0.221, 1 1 1+ exp(β ) 1 0 πˆ(X = 0) and log1 = βˆ . exp(ββ+ ) 1−=πˆ (X 0) 0 = == π = = 01 1 PY( 1| X1 1) ( X1 1) 1++ exp(ββ ) 0.221 01 log= −1.260 1− 0.221 • Odds: We also know that: Odds of disease when (X1 = 0) = exp(β0) πˆ(X1 = 1) = 0.122, Odds of disease when (X1 = 1) = exp(β0 + β1) πˆ(X = 1) and log1 =ββˆˆ + . 1−=πˆ (X 1) 01 • Odds Ratio: 1 Odds(X = 1) 0.122 = − log1.974 OR= 1 = exp(β ) 1− 0.122 Odds(X = 0) 1 1 Hence: βˆ = −1.260 • Log Odds ratio: β1 0 ˆ and β1 =−1.974 −− ( 1.260) =− 0.713 ∧ OR =−=exp( 0.713) .490 Spring 2013 Biostat 513 142 Spring 2013 Biostat 513 143 Binary Exposure – Parameter Estimates Maximum Likelihood Estimation Q: How can we estimate the logistic regression model parameters? • In simple cases, corresponds with our common sense estimates; but A: More generally, for models with multiple covariates, the computer applies to complex problems as well implements an estimation method known as “maximum likelihood estimation” • Maximum likelihood is the “best” method of estimation for any situation that you are willing to write down a probability model. Generic idea – Find the value of the parameter (β) where the log- likelihood function, l(β;data), is maximum • We can use computers to find these estimates by maximizing a particular function, known as the likelihood function. • We use comparisons in the value of the (log) likelihood function as a preferred method for testing whether certain variables (coefficients) are significant (i.e. to test H0: β=0). likelihood - Log βmax β • for multiple parameters β1, β2, … imagine a likelihood “mountain” Spring 2013 Biostat 513 144 Spring 2013 Biostat 513 145 Vitamin C Study Example – STATA Logistic Regression – Vitamin C Study Example . input count y x . logistic y x count y x 1. 17 1 1 Logistic regression Number of obs = 279 2. 31 1 0 LR chi2(1) = 4.87 3. 122 0 1 Prob > chi2 = 0.0273 4. 109 0 0 Log likelihood = -125.6561 Pseudo R2 = 0.0190 5. end ------------------------------------------------------------------------------ y | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] . expand count -------------+---------------------------------------------------------------- (275 observations created) x | .4899524 .1613518 -2.17 0.030 .2569419 .9342709 ------------------------------------------------------------------------------ . logit y x . cs y x, or | x | | Exposed Unexposed | Total Logit estimates Number of obs = 279 -----------------+------------------------+---------- LR chi2(1) = 4.87 Cases | 17 31 | 48 Prob > chi2 = 0.0273 Noncases | 122 109 | 231 Log likelihood = -125.6561 Pseudo R2 = 0.0190 -----------------+------------------------+---------- Total | 139 140 | 279 ------------------------------------------------------------------------------ | | y | Coef. Std. Err. z P>|z| [95% Conf. Interval] Risk | .1223022 .2214286 | .172043 -------------+---------------------------------------------------------------- | | x | -.713447 .3293214 -2.17 0.030 -1.358905 -.0679889 | Point estimate | [95% Conf. Interval] _cons | -1.257361 .2035494 -6.18 0.000 -1.65631 -.8584111 |------------------------+---------------------- ------------------------------------------------------------------------------ Risk difference | -.0991264 | -.1868592 -.0113937 Risk ratio | .5523323 | .3209178 .9506203 Prev. frac. ex. | .4476677 | .0493797 .6790822 Alternatives: Prev. frac. pop | .2230316 | . glm y x, family(binomial) link(logit) eform Odds ratio | .4899524 | .2588072 .9282861 (Cornfield) +----------------------------------------------- . binreg y x, or chi2(1) = 4.81 Pr>chi2 = 0.0283 Spring 2013 Biostat 513 146 Spring 2013 Biostat 513 147 Logistic Regression – Single Binary Predictor Logistic Regression – Single Binary Predictor π= ββ + Model: logit[ (XX1 )] 0 11 Model: logit[π (XX1 )]= ββ + Probability 0 11 X1 = 1 X1 = 0 X1 logit(π(X1)) exp(ββ01+ ) exp(β0 ) ++ββ + β i.e., 1 exp(01 ) 1 exp(0 ) 0 β0 • logit(P(Y=1|X1=0))=β0 1 β0 +β1 X1 = 1 X1 = 0 • logit(P(Y=1|X1=1))=β0+ β1 Odds exp(ββ01+ ) exp(β0 ) • logit(P(Y=1|X1=1)) – logit(P(Y=1|X1=0)) = β1 • exp(β1) is the odds ratio that compares the odds of a success (Y=1) in the “exposed” (X1=1) group to the “unexposed” (X1=0) group. β1 is the log odds ratio of “success” (Y=1) comparing two • The logistic regression OR and the simple 2x2 OR are identical ∧∧ groups with X1=1 (first group) and X1=0 (second group) OR=exp(β1 ) = ad / bc β0 is the log odds (of Y) for X1=0 • Also, expit( ) is P(Y = 1 | X = 0) 훽̂0 Spring 2013 Biostat 513 148 Spring 2013 Biostat 513 149 Logistic Regression – Case-Control Studies Logistic Regression – Case-Control Studies Recall: The case-control study design is particularly effective when “disease” is • In case-control studies we sample cases (Y = 1) and controls (Y = rare 0) and then ascertain covariates (X). • If the “disease” affects only 1 person per 10,000 per year, we would • From this study design we cannot estimate disease risk, P(Y=1|X), need a very large prospective study nor relative risk, but we can estimate exposure odds ratios. • But, if we consider a large urban area of 1,000,000, we would expect • Exposure odds ratios are equal to disease odds ratios. to see 100 cases a year • The result is that we can use case-control data to estimate disease • We could sample all 100 cases and 100 random controls odds ratios, which for rare outcomes approximate relative risks. • Sampling fractions, f, for cases and controls are then very different: Q: Can one do any regression modeling using data from a case- 100 control study? f = = .00001 for controls 0 999,900 A: Yes, one can use standard logistic regression to estimate ORs but 100 f = =1 for cases not disease risk probabilities 1 100 Spring 2013 Biostat 513 150 Spring 2013 Biostat 513 151 Logistic Regression – Case-Control Studies Logistic Regression – Case-Control Studies Example 2: Keller (AJPH, 1965) Key points: Case Control Total 1.

Logistic Regression

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support