<<

HST 190: Introduction to

Lecture 6: Methods for binary data

1 HST 190: Intro to Biostatistics Binary data

• So far, we have focused on setting where outcome is continuous • Now, we consider the setting where our outcome of interest is binary, meaning it takes values 1 or 0. § In particular, we consider the 2x2 tabulating pairs of binary observations (�, �), … , (�, �)

2 HST 190: Intro to Biostatistics • Consider two populations § IV drug users who report sharing needles § IV drug users who do not report sharing needles • Is the rate of positive tuberculin skin test equal in both populations? § To address this question, we 40 patients who report and 60 patients who do not to compare rates of positive tuberculin test § Data cross-classified according to these two binary variables 2x2 table Positive Negative Total Report sharing 12 28 40

Don’t report 11 49 60 sharing Total 23 77 100

3 HST 190: Intro to Biostatistics Chi-square test for contingency tables

• The Chi-square test is a test of association between two categorical variables. • In general, its null and alternative hypotheses are

§ �: the relative proportions of individuals in each category of variable #1 are the same across all categories of variable #2; that is, the variables are not associated (i.e., statistically independent).

§ � : the variables are associated

o Notice the alternative is always two-sided • In our example, this

§ �: reported needle sharing is not associated with PPD

4 HST 190: Intro to Biostatistics • The Chi-square test compares observed counts in the table to counts expected if no association (i.e., �) § Expected counts are obtained using the marginal totals of the table. • Recall independence rule � � ∩ � = � � �(�), so from 100 people, assuming independence, we expect 40 23 � share ∩ positive = � share � positive = = 0.092 100 100 § Then, we’d expect 0.092 100 = 9.2 positive sharers, instead of 12

2x2 table Positive Negative Total Report 12 28 40 sharing Don’t report 11 49 60 sharing Total 23 77 100

5 HST 190: Intro to Biostatistics • Similarly, there will likely be some discrepancy between observed and expected counts for the other three cells in the table. § Chi-square test assesses: are these differences too large to be the result of variability? • Steps of Chi-square test 1) Complete the observed-data table 2) Compute table of expected counts 3) Calculate the � 4) Get p-value from the chi-square table • This method is valid only if all expected counts ≥ 5 § test relies on approximation that does not hold in small samples

6 HST 190: Intro to Biostatistics 1) Complete observed data table 2) Complete table of expected counts �⋅×�⋅ (� + �)(� + �) � = = � �

O11 O12 O1. E11 E12 E1.

O21 O22 O2. E21 E22 E2.

O.1 O.2 n E.1 E.2 n 3) Calculate chi-square observed − expected � = ∑ expected � − � � − � � − � � − � = + + + � � � �

§ swap � − � with � − � − 0.5 for Yates continuity correction

7 HST 190: Intro to Biostatistics 4) Get p-value from chi-square distribution

§ Under null hypothesis �: no association between the two factors, the � statistic follows a chi-square distribution with 1 degree of freedom. This is often written as � ~�

o continuous and positive-valued, defined by one parameter df § p-value comes from right tail, but is inherently ‘two-sided’

o matlab: 1-chi2cdf(x,1) �,. = 3.84 Area = 0.05

8 HST 190: Intro to Biostatistics • Thus, at the � level, � is rejected if � > �, • Using 2x2 contingency table, an alternate formula for the Yates corrected test statistic is � = ()()()() 100 12(49) − 28(11) − 50 � = = 1.24 < 3.84 = � (40)(60)(23)(77) ,.

• ⇒ Fail to reject � 2x2 table Negativ Positive e Total Report � + � � = 12 � = 28 sharing = 40 Don’t report � + � � = 11 � = 49 sharing = 60 Total � + � � + � � = 100 = 23 = 77 9 HST 190: Intro to Biostatistics Fisher’s

What happens if all expected counts < 5? Instead of chi-square test, use a Fisher’s exact test (see Rosner 10.3) • Like the chi-square test, Fisher's exact test examines the significance of the association (contingency) between the two kinds of classification – rows and columns. • Both row and column totals (a+c, b+d, a+b, c+d) are assumed to be fixed - not random. • We then consider all possible tables that could give the row and column totals observed and corresponding probability of each configuration (it helps to realize that the first count, a, has a hypergeometric distribution under the null) • Finally, the p-values are computed by adding up the probabilities of the tables as extreme or more extreme than the observed one.

10 HST 190: Intro to Biostatistics Chi-square test for contingency tables, RxC

What if we are interested in a variable that has more than two categories?

Example: Test for association between eye color and presence or absence of a mutant allele at some genetic locus.

Eye color categories: blue, green, brown, hazel, gray Genetic categories: 0 copies mutant allele, ≥ 1 copy mutant allele

11 The chi-square test can be used for variables with more than two categories. Data presented in an RxC table, a generalization of the 2x2 table:

blue green brown hazel gray Total Mutant allele 3 7 21 15 15 61 absent Mutant allele 6 10 18 14 17 65 present Total 9 17 39 29 32 126

R = # rows, C = # columns (doesn’t matter which variable is which)

12 Chi-square test for RxC table same as for 2x2 table except: • This method can only be used if no more than 1/5 of cells have expected count < 5 AND if no cell has expected count < 1.

� = + + … +

2 • Under H0 , the X test statistic follows a chi-square distribution on (R-1)(C-1) degrees of freedom

� ~�()()

13 Again, we have to obtain marginal totals to determine expected count for each cell. For example…

blue green brown hazel gray Total Mutant allele 4.36 8.23 18.88 14.04 15.49 61 absent Mutant allele 4.64 8.77 20.12 14.96 16.51 65 present Total 9 17 39 29 32 126

The expected counts would be calculated as follows

E = = 4.36, … , E = = 16.51 11 RC

14 2 2 2 3 − 4.36 7 − 8.23 17 − 16.51 X 2 = ( ) + ( ) +!+ ( ) 4.36 8.23 16.51 = 1.80

• Under H0 , � ~� • MATLAB: 1-chi2cdf(1.8,4) p-value = 0.77

Conclusion: No evidence for association between eye color and mutant alleles.

15 Two-sample comparison of proportions

What if we are interested in estimating and quantifying uncertainty about the difference in proportions between two groups? • e.g., want estimate and CI of difference in proportions of positive tuberculosis skin tests between needle sharers and non-sharers Approach is similar to two-sample estimation for continuous data questions, with subtle differences!

16 HST 190: Intro to Biostatistics Two-sample comparison of proportions

• Whereas we have previously considered the difference in means of continuous two-sample data, we now compare two populations’ unknown proportions � and �. • Suppose we want to know whether two communities have the same obesity rate. § You draw random samples from both; in the first city, 20 out of 100 are obese, while in the second 24 out of 150 are obese. • Goals: § estimate and compute the 95% C.I. for the difference in proportions § conduct a significance test at level � = 0.05 for a difference

17 HST 190: Intro to Biostatistics • Before, we saw that if a random has two possible outcomes, “success” and “failure”, and we do � independent repetitions with identical success probability �, then �~Bin(�, �) is the number of successes.

§ Now, we observe �~Bin(�, �) and X~Bin(�, �) and then make inference about � − �. • Estimation is identical to two-sample continuous case: difference of sample proportions, �̂ − �̂

• If ��̂ 1 − �̂ ≥ 5 and ��̂ 1 − �̂ ≥ 5 , the associated 100 1 − � % CI given by

�̂(1 − �̂) �̂(1 − �̂) �̂ − �̂ ± � + � �

18 HST 190: Intro to Biostatistics • For example, consider two samples

§ � = 100, � = 20, �̂ = = 0.20, � �̂ 1−�̂ = 16 ≥ 5 § � = 150, � = 24, �̂ = = 0.16, � �̂ (1−�̂ ) = 20.16 ≥ 5 • Then the 95% CI for the difference is

0.2(0.8) 0.16(0.84) = (0.20 − 0.16) ± 1.96 + 100 150 = 0.04 ± 1.96 0.050 = 0.04 ± 0.10 = −0.06, 0.14

19 HST 190: Intro to Biostatistics Hypothesis testing for difference of proportions

• Now, consider �: � = � versus �: � ≠ �

§ Under �, we can pool the two samples to calculate , letting �̂ =

• Then If ��̂ 1 − �̂ ≥ 5 and ��̂ 1 − �̂ ≥ 5, under � we form the Z-test statistic �̂ − �̂ � = 1 1 �̂(1 − �̂) + � � • It has an approximate N(0,1) distribution when the null is true.

20 HST 190: Intro to Biostatistics • Continuing the same example,

§ � = 100, � = 20, �̂ = = 0.20, � �̂ 1−�̂ = 16 ≥ 5 § � = 150, � = 24, �̂ = = 0.16, � �̂ (1−�̂ ) = 20.16 ≥ 5 § �̂ = = = 0.176 • Test statistic is then �̂ − �̂ 0.20 − 0.16 � = = 1 1 1 1 �̂(1 − �̂) + 0.176(0.824) + � � 100 150 = 0.81

• From table or MATLAB, � � > 0.81 = 0.21, so p-value is 2 0.21 = 0.42 > 0.05 ⇒ do not reject H

21 HST 190: Intro to Biostatistics Odds ratio and relative risk

Chi-square tests for contingency How do we estimate the magnitude tables allow us to test for of the association between two categorical variables? association between two categorical variables.

“Is there statistical evidence of an “How much higher is the rate of association between daily peptic ulcer disease among daily aspirin users?” aspirin and peptic ulcer disease?”

22 • Consider two categorical variables: § “disease” vs “no disease” § “exposure” vs “no exposure” • “Exposure” could be treatment, risk factor, or other factor § no assumptions about increases or decreases disease risk • Prospective study: Suppose for now that we enroll patients based on exposure status (vs. based on disease status) § e.g., 100 smokers and 100 nonsmokers

23 HST 190: Intro to Biostatistics Measures of Effect for Categorical Data

After we sample a specified number of exposed and unexposed individuals, we classify them by disease status as shown below Three ways to quantify magnitude of association: 1. Risk difference (RD) = same as difference of proportions 2. Relative risk (RR) or ‘risk ratio’ Disease 3. Odds ratio + - + a b a+b

Exposure - c d c+d a+c b+d n

24 HST 190: Intro to Biostatistics Risk Difference

Risk Difference = p1 – p2, where

p1 = P(disease | exposed)

p2 = P(disease | unexposed)

a c estimated Risk Difference = − a + b c + d Disease + - + a b a+b

Exposure - c d c+d a+c b+d n 25 * Risk Ratio p Relative Risk (Risk Ratio) = 1 p2 ! a $ # & "a + b% estimated Relative Risk = ! c $ # & " c + d % Disease + - + a b a+b

Exposure - c d c+d a+c b+d n 26 * Risk Difference vs. Ratio

Suppose that you enroll 100 smokers and 100 nonsmokers in your study: disease + - smoke + 30 70 100 - 15 85 100 45 155 200

30 15 Risk difference = - = 0.15 100 100 30 Relative risk = 100 = 2 15 100 27 Complicating factors

Measuring “”: Why it gets more complicated? • Time § We often measure rate ratio instead of a risk ratio § More on this aspect when we discuss • Effect Modification and § Our estimates typically need to be adjusted for other factors • Sampling § Depending on how you enroll patients in your study, it may not be possible to estimate a risk difference or risk ratio even in principle

28 HST 190: Intro to Biostatistics Risk Difference vs. Ratio

Suppose you conduct a case-control study by enrolling 100 patients with disease and 100 without, and then determine which have smoked: disease + - smoke + 25 10 35 - 75 90 165 100 100 200

• Can’t estimate p1 & p2 if you pre-specify the number of subjects with disease à can’t estimate RD or RR. • Need to know how data in your table were sampled!

29 Retrospective sampling

• A case-control study (or retrospective study) samples patients based on disease status, then classifies according to exposure § often performed for cost and , particularly when the disease or outcome is rare no need to follow subjects through entire lifetime and collect huge samples

• Case-control studies are often performed for cost and efficiency, particularly when the disease or outcome is rare – no need to follow subjects through their entire lifetime and collect huge samples.

• There is a measure of effect size that can be computed regardless of whether patients are enrolled based on exposure status or disease status…

30 HST 190: Intro to Biostatistics Odds

• If � = �(event), then define odds of the event as § Probability = 0.2 ⇒ Odds = 0.25 § Probability = 0.5 ⇒ Odds = 1

. § Probability = 0.75 ⇒ Odds = = 3 . . § Probability = 0.99 ⇒ Odds = = 99 . • Odds can from 0 to infinity § When we randomly sample patients based on exposure status, we can estimate �(disease|exposed) and �(disease|unexposed) § If we instead perform a case-control study, we can’t. We can only estimate �(exposed|disease) and �(exposed|no disease)

31 HST 190: Intro to Biostatistics Odds ratio

Imagine a table showing all individuals in the population (the table you “wish” you could see)

Let � = �(disease|exposed) and � = �(disease|unexposed), then the ratio of both exposure groupsʼ odds of disease is: Odds of disease for exposed OR = Disease Odds of disease for unexposed + - � ⁄(1 − � ) = + a b a+b �⁄(1 − �)

�/(� + �) �/(� + �) Exposure - c d c+d = �⁄(� + �) �⁄(� + �) a+c b+d n �� = ��

32 HST 190: Intro to Biostatistics Odds ratio

Imagine a table showing all individuals in the population (the table you “wish” you could see) If we instead consider �(exposed|disease) and �(exposed|no disease) , then the ratio of both disease groupsʼ odds of exposure is: Odds of exposure for diseased OR = Disease Odds of exposure for nondiseased + - �/(� + �) �/(� + �) = + a b a+b �⁄(� + �) �⁄(� + �)

�� Exposure - c d c+d = �� a+c b+d n Therefore, the OR is a measure of association that is numerically identical in either study design.

33 HST 190: Intro to Biostatistics • Therefore, sampling by exposure, estimating � and �, and computing odds ratio is estimating the same quantity as estimating the odds ratio (of “exposure probabilities”) in a case-control study. • So what if RR is of interest?

§ If disease is rare, �, � small so � 8 � 1 − � ≈ � for small � and 1 − � 6 1 − � ≈ 1 ⇒

1 − � 4 p/(1 − p) ⁄ OR = ≈ = ��

⁄ 2 � OR approximates RR for rare outcome 0

0.0 0.2 0.4 0.6 0.8

p 34 HST 190: Intro to Biostatistics Takeaways

• Cannot estimate RR and RD in a case-control study (unless you have additional data). • Can estimate odds ratio from either “prospective” or case- control study, and we estimate it the same way in either one. • Odds ratio approximates RR for rare disease.

35 HST 190: Intro to Biostatistics Interpreting odds ratio

• Difficult to give an “everyday” interpretation of what the odds ratio’s precise value means • �� > 1 → exposure associated with higher disease risk • �� < 1 → exposure associated with lower disease risk • �� = 1 → no association of exposure and disease status

36 HST 190: Intro to Biostatistics Inference on odds ratio

• To perform hypothesis test or generate CI for OR, we 1) Compute logarithm of estimated OR [ln (OR)] 2) Make inference on ln (OR) 3) Translate conclusions into statements about OR

• Why the log of the OR? § The of ln(OR) approximates more closely than that of OR itself

o Hence, methods based on normal approximation work better for ln(OR) § To see this, compare sampling distributions of OR vs. ln(OR): on the next slide we simulate a population with fixed rates of exposure and disease. For three different sample sizes, we randomly draw 1,000 samples and compute OR and ln(OR) for each

37 HST 190: Intro to Biostatistics 38

38 HST 190: Intro to Biostatistics Code to recreate in Matlab

Sample_Size = [50,200,1000]; % Define the sample sizes Prob1 = 0.75; Prob2 = 0.5;% Set the binomial probabilities for X and U figure;

for i=1:length(Sample_Size) X = binornd(1,Prob1,Sample_Size(i),10000); % Generate 10,000 trials of X U = binornd(1,Prob2,Sample_Size(i),10000); % Generate 10,000 trials of U

OR = (sum(X,1).*(sum(1-U,1)))./(sum(U,1).*(sum(1-X,1))); % Calculate the Odds Ratio LOR = log(OR); % Calculate the log of the Odds Ratio

subplot(length(Sample_Size),2,2*i-1); hist(OR,20); xlim([min(OR) max(OR)]); xlabel('Odds Ratio'); ylabel(['Sample Size ' num2str(Sample_Size(i))]) % Plot the Odds Ratio subplot(length(Sample_Size),2,2*i); hist(LOR,20); xlim([min(LOR) max(LOR)]); xlabel('Log Odds Ratio'); % Plot the Log Odds Ratio end

suptitle('Odds Ratio Demonstration'); % Set the title for the figure

39 HST 190: Intro to Biostatistics for OR

• If the expected count in each cell of the 2x2 table is ≥ 5, then the sample estimate of the true population ln(OR) approximately follows the distribution 1 1 1 1 ln (OR)~� ln OR , + + + � � � � • Another way of writing this result is 1 1 Var�� ≈ + ��̂(1 − �̂) ��̂(1 − �̂) Disease + - + a b a+b

Exposure - c d c+d a+c b+d n

40 HST 190: Intro to Biostatistics • Therefore, to get a 100(1 − �)% CI for the population OR we use a two-step process:

1) CI for ln OR : ln OR ± � + + + = (�, �) 2) CI for OR: (�, �)

• Importantly, the CI is not symmetric around estimated OR

41 HST 190: Intro to Biostatistics • Consider an outbreak of gastroenteritis in a school following lunch. 263 students ate lunch in cafeteria that day. Sandwiches suspected § How strong is the association, if any, between consumption of the sandwich and illness? Provide a 95% CI for the odds ratio

§ OR = = = 7.99 ⇒ ln (OR) = ln 7.99 = 2.078 ()

§ Step 1: 2.078 ± � + + + = (1.01,3.146) § Step 2: 95% CI for OR Ill? ��.��, ��.��� = (�. ��, ��. �) Yes No § Because CI does not contain 1, Yes 109 116 225 reject null of no association at 0.05 level No 4 34 38

Ate sandwich? 113 150 263

42 HST 190: Intro to Biostatistics Multiple 2x2 tables

• What if we have a confounding variable associated with exposure and outcome, such that there are several 2x2 tables, each corresponding to one level of the confounding variable? • Can we pool the counts in the tables into one table? § Not so fast. This can seriously bias our results…

43 HST 190: Intro to Biostatistics • For example, Percutaneous Nephrolithotomy (PN) was compared with several other procedures, classified as “open” procedures (OP), for treatment of renal calculi Successful Unsuccessful PN 289 61 350 OP 273 77 350 562 138 700 • Percutaneous treatment clearly looks superior; the estimated odds ratio for success based on having (vs. not having) percutaneous treatment is 289 77 OR = = 1.33 > 1 61(273) 289/350 = 0.826 chance of success for PN 273/350 = 0.780 chance successes for OP

44 HST 190: Intro to Biostatistics • However, if results are stratified based on stone size, percutaneous treatment looks worse!

§ Large stones: OR = = 0.81 < 1 ()

§ Small stones: OR = = 0.48 < 1 ()

Suc. Unsuc. PN 289 61 350 OP 273 77 350 562 138 700 Large stones Small stones Suc. Unsuc. Suc. Unsuc. PN 55 25 80 PN 234 36 270 OP 192 71 263 OP 81 6 87 247 96 343 315 42 357

45 HST 190: Intro to Biostatistics • Percutaneous treatment is associated with higher success rate (OR > 1) overall, yet with lower success rate (OR < 1) for each type of stone separately § How is that possible? • This is the result of confounding by a factor associated with both the treatment and the outcome (what is it?) § PN was used mostly for small stones, which had a higher success rate in general (88%). OP’s were used mostly for large stones, which had lower success rates (72%) § Pooling the data allowed the stone-size effect to mask the difference in treatment effectiveness • Confounding may occur whenever there is a factor that is associated with both treatment assignment and outcome § Confounding leading to the opposite conclusion in aggregated data is called Simpson’s Paradox (or Ecological Fallacy).

46 HST 190: Intro to Biostatistics • No statistical procedure “automatically” protects you from confounding. Adjustment for confounding requires understanding of the science • After a study is conducted, certain statistical techniques can be used to adjust for it (discussed over next two lectures) § Stratification § Matching § (Logistic) Regression adjustment

47 HST 190: Intro to Biostatistics Stratification

• If you stratify data into multiple 2x2 tables (strata) based on a confounder, and believe they share a common OR, you can estimate this OR using the Mantel-Haenszel Method (MH) • This method is valid if the relationship between exposure and disease is the same in each stratum (even though baseline risk may differ) § If the relationship is not the same in each stratum, then it does not make sense to combine the data for doing inference • Follow two steps: 1) Test whether the OR’s are the same in each stratum 2) If so, proceed with inference for the common OR, using all the tables

48 HST 190: Intro to Biostatistics Chi-square test for homogeneity

• To see if the OR’s are the same in each stratum, we use the chi-square test for homogeneity • Given � strata (tables), we test the hypotheses

§ �: OR = OR = ⋯ = OR (homogeneity)

§ �: at least one of the OR’s is different

• Test statistic is � = ∑ � ln OR − ln OR

∑ § � = + + + , ln OR = ∑ § Under the null, �~�

• If we reject �, stop here. Otherwise, estimate common OR

49 HST 190: Intro to Biostatistics • In Renal calculi example, test of homogeneity by stone size

§ Large stones: ln OR = ln = −0.206

o � = + + + = 12.91 § Small stones: ln OR = ln = −0.731

o � = + + + = 4.74 . . .(.) § ln (OR) = = −0.347 ..

� = 12.91 −0.206 + 0.347 + 4.74 −0.731 + 0.347 = 0.956 < 3.84 = �,. § We fail to reject the null that the odds ratios differ, and continue

50 HST 190: Intro to Biostatistics Mantel-Haenzel odds ratio estimator

• If we conclude homogeneity across strata, then the Mantel- Haenszel Estimator of the common Odds Ratio is ∑ � � /� OR = ∑ ��/� • We can now use hypothesis tests and confidence intervals for the common OR (via the ln(OR)). First, check that

§ ∑(� + �)(� + �)/� ≥ 5 § ∑(� + �)(� + �)/� ≥ 5 § ∑(� + �)(� + �)/� ≥ 5 § ∑(� + �)(� + �)/� ≥ 5

51 HST 190: Intro to Biostatistics • Under these conditions, the 100(1 − �)% CI for ln(OR) is

ln OR ± z � = (�, �)

§ Where � = + + + • The CI for the OR is then �, �

52 HST 190: Intro to Biostatistics Hypothesis testing for MH

• Finally, we may wish to test null hypothesis of no association between two variables, controlling for a cofounder: �: OR = 1 versus �: OR ≠ 1 • To do the test, we need to calculate 3 quantities: § � = ∑ � = ∑ �

() () § � = ∑ � = ∑

() ()()() § � = ∑ � = ∑ (must be ≥ 5) ()

. • � = , which follows � distribution if � true

53 HST 190: Intro to Biostatistics • Returning to renal calculi example, 55 71 234 6 25 192 36 81 OR = + + = 0.69 343 357 343 357 § compromise between two stratum-specific ORs (0.81 and 0.48) • To compute 95% CI, first verify the conditions given previously (they are messy to show, but in this case met) ln OR ± � 1/ 12.91 + 4.74 = −0.84,0.10 • Thus, 95% CI for OR is �., �. = (0.43,1.10)

54 HST 190: Intro to Biostatistics