HST 190: Introduction to Biostatistics
Lecture 6: Methods for binary data
1 HST 190: Intro to Biostatistics Binary data
• So far, we have focused on setting where outcome is continuous • Now, we consider the setting where our outcome of interest is binary, meaning it takes values 1 or 0. § In particular, we consider the 2x2 contingency table tabulating pairs of binary observations (� , � ), … , (� , � )
2 HST 190: Intro to Biostatistics • Consider two populations § IV drug users who report sharing needles § IV drug users who do not report sharing needles • Is the rate of positive tuberculin skin test equal in both populations? § To address this question, we sample 40 patients who report and 60 patients who do not to compare rates of positive tuberculin test § Data cross-classified according to these two binary variables 2x2 table Positive Negative Total Report sharing 12 28 40
Don’t report 11 49 60 sharing Total 23 77 100
3 HST 190: Intro to Biostatistics Chi-square test for contingency tables
• The Chi-square test is a test of association between two categorical variables. • In general, its null and alternative hypotheses are
§ � : the relative proportions of individuals in each category of variable #1 are the same across all categories of variable #2; that is, the variables are not associated (i.e., statistically independent).
§ � : the variables are associated
o Notice the alternative is always two-sided • In our example, this means
§ � : reported needle sharing is not associated with PPD
4 HST 190: Intro to Biostatistics • The Chi-square test compares observed counts in the table to counts expected if no association (i.e., � ) § Expected counts are obtained using the marginal totals of the table. • Recall independence rule � � ∩ � = � � �(�), so from 100 people, assuming independence, we expect 40 23 � share ∩ positive = � share � positive = = 0.092 100 100 § Then, we’d expect 0.092 100 = 9.2 positive sharers, instead of 12
2x2 table Positive Negative Total Report 12 28 40 sharing Don’t report 11 49 60 sharing Total 23 77 100
5 HST 190: Intro to Biostatistics • Similarly, there will likely be some discrepancy between observed and expected counts for the other three cells in the table. § Chi-square test assesses: are these differences too large to be the result of sampling variability? • Steps of Chi-square test 1) Complete the observed-data table 2) Compute table of expected counts 3) Calculate the � statistic 4) Get p-value from the chi-square table • This method is valid only if all expected counts ≥ 5 § test relies on approximation that does not hold in small samples
6 HST 190: Intro to Biostatistics 1) Complete observed data table 2) Complete table of expected counts � ⋅×�⋅ (� + � )(� + � ) � = = � �
O11 O12 O1. E11 E12 E1.
O21 O22 O2. E21 E22 E2.
O.1 O.2 n E.1 E.2 n 3) Calculate chi-square test statistic observed − expected � = ∑ expected � − � � − � � − � � − � = + + + � � � �
§ swap � − � with � − � − 0.5 for Yates continuity correction
7 HST 190: Intro to Biostatistics 4) Get p-value from chi-square distribution
§ Under null hypothesis � : no association between the two factors, the � statistic follows a chi-square distribution with 1 degree of freedom. This is often written as � ~�
o continuous and positive-valued, defined by one parameter df § p-value comes from right tail, but is inherently ‘two-sided’
o matlab: 1-chi2cdf(x,1) � , . = 3.84 Area = 0.05
8 HST 190: Intro to Biostatistics • Thus, at the � level, � is rejected if � > � , • Using 2x2 contingency table, an alternate formula for the Yates corrected test statistic is � = ( )( )( )( ) 100 12(49) − 28(11) − 50 � = = 1.24 < 3.84 = � (40)(60)(23)(77) , .
• ⇒ Fail to reject � 2x2 table Negativ Positive e Total Report � + � � = 12 � = 28 sharing = 40 Don’t report � + � � = 11 � = 49 sharing = 60 Total � + � � + � � = 100 = 23 = 77 9 HST 190: Intro to Biostatistics Fisher’s exact test
What happens if all expected counts < 5? Instead of chi-square test, use a Fisher’s exact test (see Rosner 10.3) • Like the chi-square test, Fisher's exact test examines the significance of the association (contingency) between the two kinds of classification – rows and columns. • Both row and column totals (a+c, b+d, a+b, c+d) are assumed to be fixed - not random. • We then consider all possible tables that could give the row and column totals observed and corresponding probability of each configuration (it helps to realize that the first count, a, has a hypergeometric distribution under the null) • Finally, the p-values are computed by adding up the probabilities of the tables as extreme or more extreme than the observed one.
10 HST 190: Intro to Biostatistics Chi-square test for contingency tables, RxC
What if we are interested in a variable that has more than two categories?
Example: Test for association between eye color and presence or absence of a mutant allele at some genetic locus.
Eye color categories: blue, green, brown, hazel, gray Genetic categories: 0 copies mutant allele, ≥ 1 copy mutant allele
11 The chi-square test can be used for variables with more than two categories. Data presented in an RxC table, a generalization of the 2x2 table:
blue green brown hazel gray Total Mutant allele 3 7 21 15 15 61 absent Mutant allele 6 10 18 14 17 65 present Total 9 17 39 29 32 126
R = # rows, C = # columns (doesn’t matter which variable is which)
12 Chi-square test for RxC table same as for 2x2 table except: • This method can only be used if no more than 1/5 of cells have expected count < 5 AND if no cell has expected count < 1.