HST 190: Introduction to Biostatistics
Total Page:16
File Type:pdf, Size:1020Kb
HST 190: Introduction to Biostatistics Lecture 6: Methods for binary data 1 HST 190: Intro to Biostatistics Binary data • So far, we have focused on setting where outcome is continuous • Now, we consider the setting where our outcome of interest is binary, meaning it takes values 1 or 0. § In particular, we consider the 2x2 contingency table tabulating pairs of binary observations (�#, �#), … , (�(, �() 2 HST 190: Intro to Biostatistics • Consider two populations § IV drug users who report sharing needles § IV drug users who do not report sharing needles • Is the rate of positive tuberculin skin test equal in both populations? § To address this question, we sample 40 patients who report and 60 patients who do not to compare rates of positive tuberculin test § Data cross-classified according to these two binary variables 2x2 table Positive Negative Total Report sharing 12 28 40 Don’t report 11 49 60 sharing Total 23 77 100 3 HST 190: Intro to Biostatistics Chi-square test for contingency tables • The Chi-square test is a test of association between two categorical variables. • In general, its null and alternative hypotheses are § �*: the relative proportions of individuals in each category of variable #1 are the same across all categories of variable #2; that is, the variables are not associated (i.e., statistically independent). § �# : the variables are associated o Notice the alternative is always two-sided • In our example, this means § �*: reported needle sharing is not associated with PPD 4 HST 190: Intro to Biostatistics • The Chi-square test compares observed counts in the table to counts expected if no association (i.e., �*) § Expected counts are obtained using the marginal totals of the table. • Recall independence rule � � ∩ � = � � �(�), so from 100 people, assuming independence, we expect 40 23 � share ∩ positive = � share � positive = = 0.092 100 100 § Then, we’d expect 0.092 100 = 9.2 positive sharers, instead of 12 2x2 table Positive Negative Total Report 12 28 40 sharing Don’t report 11 49 60 sharing Total 23 77 100 5 HST 190: Intro to Biostatistics • Similarly, there will likely be some discrepancy between observed and expected counts for the other three cells in the table. § Chi-square test assesses: are these differences too large to be the result of sampling variability? • Steps of Chi-square test 1) Complete the observed-data table 2) Compute table of expected counts 3) Calculate the �A statistic 4) Get p-value from the chi-square table • This method is valid only if all expected counts ≥ 5 § test relies on approximation that does not hold in small samples 6 HST 190: Intro to Biostatistics 1) Complete observed data table 2) Complete table of expected counts �C⋅×�⋅D (�C# + �CA)(�#D + �AD) � = = CD � � O11 O12 O1. E11 E12 E1. O21 O22 O2. E21 E22 E2. O.1 O.2 n E.1 E.2 n 3) Calculate chi-square test statistic observed − expected A �A = ∑ expected � − � A � − � A � − � A � − � A = ## ## + #A #A + A# A# + AA AA �## �#A �A# �AA § swap �CD − �CD with �CD − �CD − 0.5 for Yates continuity correction 7 HST 190: Intro to Biostatistics 4) Get p-value from chi-square distribution § Under null hypothesis �*: no association between the two factors, the �A statistic follows a chi-square distribution with 1 degree of A A freedom. This is often written as � ~�# o continuous and positive-valued, defined by one parameter df § p-value comes from right tail, but is inherently ‘two-sided’ o matlab: 1-chi2cdf(x,1) A �#,*.ST = 3.84 Area = 0.05 8 HST 190: Intro to Biostatistics A A • Thus, at the � level, �* is rejected if � > �#,#YZ • Using 2x2 contingency table, an alternate formula for the _ ` ( [\Y]^ Y Yates corrected test statistic is �A = ` ([a])(^a\)([a^)(]a\) 100 12(49) − 28(11) − 50 A �A = = 1.24 < 3.84 = �A (40)(60)(23)(77) #,*.ST • ⇒ Fail to reject �* 2x2 table Negativ Positive e Total Report � + � � = 12 � = 28 sharing = 40 Don’t report � + � � = 11 � = 49 sharing = 60 Total � + � � + � � = 100 = 23 = 77 9 HST 190: Intro to Biostatistics Fisher’s exact test What happens if all expected counts < 5? Instead of chi-square test, use a Fisher’s exact test (see Rosner 10.3) • Like the chi-square test, Fisher's exact test examines the significance of the association (contingency) between the two kinds of classification – rows and columns. • Both row and column totals (a+c, b+d, a+b, c+d) are assumed to be fixed - not random. • We then consider all possible tables that could give the row and column totals observed and corresponding probability of each configuration (it helps to realize that the first count, a, has a hypergeometric distribution under the null) • Finally, the p-values are computed by adding up the probabilities of the tables as extreme or more extreme than the observed one. 10 HST 190: Intro to Biostatistics Chi-square test for contingency tables, RxC What if we are interested in a variable that has more than two categories? Example: Test for association between eye color and presence or absence of a mutant allele at some genetic locus. Eye color categories: blue, green, brown, hazel, gray Genetic categories: 0 copies mutant allele, ≥ 1 copy mutant allele 11 The chi-square test can be used for variables with more than two categories. Data presented in an RxC table, a generalization of the 2x2 table: blue green brown hazel gray Total Mutant allele 3 7 21 15 15 61 absent Mutant allele 6 10 18 14 17 65 present Total 9 17 39 29 32 126 R = # rows, C = # columns (doesn’t matter which variable is which) 12 Chi-square test for RxC table same as for 2x2 table except: • This method can only be used if no more than 1/5 of cells have expected count < 5 AND if no cell has expected count < 1. j Yl ` j Yl ` j Yl ` �A = kk kk + k` k` + … + mn mn lkk lk` lmn 2 • Under H0 , the X test statistic follows a chi-square distribution on (R-1)(C-1) degrees of freedom A A � ~�(oY#)(pY#) 13 Again, we have to obtain marginal totals to determine expected count for each cell. For example… blue green brown hazel gray Total Mutant allele 4.36 8.23 18.88 14.04 15.49 61 absent Mutant allele 4.64 8.77 20.12 14.96 16.51 65 present Total 9 17 39 29 32 126 The expected counts would be calculated as follows q#rS qTrsA E = = 4.36, … , E = = 16.51 11 #Aq RC #Aq 14 2 2 2 3 − 4.36 7 − 8.23 17 − 16.51 X 2 = ( ) + ( ) +!+ ( ) 4.36 8.23 16.51 = 1.80 A A • Under H0 , � ~�t • MATLAB: 1-chi2cdf(1.8,4) p-value = 0.77 Conclusion: No evidence for association between eye color and mutant alleles. 15 Two-sample comparison of proportions What if we are interested in estimating and quantifying uncertainty about the difference in proportions between two groups? • e.g., want estimate and CI of difference in proportions of positive tuberculosis skin tests between needle sharers and non-sharers Approach is similar to two-sample estimation for continuous data questions, with subtle differences! 16 HST 190: Intro to Biostatistics Two-sample comparison of proportions • Whereas we have previously considered the difference in means of continuous two-sample data, we now compare two populations’ unknown proportions �# and �A. • Suppose we want to know whether two communities have the same obesity rate. § You draw random samples from both; in the first city, 20 out of 100 are obese, while in the second 24 out of 150 are obese. • Goals: § estimate and compute the 95% C.I. for the difference in proportions § conduct a significance test at level � = 0.05 for a difference 17 HST 190: Intro to Biostatistics • Before, we saw that if a random experiment has two possible outcomes, “success” and “failure”, and we do � independent repetitions with identical success probability �, then �~Bin(�, �) is the number of successes. § Now, we observe �#~Bin(�#, �#) and XA~Bin(�A, �A) and then make inference about �# − �A. • Estimation is identical to two-sample continuous case: difference of sample proportions, �̂# − �̂A • If �#�̂# 1 − �̂# ≥ 5 and �A�̂A 1 − �̂A ≥ 5 , the associated 100 1 − � % CI given by • �̂#(1 − �̂#) �̂A(1 − �̂A) Z �̂# − �̂A ± �#Y + A �# �A 18 HST 190: Intro to Biostatistics • For example, consider two samples A* § � = 100, � = 20, �̂ = = 0.20, � �̂ 1−�̂ = 16 ≥ 5 # # # #** # # # At § � = 150, � = 24, �̂ = = 0.16, � �̂ (1−�̂ ) = 20.16 ≥ 5 A A A #T* A A A • Then the 95% CI for the difference is • 0.2(0.8) 0.16(0.84) = (0.20 − 0.16) ± 1.96 + 100 150 = 0.04 ± 1.96 0.050 = 0.04 ± 0.10 = −0.06, 0.14 19 HST 190: Intro to Biostatistics Hypothesis testing for difference of proportions • Now, consider �*: �# = �A versus �#: �# ≠ �A § Under �*, we can pool the two samples to calculate standard error, ( ‚ƒ a ( ‚ƒ letting �̂ = k k ` ` (ka(` • Then If �#�̂# 1 − �̂# ≥ 5 and �A�̂A 1 − �̂A ≥ 5, under �* we form the Z-test statistic �̂ − �̂ � = # A • 1 1 �̂(1 − �̂) + �# �A • It has an approximate N(0,1) distribution when the null is true. 20 HST 190: Intro to Biostatistics • Continuing the same example, A* § � = 100, � = 20, �̂ = = 0.20, � �̂ 1−�̂ = 16 ≥ 5 # # # #** # # # At § � = 150, � = 24, �̂ = = 0.16, � �̂ (1−�̂ ) = 20.16 ≥ 5 A A A #T* A A A … a… A*aAt § �̂ = k ` = = 0.176 (ka(` #**a#T* • Test statistic is then �̂ − �̂ 0.20 − 0.16 � = # A = • 1 1 • 1 1 �̂(1 − �̂) + 0.176(0.824) + �# �A 100 150 = 0.81 • From table or MATLAB, � � > 0.81 = 0.21, so p-value is 2 0.21 = 0.42 > 0.05 ⇒ do not reject H* 21 HST 190: Intro to Biostatistics Odds ratio and relative risk Chi-square tests for contingency How do we estimate the magnitude tables allow us to test for of the association between two categorical variables? association between two categorical variables.