Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

IV. Comparing Two Proportions Example IV.A Readings for Section IV: Agresti, Chapter 2 Introduction, In an involving 427 genetically engineered mice, 77 Sections 2.2 and 2.3. mice were exposed to the synthetic form of the investigative fiber, and 350 were exposed to the natural form of the fiber. After some We now consider the problem of deciding whether two binomial time, the mice were examined for the presence of tumors. Do the results contained in the table below indicate that exposed mice proportions π1 and π2 are the same, when both π1 and π2 are unknown. have a higher probability of tumor?

Tumor Fiber Type Yes No Total Synthetic 40 37 77 Natural 138 212 350 Total 178 249 427

Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Example IV.A (cont’d) Measuring the Difference Between Proportions

The number of tumors in each of the fiber categories can be Given two population proportions π1 and π2, how do we quantify thought of as a binomial random variable: the number of tumors in the disparity between them? There are three common metrics for accomplishing this: the synthetic group is Binomial(77, πS) and the number of tumors in the natural fiber group is Binomial(350, πN), where πS is the probability that a mouse exposed to synthetic fiber develops a 1. The difference, given by π1 – π2. tumor, and πN is the tumor probability for a mouse exposed to natural fiber. 2. The , given by π1/π2.

3. The , given by Odds(π )/Odds(π ), where Formally speaking, we are interested in a test of H0: πS = πN versus 1 2 Odds(π) = π/(1 – π). HA: πS <> πN.

Of these three measures, the odds ratio is by far the most widely used in practice.

1 Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Inference for the Inference for the Risk Difference

Given two population proportions π1 and π2, the MLE of the risk Note that difference is given by RD = p – p , where p and p are the π 1 2 1 2 H : =π H : Risk Difference = 0 observed proportions in each of the two groups. The variance of 0 π1 2 ⇔ 0 . the RD is given by Var(p1) + Var(p2), where Var(pi) = πi(1 – πi)/ni, HA : 1 ≠ π 2 HA : Risk Difference ≠ 0 for i = 1, 2, and ni is the sample size in group i. Since π1 and π2 are unknown, we estimate the variance of the RD by substituting p1 Hence, under the null, the statistic Z = (RD – 0)/s.e.(RD) ~ N(0,1). and p2. Moreover, a (1 – α)100% confidence interval for the true risk

Provided that n1 and n2 are sufficiently large, by the CLT the difference is given by approximate distribution of RD is N(π1 – π2, s.e.[RD]), where RD ± z1−α / 2s.e.(RD). p (1− p ) p (1− p ) s.e.(RD) = 1 1 + 2 2 . n1 n2

Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Example IV.B Inference for the Relative Risk

For the data in Example IV.A, the observed proportions of mice The MLE of the relative risk is given by RR = p1/p2, where p1 and p are the observed proportions in each of the two groups. with tumors are pS = 0.5195 for the synthetic treatment group and 2 pN = 0.3943 for the natural treatment group. For the purpose of inference, we normally focus on the log(RR) What is the RD? What is s.e.(RD)? (by convention, “log” in this class refers to the natural log, or log with base e). This is because the distribution of the log(RR) is Give the value of a statistic to test the null hypothesis in IV.A. approximately normal, provided that the sample is sufficiently What is the P-value of this test? large.

The MLE of the log relative risk is given by log(RR) = Give a 95% CI for π1 – π2. log(p1/p2) = log(p1) – log(p2). Negative values of log(RR) indicate Is there evidence that exposure to the synthetic fiber leads to that p1 < p2, positive values indicate the reverse. When greater tumor risk? log(RR) = 0, this indicates that p1 = p2.

2 Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Inference for the Relative Risk Example IV.C The large-sample standard error of the log(RR) is given by For the data in Example IV.A, what is the RR? What is the n − x n − x s.e.(log[RR]) = 1 1 + 2 2 , log(RR)? What is s.e.(log[RR])? x1n1 x2n2 where n and x are respectively the sample size and number of responses in the ith group, i i Using the RR, give the value of a statistic to test the null for i = 1, 2. π π hypothesis in IV.A. What is the P-value of this test? π Note that π π H H π 0 : 1 = 2 0 : log( 1 / 2 ) = 0 Give a 95% CI for π /π . ⇔ π . 1 2 H A : 1 ≠ 2 H A : log( 1 /π 2 ) ≠ 0 Is this evidence that exposure to the synthetic fiber leads to greater Hence, under the null, the statistic Z = (log[RR] – 0)/s.e.(log[RR]) ~ N(0,1). tumor risk? Moreover, a (1 – α)100% confidence interval for the true risk ratio is given by

exp{log(RR) ± z1−α / 2s.e.[log(RR)]}.

Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Inference for the Odds Ratio Inference for the Odds Ratio The MLE of the odds ratio is given by OR = Odds(p )/Odds(p ) = The large-sample standard error of the log(OR) is given by 1 2 π {p1/(1 – p1)}/{p2/(1 – p2)} = p1(1 – p2)/p2(1 – p1) = π 1 1 1 1 s.e.π (log[OR]) = + + + . x1(n2 – x2)/ x2(n1 – x1), where p1 and p2 are the observed proportions π x1 n1 − x1 x2 n2 − x2 in each of the two groups, and ni and xi are respectively the sample Note that π th size and number of responses in the i group, for i = 1, 2. H : = H : log[Odds(π ) / Odds(π )] = 0 0 1 2 ⇔ 0 1 2 . H : ≠ H : log[Odds( ) / Odds(π )] ≠ 0 As with the relative risk, we carry out inference for the odds ratio A 1 2 A 1 2 via the log(OR), due to the more desirable distributional properties Hence, under the null, the statistic Z = (log[OR] – 0)/s.e.(log[OR]) ~ N(0,1). of the log(OR). Moreover, a (1 – α)100% confidence interval for the true odds ratio is given by

The MLE of the log odds ratio is given by log(OR) = exp{log(OR) ± z1−α / 2s.e.[log(OR)]}. log[x1(n2 – x2)/ x2(n1 – x1)]. Negative values of log(OR) indicate that p1 < p2, positive values indicate the reverse. When log(OR) = 0, this indicates that p1 = p2.

3 Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Example IV.D Practical Advantages of the Odds Ratio Of the three measures discussed, the odds ratio is the most widely used, mainly For the data in Example IV.A, what is the OR? What is the for the following reasons: log(OR)? What is s.e.(log[OR])? 1. The odds ratio is invariant with respect to whether a study is prospective Using the OR, give the value of a statistic to test the null or retrospective. hypothesis in IV.A. What is the P-value of this test? A prospective study is one in which sample subjects are grouped in some way, and then followed forward in time to determine whether the subjects experience Give a 95% CI for Odds(π1)/Odds(π2). the outcome of interest.

Is this evidence that exposure to the synthetic fiber leads to greater A retrospective study is one in which subjects are sampled on the basis of the tumor risk? outcome, and then grouped according to their history. This is an efficient design when the outcome is relatively rare (i.e., you would not observe a sufficient number of outcomes if you followed the subjects prospectively). How do the results based on the three measures compare?

Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Invariance of the Odds Ratio Practical Advantages of the Odds Ratio, continued Whether the design is prospective or retrospective, we want to know how the probability of experiencing the outcome differs by group. Suppose for instance, that S represents the 2. We interpret the parameters of the model in terms of event that an individual smokes, and D represents that the individual has a . Note the odds ratio. that under the prospective design, we can determine P(D|S). Under the retrospective design, however, we have sampled according to disease status, so we can compute 3. When the outcome is relatively rare, the odds ratio and relative risk are P(S|D), but not P(D|S). approximately equal.

The odds ratio comparing risk of disease among smokers to nonsmokers is given by To understand why this last fact is so, consider again the example on the P(D | S) / P(D'| S) P(D | S)P(D'| S') previous slide involving smoking status and disease. The relative risk of = . P(D | S') / P(D'| S') P(D | S')P(D'| S) disease, comparing smokers to nonsmokers, is given by P(D|S)/P(D|S’). From the odds ratio given on the previous slide, this that However, since P(S|D) = P(D|S)P(S), we can see that P(D | S)P(D'| S') P(D'| S') P(S | D) / P(S | D') P(S | D)P(S'| D') P(D | S)P(S)P(D'| S')P(S') P(D | S)P(D'| S') Odds Ratio = = (Relative Risk)× . = = = . P(D | S')P(D'| S) P(D'| S) P(S'| D) / P(S'| D') P(S'| D)P(S | D') P(D | S')P(S')P(D'| S)P(S) P(D | S')P(D'| S) If the disease is rare among both smokers and nonsmokers, then the fraction at The odds ratio is the same regardless of which way it is computed – it therefore does the end of this equation will be approximately equal to 1. Hence, the odds ratio not depend on the design! and relative risk will be approximately equal.

4 Stat 5120 – Categorical Data Analysis Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004

Inference for Two Proportions in SAS Exact Inference for the Odds Ratio The following SAS program shows how to read the mouse data into SAS, and then In a manner similar to what was described for a single proportion, we can also obtain the risk difference, relative risk, and odds ratio, with the results of the associated obtain an exact confidence interval for an odds ratio when we are comparing inferential procedures. The program also requests the exact CI for the OR, along with two proportions. The details are not important for the purpose of its practical output regarding the chi-square and likelihood ratio tests (we will discuss these latter two procedures shortly). use, but we determine the bounds of the interval by using the hypergeometric distribution (introduced in Section II of the lecture notes). In Section V we will options ls=79 nodate; proc freq order=data; exact fisher or; discuss further how the hypergeometric distribution is used to derive an exact data; tables synthetic*tumor / riskdiff hypothesis test for an arbitrary two-way table. input synthetic tumor count; relrisk; cards; weight count; 1 1 40 run; Note: the same issues we discussed in Section III regarding exact versus large- 1 0 37 sample procedures apply here. 0 1 138 0 0 212 ;; The SAS example that follows includes commands to compute an exact 95% proc sort; confidence interval for the odds ratio based on the experiment in Example by descending synthetic descending IV.A. tumor; run;

SAS Output Stat 5120 – Categorical Data Analysis SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004 Dr. Corcoran, Fall 2004 The FREQ Procedure

Table of synthetic by tumor Fisher's Exact Test ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ synthetic tumor Cell (1,1) Frequency (F) 40 Left-sided Pr <= F 0.9836 Frequency‚ Right-sided Pr >= F 0.0299 Percent ‚ Row Pct ‚ Table Probability (P) 0.0135 Col Pct ‚ 1‚ 0‚ Total Two-sided Pr <= P 0.0551 ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ 1 ‚ 40 ‚ 37 ‚ 77 Column 1 Risk Estimates ‚ 9.37 ‚ 8.67 ‚ 18.03 ‚ 51.95 ‚ 48.05 ‚ (Asymptotic) 95% (Exact) 95% ‚ 22.47 ‚ 14.86 ‚ Risk ASE Confidence Limits Confidence Limits ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0 ‚ 138 ‚ 212 ‚ 350 Row 1 0.5195 0.0569 0.4079 0.6311 0.4026 0.6348 ‚ 32.32 ‚ 49.65 ‚ 81.97 Row 2 0.3943 0.0261 0.3431 0.4455 0.3427 0.4476 ‚ 39.43 ‚ 60.57 ‚ Total 0.4169 0.0239 0.3701 0.4636 0.3696 0.4652 ‚ 77.53 ‚ 85.14 ‚ ƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒˆ Difference 0.1252 0.0626 0.0024 0.2480 Total 178 249 427 41.69 58.31 100.00 Difference is (Row 1 - Row 2)

Statistics for Table of synthetic by tumor Column 2 Risk Estimates Statistic DF Value Prob (Asymptotic) 95% (Exact) 95% ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Risk ASE Confidence Limits Confidence Limits Chi-Square 1 4.0695 0.0437 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Likelihood Ratio Chi-Square 1 4.0207 0.0449 Row 1 0.4805 0.0569 0.3689 0.5921 0.3652 0.5974 Continuity Adj. Chi-Square 1 3.5708 0.0588 Row 2 0.6057 0.0261 0.5545 0.6569 0.5524 0.6573 Mantel-Haenszel Chi-Square 1 4.0600 0.0439 Total 0.5831 0.0239 0.5364 0.6299 0.5348 0.6304 Phi Coefficient 0.0976 Contingency Coefficient 0.0972 Difference -0.1252 0.0626 -0.2480 -0.0024 Cramer's V 0.0976 Difference is (Row 1 - Row 2)

5 SAS Output (cont’d)Stat 5120 – Categorical Data Analysis Dr. Corcoran, Fall 2004

The FREQ Procedure

Statistics for Table of synthetic by tumor

Estimates of the Relative Risk (Row1/Row2)

Type of Study Value 95% Confidence Limits ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Case-Control (Odds Ratio) 1.6608 1.0116 2.7267 Cohort (Col1 Risk) 1.3175 1.0250 1.6935 Cohort (Col2 Risk) 0.7933 0.6196 1.0157

Odds Ratio (Case-Control Study) ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Odds Ratio 1.6608

Asymptotic Conf Limits 95% Lower Conf Limit 1.0116 95% Upper Conf Limit 2.7267

Exact Conf Limits 95% Lower Conf Limit 0.9804 95% Upper Conf Limit 2.8135

Sample Size = 427

6