1430 Cook • UP WITH ODDS RATIOS! SPECIAL CONTRIBUTIONS Advanced : Up with Odds Ratios! A Case for Odds Ratios When Outcomes Are Common

Thomas D. Cook, PhD Abstract Treatment comparisons from clinical studies involving cause odds ratios differ from risk ratios and are fre- dichotomous outcomes are often summarized using risk quently interpreted incorrectly as risk ratios. In this ar- ratios. Risk ratios are typically used because the under- ticle, the author contends that risk ratios can be easily lying statistical model is often consistent with the un- misinterpreted and that, in many cases, odds ratios derlying biological mechanism of the treatment and they should be preferred, especially in studies in which out- are easily interpretable. The use of odds ratios to sum- comes are common. Key words: odds ratios; risk ratios; marize treatment effects has been discouraged, especially statistics; differences; outcomes. ACADEMIC EMER- in studies in which outcomes are common, largely be- GENCY MEDICINE 2002; 9:1430–1434.

In a 1999 article, Schwartz and colleagues1 com- covariate-adjusted—was reported using ORs.) ment on the reporting of a study of the effect of Overall mortality in patients receiving DCLHb (see gender and race on physicians’ recommendations the first two columns of Table 1) was 46% (24/52) for cardiac catheterization.2 They contend that, as and mortality in patients receiving normal saline reported, the differences between African Ameri- (control) was 17% (8/46). The risk ratio (RR) is the cans and whites, and between men and women, in ratio of these mortality rates, RR = 2.65 = 46%/17%. rates of referral for cardiac catheterization study Conversely, the odds of mortality is the ratio of the were overstated (Schwartz et al. indicate other rea- mortality rate to the survival rate, or equivalently, sons why this finding may be misleading, but they the ratio of the number of deaths to the number of are not relevant to this discussion. As the authors survivors. In the DCLHb study, the odds are 0.857 suggest, formal comparisons of rates in this setting = 24/28 and 0.211 = 8/38 in the DCLHb and control 1 may not even be meaningful). Schwartz et al. groups, respectively. The is the ratio of blame this overstatement, in part, on the use of odds in the two groups, OR = 4.07 = 0.857/0.211. odds ratios (ORs) to summarize the results and ar- These measures are numerically quite different and, gue against such use. Additionally, a number of hence, must be interpreted differently. As discussed other articles have appeared in recent years dis- below, each requires careful consideration and each couraging the use of ORs for reporting the results can be easily misinterpreted. 3–9 of medical studies, especially when outcomes are Criticisms of ORs fall principally into two cate- common. Very little has been published since then gories: 1) ORs are not as intuitive as RRs and, there- 10 to refute this recommendation. As is argued be- fore, are difficult to understand and easily misin- low, in the study cited above and when properly terpreted and misapplied, and 2) ORs often differ interpreted, ORs may be the most meaningful sum- significantly from RRs. Arguments of the first cat- mary measures of the differences observed. egory are important, but they suffer from a major As an illustration, consider the results of a recent flaw. Risk ratios may seem intuitive and easily ap- study of diaspirin cross-linked hemoglobin plied; however, they are easily misapplied and the (DCLHb) in patients suffering from severe trau- conclusions drawn from their use may be inappro- matic hemorrhagic shock.11 (For this study, the ef- priate. An intuitive, easily understood summary fect of treatment—overall, within subgroups, and measure is worthwhile only to the extent that it re- sults in valid conclusions. From the Department of Biostatistics and Medical Informatics, Arguments in the second category appear to be University of Wisconsin, Madison, WI (TDC). Received October 8, 2001; revision received May 7, 2002; ac- based implicitly on two assumptions. The first is cepted July 17, 2002. that the most appropriate summary of differences Series editor: Roger J. Lewis, MD, PhD, Department of Emer- between groups is the RR and that this measure gency Medicine, Harbor–UCLA Medical Center, Torrance, CA. should be reported whenever possible. Second, Address for correspondence and reprints: Thomas D. Cook, PhD, 209 WARF Building, 610 Walnut Street, Madison, WI since ORs, especially when the underlying risk is 53705. Fax: 608-263-0415; e-mail: [email protected]. high, are more extreme than RRs (larger than RR ACAD EMERG MED • December 2002, Vol. 9, No. 12 • www.aemj.org 1431

TABLE 1. Mortality in the Diaspirin Cross-linked Hemoglobin (DCLHb) Study11 Overall and by Baseline Predicted of Death Using the TRISS Method TRISS-predicted Probability of Survival Overall 80%–100% 20%–80% 0%–20% DCLHb Control DCLHb Control DCLHb Control DCLHb Control Dead 24 8 5 1 5 1 12 6 (Mortality) (46.2%) (17.4%) (21.7%) (4.5%) (38.5%) (8.3%) (92.3%) (60.0%) Alive 28 38 18 21 8 11 1 4

Total5246232213121310

Note that five patients had insufficient baseline data upon which to compute a TRISS score. when RR > 1 and smaller than RR when RR < 1), the average risk does not necessarily represent the and they overstate the differences between treat- risk for any particular individual, the RR calculated ment groups. Again, a case can be made that pre- using the average risk may not represent the RR for cisely the opposite is true: when they differ, RRs any particular individual. Assuming that the two actually understate treatment differences. groups are balanced with respect to underlying pa- The purpose of this article is to argue that in tient risk (no confounding), the aggregate unad- many cases the OR is a more appropriate summary justed RR will apply to individuals only if there is measure that can be applied to a broader popula- a common RR over the population (homogeneous tion of patients than the RR. In such cases ORs RR assumption). Understanding this fact is critical should be preferred, especially when ORs and RRs to the correct application of RRs in practice. Con- differ, i.e., when outcomes are common. Notwith- sidering Table 2, the overall observed RR of 2.65 standing the errors in the interpretation of the re- likely does not represent the RR for any of the sub- sults reported by Schulman et. al.,2 there is no evi- groups, especially the high-risk group (it is outside dence that in practice, errors resulting from the the 95% confidence interval for the RR for this misinterpretation of ORs are more frequent than er- group). It is also well below the observed RR in the rors resulting from the misinterpretation of RRs. We other two groups (although well within the corre- suggest, and illustrate below, that practitioners who sponding confidence intervals). These differences are likely to misinterpret or misuse ORs are also suggest that the homogeneity of RR assumption likely to misinterpret or misuse RRs. does not hold (This example is primarily for pur- poses of illustration and no attempt at statistical RISK RATIOS VERSUS ODDS RATIOS inference is intended. Because of the relatively small numbers of patients in this study, observed Given an of interest (considered a failure), differences among groups, here and in what fol- the risk of failure is the probability that a patient lows, may not reach statistical significance. This will experience failure. For a given population, the fact should have no bearing on the principles being risk is usually estimated by the proportion of the illustrated). population observed to fail. It is important to keep in mind, however, that there is likely to be variation TABLE 2. Risk Ratios (RRs) and Odds Ratios (ORs) in risk within the population. The observed popu- in the Diaspirin Cross-linked Hemoglobin 11 lation risk is actually an average of the risks for the (DCLHb) Study Overall and by Baseline individuals in the population, and therefore, the av- Predicted Probability of Death Using the erage risk may not necessarily apply to individuals TRISS Method within the population. Again, this can be illustrated TRISS-predicted by considering data from the DCLHb study shown Probability of in Table 1. Three subpopulations are defined by the Survival RR 95% CI OR 95% CI probability of survival using the TRISS method.12 Overall 2.65 (1.32, 5.32) 4.07 (1.59, 10.4) We consider a low-risk group (45 patients), a mid- 80%–100% 4.78 (0.61, 37.7) 5.83 (0.62, 54.7) dle-risk group (25 patients), and a high-risk group 20%–80% 4.62 (0.63, 34.1) 6.88 (0.67, 70.8) (23 patients). Note that five patients had insufficient 0%–20% 1.54 (0.91, 2.61) 8.00 (0.73, 88.2) baseline data to compute the TRISS score. Now, given two groups of patients, for example, TRISS-adjusted 2.07 (1.22, 4.50) 7.15 (2.18, 23.5) treated and control, the (unadjusted) RR is the ratio For the TRISS-adjusted RR, the confidence interval (CI) was of the risks in the two groups. As above, because computed using a bootstrap method. 1432 Cook • UP WITH ODDS RATIOS!

Conversely, the odds of failure is the ratio of the For these patients it is nonsensical to suggest that failure probability to the success probability. In a they would have risk of more than 100% (38% ϫ population, the odds can usually be estimated by 2.65 = 101%) if given DCLHb. It is also likely to be the number (or proportion) observed to fail divided false for patients with very low risk. In fact, for by the number (or proportion) observed not to fail. low-risk patients (about half of the DCLHb study While the risk must be between 0 and 100%, the population has TRISS-predicted baseline risk below odds can be any number greater than or equal to 6%), the OR is a good approximation to the RR. In zero. Given two groups of patients, the (unad- addition, to the extent that the aggregate (popula- justed) OR is the ratio of the odds of failure in the tion) OR reflects the common individual OR (as- two groups. The OR is always more extreme (fur- suming that a common OR exists), the RR for low- ther from 1) than the RR, but when the risks are risk patients is probably more accurately estimated small (less than, say, 10%), the RR and the OR will by the crude OR of 4.07. In reality, and as suggested roughly agree. As the underlying risks increase, by Table 2, given the heterogeneity of the popula- however, the difference between the RR and the OR tion, the crude OR most likely underestimates the can become quite large. As with risk, the true odds true OR, but the crude OR is sufficient for this dis- of failure may vary significantly among members cussion. Indeed, the observed RR for the low-risk of the population and, again, the unadjusted OR group in Table 2 is 4.78. Thus, statement 2 is likely may not represent the OR for any particular indi- to hold for only a relatively small number of pa- vidual. The principal benefit of ORs is that the as- tients. A reader who is likely to misinterpret ORs sumption of homogeneity of ORs is a more tenable is also likely to believe, incorrectly, that statement assumption than the assumption of homogeneity of 2 follows directly from statement 1. Clearly, with- RRs and thus it is more likely that an estimate of out additional consideration of the underlying as- the OR can be reliably applied to all individuals sumptions (in particular the assumption of homo- within a population. geneous RR), RRs can be easily misinterpreted, We now illustrate how the use of RRs can be mis- obscuring the actual effect of the treatment. In cases leading. Note that the overall RR of 2.65 can be where there is large variation in risk, RRs can be expressed by the following statement: uninterpretable. In contrast, the ORs computed for the three risk Mortality in patients receiving DCLHb was categories are not too dissimilar and the assump- 2.65 times higher than in patients receiving tion of homogeneity of ORs is quite reasonable. Un- normal saline. 1) der this assumption, the TRISS-adjusted OR of 7.15 shown in Table 2 may be used as the estimate of a We first note two things regarding statement 1, common OR for all patients in the study. Further- above. First, it is a summary of the aggregate re- more, this may represent the most reliable estimate sults observed in this study, neither inferring cau- of the RR for the lowest-risk patients. (Note that the sation nor explicitly quantifying the effect of estimate in the low-risk group of 4.78 is based on DCLHb. Second, as discussed above, it is a state- only one death in the saline group and has a much ment about the population under study in aggre- wider confidence interval.) Given that the homo- gate, and does not directly apply to individual pa- geneous RR assumption cannot possibly hold, the tients. TRISS-adjusted RR shown in Table 2 is probably Given statement 1, it may be natural for a reader meaningless, especially since predicted risk for a to infer a statement such as the following: number of patients under this model is greater than 100%. DCLHb increases the risk of death 2.65 times This leads to a second issue—that RRs may not relative to normal saline. 2) adequately summarize the effect of treatment when outcomes are common. To illustrate, consider sub- Statement 2 differs from statement 1 in two im- groups of patients defined by baseline risk in the mediate ways. First, it directly addresses the effect DCLHb study. From Table 2, in the high-risk group of DCLHb, and second, it makes sense only when (predicted survival probability < 20%) the observed applied to individual patients. It also differs from saline (control) mortality rate is 60.0% and the ob- statement 1 in that it is false, at least to the extent served DCLHb mortality rate is 92.3%. (Recall that that it does not hold for a significant number of in this study, treatment was observed to have an patients. In particular, it cannot hold for those pa- adverse effect on mortality.) Thus, the observed RR tients whose underlying (saline) risk is above 38% in this subgroup is 1.54 and the observed OR is 8.0. (estimated to be about 16% of the study population Given the high control group rate, the largest pos- based on TRISS-predicted survival ). sible observed RR (assuming 60.0% mortality in the ACAD EMERG MED • December 2002, Vol. 9, No. 12 • www.aemj.org 1433 control group and 100% mortality in the DCLHb that this represents only a 7% reduction in the rate group) is 1.67 (=100%/60.0%). On the other hand, of referral based on a ‘‘risk ratio’’ of 0.93. What is the observed RR in the low-risk group is 4.78 and peculiar about this conclusion is that, while in most the OR = 5.83. settings ‘‘risk’’ refers to the probability of failure, This brings us to a potential contradiction arising the authors use ‘‘risk’’ to refer to the probability of from the use of RRs. Even though it is quite plau- referral, which in this study is implicitly viewed as sible that the biological effect of DCLHb is as great success. Based on its more traditional meaning, it or greater in the higher-risk patients, it would be would seem more appropriate that ‘‘risk’’ refer to literally impossible to conclude this using the RR the probability of non-referral. In fact, the RR for no matter what the data show unless the RRs for non-referral is 1.63 (15.3/9.4). That is, African Amer- the lower-risk groups were below 1.67. In fact, to icans and women are 63% more likely to not be take the extreme case, had the observed mortality referred than are whites and men. This RR is not in patients in the high-risk group receiving DCLHb too dissimilar to the OR for non-referral of 1.74 been 100%, and ignoring random error, it would be (=1/.57) and suggests a far more pronounced dif- reasonable to argue that the effect of DCLHb is ac- ference than does the 7% reduction in rate of refer- tually much greater in this subgroup (it would be ral. The apparent discrepancy results from the ar- 100% fatal!), despite the lower RR. Conversely, this bitrary decision to use either the rate of success or extreme effect would be reflected by a very large the rate of failure as the outcome measure. The per- (infinite) OR. When outcomes are common, the RR ception of the stated difference can be heavily in- can be seriously misleading regarding the clinical fluenced by this choice. Since the OR combines the significance of observed treatment differences. two outcomes (success and failure) symmetrically, These arguments apply equally to situations it is not subject to such arbitrariness, and therefore where the treatment has a beneficial effect (OR < 1 is usually a more robust measure of treatment dif- or RR < 1). Again, if events are common, and the ferences. RR is less than 1, say, 0.6, then this implies that the highest possible risk for a treated patient is 60% CONCLUSIONS (=100% ϫ 0.6), which would result if a patient had a baseline risk of 100%. It is highly unlikely that The assertion that odds ratios can mislead is true any treatment could have such a dramatic effect on only when odds ratios are misinterpreted. The best patients who are otherwise certain to either die or solution to the problem is that the proper use of experience failure. Again, if outcomes are suffi- odds ratios should be encouraged, especially when ciently common, it is almost certain that there will outcomes are common or when the range of un- be a substantial and unidentified subset of patients derlying risks in the populations is large, rather for whom the RR does not apply. On the other than to discourage their use in the reporting of clin- hand, given any estimated OR, we would still con- ical studies. Furthermore, readers need to be edu- clude that a patient who is at 100% risk without cated in their proper interpretation. One can be treatment will also be at 100% risk when treated, misled by odds ratios only when they are applied and others who are at very high risk will remain at as if they were risk ratios. very high risk. This behavior is likely to be more The assumption of homogeneity of odds ratios is consistent with the true effect of treatment. far more tenable in most situations than the implicit In part, as pointed out be Senn,10 the difficulty assumption of homogeneity of risk ratios. The ap- with RRs results from the somewhat arbitrary de- parent ease of application of risk ratios is negated cision to use either the rate of success or the rate of by the fact that they are not as well understood as failure as the summary outcome measure. Since the many believe, and naive applications may be in- OR accounts for both failure and success rates sym- correct. metrically, it does not suffer from this difficulty. To illustrate, consider the gender and race study dis- This paper has benefited from the helpful comments of Michael cussed earlier.2 The results were reported by Schul- Kosorok, the senior statistical editor, and the reviewers. man et al. as ORs, which were interpreted incor- rectly by some as RRs. For example, an OR of .57 References was interpreted as ‘‘Blacks and women with chest 1. Schwartz LM, Woloshin S, Welch HG. Misunderstandings pain are 40 percent less likely than whites or men about the effects of race and sex on physicians’ referrals to be referred for cardiac catheterization.’’13 The ac- for cardiac catheterization. N Engl J Med. 1999; 341:279– tual referral rates cited by Schulman et al. are 90.6% 83. 2. Schulman KA, Berlin JA, Harless W, et al. The effect of for whites and men and 84.7% for African Ameri- race and sex on physicians’ recommendations for cardiac cans and women. Schwartz et al. argue correctly catheterization. N Engl J Med. 1999; 340:618–26. 1434 Cook • UP WITH ODDS RATIOS!

3. Davies HTO, Crombie IK, Tavakoli M. When can odds 8. Sackett DL, Deeks JJ, Altman DG. Down with odds ra- ratios mislead? BMJ. 1998; 316:989–91. tios. Evid Based Med. 1996; 1:164–6. 4. Bracken MB, Sinclair JC. When can odds ratios mislead? 9. Zhang J, Yu KF. What’s the relative risk? A method of Avoidable systematic error in estimating treatment effects correcting the odds ratio in cohort studies of common must not be tolerated [letter; comment]. BMJ. 1998; 317: outcomes. JAMA. 1998; 280:1690–1. 1156–7. 10. Senn S. Rare distinction and common fallacy [letter]. 5. Deeks JJ. When can odds ratios mislead? Odds ratios eBMJ, 1999. Website: http://bmj.com/cgi/eletters/317/ should be used only in case–control studies and logistic 7168/1318. regression analyses [letter; comment]. BMJ. 1998; 317: 11. Sloan EP, Koenigsberg M, Gens D, et al. Diaspirin cross- 1156–7. linked hemoglobin (DCLHb) in the treatment of severe 6. Altman DG, Deeks JJ, Sackett DL. Odds ratios should be traumatic hemorrhagic shock, a randomized controlled avoided when events are common [letter]. BMJ. 1998; efficacy trial. JAMA. 1999; 282:1857–63. 317:1318. 12. Boyd CR, Tolson MA, Copes WS. Evaluating trauma 7. Taeger D, Sun Y, Straif K. On the use, misuse and inter- care: the TRISS method. J Trauma. 1987; 27:370–8. pretation of odds ratios. eBMJ, 1998. Website: http:// 13. Rubin R. Heart care reflects race and sex, not symptoms. bmj.com/cgi/eletters/316/7136/989. USA Today. Feb 25, 1999:1A.