<<

ROC curve analysis and medical decision making: what’s the evidence that matters for evidence-based diagnosis ? Piergiorgio Duca Biometria e Statistica Medica – Dipartimento di Scienze Cliniche Luigi Sacco – Università degli Studi – Via GB Grassi 74 – 20157 MILANO (ITALY) [email protected]

1) The ROC (Receiver Operating Characteristic) curve and the Area Under the Curve (AUC) The ROC curve is the statistical tool used to analyse the accuracy of a diagnostic test with multiple cut-off points. The test could be based on a continuous diagnostic indicant, such as a serum enzyme level, or just on an ordinal one, such as a classification based on radiological imaging. The ROC curve is based on the probability density distributions, in actually diseased and non diseased patients, of the diagnostician’s confidence in a positive diagnosis, and upon a set of cut-off points to separate “positive” and “negative” test results (Egan, 1968; Bamber, 1975; Swets, Pickett, 1982; Metz, 1986; Zhou et al, 2002). The Sensitivity (SE) – the proportion of diseased turned out to be test positive – and the Specificity (SP) – the proportion of non diseased turned out to be test negative – will depend on the particular confidence threshold the observer applies to partition the continuously distributed perceptions of evidence into positive and negative test results. The ROC curve is the plot of all the pairs of True Positive Rates (TPR = SE), as ordinate, and False Positive Rates (FPR = (1 – SP)), as abscissa, related to all the possible cut-off points. An ROC curve represents all of the compromises between SE and SP can be achieved, changing the confidence threshold. The discrimination capacity of the test depends upon the extent to which the probability distributions of the perceived evidence of the true state of the patients are separated, and the Area Under the Curve (AUC) is the best summary measure of this discriminant accuracy. It corresponds to the probability that a randomly chosen patient from the diseased population, will give a perceived evidence of disease greater than a patient randomly chosen from the disease-free population. It depends only on the overlapping of the distributions of perceived evidence. It takes a value between .5 (a chance performance, based on completely overlapped distributions) and 1.0 (a perfect performance, based on completely disjointed distributions). Reliable methods for estimating the AUC require that some assumption be made regarding the functional form of the ROC curve, and the binormal form has been used in diagnostic context. This functional form has been found providing satisfactory fits to generated in a very broad variety of real situations. It has also the convenient property that all possible ROC curves are transformed into straight lines, if plotted on a “normal deviates” system of axes. Assuming this functional form and plotting the curve on normal- deviates axes, an estimate for the AUC can be derived estimating the parameters α, β of the straight line corresponding to the ROC curve in the ordinary [SE; (1-SP) space]. α is the difference of the of the perceived evidence of the two populations: ∆ = µD − µd, expressed in terms of units of of the diseased population (σD), β is the standard deviation of the actually negative population (σd) as units of standard deviation of the actually positive population. In the ordinary ROC space, assuming µD>µd: SE = f (1-SP) ZSE = – (XC – µD) / σD Z1-SP = – (XC - µd) / σd and in the “normal deviates” space: 2 1/2 ZSE = α + β Z1-SP α = ∆ / σD β = σd / σD AUC=Φ(ZA) ZA=[α /(1+β ) ] The task of curve fitting becomes one of choosing numerical values a and b, as estimates of α and β, and a maximum likelihood parameter estimation scheme could be applied (Dorfman , Alf, 1969; Metz, Herman, Shen 1998; Pepe 2003). Sources of variability to be considered are: 1) The patients variability, due to random variation in the diagnosis’ difficulty of the cases/non-cases included in the sample; 2) The observers variability, due to random variation in the skills of the observers included in the study; 3) The within-observer variability, due to the inability of each observer to exactly reproduce his own diagnosis on repeated occasions on the same case. The AUC has only a limited meaning in clinical practice. To illustrate this point consider a dichotomous test of SE=SP=0.90. What is the corresponding AUC ? There are three possible answers: 1/2 AUC B = Φ[1/2 (ZSP + ZSE)] AUC C = (SE + SP)/2 AUC D = SE x SP In our example: 1/2 AUCB = Φ[1/2 (1.282+1.282)] = Φ(1.813) = 0.965 AUCC = SE x SP + [SE (1-SP) + SP (1 – SE)]/2 = (SE + SP)/2 = 0.90 AUCD = SE x SP = 0.81. Which is the correct one ? It depends on the context: 1) assuming an observable continuous binormal underlying variable, the correct value of AUC is the first one; 2) assuming an interest on guessing, flipping a coin, which, in a pair of both positives or negatives, is actually the diseased, the correct answer is the second one; 3) assuming an interest just in predicting what’s the probability to identify correctly the positive, taking a pair of patients with the same a priori probability to be diseased, without knowing in advance that only one is certainly diseased, it’s correct the third value. Following Hilden (1991): “Knowing AUC might be useful in diagnosing an accidental interchange of two blood samples, from a D and d patient, but I have never been able to see that AUC has any operational meaning in the context of diagnosing one patient as D or d”.

2) Medical decision making and likelihood ratios (LR) Although an ROC curve describes all the tradeoffs that can be realized among the frequencies of TP, FN, TN, and FP, a clinician is interested on defining a particular cut-off point giving the most efficacious compromise, in any particular case, considering the prevalence (P) of disease (disease a priori probability) and the utilities (U), a numerical measure of the desirability of any of the consequences of a decision, preferably expressed by a number lying between 0 and 1. The optimal operating point on the ROC curve has to be considered in terms of feasibility (not always a test is based on an observable continuous variable, measurable with high precision), and of expected net benefit. In a clinical setting, a test is used for choosing among alternative courses of action. It is a decision making tool. The clinical decision maker would not be content just to be right more than wrong. He wouldn’t know just the true state of the patient. Actually he would do more goods than evils to his patient. In the clinical decision context the inherent discrimination capacity of the test is just one of the things a clinician must consider. Utilities of the different conditions, as well as the pre-test disease probability, both largely depending on subjective judgment, must be considered too. In a very simple situation, having to choose among treatment (T), no-treatment (t), and testing (A), given a patient affected by the disease D with an a priori probability P, the clinical decision will depend on the treatment expected utility: E(U)T; the un-treatment expected utility: E(U)t; and the test and treat only the positives expected utility: E(U)A. Which one of those is greater. If we decide to maximize the expected utility, as a criterion for choosing the best strategy, we will decide to treat, to do not treat, or to assess the patients, considering which decision has the maximum expected utility, computed as follows (Pauker, Kassirer, 1980): E(U)T = P UDT + (1 – P) UdT E(U)t = P UDt + (1 – P) Udt E(U)A = P SE UDT + P (1 – SE) UDt + (1 – P) SP Udt + (1 – P) (1 – SP) UdT Obviously, the utility of each condition, for example “to be diseased and treated”, will depend both from the treatment effectiveness and the disease’s characteristics. If 100% of patients are cured, UDT will be considered 1, the maximum possible value. If the treatment is not 100% effective, UDT will be defined in terms of effectiveness, residual invalidity, quality of life, life expected and so on. The utility of the condition “to be disease-free and treated” (UdT), instead, will depend mainly on the treatment adverse effects. The expected utility of the decision will depend on the utilities, above defined, and on the probability of the disease. The expected utility of the decision “to assess, and to treat the positives only”, E(U)A, will depend also on the test’s SE and SP. In order to identify the best strategy, we can consider when the choice between two decisions might be random and starting with the dichotomy “T” vs “t”, we can show that, for a value of disease probability P*, we can indifferently choose both because E(U)T = E(U)t, with: P* = (C/B)/(C/B+1) where C = (Udt – UdT) is the cost of the treatment in terms of adverse effects affecting disease free patients; B = (UDT – UDt) is the benefit of the treatment in terms of effectiveness for diseased patients. Only if the disease probability is greater then P*, the best decision will be “to treat”. In order to choose between “t” and “A” we can show, analogously, that exists a threshold value PA + for which we can indifferently choose both because E(U)t = E(U)A, with PA = (C/B)/[C/B + LR ] where LR+ = SE/(1 – SP) is the likelihood ratio of the positive test. Only if P is less than P* but greater than PA, the patient must be tested to do more benefit than harm. Finally, in order to choose between “A” and “T” we can show that it will be a threshold value PT for – which we can indifferently choose both because E(U)T = E(U)A, with PT = (C/B)/[C/B + LR ]; where LR– = (1 – SE)/SP is the likelihood ratio of the negative test. Only if P is greater than P* but less than PT, the patient must be tested and treated only if test positive. The correct clinical decision, in terms of expected utility, will so depend on the relationship between P, the pre-test probability of disease for a particular patient, and the above computed thresholds, defined by the cost/benefit ratio, depending on the clinical history of the disease, on the effectiveness of the treatment, and on its harmfulness, as well as on the accuracy of the diagnostic test, through its LRs.

3) LR and AUC From the ROC curve we can derive three differently defined Likelihood Ratios (Choi, 1998): 1) the slope of the tangent line to the ROC curve, that corresponds to the likelihood ratio of a particular x-value that generates the pair of values of SE and SP corresponding to that point on the curve; 2) the slope of the segment connecting two points of the curve, that corresponds to the likelihood ratio of an x value inside that particular interval of values; 3) the slope of the line connecting the origin with a point on the ROC curve and the slope of the line connecting the same point to the right upper corner of the plot, corresponding, respectively, to the LR+ and to the LR– computed considering the test as a dichotomous one. The ROC slope is decreasing throughout all the ROC’s extent, ranging from infinity to zero. If this assertion holds strictly only for binormal ROCs with β =1 (σD = σd), its failure is limited to the extremes only and need not concern us, considering that we are interested on portion of ROC curve away from values of (1-SP) > 0.3. Note also that, in case of binormal ROC curve with β = 1, LR = 1 corresponds to a dichotomous cut-off point with: SE = SP, (SE+SP) = maximum, LR+ = 1/LR–, and + 1/2 1/2 LR = Φ(ZA/2 )/Φ(-ZA/2 ). This slope has no general application in itself in the clinical context. Generally in the clinical context is used the LR of type 3, also for a test based on a continuous variable. Actually clinician defines a single cut-off point or a set of 2 to 4 points, choosing among them that more stringent, used to confirm a disease (SpIn test: a high specificity test, informative if positive), or that more lenient, used to exclude a disease (SnOut test: a high sensitivity test, informative if negative). Any intermediate value being inconclusive, getting poor information of the true state of patient. With binormal distributions we have that: + – 1) LR = SE/(1-SP) = Φ(ZSE) / Φ(−ΖSP) LR = (1 – SE)/SP = Φ(−ZSE) / Φ(ΖSP) 2 2 2 2 1/2 2) LR = β exp[1/2 (Z SP – Z SE)] AUC = Φ[(µD – µd)/(σ D + σ d) ] and assuming homoscedasticity: 1/2 1/2 3) LR = exp [½ ZA (ZSP – ZSE)] XC = (µD + µd)/2 + σ Ln (LR)/2 ZA The relationship between LR and AUC is not a simple one (Johnson, 2004).

4) Conclusions A diagnosis is relevant insofar as it directs the treatment and, therefore, the prognosis of a particular patient. The a priori disease’s probability, the availability of an informative diagnostic test, the availability of an effective treatment, the knowledge of the prognosis of the untreated diseased patient are all imbedded into the complex clinical decision process. Clinicians managing individual patients have to consider the trade-offs that influence the decision whether to withhold therapy, to order a test, or to administer a therapy. The evidence that has to be incorporated into such a strategy include reliability and risk of the test, effectiveness and harmfulness of the treatment, a priori disease probability and knowledge about its prognosis, if untreated. Particularly the test usefulness depends on LR, defining PA and PT decision thresholds. When the disease probability falls outside the PA – PT interval, either the test result will not alter the optimal therapeutic action and so would be not useful. Only if the disease probability fall inside the thresholds interval, the test result could sensibly change the clinical strategy, and thus, in this case only, the best choice is to perform the test. For these reasons, in clinical decision making, the evidence the clinicians need are precisely estimated LRs (Black et al. 1999), for well defined cut-off points, reproducible within and between diagnosticians, adjusted for relevant patient characteristics.

5) References 1) Bamber D (1975): The area above the ordinal dominance graph and the area below the receiver operating graph J Math Psychol; 12: 387 – 415 2) Black ER, Bordley DR, Tape TG, Panzer (1999): Diagnostic strategies for common medical problems American College of Physicians – Philadelphia 3) Choi BCK (1998): Slopes of a Receiver Operating Characteristics Curve and Likelihood Ratios for a Diagnostic Test Am J Epidemiol 148: 1127 – 1132 4) Dorfman DD, Alf E (1969): Maximum likelihood estimation of parameters of signal detection theory and determination of confidence intervals – rating method data. Journal of mathematical psychology 6: 487 – 96 5) Egan JM (1968): Signal detection theory and ROC analysis. Academic Press – New York 6) Hilden J (1991): The area under the ROC curve and its competitors Med Dec Making 11: 95 – 101 7) Johnson NP (2004): Advantages to transforming the receiver operating characteristic (ROC) curve into likelihood ratio co-ordinates Stat Med 23: 2257 - 2566 8) Metz CE (1986): ROC methodology in radiologic imaging Invest Radiol; 21: 720 – 733 9) Metz CE, Herman BA, Shen J-H (1998): Maximum likelihood estimation of Receiver Operating Characteristic (ROC) curves from continuously-distributed data Statist Med; 17: 1033 – 1053Pauker SG, Kassirer JP (1980): The threshold approach to clinical decision making NEJM; 302: 1109-1117 10) Pepe MS (2003): The statistical evaluation of medical tests for classification and prediction Oxford University Press – Oxford 11) Swets JA, Pickett RM(1982): Evaluation of diagnostic systems: methods from signal detection theory. Academic Press – New York 12) Zhou X-H, Obuchowski NA, McClish DK (2002): Statistical methods in diagnostic medicine Wiley Interscience – New York