ROC Curve Analysis and Medical Decision Making

ROC curve analysis and medical decision making: what’s the evidence that matters for evidence-based diagnosis ? Piergiorgio Duca Biometria e Statistica Medica – Dipartimento di Scienze Cliniche Luigi Sacco – Università degli Studi – Via GB Grassi 74 – 20157 MILANO (ITALY) [email protected] 1) The ROC (Receiver Operating Characteristic) curve and the Area Under the Curve (AUC) The ROC curve is the statistical tool used to analyse the accuracy of a diagnostic test with multiple cut-off points. The test could be based on a continuous diagnostic indicant, such as a serum enzyme level, or just on an ordinal one, such as a classification based on radiological imaging. The ROC curve is based on the probability density distributions, in actually diseased and non diseased patients, of the diagnostician’s confidence in a positive diagnosis, and upon a set of cut-off points to separate “positive” and “negative” test results (Egan, 1968; Bamber, 1975; Swets, Pickett, 1982; Metz, 1986; Zhou et al, 2002). The Sensitivity (SE) – the proportion of diseased turned out to be test positive – and the Specificity (SP) – the proportion of non diseased turned out to be test negative – will depend on the particular confidence threshold the observer applies to partition the continuously distributed perceptions of evidence into positive and negative test results. The ROC curve is the plot of all the pairs of True Positive Rates (TPR = SE), as ordinate, and False Positive Rates (FPR = (1 – SP)), as abscissa, related to all the possible cut-off points. An ROC curve represents all of the compromises between SE and SP can be achieved, changing the confidence threshold. The discrimination capacity of the test depends upon the extent to which the probability distributions of the perceived evidence of the true state of the patients are separated, and the Area Under the Curve (AUC) is the best summary measure of this discriminant accuracy. It corresponds to the probability that a randomly chosen patient from the diseased population, will give a perceived evidence of disease greater than a patient randomly chosen from the disease-free population. It depends only on the overlapping of the distributions of perceived evidence. It takes a value between .5 (a chance performance, based on completely overlapped distributions) and 1.0 (a perfect performance, based on completely disjointed distributions). Reliable methods for estimating the AUC require that some assumption be made regarding the functional form of the ROC curve, and the binormal form has been used in diagnostic context. This functional form has been found providing satisfactory fits to data generated in a very broad variety of real situations. It has also the convenient property that all possible ROC curves are transformed into straight lines, if plotted on a “normal deviates” system of axes. Assuming this functional form and plotting the curve on normal- deviates axes, an estimate for the AUC can be derived estimating the parameters α, β of the straight line corresponding to the ROC curve in the ordinary [SE; (1-SP) space]. α is the difference of the means of the perceived evidence of the two populations: ∆ = µD − µd, expressed in terms of units of standard deviation of the diseased population (σD), β is the standard deviation of the actually negative population (σd) as units of standard deviation of the actually positive population. In the ordinary ROC space, assuming µD>µd: SE = f (1-SP) ZSE = – (XC – µD) / σD Z1-SP = – (XC - µd) / σd and in the “normal deviates” space: 2 1/2 ZSE = α + β Z1-SP α = ∆ / σD β = σd / σD AUC=Φ(ZA) ZA=[α /(1+β ) ] The task of curve fitting becomes one of choosing numerical values a and b, as estimates of α and β, and a maximum likelihood parameter estimation scheme could be applied (Dorfman , Alf, 1969; Metz, Herman, Shen 1998; Pepe 2003). Sources of variability to be considered are: 1) The patients variability, due to random variation in the diagnosis’ difficulty of the cases/non-cases included in the sample; 2) The observers variability, due to random variation in the skills of the observers included in the study; 3) The within-observer variability, due to the inability of each observer to exactly reproduce his own diagnosis on repeated occasions on the same case. The AUC has only a limited meaning in clinical practice. To illustrate this point consider a dichotomous test of SE=SP=0.90. What is the corresponding AUC ? There are three possible answers: 1/2 AUC B = Φ[1/2 (ZSP + ZSE)] AUC C = (SE + SP)/2 AUC D = SE x SP In our example: 1/2 AUCB = Φ[1/2 (1.282+1.282)] = Φ(1.813) = 0.965 AUCC = SE x SP + [SE (1-SP) + SP (1 – SE)]/2 = (SE + SP)/2 = 0.90 AUCD = SE x SP = 0.81. Which is the correct one ? It depends on the context: 1) assuming an observable continuous binormal underlying variable, the correct value of AUC is the first one; 2) assuming an interest on guessing, flipping a coin, which, in a pair of both positives or negatives, is actually the diseased, the correct answer is the second one; 3) assuming an interest just in predicting what’s the probability to identify correctly the positive, taking a pair of patients with the same a priori probability to be diseased, without knowing in advance that only one is certainly diseased, it’s correct the third value. Following Hilden (1991): “Knowing AUC might be useful in diagnosing an accidental interchange of two blood samples, from a D and d patient, but I have never been able to see that AUC has any operational meaning in the context of diagnosing one patient as D or d”. 2) Medical decision making and likelihood ratios (LR) Although an ROC curve describes all the tradeoffs that can be realized among the frequencies of TP, FN, TN, and FP, a clinician is interested on defining a particular cut-off point giving the most efficacious compromise, in any particular case, considering the prevalence (P) of disease (disease a priori probability) and the utilities (U), a numerical measure of the desirability of any of the consequences of a decision, preferably expressed by a number lying between 0 and 1. The optimal operating point on the ROC curve has to be considered in terms of feasibility (not always a test is based on an observable continuous variable, measurable with high precision), and of expected net benefit. In a clinical setting, a test is used for choosing among alternative courses of action. It is a decision making tool. The clinical decision maker would not be content just to be right more than wrong. He wouldn’t know just the true state of the patient. Actually he would do more goods than evils to his patient. In the clinical decision context the inherent discrimination capacity of the test is just one of the things a clinician must consider. Utilities of the different conditions, as well as the pre-test disease probability, both largely depending on subjective judgment, must be considered too. In a very simple situation, having to choose among treatment (T), no-treatment (t), and testing (A), given a patient affected by the disease D with an a priori probability P, the clinical decision will depend on the treatment expected utility: E(U)T; the un-treatment expected utility: E(U)t; and the test and treat only the positives expected utility: E(U)A. Which one of those is greater. If we decide to maximize the expected utility, as a criterion for choosing the best strategy, we will decide to treat, to do not treat, or to assess the patients, considering which decision has the maximum expected utility, computed as follows (Pauker, Kassirer, 1980): E(U)T = P UDT + (1 – P) UdT E(U)t = P UDt + (1 – P) Udt E(U)A = P SE UDT + P (1 – SE) UDt + (1 – P) SP Udt + (1 – P) (1 – SP) UdT Obviously, the utility of each condition, for example “to be diseased and treated”, will depend both from the treatment effectiveness and the disease’s characteristics. If 100% of patients are cured, UDT will be considered 1, the maximum possible value. If the treatment is not 100% effective, UDT will be defined in terms of effectiveness, residual invalidity, quality of life, life expected and so on. The utility of the condition “to be disease-free and treated” (UdT), instead, will depend mainly on the treatment adverse effects. The expected utility of the decision will depend on the utilities, above defined, and on the probability of the disease. The expected utility of the decision “to assess, and to treat the positives only”, E(U)A, will depend also on the test’s SE and SP. In order to identify the best strategy, we can consider when the choice between two decisions might be random and starting with the dichotomy “T” vs “t”, we can show that, for a value of disease probability P*, we can indifferently choose both because E(U)T = E(U)t, with: P* = (C/B)/(C/B+1) where C = (Udt – UdT) is the cost of the treatment in terms of adverse effects affecting disease free patients; B = (UDT – UDt) is the benefit of the treatment in terms of effectiveness for diseased patients. Only if the disease probability is greater then P*, the best decision will be “to treat”. In order to choose between “t” and “A” we can show, analogously, that exists a threshold value PA + for which we can indifferently choose both because E(U)t = E(U)A, with PA = (C/B)/[C/B + LR ] where LR+ = SE/(1 – SP) is the likelihood ratio of the positive test.

ROC Curve Analysis and Medical Decision Making

Auditor: an R Package for Model-Agnostic Visual Validation and Diagnostics

Estimating the Variance of a Propensity Score Matching Estimator for the Average Treatment Effect

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Njit-Etd2007-041

Propensity Score Analysis with Hierarchical Data

Logistic Regression, Part I: Problems with the Linear Probability Model

Homoskedasticity

HOMOSCEDASTICITY PLOT Graphics Commands

Chapter 10 Heteroskedasticity

Assumptions of Multiple Linear Regression

Simple Linear Regression 80 60 Rating 40 20

Discriminatory Accuracy of Serological Tests for Detecting Trypanosoma