Receiver Operating Characteristic (ROC) Curves

and specificity, using as weights the (particularistic) Receiver Operating relative frequencies of the D+ and D− states. Using Characteristic (ROC) two measures, namely a (TPF, FPF) pair, avoids this arbitrariness. Curves Although a (TPF, FPF) pair is a big improvement over an overall accuracy index, it is often not Receiver operating characteristic (ROC) analysis was sufficient. A single (TPF, FPF) pair still does not developed to summarize data from signal detection allow meaningful comparison of the performance experiments in psychophysics [21]. Today, the term of one diagnostic test with another, or even with refers to the same test performed in another setting or by another observer, when different criteria for test a method of quantifying how accurately exper- positivity are used in the two instances compared. imental subjects, professional diagnosticians and The ROC curve, in the form of a series of (TPF, prognosticators (and their various tools: tests or instruments yielding numerical results, combinations FPF) pairs (see Figure 1), isolates a test’s capacity to of data-collection and data-display devices, different discriminate between a given disease and its absence, amounts and types of information ...) perform when from the confounding influence of the decision they are required to make a series of fine discrimina- criterion (confidence level or cutting score) that is tions or to say which of two conditions or states of adopted for test positivity [37, 52, 58]. A more nature, confusable at the moment of decision, exists accurate test will be located on an ROC curve closer or will exist [50]. to the top left corner than a less accurate one. A In biomedical applications, the two states are often noninformative test will have an ROC curve that lies referred to as diseased and nondiseased, or D+ and along the diagonal. D− for short. Central to this analysis is the ROC Statistical techniques to handle the full range of curve, which displays diagnostic accuracy as a series ROC study designs continue to be developed [3, of pairs of performance measures. Each pair consists 5, 9, 26]. Analyses can vary in complexity from of a true positive fraction (TPF) and the correspond- deriving an ROC curve for a single diagnostic test ing false positive fraction (FPF) for a given definition involving numerical values derived from patients of “test” (t) positivity, t+. These fractions are cal- at a single institution, to complex multi-institution culated from the D+ and D− groups respectively, studies to compare two or more imaging modalities. TPF as the proportion of (t+,D+) among those The complexity also depends on the purpose of the D+, and FPF as the proportion of (t+,D−) among discrimination test, the setting and context to which those D−. In the medical literature, the term TPF is it refers, whether in the study interpretations are called sensitivity and the complement of the FPF is performed individually in real time [18] or later called specificity. For the performance of statistical in “batch” mode [52], and whether the tests under tests, the term power, rather than sensitivity, tends to study and the procedures for independent definitive be used. Equivalently, one can use its complement, determination of the true state of nature (the gold the false negative fraction (FNF, the complement standard) are costly, invasive, uncomfortable or of TPF) or the frequency (β) of a so-called type II dangerous. error. There is no direct statistical term for specificity. This latter issue can create special problems since Instead, statisticians again focus on the complement, ROC curves are strongly influenced by the source of using the false positive fraction (FPF, the comple- the test material used [6]. Distortions occur when the ment of the true negative fraction, TNF) to denote result of the test being studied affects the subsequent the frequency (α) of what they call type I error. work-up needed to establish a definitive diagnosis. Diagnostic performance is sometimes naively Information available on the distribution of test characterized using a single overall index of results and clinical indicants in the source population “accuracy”, calculated as the sum of two proportions, can be used to remove quantitatively this “verification i.e. the proportion of (t+,D+) instances plus bias” from ROC curves [20, 30]. Other biases in the proportion of (t−,D−) instances, where the the assessment of diagnostic tests and guidelines for proportions are based on all patients undergoing the circumventing the problems in prospective studies test. This index is a weighted average of sensitivity have been described [4, 7]. 2 Receiver Operating Characteristic (ROC) Curves 1 commonly used. Getting a reader to use all of the TPF (2+2+11+33) /51 + + rating categories provided yields a more stable ROC (2 11 33) /51 (6 (11+33) /51 + curve estimate, but is not always easy to accomplish (6 6 (11 + + 11 0.8 11 without causing other problems [23]. Use of ratings + + 2) /58 + 2) /58 2) /58 from the continuous 0–100% confidence scale [31, 49] has several advantages: it more closely resem- 33/51 0.6 2/58 bles reader’s clinical thinking and reporting; its use of a finer scale leads to somewhat smaller stan- Rating category -- - +/− + ++ total dard errors of estimated indices of accuracy; and − 0.4 D 33 6 6 11 2 58 it increases the possibility that the data will allow D+ 3 2 2 11 33 51 parametric curve fitting. D− 0.2 Obtaining an ROC Curve and Summary Latent scale Indices Derived from it D+ FPF 0 0.2 0.4 0.6 0.8 1 For rating scale data, the 2 (D states) × k (rating categories) frequency table of the ratings yields k − 1 Figure 1 Example of empirical ROC points and smooth empirical (TPF, FPF) ROC points. As shown in curve fitted to them. The empirical points are calculated Figure 1, these are obtained from the k − 1 possible from successively more liberal definitions of test positivity × + two-by-two tables formed by different re-expressions applied to the 2 5 table (inset) of disease status (D × = = or D−) and rating category (−− to ++). The smooth of the 2 k data table. After TPF 0 at FPF 0, ROC curve is derived from the fitted binormal model (inset, the lowest leftmost ROC point is derived using the lower right, with parameters a = 1.657 and b = 0.713 on strictest cutpoint, where only the most positive cate- a continuous latent scale) by using all possible scale values gory would be regarded as positive; each subsequent for test positivity. The fitted parameters a and b, together point towards the top right ROC corner (TPF = with the four estimated cutpoints, produce fitted frequencies 1, FPF = 1) is obtained by employing successively of {32.9, 6.4, 5.9, 10.7, 2.1} and {3.2, 1.5, 2.1, 11.2, 32.9} laxer criteria for test positivity. For objective tests for the D− and D+ rows of the 2 × 5 table. Note that a monotonic transformation of the latent axis may produce that yield numerical data, the same procedure – with overlapping distributions with nonbinormal shapes, but will each distinct observed numerical test value as a cate- yield the same multinomial distributions and the same fitted gory boundary and with k no longer fixed a priori but ROC curve rather determined by the numbers of ‘runs’ of D+ and D− in the aggregated data – is used to calculate the series of empirical ROC data points. The sequence of The complexity of the test material can have an points can then be joined to form the empirical ROC important bearing on the ability of a study to compare curve or a smooth curve can be fitted. tests. Cases resulting in an ROC curve that is midway As a summary measure of accuracy, one can between the diagonal (subtle or completely obscure use: (i) TPF[FPF], the TPF corresponding to a single ones) and the upper left corner (all ‘obvious’) allow selected FPF; (ii) the area under the ROC curve; or for sizable differences in performance; however, the (iii) the area under a selected portion of the curve, closer the curve is to the upper left corner, the often called the partial area. Summary (i) is readily narrower is the sampling distribution of the various understood and most clinically pertinent. However, indices derived from the curve [39]. reported TPFs are often in reference to different FPF For clinical imaging studies involving inter- values, and it may be unclear whether a reference pretations, the most economical method of col- FPF was chosen in advance or after inspection of the lecting a reader’s impression of each case is curve. Moreover, the statistical reliability tends to be through the use of a rating scale, i.e. graded lev- lower than that of other summary indices. els of confidence that the case is D+.Adiscrete Summary (ii) has been recommended as an alter- five-point scale – 1 = “definitely not diseased”, 2 = native [52]. It has an interpretation in signal detection “probably not diseased”, 3=“possibly diseased”, 4= theory as the proportion of correct choices in a two- “probably diseased”, 5 = “definitely diseased” – is alternative forced choice experiment [21], i.e. an Receiver Operating Characteristic (ROC) Curves 3 experiment where in each trial the subject is presented which the normal deviate (see Normal Scores) with a pair of stimuli, one from a randomly cho- transformations (z[TPF], z[FPF]) of the empirical sen D+ and one from a randomly chosen D−,and (TPF, FPF) pairs are linear provides a visual test is asked to decide which derives from which.

Load more