Estimating and Comparing Diagnostic Tests' Accuracy When the Gold

Estimating and Comparing Diagnostic Tests’ Accuracy When the Gold Standard Is Not Binary1 Nancy A. Obuchowski, PhD Rationale and Objectives. Investigators often need to assess the accuracies of diagnostic tests when the gold standard is not binary-scale. The objective of this article is to describe nonparametric estimators of diagnostic test accuracy when the gold standard is continuous, ordinal, and nominal scale. Materials and Methods. A nonparametric method of estimating and comparing the area under receiver operating characteristic (ROC) curves, proposed by DeLong et al, is extended to situations in which the gold standard is not binary. Two examples illustrate the methods. Results. Measures of diagnostic test accuracy, their variance, and tests for comparing two diagnostic tests’ accuracies in paired designs are presented for situations in which the gold standard is continuous, ordinal, and nominal scale. These summary measures of diagnostic test accuracy are analogous in form and interpretation to the area under the ROC curve. Conclusion. Dichotomizing the outcomes of a gold standard so that traditional ROC methods can be applied can lead to bias. The methods described here are useful for assessing and comparing summary test accuracy when the gold standard is not binary scale. They have limitations similar to other summary indices. Key Words. Diagnostic accuracy; gold standard; nonparametric statistics; receiver operating characteristic curve; ROC analysis. © AUR, 2005 Receiver operating characteristic (ROC) analysis has be- racy (ie, sensitivity and specificity) at various cutpoints of come the standard method for characterizing a diagnostic the diagnostic test’s results. test’s accuracy and for comparing accuracies of compet- Some examples of binary-scale gold standards are sur- ing diagnostic tests (1–4). In traditional ROC analysis, gery for determining whether or not a cancer has spread there exists a gold or reference standard, independent of to the lymph nodes, invasive catheter angiography for the diagnostic test(s), which provides a binary result as to determining if a significant stenosis is present or absent, the presence or absence of disease. Patients undergo both and biopsy for determining if a lesion is benign or malig- the gold standard procedure and the diagnostic test(s). nant. There is a vast literature on parametric, nonparamet- The diagnostic test results are then compared with the ric, and semiparametric statistical methods for estimating results of the binary-scale gold standard to estimate accu- and comparing ROC curves and their associated indices when the gold standard is binary scale (5–12). The application of ROC curves has expanded to many Acad Radiol 2005; 12:1198–1204 fields. Even in the diagnostic radiology setting, there are 1 From the Department of Quantitative Health Sciences/Wb4, The Cleve- many applications in which we want to characterize and land Clinic Foundation, 9500 Euclid Ave, Cleveland, OH 44195. Received compare the accuracies of diagnostic tests, but the gold March 3, 2005; revision received and accepted May 17, 2005. Address correspondence to: N.A.O. e-mail: [email protected] standard is not binary scale. One approach might be to © AUR, 2005 construct a binary scale from the nominal-, ordinal- or doi:10.1016/j.acra.2005.05.013 continuous-scale gold standard so that traditional ROC 1198 Academic Radiology, Vol 12, No 9, September 2005 NON-BINARY GOLD STANDARD Table 1 Comparison of Summary Measures of Diagnostic Test Accuracy Gold Standard Estimator of Accuracy 1 nt ns ␪^ ϭ ⌿͑ ͒ Binary ͚ ͚ Xit,Xjs where: ⌿ϭ1if Xit Ͼ Xjs ntns͑i ϭ 1͒ ͑j ϭ 1͒ ⌿ϭ0.5if Xit ϭ Xjs ⌿ϭ0if X Ͻ X . 1 N N it js ␪^ ϭ ⌿͑ ͒ Continuous ͚ ͚ Xit,Xjs where: ⌿ ϭ 1if tϾs and Xit Ͼ Xjs or s Ͼ t N͑NϪ1͒͑i ϭ 1͒ ͑j ϭ 1͒ and Xjs Ͼ Xit ⌿ ϭ 0.5if t ϭ sorXjs ϭ Xit ⌿ ϭ T T 0 otherwise. ␪^ ϭ Ϫ ͑ ͒ ͑ Ϫ ␪^ ͒ Ordinal 1.0 ͚ ͚ wts·L t,s · 1 ts where: is the binary-scale estimator ͑t ϭ 1͒ ͑s Ͼ t͒ nt ns ^ 1 Nominal SAME AS ORDINAL where ␪ts ϭ ͚ ͚ ⌿͑D͑tϪs͒tj,D͑tϪs͒sk͒ ͑nt·ns͒͑j ϭ 1͒ ͑k ϭ 1͒ ⌿(D(t Ϫ s)tj,D(t Ϫ s)sk) ϭ 1ifD(t Ϫ s)tj Ͼ D(t Ϫ s)sk ⌿(D(t Ϫ s)tj,D(t Ϫ s)sk) ϭ 1/2 if D(t Ϫ s)tj ϭD(t Ϫ s)sk ⌿(D(t Ϫ s)tj,D(t Ϫ s)sk) ϭ 0ifD(t Ϫ s)tj Ͻ D(t Ϫ s)sk. methods can be used. This approach, however, often di- gold standard is not binary-scale. Table 1 summarizes the lutes the measure of accuracy, can conceal important rela- various related measures of accuracy described in this tionships between the gold standard and diagnostic test, article. and can lead to biased estimates of accuracy (13,14). In this article, we describe nonparametric methods for Binary-Scale Gold Standard estimating and comparing the accuracies of diagnostic Let Xit denote the result of the diagnostic test for the tests when the gold standard is not binary scale. Accuracy ith patient with disease status t (as determined by the gold is defined using indices analogous to the traditional area standard), and let Xjs denote the result of the diagnostic under the ROC curve. test for the jth patient with disease status s. We can think We illustrate the methods using two examples. In the of status t as being “disease present” and status s as “dis- first example magnetic resonance imaging (MRI) is used ease absent.” X can be a rating-scale confidence score to score regions of the heart according to the amount of assigned by a reader according to the perceived likelihood visible scarring. The amount of scar is compared to the that disease is present (eg, 1 ϭ “definitely no disease,” 2 ϭ results of the gold standard, positron emission tomogra- “probably no disease,” 3 ϭ “possibly disease,” 4 ϭ “proba- phy (PET), where each region was read as normal, hiber- bly disease,” or 5 ϭ “definitely disease”), a percent confi- nating, ischemic, or necrotic (ordinal-scale gold standard). dence score assigned by a reader (ie, a value between 0 and In the second example, computed tomography (CT) is 100, where 0% means no confidence in the presence of dis- used to measure the size of renal tumors. Patients also ease and 100% is complete confidence in the presence of underwent surgery during which the renal tumor was disease), or an objective measurement (eg, attenuation value, measured. The surgical measurement is considered the serum iron concentration). gold standard (continuous-scale gold standard). A nonparametric estimator of diagnostic test accuracy is ESTIMATING AND COMPARING nt ns ^ 1 DIAGNOSTIC TEST ACCURACY ␪ ϭ ͚ ͚ ⌿(Xit, Xjs) (1) ntns(i ϭ 1) (j ϭ 1) We first review the nonparametric estimation of diagnostic test accuracy when the gold standard is binary. We where nt and ns are the number of patients with disease use the methods of DeLong et al (6) to estimate the vari- status t and s in the study sample, respectively. We as- ance and covariance of the estimated accuracy because sume that higher diagnostic test results are indicative of we can easily extend the method to situations where the greater suspicion for disease status t; then the kernel is 1199 OBUCHOWSKI Academic Radiology, Vol 12, No 9, September 2005 defined as higher gold standard outcome (eg, larger renal tumor) has a higher diagnostic test result than the patient with a lower ⌿ϭ Ͼ gold standard outcome (eg, smaller tumor). 1ifXit Xjs The estimator of the variance of ^␪ ⌿ϭ ϭ 0.5 if Xit Xjs 1 N ␪^ ϭ Ϫ ␪^ 2 ⌿ϭ ϭ Vaˆr( ) ͚ [V(Xit) ] (3) 0ifXit Xjs (N ⁄ 2)((N ⁄2Ϫ 1)(i ϭ 1) ␪ˆ is an estimator of the area under the ROC curve (5,6). where the structural components are now defined as It is interpreted as: of two randomly chosen patients, one with disease and one without, ␪ is the probability that the 1 N patient with disease will have a higher result on the diag- V(Xit) ϭ ͚ ⌿ (Xit, Xjs) ͑i j͒ (4) nostic test than the patient without disease (5). (N Ϫ 1)(j ϭ 1) DeLong et al (6) used structural components to estimate the variance of ␪ˆ and the covariance between the Note that the variance estimator in Eq 3 is similar to the accuracies of two diagnostic tests. Details are given in the variance estimator for binary-scale gold standards (see Eq appendix. A2), but the estimator in Eq A2 has two sums of squares terms, one for diseased and one for nondiseased patients. Continuous-Scale Gold Standard To compare two diagnostic tests’ accuracies, ␪=1 and Suppose now that the gold standard is continuous ␪=2, we use the test statistic given in Eq A6, where the scale. For example, tumor diameter in centimeters is mea- estimator of the covariance is sured noninvasively with CT (the diagnostic test) and compared with the diameter measured at surgery (the con- 1 N ^1 ^2 1 ^1 2 ^2 Coˆv(␪ , ␪ ) ϭ ͚ [V(Xit) Ϫ ␪ ][V(Xit) Ϫ ␪ ] (5) tinuous-scale gold standard). Here, Xit denotes the result (N ⁄ 2)((N ⁄2)Ϫ 1)(i ϭ 1) of the diagnostic test for the ith patient who has a continuous-scale gold-standard outcome of t. As with a binary- Ordinal-Scale Gold Standard scale gold standard, X can be a subjective confidence With an ordinal-scale gold standard, we assume that score assigned by a reader or an objective measurement.

Estimating and Comparing Diagnostic Tests' Accuracy When the Gold

Should the Randomistas (Continue To) Rule?

Getting Off the Gold Standard for Causal Inference

Assessing the Accuracy of a New Diagnostic Test When a Gold Standard Does Not Exist Todd A

Memorandum Explaining Basis for Declining Request for Emergency Use Authorization for Emergency Use of Hydroxychloroquine Sulfate

Randomized Controlled Trials, Development Economics and Policy Making in Developing Countries

Evaluating Diagnostic Tests in the Absence of a Gold Standard

Screening for Alcohol Problems, What Makes a Test Effective?

Adaptive Platform Trials

Common Types of Questions: Types of Study Designs

Efficient Adaptive Designs for Clinical Trials of Interventions for COVID-19

Assessing the Accuracy of Diagnostic Tests

Evaluation of Diagnostic Tests When There Is No Gold Standard Vol