Confidence Intervals for the Area Under the Receiver Operating Characteristic Curve in the Presence of Ignorable Missing Data
Total Page:16
File Type:pdf, Size:1020Kb
Loyola University Chicago Loyola eCommons Mathematics and Statistics: Faculty Faculty Publications and Other Works by Publications and Other Works Department 4-2019 Confidence Intervals for the Area Under the Receiver Operating Characteristic Curve in the Presence of Ignorable Missing Data Hunyong Cho University of North Carolina at Chapel Hill Gregory J. Matthews Loyola University Chicago, [email protected] Ofer Harel University of Connecticut Follow this and additional works at: https://ecommons.luc.edu/math_facpubs Part of the Mathematics Commons Author Manuscript This is a pre-publication author manuscript of the final, published article. Recommended Citation Cho, Hunyong; Matthews, Gregory J.; and Harel, Ofer. Confidence Intervals for the Area Under the Receiver Operating Characteristic Curve in the Presence of Ignorable Missing Data. International Statistical Review, 87, 1: 152-177, 2019. Retrieved from Loyola eCommons, Mathematics and Statistics: Faculty Publications and Other Works, http://dx.doi.org/10.1111/insr.12277 This Article is brought to you for free and open access by the Faculty Publications and Other Works by Department at Loyola eCommons. It has been accepted for inclusion in Mathematics and Statistics: Faculty Publications and Other Works by an authorized administrator of Loyola eCommons. For more information, please contact [email protected]. This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. © The Authors, 2019. HHS Public Access Author manuscript Author ManuscriptAuthor Manuscript Author Int Stat Manuscript Author Rev. Author manuscript; Manuscript Author available in PMC 2019 April 18. Published in final edited form as: Int Stat Rev. 2019 April ; 87(1): 152–177. doi:10.1111/insr.12277. Confidence Intervals for the Area Under the Receiver Operating Characteristic Curve in the Presence of Ignorable Missing Data Hunyong Cho1, Gregory J. Matthews2, and Ofer Harel3 1Department of Biostatistics, University of North Carolina at Chapel Hill, 135 Dauer Drive, Chapel Hill, NC 27599, USA 2Department of Mathematics and Statistics, Loyola University Chicago, 1032 W. Sheridan Road, Chicago, IL 60660, USA 3Department of Statistics, University of Connecticut, 215 Glenbrook Road U-4120, Storrs, CT 06269, USA Summary Receiver operating characteristic curves are widely used as a measure of accuracy of diagnostic tests and can be summarised using the area under the receiver operating characteristic curve (AUC). Often, it is useful to construct a confidence interval for the AUC; however, because there are a number of different proposed methods to measure variance of the AUC, there are thus many different resulting methods for constructing these intervals. In this article, we compare different methods of constructing Wald-type confidence interval in the presence of missing data where the missingness mechanism is ignorable. We find that constructing confidence intervals using multiple imputation based on logistic regression gives the most robust coverage probability and the choice of confidence interval method is less important. However, when missingness rate is less severe (e.g. less than 70%), we recommend using Newcombe’s Wald method for constructing confidence intervals along with multiple imputation using predictive mean matching. Keywords ROC curve; AUC; Mann–Whitney statistic; confidence interval; multiple imputation; predictive mean matching; missing data; logistic regression 1 Introduction Diagnostic tests are often used to classify subjects as either diseased or non-diseased. Assuming that large values of test results are indicative of diseased status, a subject will be classified as diseased or non-diseased if their test results are above or below a certain threshold, respectively. Then, for a specific threshold value, the performance of the test can be evaluated using measures such as sensitivity and specificity, where sensitivity is the probability of a true positive or true positive rate (TPR) (i.e. given a positive test result, the Supporting Information Additional supporting information may be found online in the Supporting Information section at the end of the article. Cho et al. Page 2 subject actually has the disease) and specificity is the probability of a true negative or true Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author negative rate (TNR) (i.e. given a negative test results, the subject is actually non-diseased). A more comprehensive wayto measure the accuracy of a diagnostic test is to use the receiver operating characteristic (ROC) curve (Pepe, 2003), which graphically compares specificity and sensitivity levels; specifically, the ROC curve is a plot of sensitivity against 1 – specificity across all possible cut-off values. When comparing two diagnostic tests with one another, if the ROC curve for one test is located uniformly higher than the other test across all possible thresholds, the test is said to be strictly more accurate and therefore preferred over the other test. It is, however, not straight forward to compare two ROC curves that intersect at some location other than the end points. One common way to summarise the ROC curve is to compute the area under the ROC curve (AUC). The AUC ranges in value from 0.5 (essentially a meaningless test) to 1 (a perfect test) and can be interpreted as the probability that a randomly chosen diseased subject has a higher test value than that of a randomly chosen healthy subject (given higher test value is indicative of the disease) (Hanley & McNeil, 1982). The AUC is, in most cases, calculated empirically by calculating the area under the sample ROC curve, and it is often of interest to present an interval estimate of the AUC rather than simply a point estimate. There exist a number of methods that have been proposed to compute an interval estimator for the AUC. Many of these have been derived based on the equivalence relation between the AUC and Wilcoxon’s rank-sum test statistic (Hollander et al., 2014) such as Bamber (1975), Hanley & McNeil (1982), DeLong et al. (1988), Mee (1990) and Newcombe (2006). Other proposals, such as Reiser & Guttman (1986) and Newcombe (2006), developed confidence interval (CI) estimators of the AUC by assuming a parametric distribution for the test scores and deriving the interval in that manner. Cortes & Mohri (2005) suggested a CI method using a combinatoric approach, and Obuchowski & Lieber (2002) introduced an exact method, which can be used when the estimated AUC is close to 1. These methods can be easily applied when the true disease status and the test values for all subjects in the sample are known. However, in practice, data may be incomplete. This often happens when the true disease status of subjects is unknown, which commonly occurs when a patient does not go through a so-called gold standard test that, in many cases, is invasive and/or expensive. Missing data can also occur on patients’ test scores for many different reasons including they simply did not have the test performed. When observations with missing values are removed from analysis (i.e. complete case analysis), the estimator is potentially subject to bias, and when bias is caused specifically by missing disease status, this is referred to as verification bias. Many corrected estimators of AUC have been proposed to adjust biases caused by missing values. Begg & Greenes (1983), Alonzo & Pepe (2005), He et al. (2009), along with others, came up with AUC estimators correcting verification bias directly (i.e. ‘direct approach’). In order to construct CIs based on those methods, bootstrap techniques are used. Alternatively, CI methods for complete data settings can be used after applying multiple imputation (MI) techniques to the incomplete data set (i.e. ‘MI approach’) as in Harel & Zhou (2006), which sought to correct the verification bias for measuring sensitivity and specificity, and Harel & Int Stat Rev. Author manuscript; available in PMC 2019 April 18. Cho et al. Page 3 Zhou (2007a). To address the problem of missing biomarker values, Long et al. (2011a) and Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author Long et al. (2011b) came up with a robust approach and an approach using non-parametric MI, respectively. All the estimators correcting for bias introduced here are based on the assumption that the data are missing at random (MAR) (Rubin, 1976) and exploit the covariates to predict the missing values. Although quite a few estimators have been proposed, relatively little work has been done reviewing the performance of the CI methods for the AUC in the presence of missing data. In this article, we compare the performance of several Wald-type interval estimators for the AUC in data sets where missing data exist. In our example, the incomplete data will be true disease statuses, and it will be assumed that the missingness mechanism is MAR. 2 Wald-type Confidence Interval Methods for the Area Under the Receiver Operating Characteristic Curve Consider a test such that higher test scores are associated with a higher probability that the subject is diseased and vice versa. By definition, the AUC (θ) is the integral of the TPR of the theoretical ROC curve by its false positive rate (FPR = 1 – TNR), that is, θ ∞ t d t where TPR(t) and FPR(t) are the TPR and FPR for a threshold = ∫ −∞ TPR ( ) FPR ( ), value t. As the AUC can be interpreted as the probability that a randomly chosen diseased subject has higher biomarker value than a randomly chosen healthy subject’s biomarker value (plus half of the probability of ties is added, if any), it can also be expressed as 1 θ = P[Y > X] + P[Y = X], where X and Y are the biomarkers of the randomly chosen non- 2 diseased and diseased subjects, respectively. So it is straightforward to have an unbiased estimator of the AUC as n n Y X 1 ∑ ∑ I y > x + I y = x i = 1 j = 1 i j 2 i j θ = , nYnX where xj and yi are the test scores for the j -th and i-th individual in the non-diseased and diseased groups, respectively, and j = 1, …, nX and i = …, nY.