Statistical Evaluation of Diagnostic Tests (Part 1): Sensitivity, Specificity, Positive and Negative Predictive Values NJ Gogtay, UM Thatte

Home , Diagnostic odds ratio, Gold standard (test), Prevalence

80 Journal of The Association of Physicians of India ■ Vol. 65 ■ June 2017

Statistics for Researchers Statistical Evaluation of Diagnostic Tests (Part 1): Sensitivity, Specificity, Positive and Negative Predictive Values NJ Gogtay, UM Thatte

Introduction to Screening a definitive diagnosis and are often evening rise of temperature and called confirmatory tests. cough for over 3 weeks [wherein and Diagnostic Tests he suspects tuberculosis], but The Challenge of n clinical practice, two broad would never treat the patient for Itypes of tests are used- screening Interpreting Screening and tuberculosis simply based on the and diagnostic tests. Screening Diagnostic Tests results of an elevated ESR. tests are those that are used on a Thus, the question that the large population [usually healthy The interpretation of screening clinician is really trying to answer individuals or patients who are and diagnostic tests can be is “Given a test result [positive or yet asymptomatic] to identify challenging. For instance, for negative], what is the probability those likely to need intervention the diagnosis of malaria, the that the patient has [or does not] or identify disease early. Examples peripheral smear still remains the have the disease?’ From the point of screening tests would include “gold standard” or the “reference of view of the patient, his/her routine blood pressure monitoring standard” test for identifying the thoughts are likely to be a) the test for diagnosing hypertension, a malarial parasite. For a patient who is negative so should I be reassured Pap smear for early diagnosis of presents with fever, a physician or continue to worry? b) the test is cervical cancer, Prostate Specific thinks of the probability of malaria positive- should I worry or simply Antigen [PSA] estimation for and orders the peripheral smear. ignore it? detection of prostate cancer or a The test result is binary - either Diagnostic and screening tests, mammogram for early detection positive or negative. The easiest thus, should be used correctly and of breast cancer. In real life, these approach for a clinician would be interpreted appropriately to make are usually done, for example, for to simply classify the patient into a diagnosis or aid in one. This is purposes of obtaining an insurance one of two groups- “the test is dependent upon the discriminative or a routine health checkup or positive and hence the patient has ability of the test, i.e., the ability as part of evidence based health the disease and thus I should treat to make a distinction between the policy recommendations in a him for malaria” OR that “the test two conditions of interest- health given population. Screening tests is negative and hence the patient and disease. should be easy to use, relatively does not have malaria and so I need inexpensive and ensure that they to look for alternate diagnoses”. The Discriminative Ability do not miss patients with disease, But does this really happen in the of a Test – Metrics of nor misclassify those without. clinical setting? The answer is Diagnostic Accuracy The second type of test is the maybe not! A physician may still prescribe anti-malarials to a patient diagnostic test. This is an aid to The following metrics are used who is smear-negative, because the clinical decision-making, done on to assess the diagnostic accuracy of signs and symptoms are classically patients who are symptomatic and a new test (regardless of whether it that of malaria or because the is usually more expensive and can is a screening or a new diagnostic patient has had a past episode of carry more risks than screening test, as defined above) 2 tests [a trans-rectal biopsy for malaria. Similarly, a clinician may • Sensitivity confirmation of prostate cancer for ask for a complete blood count instance carries greater risk than a along with ESR in a patient with • Specificity screening blood test for the PSA].1 Diagnostic tests are also done after a positive screening test to establish Department of Clinical Pharmacology, Seth GS Medical College & KEM Hospital, Mumbai, Maharashtra Received: 06.05.2017; Accepted: 10.05.2017 Journal of The Association of Physicians of India ■ Vol. 65 ■ June 2017 81

Table 1: A 2 x 2 table of depicting the results of a new test vis à vis a gold standard that a disease-free individual Disease Status as confirmed by the gold standard will have a negative test and Present Absent represents the “true negativity Test Result Positive True Positive [TP] a False positive [FP] b a+b rate” or TNR of the test. Simply put, Test Result Negative False Negative [FN] c True Negative [TN] d c+d this would be the ability of a test to a+c b+d correctly classify an individual as being disease- free.2 • Positive and Negative • Negative and the individual predictive values does not have the disease [TN]- Mathematically, this would be • Likelihood ratio d Those WITHOUT the disease who test negative [d] • Area under the Receiver • Negative, but the individual Operating Characteristic [ROC] has the disease [FN] -c ALL those WITHOUT the disease Curve and Youden’s index For the purpose of calculating [b+d] • Diagnostic odds ratio the various metrices described OR expressed mathematically as above, the data is summated In this first article in the TN conventionally as follows: Diagnostic tests series, we will TN + FP • a+c gives the total number of understand the concepts of Thus, when we say that a test individuals WITH the disease sensitivity, specificity and positive has a specificity of 90%, it means and negative predictive values • b+d is the total number of that of the 100 individuals who along with the mathematical individuals WITHOUT the do not have the disease and are formulae to compute them as disease tested, the test would show 90/100 also their limitations and clinical • a+b gives the total number of [90%] individuals as not having applications. positive test results the disease. The corollary would Beginning the Assessment • c + d gives the total of negative be that 10/100 [10%] of them would test results be wrongly picked up as having the of a New Test - The 2 X 2 disease [false positive]. Table Evaluating and Sensitivity and specificity Understanding the Metrics primarily address the question The assessment of any new “How accurately does the test test [also called as the index test] Sensitivity being evaluated discriminate begins with testing in two groups This is defined as the probability between individuals with disease of individuals - those who have the that an individual with disease will and without?” Both are test disease and those who do not. It is have a positive test and represents characteristics or test properties good to remember here that any the “true positivity rate” or TPR of and are independent of the disease test can return results as binary the test. Simply put, it is the ability prevalence of the population where [positive or negative as seen with of a test to correctly classify an they are tested as we will see a peripheral smear for malaria] or individual as ′diseased′.2 little later. on a continuous scale as seen with Mathematically, this would be Positive and Negative Predictive blood sugar, serum cholesterol or values plasma phenytoin levels. Those WITH the disease who test positive [a] As stated earlier, what the The next step would be the clinician wants to know is “Given ALL those WITH the disease construction of the two by two [2x2] a certain test result, what is the [a+c] table [Table 1]. The disease status probability of the disease?” which (as assessed with the Gold Standard OR brings us to understanding the {see below}) is conventionally put TP “predictive value” concept. in the top row and the test result in Positive predictive value (PPV) is the first column. TP + FN When we say that a test has a the probability that an individual Table 1 represents the four with a positive test truly has distinct possibilities that follow sensitivity of 90%, it means that of the 100 individuals who have the disease. In other words, an after a test is conducted on an individual has a positive test; individual. The test is the disease and are tested, the test will pick up 90/100 [90%] with the how worried should he be? • Positive and the individual has disease. The corollary would be Mathematically, this would be the disease [TP] - a that 10/100 [10%] would be missed a ratio and expressed as the • Positive, but the individual [false negative]. proportion of all those tested who have the disease AND a positive does not have the disease [FP] Specificity - b test [a] to all those screened who This is defined as the probability return a positive test [a+b]. Thus th Journal of The Association of Physicians of India ■ Vol. 65 ■ June 2017 8 May 2017 1030 am Final 82

Fig. 1: An ideal test that clearly discriminates between those with disease and those without with a clear cutoff Table 2: Calculation of sensitivity, specificity, positive and negative predictive Without disease With disease values of a test using the 2x2 table3

Posi�ve predic�ve value = a/a +b

Disease Test P resent Absent

Sensi�vity = a/a +c Posi�ve True Posi�ve [TP] a False posi�ve [FP] b a+b Speciﬁcity = d/b+d +c Nega�ve F alse Nega�ve [FN] c True Nega�ve [TN] d c+d

Nega�ve predic�ve value = d/d +c

Blood sugar = 126mg/dl Table 3: Diagnosis of microfilaraemia in a village with a prevalence of 5% using a test with 90% sensitivity and 90% specificity This however rarely happens in clinical practice and let us understand this with an example. Measuring Fig. 1: An ideal test that clearly fasting blood sugar is one of the screening tests that is used to make the provisional diagnosis of Test Disease discriminates between diabetes. Let us say we have a pre-identified sample of n = 20 patients who have diabetes and n = 20 Present Absent those with disease and who do not, based on a gold standard test [the oral glucose tolerance test; OGTT]. We have with us a Positive True Positive [45] False positive [95] a+b [140] those without with a clear Negative False Negative [5] True Negative [855] c+d [860] new screening test that doescut-off not need a finger prick but rather measures fasting blood sugar by the principle of infra-red light passing through the skin and is thus noninvasive. When this test is used for 50 950 1000 measuring blood sugarPositive in this sample predictiveof patients we find that itvalue has a sensitivity = of 85% and specificity of Table 4: Diagnosis of microfilaraemia in a village with a prevalence of 20% using a30%. 180/260 or 0.69 or 69% test with 90% sensitivity and 90% specificity In village 2, the same test is a Test Disease much better test! Present Absent Thus, it is now clear that while Positive True Positive [180] False positive [80] a+b [260] sensitivity and specificity remain Negative False Negative [20] True Negative [720] c+d [740] unaltered as they are properties 200 800 1000 The 2 x 2 table for this new test would be as follows: of the test itself, the positive and Table 3 Results with a new screening test for diabetes with 85% sensitivity and 30% specificity mathematically, Positive predictive village 1 with a prevalence of 5%. negative predictive values are Disease value is given by a/a+b. This means that 50 individuals in not fixed and vary with variation Test Present Absent this village have the disease, and Negative predictive value (NPV) Positive in the Truedisease Positive n = 17prevalence. False positive n The= 14 a+b n = 31 is the probability that individuals 950 are disease free. A test used corollary to this statement is that to diagnose microfilaraemia has a with a negative screening test truly because sensitivity and specificity 7 do not have the disease. In other sensitivity of 90% and specificity are only test performance features words, the individual has tested of 90%. Thus, of the 50 individuals and do not address the problem of negative, so how reassured should with the disease, the test would prevalence in diverse populations; he be? Mathematically, this would correctly identify 45 as having the positive and negative predictive be expressed as the proportion of disease [and miss 5] and of the 950 values become important in all those tested who DO NOT have without the disease, the test would interpretation of test results. the disease AND are negative [d] correctly identify 855 as not having to ALL those who test negative [d/ the disease [and falsely label 95 as The Tradeoff between c+d].2 having the disease]. Let us now fit Sensitivity and Specificity the 2 x 2 table with the actual values In the 2 x 2 table presented earlier, based on this information. When we finally choose a test, the four concepts of sensitivity, we often have to often accept a specificity, positive predictive Microfilaraemia prevalence – trade-off between sensitivity and value and negative predictive value 5%, Test sensitivity 90% and test specificity. Figure 1 depicts an can be easily understood with the specificity 90% ideal scenario where a fasting four arrows and the direction of Positive predictive value = plasma glucose of 126mg/dl {based their movement (Table 2). 45/140 or 0.32 or 32% on the guidelines of the American The Relationship of Thus, in this population, this test Diabetes Association [ADA]} may not be such a good test. clearly identifies individuals with Predictive Values with In village 2, the prevalence of diabetes and those without. Prevalence microfilaraemia is much higher at This however rarely happens 20%. Then, the 2 x 2 table would in clinical practice and let us Unlike sensitivity and look as given below for the same understand this with an example. specificity, PPV and NPV are not sensitivity and specificity Measuring fasting blood sugar fixed characteristics of the test, Microfilaraemia prevalence is one of the screening tests that but depend upon the prevalence – 20%, test sensitivity 90%, test is used to make the provisional of the disease.3,4 Let us say that specificity 90% diagnosis of diabetes. Let us say we testing for microfilaraemia in we have a pre-identified sample of a population of 1000 patients in Journal of The Association of Physicians of India ■ Vol. 65 ■ June 2017 83

Table 3: Results with a new screening test for diabetes with 85% sensitivity and disease as a positive test indicates 30% specificity that the patient has the disease Test Disease in all likelihood. There are two Present Absent clinical scenarios where a highly Positive True Positive n = 17 False positive n = 14 a+b n = 31 specific test is useful Negative False Negative n = 3 True Negative n = 6 c+d n = 9 a. When a false positive test can 20 20 harm the patient physically Table 4: Results with a new screening test for diabetes with 25% sensitivity and or emotionally [for example 90% specificity declaring a person to be HIV positive or declaring the Test Disease diagnosis of cancer. Here the Present Absent clinician has to be absolutely Positive True Positive n = 5 False positive n = 02 a+b n = 07 sure that the patient does Negative False Negative n = 15 True Negative n = 18 c+d n = 33 indeed have the disease] 20 20 b. To rule in a diagnosis suggested n = 20 patients who have diabetes very few]. Thus, a test with high by other tests [for example a and n = 20 who do not, based sensitivity actually rules out the biopsy that will rule in the final on a gold standard test [the oral disease as a negative test in fact diagnosis of breast or prostate glucose tolerance test; OGTT]. We indicates absence of the disease. 3, 4, 5 cancer that has been suggested have with us a new screening test There are three clinical scenarios by a mammogram or PSA test] that does not need a finger prick where high sensitivity is required The serum ferritin test [90% but rather measures fasting blood for a test specificity] for iron deficiency sugar by the principle of infra-red a. When there is an important anemia is an example of a test with light passing through the skin and penalty for missing a patient high specificity.5 is thus noninvasive. When this test with the disease. For example, is used for measuring blood sugar in a blood bank, a highly How Varying the Cut- in this sample of patients we find sensitive test like the ELISA that it has a sensitivity of 85% and Off Points can Impact is needed as missing an HIV specificity of 30% (Table 3). Sensitivity and Specificity positive donor can have serious Using this test, a majority (17/20) consequences for the recipient. Sensitivity and specificity are of those who have diabetes have b. When the probability of the inversely proportional, meaning been picked up correctly, but disease is low and the sole that as the sensitivity increases, the there are also a large number of purpose of the test is to discover specificity decreases and vice versa. individuals who have been falsely asymptomatic individuals. Let us understand this with an labelled as being diabetic (14/20, This is classically seen with example of Prostate specific antigen because of the low specificity 30%). screening for diabetes in [PSA] in the diagnosis of prostate Subsequently, we decide to use diabetes detection “camps” cancer. Worldwide, most studies another test that is also noninvasive where apparently normal have used a PSA cut off of 4ng/ and uses saliva to measure glucose. individuals are screened en ml for the diagnosis of the disease This test is found to have a masse. [i.e., those below are likely not have sensitivity of 25% and a specificity c. In early stages for the work up the disease, and those above likely of 90% (Table 4). 6 of a disease. Here a “negative” to]. At this cut off, the sensitivity As seen from Table 4, the test test tells the clinician that a of the test is approximately 20% has labelled a majority without particular disease is highly and specificity 90%. Table 5 depicts the disease correctly as not having unlikely in the patient and this information for 50 patients the disease, but has missed a large that he should be looking at with proven prostate cancer and 50 number of patients who actually differential diagnoses. individuals who are disease free. have the disease. Examples of tests with high When the PSA cut off is lowered from 4ng/ml to 3 ng/ml, logically, When do you Need a Test sensitivity include a positive D-Dimer test for deep vein more patients with the disease will to be Highly Sensitive or thrombosis [sensitivity 89%] or the be picked, but more individuals Highly Specific? positive corneal reflex [sensitivity without disease will now be 92%] for favorable prognosis labelled as having the disease. When a test has high sensitivity, following non-traumatic coma.5 Table 6 depicts this information. the maximum number of patients Likewise, tests with high The lowering of the PSA cut off with the disease are picked up, specificity actually “rule in” the from 4ng/ml to 3 ng/ml has resulted [meaning the false negatives are in the following: an increase in 84 Journal of The Association of Physicians of India ■ Vol. 65 ■ June 2017

Table 5: Evaluation of prostate cancer with a PSA cut off of 4ng/ml with test screening and diagnostic tests need sensitivity of 20% and test specificity of 90% to be interpreted in the context of PSA [test] 4ng/ml Prostate cancer diagnosed by the current gold standard performance of the test [as assessed Present Absent by the metrics of sensitivity and Positive True Positive n = 10 False positive n = 05 a+b n = 55 specificity] and disease prevalence Negative False Negative n = 40 True Negative n = 45 c+d n = 45 [as assessed by the positive and 50 50 negative predictive values]. Given that both benefit [identifying Table 6: Evaluation of prostate cancer with a PSA cut-off of 3ng/ml individuals with disease and PSA [test] 3ng/ml Prostate cancer diagnosed by the current gold standard ruling out those without] and harm Present Absent [falsely labelling an individual as Positive True Positive n = 15 False positive n = 10 a+b n = 55 having disease or missing disease Negative False Negative n = 35 True Negative n = 40 c+d n = 45 in others] can accrue with the use 50 50 of these tests, their use by clinicians should be judicious and made detection of true positives [from patient’s blood] give the diagnosis with a clear understanding and 10 to 15 individuals]. However, the with ease and great rapidity. appreciation of the implications number of false positives has also Now, these “new” tests have to be for diagnosis and subsequent gone up [from 5 individuals to 10]. “compared” for their performance management, test limitations, This results in a test sensitivity of with the existing gold standard. financial considerations and 30%. Similarly, the reduction of A gold standard is the one that is finally, impact on the patient’s detection of true negatives [from universally accepted as being the quality of life. 45 to 40 individuals] and improved benchmark test for that condition false negative rate [which has to make a definitive diagnosis or Acknowledgements dropped from 40 to 35 individuals] the most accurate test at that point The authors are grateful to giving a test specificity of 80%. in time. For example, the gold Dr. Seema Kembhavi, Radiation Thus, when we vary cut offs or standard test for the diagnosis Oncologist from the Tata Memorial boundaries, the following aspects of prostate cancer as stated Hospital, Mumbai for her helpful need to be borne in mind earlier would be the trans-rectal inputs on the manuscript. ultrasound guided biopsy and that • Different cut off points or for coronary artery disease would References boundaries will yield different be a coronary angiography. A gold sensitivities and specificities 1. Streiner DL. Diagnosing tests: using and standard test may be a “single” best misusing diagnostic and screening tests. J • The cutoff point is crucial in test or a combination of tests. Pers Assess 2003; 81:209-19. that it labels patients as having 2. Šimundić A-M. Measures of Diagnostic the disease or otherwise Use of Multiple Tests Accuracy: Basic Definitions. EJIFCC 2009; • A cutoff point that identifies 19:203-211. More often than not, in clinical more true negatives, will also 3. Schulz KF, Grimes DA. Uses and abuses of practice, clinicians tend to use yield more false negatives screening tests. In The Lancet Handbook multiple rather than single tests and of essential concepts in clinical research. • A cutoff point that identifies this needs to be remembered. For Elsevier. 2006. more true positives, will also example, for primary open angle 4. Parikh R, Mathai A, Parikh S, et al. yield more false positives glaucoma, the most prevalent form Understanding and using sensitivity, What a Gold Standard is of glaucoma, the diagnosis is made specificity and predictive values. Indian on a combination of measuring Journal of Ophthalmology 2008; 56:45-50. If we go back to the example of intra ocular pressure [IOP] and 5. http://learn.chm.msu.edu/epi/Coursepack/ assessing optic disc changes with EPI546_Lecture_5_course_notes.pdf, peripheral smear for the diagnosis accessed on 24th April 2017. of malaria, the test is available a slit lamp examination. Similarly, 6. Ankerst DP, Thompson IM. Sensitivity and only in select centers, and requires the initial diagnosis of prostate cancer is usually made with a specificity of prostate-specific antigen for considerable technical expertise, prostate cancer detection with high rates time and skill in identifying the combination of Digital rectal of biopsy verification. Arch Ital Urol Androl parasite. Hence, several rapid examination [DRE] and the serum 2006; 78:125-9. diagnostic tests that use malarial Prostate specific antigen estimation 7. Mistry K, Cable G. Meta-analysis of antigens have been introduced [PSA] and confirmed by trans- prostate-specific antigen and digital 7 where the presence or absence of a rectal ultrasound guided biopsy. rectal examination as screening tests for prostate carcinoma. J Am Board Fam Pract “line” along with the control line Conclusions 2003; 16:95-101. on the test strip [which requires In summary, results of both nothing more than a drop of the