<<

DIAGNOSTICS TESTS: STATISTICAL EVALUATION TO DETERMINE THE SUITABILITY FOR DIAGNOSIS Michael J. Campbell and Jenny V. Freeman demonstrate how statistical methods should be used to evaluate the suitability of a diagnostic or screening test

IN THIS TUTORIAL WE will examine negative predictive values. The proportion of those whose test result is how to evaluate a diagnostic test. sensitivity of the test is the proportion What negative who do not have the Initially we will consider the case when of people with the disease who are and is given by d/(c + d). there is a binary measure (two correctly identified as having the patients It should be noted that whilst categories: disease present / disease disease. This is given by a/(a + c) and is really want sensitivity and specificity are absent). We will then look at how to usually presented as a percentage. “to know is independent of , positive and define a suitable cut-off for an ordinal Suppose a test is 100 per cent negative predictive values are not. or continuous measurement scale and sensitive. Then the number of false ‘if I have a Sensitivity and specificity are we will finish with a short discussion negatives is zero and we would expect positive characteristics of the test and will be contrasting diagnostic tests with table 2. test, what valid for different populations with screening tests. From table 2 we can see that if a different . Thus we could When evaluating any diagnostic test patient has a negative test result we are the use them in populations with high one should have a definitive method for can be certain that the patient does not chances I prevalence such as elderly people as deciding whether the disease is have the disease. Sackett et al.1 refer to have the well as for low prevalence such as for present in order to see how well the this as SnNout, i.e. for a test with a young people. However, the PPV is a test performs. For example, to high sensitivity (Sn), a Negative result disease?’ characteristic of the population and so diagnose a cancer one could take a rules out the disease. will vary depending on the prevalence. biopsy, to diagnose depression one The specificity of a test is the To show this, suppose that in a could ask a psychiatrist to interview a proportion of people without the different population the prevalence of patient, and to diagnose a walking disease who are correctly identified as the disease is double that of the current problem one could video a patient and not having the disease. This is given by population (assume the prevalence is have it viewed by an expert. This is d/(b + d) and as with sensitivity is ” low, so that a and c are much smaller sometimes called the ‘gold standard’. usually presented as a percentage. than b and d and thus the results for Often the gold standard test is Now suppose a test is 100 per cent those without the disease are much the expensive and difficult to administer specific. Then the number of false same as the earlier table). The situation and thus a test is required that is positives is zero and we would expect is given in table 4. cheaper and easier to use. table 3. The sensitivity is now 2a/(2a + 2c) = From table 3 we can see that if a a/(a + c) as before. The specificity is BINARY SITUATION patient has a positive test we can be unchanged. However the positive Let us consider first the simple binary certain the patient has the disease. predictive value is given by 2a/(2a + b) situation in which both the gold Sackett et al. refer to this as SpPin, i.e. which is greater than the earlier value standard and the diagnostic test have for a test with a high specificity (Sp), a of a/(a + b). either a positive or negative outcome Positive test rules in the disease. (disease is present or absent). The LIKELIHOOD RATIO situation is best summarised by a 2 × 2 USEFUL MNEMONIC It is common to prefer a single table (table 1). In writing this table SeNsitivity = 1 − proportion false summary measure, and for a always put the gold standard on the top Negatives (n in each side) diagnostic test this is given by the and the results of the test on the side. likelihood ratio for a positive test (LR(+)) The numbers ‘a’ and ‘d’ are the SPecificity = 1 − proportion false as defined below: numbers of true positives and true Positives (p in each side) negatives, respectively. The number ‘b’ LR+ = is the number of false positives, What patients really want to know, because although the test is positive however, is ‘if I have a positive test, Probability of positive test given the disease = the patients don’t have the disease, what are the chances I have the Probability of positive test without disease and similarly ‘c’ is the number of false disease?’ This is given by the positive negatives. The prevalence of the predictive value (PPV) which is a/(a + b). Sensitivity = a (b + d) disease is the proportion of people One way of looking at the test is that 1-Specificity b(a + c) diagnosed by the gold standard and is before the test the chances of having given by (a + c)/n, although this is often the disease was (a + c)/n. After the test One reason why this is useful is that expressed as a percentage. they are either a/(a + b) or c/(c + d) it can be used to calculate the odds of In order to assess how good the test depending on whether the result was having the disease given a positive is we can calculate the sensitivity and positive or negative. result. The odds of an event are defined specificity, and the positive and The negative predictive value is the as the ratio of the probability of the

22 | DECEMBER 08 | SCOPE TUTORIAL | SCOPE

event occurring to the probability of the on the result. Note that even with a event not occurring, i.e. p/(1 − p) where positive test the chances of having p is the probability of the event. Before GAD are still less than 1/3. TABLE 1 the test is conducted the probability of For the GAD example we find that having the disease is just the LR(+) = 0.86/(1 − 0.83) = 5.06 and the Diagnostic Gold standard prevalence, and the odds are simply {(a odds = 0.29/(1 − 0.29) = 0.41. test Positive Negative + c)/n}/{b + d)/n} = (a + c)/(b + d). The odds of having the disease after a ROC CURVES Positive a b a + b positive test are given by For a diagnostic test that produces results on a continuous or ordinal Negative c d c + d Odds of disease after positive test = measurement scale, a convenient cut- odds of disease before test × LR(+) = off level needs to be selected to Total a + c b + d n a/b calculate the sensitivity and specificity. For example the GAD2 We can also get the odds of disease has possible values from 0 to 6. Why TABLE 1. Standard table for diagnostic tests. after a positive test directly from the should one choose the value of 3 as PPV since the odds of disease after a the cut-off? For a cut-off of 2 the positive test is PPV/(1 − PPV). sensitivity is 0.95, the specificity is 0.64 and the LR(+) is 2.6.2 One might argue TABLE 2 EXAMPLE that since a cut-off of 3 has a better A recent study by Kroenke et al. 2 LR(+) then one should use it. However, surveyed 965 people attending primary a cut-off of 2 gives a higher sensitivity, Diagnostic Gold standard care centres in the US. They were which might be important. It should be test Positive Negative interested in whether a family noted that a sensitivity of 100 per cent practitioner could diagnose is always achievable by stating that Positive a b a + b Generalised Anxiety Disorder (GAD) by everyone has the disease, but this is at asking two simple questions (the GAD2 the expense of a poor specificity Negative 0 d d questionnaire): ‘Over the last two (similarly a 100 per cent specificity can weeks, how often have you been be achieved by stating no-one has the Total a b + d n bothered by the following problems? disease. If the prevalence is low, this (1) Feeling nervous, anxious or on edge; tactic will have a high accuracy, i.e. it (2) not able to stop or control worrying’. will be right most of the time, but sadly TABLE 2. Results of a diagnostic test with 100 per cent sensitivity. The patients answered each question wrong for the important cases). A from ‘not at all’, ‘several days’, ‘more discussion of the different scenarios than half’ and ‘nearly every day’, for preferring a high specificity or scoring 0, 1, 2 or 3, respectively. The sensitivity is given in the next section. TABLE 3 scores for the two questions were A simple graphical device for summed and a score of over 3 was displaying the trade-offs between Gold standard considered positive. Two mental health sensitivity and specificity for tests on a Diagnostic test professionals then held structured continuous or ordinal scale is a Positive Negative psychiatric interviews with the subject receiver operating characteristics over the telephone to diagnose GAD. (ROC) curve (the unusual name Positive a 0 a The professionals were ignorant of the originates from electrical result of the GAD2 questionnaire. The engineering). This is a plot of Negative c d c + d results are given in table 5. sensitivity versus one minus specificity for different cut-off values of the Total a + c d n The prevalence of the disease is diagnostic test. ROC curves for two given by (a + c)/n = 73/965 = 0.076 = 7.6 theoretical tests are shown in figure 1, TABLE 3. Results of a diagnostic test with 100 per cent per cent. together with the line of equality which specificity. is what we would expect if a test had no The sensitivity of the test is given by power to detect disease. A perfect a/(a + c) = 63/73 = 0.86 = 86 per cent. diagnostic test would be one with no false negatives (i.e. sensitivity of 1) or TABLE 4 The specificity of the test is given by false positives (i.e. specificity of 1) and d/(b + d) = 740/892 = 0.83 = 83 per cent. would be represented by a line starting Diagnostic Gold standard at the origin, travelling vertically up the test Positive Negative The positive predictive value (PPV) is Y-axis to a sensitivity of 1 and then a/(a + b) = 63/215 = 0.29 = 29 per cent. horizontally across to a false positive Positive 2a b 2a + b rate of 1. Any diagnostic test that was The negative predictive value is d/(c reasonable would produce a ROC Negative 2c d 2c + d + d) = 740/750 = 0.987 = 98.7 per cent. curve in the upper left-hand triangle of figure 1. The selection of the optimal Thus before the test the chances of cut-off will depend upon the relative Total 2(a + c) b + d n having GAD were 7.6 per cent. After the medical consequences and costs of test they are either 29 per cent or 1.3 false positive and false negative TABLE 4. Standard situation but with a doubling of the

per cent (i.e. 100*(1 − 0.987) depending errors. M prevalence.

SCOPE | DECEMBER 08 | 23 SCOPE | TUTORIAL M ROC curves are particularly useful mnemonics SpPin and SnNout, for for comparing different diagnostic tests diagnosis we want a positive test to rule and when more than one test is people in, so we want a high specificity. TABLE 5 available they can be compared by For screening we want a negative test to plotting both on the same plot. A test for rule people out so we want a high Diagnosis by mental health worker which the plot is consistently nearer the sensitivity. Thus mass mammography GAD2 left-hand side and the top is to be will have a fairly low threshold of Positive Negative preferred. In addition the area under the suspicion, to ensure a high sensitivity ≥3 63 152 215 curve (AUC) for each plot can be and reduce the chances of missing calculated. For the perfect test outlined someone with breast cancer. The above the AUC is 1 and represents the subsequent biopsy of positive results <3 10 740 750 total area of the panel (i.e. 1 × 1). For the will have a high specificity to ensure that two curves displayed it is obvious that if, say, mastectomy is to be considered, Total 73 892 965 the best test is the one with the line the doctor is almost certain that the represented by the dashed line on the patient has breast cancer. TABLE 5. Results from Kroenke et al. left of the figure. This has an AUC value of 0.95 compared to the other much SUMMARY poorer fitting line which as an AUC This tutorial has summarised the value of 0.59. methods used for examining the REFERENCES suitability of a particular test for DISTINCTION BETWEEN diagnosing disease. In addition it has DIAGNOSIS AND SCREENING highlighted the difference between 1 Sackett DL, Richardson WS, Rosenberg W, It is important to understand the diagnostic and screening tests. In Haynes RB. Evidence Based . New York: difference between diagnosing a reality the same methods are used to Churchill Livingstone, 1997. disease and screening for it. In the evaluate both diagnostic and screening 2 Kroenke K, Spitzer RL, Williams JB, Monahan former case there are usually some tests, the important difference being the PO, Löwe B. Anxiety disorders in primary care: symptoms, and so there may already be emphasis that is placed on the prevalence, impairment, comorbidity and a suspicion that something is wrong. If a sensitivity and specificity. Further detection. Ann Intern Med 2007; 146: 317–25. test is positive some action will be details are given in Chapter 4 of taken. In the latter case there are Campbell et al.3 3 Campbell MJ, Machin D, Walters SJ. Medical usually no symptoms and so if the test : A Textbook for the Health Sciences. Chichester: Wiley, 2007. is negative the person will have no M FIGURE 1. Example ROC curves further tests. Recalling Sackett’s showing also the line of equality.

24 | DECEMBER 08 | SCOPE