Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

What kind of tests are these? Evaluating  Most generally, a procedure that provides Diagnostic tests information that places a probability on a subject having:

 a disease (diagnostic test)

Steven Goodman, MD, PhD  pre-clinical condition (screening) [email protected]  high risk state (screening)  clinical event (prediction)

What kind of tests are these? What kinds of tests are these?

 Devices  For us to care, this information must be

 CT scans, ultrasound, thermography, FOBT, actionable. colonoscopy, oral rinse

 Biomarkers

 PSA, BMI, GTT, d-Dimer, pregnancy test

 Genes

 Huntington’s, BrCA, Gene expression

 Clinical hx, signs and symptoms!

1 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Differences from explanatory or etiologic epi We will care about….

 Who is being tested Subject  Test setting   Where they are being tested Test  Result   Why they are being tested  How they are being tested Action  Consequences  The consequences of testing….

 to the one tested

 to society

We will care about… We will care about…

 Individual classification, not group differences! Individual classification, not group differences!

N=1000 N=70 t= 15 t= 4 P<0.0001 P=0.001 OR = 2.7 OR = 2.5 AUC = 0.76 N=40 P=0.15 AUC = 0.76

2 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

We will care about… Diagnostic study Absolute, not relative risks.! design

Courtesy of! Milo A. Puhan, MD, PhD JHU Dept. of Epidemiology

Key messages Learning objectives

 Diagnostic research goes far beyond sensitivity and  To describe the use of diagnostic specificity tests in practice  To frame a diagnostic research  Diagnostic test evaluation ranges from determination of test accuracy to impact on health outcomes and question costs  To know about the phases of diagnostic  The diagnostic research question must take the test evaluation context of the diagnostic work-up into consideration  To know the different diagnostic study designs  Diagnostic study design varies greatly across the phases of diagnostic test evaluation

3 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Brain natriuretic peptide Use of BNP in practice

(BNP) for dx’ing CHF Stage of clinical Setting Available Purpose of management BNP information test Patient history Early diagnostic Primary Triage work-up care Physical exam ER

Diagnosis not yet + ER Add-on established Chest X-ray ECG Replacement

ER + Prognostic Cut-offs Diagnosis established Specialized echocardiography information between 15 care and 100 pg/ml Diagnostic Monitor disease Under treatment Primary or used specialized work-up + process care treatments

Toma et al Cardiovascular Medicine 2007;10:27–33

BNP for triage to avoid unneeded work-up BNP as add-on test Heart failure among differential diagnoses 100% Heart failure suggested after patient history, 100% after taking patient history and physical physical exam, ECG and Chest x-ray exam 50% 50%

0% 0%

100% Positive test result Negative test result 100% 100% Positive test result Negative test result 100%

50% 50% 50% 50% Heart failure work-up Consider other Heart failure work-up Consider other ± treatment diagnoses ± treatment diagnoses 0% 0% 0% 0%

Health outcomes Health outcomes

4 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

BNP used as replacement for echocardiography BNP for monitoring Heart failure diagnosis and treatment established Heart failure suggested after patient history, 100% physical exam, ECG and Chest x-ray 50%

0%

Not in therapeutic range In therapeutic range

100% Positive test result Negative test 100% result Adapt treatment Treatment unchanged

50% 50% Treatment Consider other diagnoses 0% 0%

Health outcomes

Diagnostic research questions Framing a diagnostic 100% Information before testing What is the prior research question 50% probability?

0% PICO

How accurate is the test at Population (setting, patient characteristics, stage of diagnostic work- what stage of diagnostic up) work-up? Positive test result Negative test result Index test (test of interest)

Control (Reference) test (determines disease status) Heart failure work-up ± treatment Outcome (test accuracy measures, management decisions, health Is decision Is the health outcome outcomes) making influenced? influenced? Health outcomes At acceptable costs?

5 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Test phases for diagnostic tests Phases in the development Phase I Investigates whether test results are different for patients ± disease of a diagnostic test

Phase II Investigates whether patients with disease are more likely to have positive test results compared to patients without disease Phase III Investigates how well the a distinguishes between patients ± disease in patients suspected of having the disease Phase IV Investigates how informative a test is considering additional information available at the moment of testing.

Phase V Investigates whether using the test leads to better health outcomes

Phase VI Investigates whether using the test leads to better health outcomes at acceptable costs

Phases of diagnostic Phases of diagnostic test evaluation test evaluation

Phase I Healthy Phase III Patient with Patient without subjects heart failure heart failure

Patient with Test positive 85 250 PPV: 25% heart failure Test negative 15 650 NPV: 96% +LR: 3.0 Sens: 85% Spec: 72% -LR: 0.21

Patient with Healthy Phase II heart failure subjects Phase IV Test positive 90 20 PPV: 82% DOR: 36 Test negative 10 80 NPV: 89% +LR: 4.5 Patients suspected of ± Sens: 90% Spec: 80% -LR: 0.13 having disease Probability of heart failure? Age, gender, smoking and coronary heart disease status known

6 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Phases of diagnostic Study designs for the evaluation of Phase V test evaluation + Treat and follow-up diagnostic tests Randomized trial Phase I Cross-sectional case-control study (pro- or retrospective) Patients suspected of - R Health outcome having heart failure Phase II Cross-sectional case-control study ± Treat and follow- (pro- or retrospective) or up Phase III Cross-sectional study of patients suspected of having disease Before-after study Systematic (pro- or retrospective) until to 2000 from 2001 review Patients suspected of Patients suspected of Phase Cross-sectional study of patients having heart failure having heart failure IV suspected of having disease

± Treat and follow- Phase Randomized trial or before-after study up IV Health Phase V Cost effectiveness outcome - + Treat and follow-up study

Diagnostic test framework in dx testing

Measurement Ref. Test GOOD Spectrum Information bias Dx test reader Reader bias Imperfect gold- Verification Test-review standard Imperfect test Workup Diagnostic review Population Test Reference reading standard Context Selection Incorporation BAD (prevalence) Reading order Outlier/no result

7 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Is Sensitivity a Property Diagnostic test framework of the Test Alone?

Ref. Test GOOD Dx test reader Reader

Population Test Reference standard BAD

Is Specificity a Property Spectrum Bias - Sensitivity of the Test Alone?

Sensitivity of a test changes as the composition of the case population changes, with different proportions of mild, moderate and severe cases.

Normal

Mild Mod. Severe

Cutoff Test Value

8 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Spectrum Bias - Specificity Diagnostic test biases

Specificity of a test usually goes down when the non-  Spectrum bias

diseased population includes more people with  Spectrum of disease severity/subtype in cases varies from target symptoms that mimic the disease. population  Spectrum of comorbidities in controls varies from target population Normal  Information bias  Interpretation of dx test is affected by gold standard (or patient correlates), or vice versa. With Sx that Mild Dz mimic dz  [Imperfect] Measurement bias  Bias introduced by imperfect dx or referent test reading.

Cutoff

Mathematics of Topics

Diagnostic testing  Basic goals of diagnostic testing  Threshold model

 Link to decision analysis

 Mathematics of diagnostic testing (review)

 Bayes theorem

 Sensitivity, Specificity, Odds, Likelihood ratio, PV

 Tests with multiple cutpoints

 ROC curves

 Choosing optimal cutpoints

 Distinction between binary vs. continuous LRs.

9 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Goal of a diagnostic test? Threshold model of testing

 Not!!! ...... just to distinguish between diseased and non-diseased patients. Go Home More Reference Treat  To improve the overall health outcomes (or 0 100% reduce cost or suffering) in a group in whom it testing standard is applied. Probability

 To do more good than harm. of disease

Basic quantitative concepts Sensitivity and Specificity

 Sensitivity: Proportion of persons with disorder who will test positive

 Specificity: Proportion of persons without disorder who will test negative

 Likelihood Ratio: Change in disease odds from before to after test.

 Predictive Value: Probability of disease (positive test) Specificity Sensitivity or non-disease (negative test) after test

 Posterior probability: Probability of disease after test Test Value

10 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

The 2x2 Nightmare Diagnostic calculations

Test Test Test Positive! Negative! Test Positive Negative

Sensitivity= Disease True False Disease Sensitivity= present Positive Negative True Pos / Tot. Disease present 95 5 100 95/100=95%

Disease False True Specificity= Disease Specificity= Negative 80 absent Positive True Neg / Tot. Well absent 320 400 320/400=80%

PV(+) = 95 / 175 PV(-)= 320/325 Pos. Pred. Value Neg. Pred. Value = 54% = 98.5% PV(+) = PV(-) = Pr(Dis | - ) = 1.5% True Pos / Tot. Pos True Neg / Tot Neg

Bayes Theorem Semantic confusion

 Positive predictive value = Pr(Dis | T+) = The posterior (post-test) probability of disease after a positive test, i.e. the probability of disease after a positive test. It is the probability that you are right if you classify someone as diseased after a positive test.

 Negative predictive value a.k.a. Bayes factor = Pr(No Dis | T-) = 1- Pr(Dis | T-) is the complement of the posterior probability of disease after a negative test. It is the probability that you are right if you classify someone as non-diseased after a negative test. = 1- posterior probability after a negative test

11 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Conceptual/Computational Model for From probability to odds, Analyzing Diagnostic Test Information and back

Odds of disease after Odds of disease after negative test positive test

0 Infinite Odds

Well LR(- ) LR(+ ) Has Disease

Odds of disease before testing, (prevalence, clinical suspicion

Bayes theorem Odds Practice

Probability world p=.1 Pre-test probability Post-test probability Odds = .1/(1-.1) = .1/.9 = 0.11 p=.5 Odds = .5/(1-.5) = 1 p=.9 Odds world Odds = .9/.1 = 9 p=.95 Odds = .95/.05 = 19 Pre-test odds Prior Odds x LR Post-test odds Range of odds is 0--> infinity Bayes theorem!

12 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Higher Odds Likelihood Ratio Defn.

 Likelihood Ratio (LR) is the chance of Odds =0.1 p= 0.1/(1+0.1) = 0.09 observing a given result when the patient has Odds = 0.5 disease, divided by the probability of observing p = 0.5/(1.5) = 0.33 that result if they are well. Odds = 5 p = 5/6 = 0.83 Odds = 24 p = 24/25 = 0.96

 Range of p is 0--> 1

Likelihood Ratio Eqns. Diagnostic Tests acronyms

LR(+)=Sens/(1-Spec) LR(-) = (1-Sens)/Spec

 SnNout

 A highly Sensitive test, that when negative rules out the disorder. ( i.e. LR(-) small))!

 SpPin

 A highly specific test, that when positive rules in the disorder. (i.e. LR (+) large)!

13 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Calculation of LRs Value of LRs

 The degree to which dz probability is affected LR(+)=Sens/(1-Spec) by a test result. LR(-) = (1-Sens)/Spec  Tell us exactly how to combine sensitivity and spec to inform us about dx.  Sens 95% ---> LR (+) = 9.5  Sens 80% ---> LR (+) = 16  Spec 90% ---> LR (-) = 1/18  Are a measure of the “strength of the  Spec 95% ---> LR (-) = 1/4.8 evidence” of a test result for the hypothesis that

 Sens 65% ---> LR (+) = 1 the patient has dz vs. patient does not.  Sens 90% ---> LR (+) = 180  Spec 35% ---> LR (-) = 1  Spec 99.5% ---> LR (-) = 1/10  Does not require a dichotomous test outcome; can be calculated for any test result

What is a “good” LR? Diagnostic calculations

 3-5 “Fair” Prevalence = 33%, Sensitivity = 80%, Specificity = 90%

 8-12 “Moderate” What is the probability of disease after a positive test?

 12-20 “Strong”  Prevalence odds = 33/66 = 0.5  LR(+) = 80/(100-90) = 8  > 20 V. Strong  Odds of disease (+) = LR(+) x Prev. Odds = 8 x 0.5 = 4  Probability of disease (+) = 4/(1+4) = 80%

14 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Computational Model for Analyzing Diagnostic Test Information Diagnostic calculations

Prevalence = 33%, Sensitivity = 80%, Specificity = Odds of disease after 90% positive test = 8 x 0.5 = 4 What is the probability of disease after a negative test? 0 Infinite Odds  LR(-) = 20/90 = 0.22 Well LR(+ )=8 Has  Prevalence odds = 33/67 = 0.5 Disease  Odds of disease (-) = LR(-) x Prev. Odds = 0.22 x 0.5 = 0.11 Odds of disease before  Probability of disease (-) = 0.11/1.11 = 10% testing = 0.5

Conceptual/Computational Model for Analyzing Diagnostic Test Information

Disease odds = 0.11 Disease prob = 0.10

0 Infinite Odds

Well LR= .22 Has Disease

Odds of disease = 0.5

15 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

 LR(+) = Sens/(1-Spec) = 35.8/18.1 = 1.98

 LR(-) = (1-Sens)/Spec = 64.2/81.9 = 0.78

16 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

 Predictive value calculation

 Pre-test probability = 405/(405+914) = 30.7%

 Prior odds = 30.7/69.3 = 0.44

 Post-test odds = 0.44 x LR(+) = 0.44 * 1.98 = 0.87

 Post-test probability = 0.87/1.87 = 47%!

 145/(145+165) = 47%

Bayes theorem

Prior prob. Probability world 0.47 0.307

 Negative predictive value calculation Odds world  Pre-test probability = 405/(405+914) = 30.7%  Prior odds = 30.7/69.3 = 0.44

 Post-test odds = 0.44 x LR(-) = 0.44 * 0.78 = 0.34 0.44 Prior odds x LR = 0.44 x 1.98 0.87  Post-test probability = 0.34/1.34 = 26% Bayes theorem!  Predictive value (-) = 100-26 = 74%!

17 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Should we be testing? Decisions about testing

Diagnostic setting Sensitivity = 95%, Specificity = 95%, Prevalence = 50%

 Pre-test odds = 50/50 = 1  LR(+) = 95/5 = 19  Disease Odds(+) = 1x19 = 19  PV(+) = 19/20 = 95%

 For every 19 true cases worked-up, one patient will have an unnecessary work-up. ! For this test to be worth using in this setting: ! 19 x benefit (true positive) > harm (false positive)!

Should we test? (Cont.) Effect of lowering prevalence

Screening setting: Positive test Sensitivity = 95%, Specificity = 95%, Prevalence = 1/500

 Pre-test odds ≈ 1/500  LR(+) = 95/5 = 19  Disease Odds(+) = (1/500)x19 = 0.04  PV(+) = 0.04/1.04 = 4%

 For every single true case worked-up, 24 patients will have an unnecessary work-up. ! For this test to be worth using in this setting: ! Benefit (true positive) > 24 x Harm (false positive)! Test Value

18 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Decisions about screening Key Dx Test Questions

 “Phase” of study? Retrospective vs. prospective?  There must be a clear benefit that results  Valid reference standard, done in all patients, or in a from screening, and this must outweigh the cost defined random sample of cases and controls. of testing (infrastructure, and harm to false  Description of disease spectrum, comorbidities and positives and false negatives). other clinical characteristics in all patients.

 Alternatives to screening/testing include:  Clear test procedures, inc. training, interpretations  Doing nothing - waiting for disease to become manifest.(e.g. and reproducibility. Done blind to reference standard? PSA controversy)  Comparison with alternatives. Incremental value.  Doing “next step” on everyone. (e.g. Strep throat)  Screen only subsets at higher risk.  Appropriate indices, w/variability.

 How is it proposed to be used, and has this use been shown to impact clinical outomes?

Puzzler #1 Puzzler #2

 PET scanning is being assessed as a Dx test for  New radioisotope is being evaluated for use in schizophrenia. diagnosing MI’s.   20 schizophrenics have already had the test as part Test is applied to large ER population presenting of a pilot feasibility study. with angina.  Gold standard is combination of ECG changes and  20 normal controls are chosen among general cardiac enzyme elevation. medicine clinic patients for comparison.  Sensitivity = 96%, Specificity = 93%  Results: PET scan positive in 17/20 schizophrenics, 9/20 controls.  Conclusion: Useful screening test.  Recommendation: Implement in all ERs.  Conclusion: Useful test.

19 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

Liquid Crystal Thermography Liquid Crystal Thermography as a Screening as a Screening Test

Test For Deep-Vein Thrombosis, Sandler DA and Martin JF, Lancet, 1985; 1:665 - 7 Sensitivity (34/35) =97.1%  Background: Poor correlation between clinical signs and Specificity (28/45) =62.2% objectively proven DVT. Venograms have significant expense and morbidity, making a universal screening test for DVT’s desirable. LR(+) = 2.5  Patients: “Patients with sx or signs suggesting unilateral, lower LR(-) = 1/21 limb, DVT were studied with a standardized clinical exam and LCT before X-ray venography.” PV(+) = 34/51 = 66.7%  Diagnostic test: “Criteria have been described...” “If there is a PV(-) = 28/29 = 96.5% homogenous area of increased temperature in the symptomatic [n.b., 95% CI: 82% to 99.9%] limb...This was done w/o knowledge of either the side of the suspected thrombis or the result of the venogram.”

Liquid Crystal Thermography PET in Breast Cancer Staging as a Screening Test Smith, et al. Annals of Surgery, 1998

 Conclusions: “We would propose that any patients in whom  Goal: Find non-invasive means to dx spread of DVT is clinically suspected should undergo LCT. breast ca to local nodes Anticoagulants may be withheld from those with a negative  Patients: 50 women w/breast cancer in Aberdeen thermogram. Since a positive thermogram still has a 1/3 Royal Infirmary Breast Clinic. chance of being a false-positive, patients with a positive LCT could undergo [technetium] venoscanning, which would give a  Referred? Consecutive? correct diagnosis in 2/3rds of them. The remainder would then  Test: PET scan, read by two observers, blinded to undergo the definitive test, X-ray venography. This diagnostic clinical information and surgical staging. approach...would reduce morbidity, time and cost; the chance of missing a DVT because of a false-negative thermogram on  Reference test: Surgical exploration of axilla with initial screening would be 1.25%.” nodal pathology or FNA.

20 Mathematics of Diagnostic Testing Steven Goodman, MD, PhD

PET in Breast Cancer Staging PET in Breast Cancer Staging Smith, et al. Annals of Surgery, 1998 Smith, et al. Annals of Surgery, 1998

 Results of pathology; TNM Stage Sensitivity Specificity  Total nodes examined: 425 T1 100 (1/1) 100 (7/7) [CI 2.5% -100%] [82% -99%]  Total # of patients with involved nodes: 21 T2 78 (7/9) 92 (11/12)  Results of PET [40% -97%] [66% -99+%]  Sensitivity: 90% (19/21) [95% CI 70% to 99%] T3 100 (5/5) 100 (4/4)  Specificity: 97% (28/29) [95% CI 82% to 99.9%] [48% -100%] [40% - 100%]

 PPV: 95% T4 100 (6/6) 100 (4/4)

 NPV: 96% N1 and N2 91 (10/11) 100 (3/3)

Claims… Key Dx Test Questions

 “The results from this study show that PET can  Retrospective vs. prospective study. accurately (94%) and reliably (PV>94%) stage the  Valid reference standard, done in all patients, i.e. not affected axilla…” by dx. test result.

 “…a preponderance of patients with large or locally  Description of disease spectrum, comorbidities and other advanced cancer.” clinical characteristics in all patients.  Clear test procedures, inc. interpretations and reproducibility.  “PET…had a sensitivity and specificity of 100% in Done blind to reference standard? seven patients with T1N0 disease.  Comparison with alternatives.

 “On the evidence obtained…PET could help  Appropriate indices, w/variability. determine treatment options in postmenopausal  How is it proposed to be used, and has this use been shown to women….PET may have the advantage of obviating impact clinical outomes? the need for additional staging investigations.”

21