Critical Appraisal of Medical Literature – Short course

What is Critical appraisal?

Critical appraisal is the systematic evaluation of clinical research papers in order to establish:

 If the study addresses a clearly focused question?  Are valid methods used to address this question?  If the study results are important?  Are these results applicable to my patient or population?

Resources for critical appraisal and reporting guidelines are available in The Equator Network, JAMA, Oxford Evidence based medicine, SQUIRE and TREND.

How to do Critical appraisal?

The steps involved in critical appraisal are

 Choose a journal article relevant to your research question  Read the article in detail, especially the section on methods  Answer the critical appraisal questions  Make a presentation on critical appraisal of the article

Quick references for each study design:

The critical appraisal questions for each study design have been enclosed. The questions are based on JAMA Users guide to Medical Literature. However, we strongly recommend reading the guide in detail prior to critical appraisal.

DIAGNOSTIC TEST

How serious is the risk of bias?

Did participating patients constitute a representative sample of those presenting with a diagnostic dilemma?

Did investigators compare the test to an appropriate, independent reference standard?

Were those interpreting the test and reference standard blind to the other result?

Did all patients receive the same reference standard irrespective of the test results? What are the results?

What likelihood ratios were associated with the range of possible test results?

How can i apply the results to patient care?

Will the reproducibility of the test results and their interpretation be satisfactory in my clinical setting? Are the study results applicable to the patients in my practice?

Will the test results change my management strategy?

Will patients be better off as a result of the test? CASP Checklist: 12 questions to help you make sense of a Diagnostic Test study

How to use this appraisal tool: Three broad issues need to be considered when appraising a trial:

Are the results of the study valid? (Section A) What are the results? (Section B) Will the results help locally? (Section C)

The 12 questions on the following pages are designed to help you think about these issues systematically. The first three questions are screening questions and can be answered quickly. If the answer to both is “yes”, it is worth proceeding with the remaining questions. There is some degree of overlap between the questions, you are asked to record a “yes”, “no” or “can’t tell” to most of the questions. A number of italicised prompts are given after each question. These are designed to remind you why the question is important. Record your reasons for your answers in the spaces provided.

About: These checklists were designed to be used as educational pedagogic tools, as part of a workshop setting, therefore we do not suggest a scoring system. The core CASP checklists (randomised controlled trial & systematic review) were based on JAMA 'Users’ guides to the medical literature 1994 (adapted from Guyatt GH, Sackett DL, and Cook DJ), and piloted with health care practitioners. For each new checklist, a group of experts were assembled to develop and pilot the checklist and the workshop format with which it would be used. Over the years overall adjustments have been made to the format, but a recent survey of checklist users reiterated that the basic format continues to be useful and appropriate. Referencing: we recommend using the Harvard style citation, i.e.: Critical Appraisal Skills Programme (2018). CASP (insert name of checklist i.e. Diagnostic Test Study) Checklist. [online] Available at: URL. Accessed: Date Accessed.

©CASP this work is licensed under the Creative Commons Attribution – Non-Commercial- Share A like. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc- sa/3.0/ www.casp-uk.net

Critical Appraisal Skills Programme (CASP) part of Oxford Centre for Triple Value Healthcare Ltd www.casp-uk.net Paper for appraisal and reference: Section A: Are the results of the trial valid?

1. Was there a clear question Yes HINT: A question should include for the study to address? information about Can’t Tell • the population • the test No • the setting • the outcomes

Comments:

2. Was there a comparison Yes HINT: Is this reference test(s) the best with an appropriate available indicator in the circumstances reference standard? Can’t Tell

No

Comments:

Is it worth continuing?

3. Did all patients get the Yes HINT: Consider diagnostic test and • were both received regardless of the reference standard? Can’t Tell results of the test of interest • Check the 2x2 table (verification No bias)

Comments:

2

4. Could the results of the test Yes HINT: Consider have been influenced by the • was there blinding results of the reference Can’t Tell • were the tests performed standard? independently No • review bias

Comments:

5. Is the disease status of the Yes HINT: Consider tested population clearly • presenting symptoms described? Can’t Tell • disease stage of severity • co-morbidity No • differential diagnoses (spectrum

bias)

Comments:

6. Were the methods for Yes HINT: Consider performing the test described in • was a protocol followed sufficient detail? Can’t Tell

No

Comments:

Section B: What are the results?

3

7. What are the results? HINT: Consider • are the sensitivity and specificity and/or likelihood ratios presented • are the results presented in such a way that we can work them out Comments:

8. How sure are we about the results? HINT: Consider Consequences and cost of alternatives • could they have occurred by chance performed? • are there confidence limits • what are they

Comments:

Section C: Will the results help locally? Consider whether you are primarily interested in the impact on a population or individual level

9. Can the results be applied to Yes HINT: Do you think your your patients/the population patients/population are so different from of interest? Can’t Tell those in the study that the results cannot be applied, such as age, sex, ethnicity and No spectrum bias

Comments:

10. Can the test be applied to Yes HINT: Consider your patient or population of • resources and opportunity costs interest? Can’t Tell • level and availability of expertise required to interpret the tests No • current practice and availability of services

4

Comments:

11. Were all outcomes Yes HINT: Consider important to the individual • will the knowledge of the test result or population considered? Can’t Tell improve patient wellbeing • will the knowledge of the test result No lead to a change in patient management

Comments:

12. What would be the impact of using this test on your patients/population?

Comments:

5 See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/292612112

Critical Appraisal of a Diagnostic Test Study

Article · February 2016

CITATIONS READS 0 1,749

1 author:

Leonardo Roever (BRAMETIS) Brazilian Network of Research in Meta-analysis

554 PUBLICATIONS 2,115 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Rivaroxaban or Apixaban in Mechanical Valves: RAMV Study View project

IMCSC group ( Section: Diagnosis ) View project

All content following this page was uploaded by Leonardo Roever on 02 February 2016.

The user has requested enhancement of the downloaded file. Roever, Evidence Based Medicine and Practice 2015, 1:1 Evidence Based Medicine and Practice http://dx.doi.org/10.4172/EBMP.1000e104

Editorial Open Access Journal

Critical Appraisal of a Diagnostic Test Study Leonardo Roever* Department of Clinical Research, Federal University of Uberlândia, Uberlândia, Brazil *Corresponding author: Leonardo Roever, Department of Clinical Research, Av Pará, 1720 - Bairro Umuarama, Uberlândia-MG-CEP 38400-902, Brazil, Tel: +553488039878; E-mail: [email protected] Rec Date: 30 November, 2015; Acc Date: 07 December, 2015; Pub Date: 14 December, 2015 Copyright: © 2015 Roever L. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Introduction Diagnostic : LR+/LR– = (TP/FN) / (FP/TN)

Be able to evaluate a diagnostic test is not an easy task. Diagnostic Did all patients get the diagnostic test and reference standard? tests are invaluable tools used to distinguish between patients having a disease and those who have not. It is essential to be able to critically Were both received regardless of the results of the test of interest? Check the 2 × 2 table (verification bias). appraise published articles on a diagnostic test. The list of questions below can help you better appreciate and understand the diagnostic Was the reference standard applied regardless of the index test result? Is the studies better. The Table 1 shows the checklists needed to make a diagnostic test available, affordable, accurate, and precise in your setting? critical analysis of a diagnostic test study [1-12]. The index test results interpreted without knowledge of the results of the reference standard. If a threshold is used, it is pre-specified. The index test, its Appraisal questions conduct, and its interpretation are similar to that used in practice with the target population of the guideline. Was there a clear question for the study to address? A question should include information about: population, test, setting and outcomes. A consecutive There is an appropriate interval between the index test and reference standard. sequence or random selection of patients is enrolled. Inappropriate exclusions All patients receive the same reference standard. All patients recruited into the are avoided. This includes patients and settings match the key question. study are included in the analysis.

Was the diagnostic test evaluated in a representative spectrum of patients (like Are test characteristics presented? What is the measure? What does it mean? those in whom it would be used in practice)? Was the reference standard Can you generate a clinically sensible estimate of your patient’s pre-test applied regardless of the diagnostic test result? probability (from personal experience, statistics, practice databases, or primary studies)? Was the test (or cluster of tests) validated in a second, independent group of patients? Are the study patients similar to your own? Is it unlikely that the disease possibilities or probabilities have changed since the evidence was gathered? Was there a comparison with an appropriate reference standard? Is this reference test(s) the best available indicator in the circumstances? Was there an independent, blind comparison between the index test and an appropriate reference ('gold') standard of diagnosis? Are the valid results of this diagnostic study important? The reference standard is likely to correctly identify the target condition. Target disorder Totals Reference standard results are interpreted without knowledge of the results of Sample Calculations the index test. The target condition as defined by the reference standard Present Absent matches that found in the target population of the guideline

Positive A (TP) B (FP) a+b Diagnostic Could the results of the test have been influenced by the results of the reference test result standard? Negative C (FN) D (TN) c+d Was there blinding? Were the tests performed independently (review bias). Totals a+c b+d a+b+c+d Were the methods for performing the test described in sufficient detail to permit Sensitivity=a/ (a+c) replication? Specificity=d/ (b+d) Is the disease status of the tested population clearly described? Likelihood ratio for a positive test result=LR+ =sens/(1-spec) Presenting symptoms Likelihood ratio for a negative test result=LR- = (1-sens)/spec Disease stage or severity Positive Predictive Value=a/ (a+b) Co-morbidity Negative Predictive Value=d/ (c+d) Differential diagnoses (Spectrum Bias) Pre-test probability (prevalence)= (a+c) / (a+b+c+d) Were the methods for performing the test Described in sufficient detail Was a Pre-test odds=prevalence/ (1-prevalence) protocol followed?

Post-test odds=pre-test odds ´ LR What are the results? Are the sensitivity and specificity and/or likelihood ratios presented? Are the results presented in such a way that We can work them out? Post-test probability=post-test odds/ (post-test odds+1)

Accuracy (TP+TN) / (TP+FP+TN+FN)

Evidence Based Medicine and Practice Volume 1 • Issue 1 • e104 ISSN:EBMP Evidence Based Medicine and Practice Citation: Roever L (2015) Critical Appraisal of a Diagnostic Test Study. Evidence Based Medicine and Practice 1: 1000e104. doi:10.4172/EBMP. 1000e104

Page 2 of 2

How sure are we about the results? Consequences and cost of alternatives 2. Leeflang MM, Deeks JJ, Gatsonis C, Bossuyt PM; Cochrane Diagnostic Test performed? Could they have occurred by chance? Are there confidence limits? Accuracy Working Group (2008) Systematic reviews of diagnostic test What are they? accuracy. Ann Intern Med 149: 889-897. 3. Eusebi P (2013) Diagnostic accuracy measures. Cerebrovasc Dis 36: Can the results be applied to your patients/the population of interest? Do you 267-272. think your patients/population is so different from those in the study that the results cannot be applied? Such as age, sex, ethnicity and spectrum bias. 4. http://www.cebm.net/wp-content/uploads/2014/04/diagnostic-study- appraisal-worksheet.pdf Can the test be applied to your patient or population of interest? Resources and 5. http://www.caspinternational.org/ opportunity costs Level and availability of expertise required to Interpret the http://www.caspinternational.org/mod_product/uploads/ tests Current practice and availability of services 6. CASP_Diagnostic_Checklist_14.10.10.pdf Will the resulting post-test probabilities affect your management and help your 7. http://media.wix.com/ugd/ patient? dded87_3815f02af1b34c21b8c3b2b5020024c3.pdf 8. Koffijberg H, van Zaane B, Moons KGM (2013) From accuracy to patient Could it move you across a test-treatment threshold? Would your patient be a outcome and costeffectiveness evaluations of diagnostic tests and willing partner in carrying it out? biomarkers: an exemplary modelling study. BMC Med Res Methodol 13: Were all outcomes important to the individual or population considered? Will the 12. knowledge of the test result improve patient wellbeing? Will the knowledge of 9. Montori VM, Wyer P, Newman TB, Keitz S, Guyatt G; Evidence-Based the test result lead to a change in patient management? Medicine Teaching Tips Working Group (2005) Tips for learners of evidence-based medicine: 5. The effect of spectrum of disease on the What would be the impact of using this test on your patients/population? How performance of diagnostic tests. CMAJ 173: 385-390. well was the study done to minimise bias? What is your assessment of the applicability of this study to our target population? Would the consequences of 10. Altman DG, Bland JM (1994) Diagnostic tests. 1: Sensitivity and specificity. the test help your patient? BMJ 308: 1552. 11. Altman DG, Bland JM (1994) Diagnostic tests 2: Predictive values. BMJ 309: 102. Table 1: Critical appraisal of a diagnostic test study. 12. Deeks JJ, Altman DG (2004) Diagnostic tests 4: likelihood ratios. BMJ 329: 168-169. Use this checklist can improve the evaluation of diagnostics testing studies.

References 1. Manikandan R, Dorairajan LN (2011) How to appraise a diagnostic test. Indian J Urol 27: 513-519.

Evidence Based Medicine and Practice Volume 1 • Issue 1 • e104 ISSN:EBMP Evidence Based Medicine and Practice

View publication stats 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice

(https://bestpractice.bmj.com/login) Benets (https://bestpractice.bm Diagnostic test studies: assessment and criticafleatures/) appraisal Best Practice for (https://bestpractice.bm » EBM Toolkit (https://bestpractice.bmj.com/info/toolkit/) » Learn EBM options/) (https://bestpractice.bmj.com/info/toolkit/learn-ebm/) » Diagnostic test studies: assessment and critical appraisal News & Studies (/info/news-studies/) There are many checklists available for the assessment and critical appraisal of diagnostic test studies, as reporting is frequently inadequate.[1][2] However, they Subscribe all include some variation of three critical questions;[2][3] these are: (https://bestpractice.bm • Is this study valid? • Does the diagnostic test under assessment accurately distinguish between peoplCe ownhota dcot a unds do not have the specic disorder? (https://bestpractice.bm • Can I apply this valid, accurate diagnostic test to a specic patient? us-3/) Assessment

How do we assess if a diagnostic test study is valid? We can assess whether our study is valid by considering these questions:

1. Was there an independent, blind comparison with a reference (gold) standard of diagnosis? What does that mean?

• That patients in the study should have undergone both the index diagnostic test and the reference (gold) standard. Why? To conrm or refute the ndings of the index test. The accuracy of the test can be overestimated if you perform the index test initially in people that you know have the disease and then separately in healthy people (case-control studies do this) rather than performing both the index and reference tests in the same group of people without knowing whether or not they have the disease you are trying to diagnose.[4] • That the people assessing the results of the index test are blind to the results of the reference standard. Why? To avoid biasing the results of the index test or the reference standard. Interpreting the results of the reference test while already knowing the results of the index test can lead to an overestimation of the index test accuracy, especially if the reference test is open https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 1/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice to subjective interpretation.[4] Blinding is less important if the results of the test are objective (e.g., serodiagnostic tests for tuberculosis where sputum culture results are analysed) than if results require clinical interpretation (e.g., MRI images for diagnosing rotator cu injury).

2. Was the diagnostic test evaluated in an appropriate spectrum of patients (like those a clinician would see in practice)? What does that mean?

• Did the study include people with all the common presentations of the target disorder, with symptoms of early manifestations as well as more severe symptoms, and/or people with other disorders that are commonly confused with the target disorder when diagnosing? Why? Studies that only include people with obvious symptoms versus people with no symptoms are not very useful! If you can diagnose something by eye, why would you need a diagnostic test?

3. Was the reference standard applied regardless of the index diagnostic test result? What does that mean?

• If the patient has a negative index test result, the investigators sometimes do not carry out the reference standard test to conrm the negative result, especially if the test is invasive or risky, as this may be unethical. To overcome this, investigators employ an alternative reference standard for proving that the patient does not have the target disorder, which is long-term follow-up to assess that there are no adverse eects associated with the target disorder present without any treatment. Why? To conrm the accuracy of the index test: in other words, that the negative result of the index test is in fact the correct result for the patient and he/she denitely doesn’t have the disease.

4. Was the test validated in a second independent group of patients? What does that mean?

• When a new diagnostic test is evaluated, there is a risk that the results in the initial assessment are caused by other factors: for example, something about that specic group of patients included in the study (e.g., they represent only patients with advanced symptoms of the disease). So, to prove the results are reliable and replicable, the new diagnostic test should be evaluated in a second independent (or test) group of patients. Why? If the results in this second group of patients are similar to the results in the rst group of patients, then we can be reassured about the test accuracy. If no test set study has been carried out, then maybe we need to reserve judgement.

In conclusion: If the study that we are evaluating fails any of these 4 criteria, we need to consider whether the aws of the study make the results invalid.

How do we assess the results of the test? There are two types of result commonly reported in diagnostic test studies. One concerns the accuracy of the test and is reected in the sensitivity and specicity, oen dened as the test’s

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 2/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice ability to nd true positives for the disorder (sensitivity) or true negatives for the disorder (specicity). An ideal diagnostic test nds no false positives but at the same time misses no one with the disease (nds no false negatives) — much easier said than done!

The other concerns how the test performs in the population being tested and is reected in predictive values (also called post-test probabilities) and likelihood ratios. To give brief denitions of these terms consider this example (based on reference[5]):

1000 elderly people with suspected dementia undergo an index test and a reference standard. The prevalence of dementia in this group is 25%. 240 people tested positive on both the index test and the reference standard and 600 people tested negative on both tests. The remaining 160 people had inaccurate test results.

The rst step is to draw a 2×2 table as shown below. We are told that the prevalence of dementia is 25%; therefore, we can ll in the last row of totals — 25% of 1000 people is 250 — so 250 people will have dementia and 750 will be free of dementia. We also know the number of people testing positive and negative on both tests and so we can ll in two more cells of the table.

By subtraction we can easily complete the table:

Now we are ready to calculate the various measures.

Term Denition Example

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 3/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice

Pre-test probability = In this example: 390/1000 = 0.39 What (true positive This measure tells us the probability of having does this mean: The probability of a + false a target condition before a diagnostic test patient in this study having dementia positive)/total before the tests are run number of people

Sensitivity In our example, the Sn = 240/250 = (Sn) = the 0.96 What does that mean? 10 (4%) proportion of people with dementia were falsely The sensitivity tells us how well the test people with identied as not having it, as opposed to identies people with the condition. A highly the condition the 240 (96%) people who were sensitive test will not miss many people who have a correctly identied as having dementia. positive test This means the test is fairly good at result identifying people with the condition

Specicity (Sp) = the In our example, the Sp = 600/750 = proportion of 0.80 What does that mean? 150 The specicity tells us how well the test people (20%) people without dementia were identies people without the condition. A without the falsely identied as having it. This means highly specic test will not falsely identify condition who the test is only moderately good at many people as having the condition have a identifying people without the negative test condition result

Positive predictive value (PPV) This measure tells us how well the test In our example, the PPV = 240/390 = = the performs in this population. It is dependent 0.62 What does that mean? Of the proportion of on the accuracy of the test (primarily 390 people who had a positive test people with a specicity) and the prevalence of the result, 62% will actually have dementia positive test condition who have the condition

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 4/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice

Negative predictive value (NPV) = the This measure tells us how well the test In our example, the NPV = 600/610 = proportion of performs in this population. It is dependent 0.98 What does that mean? Of the people with a on the accuracy of the test and the 610 people with a negative test, 98% negative test prevalence of the condition will not have dementia who do not have the condition

Likelihood ratio for positive This measure tells us how well the test results (LR+) performs in this population. It is dependent In this example the LR+ = 96/20 = 4.8 = on the accuracy of the test for positive results What does that mean? People with sensitivity/the (sensitivity) and the proportion of people dementia are 4.8 times more likely to % of people falsely identied as having the target have a positive test result than falsely condition A likelihood ratio of >1 indicates someone without dementia identied as the test result is associated with the disease having the disorder

Likelihood ratio for This measure tells us how well the test negative performs in this population. It is dependent results (LR–) on the accuracy of the test for negative In this example LR– =4/80 = 0.05 What = the % of results (specicity) and the proportion of does that mean? There is 0.05% people with people with the target condition falsely chance that someone with dementia the disorder identied as not having the target condition will test negative identied as A likelihood ratio < 1 indicates that the result not having is associated with absence of the disease it/% specicity

How to apply the diagnostic test to a specic patient: Having found a valid diagnostic test study, and decided that its accuracy is suciently high to make it a useful tool, here are some useful points to consider when applying the test to a specic patient:

• Is the test available, aordable, and accurate in our setting?

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 5/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice • Can a clinically sensible estimate of the pre-test probabilities of the patient be made from personal experience, prevalence statistics, practice databases, or primary studies? • Are the study patients similar to the patient in question? • How current is the study we are analysing — has evidence moved on since the publication of the study?

Will the post-test probability aect the management of the specic patient?

• Could the result move the clinician across a test-treatment threshold: for example, could the results of the test stop all further testing? That is, rule the target disorder out so the clinician would stop pursuing that possibility, or make a rm diagnosis of the target disorder and move onto choosing appropriate treatment options. • Will the patient be willing to have the test carried out? • Will the results of the test help the patient reach their goals?

Critical appraisal Based on the information given in the Assessment section above, the table below gives some basic check points to look for when critically appraising a diagnostic test study. This list is by no means comprehensive, but should cover all the main issues. The main focus of the list is the rst two questions based on validity and the importance of the results.

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 6/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 7/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice

Read more (https://bestpractice.bmj.com/info/toolkit/ebm-toolbox/bibliography/)

References

1. Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards a complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Clin Chem 2003;49:1–6. https://www.ncbi.nlm.nih.gov/pubmed/12507953 (https://www.ncbi.nlm.nih.gov/pubmed/12507953) https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 8/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice 2. CASP UK. Critical Appraisal Skills Programme (CASP) https://www.casp-uk.net (https://www.casp-uk.net) (last accessed 9 March 2017) 3. Sackett DL, Straus SE, Richardson ES, et al. Evidence-based medicine; how to practice and teach EBM. 2nd ed. Edinburgh: Churchill Livingstone, 2000. 4. Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999;282:1061–1066. https://www.ncbi.nlm.nih.gov/pubmed/10493205 (https://www.ncbi.nlm.nih.gov/pubmed/10493205) 5. Centre for Evidence Based Medicine. https://www.cebm.net/likelihood-ratios/ (https://www.cebm.net/likelihood-ratios/)(last accessed 9 March 2017).

Learn EBM

What is EBM? (https://bestpractice.bmj.com/info/toolkit/learn-ebm/what-is-ebm/)

How to clarify a clinical question (https://bestpractice.bmj.com/info/toolkit/learn-ebm/how-to- clarify-a-clinical-question/)

Design the search (https://bestpractice.bmj.com/info/toolkit/learn-ebm/design-the-search/)

Where to look for research evidence (https://bestpractice.bmj.com/info/toolkit/learn- ebm/where-to-look-for-research-evidence/)

Study design search lters (https://bestpractice.bmj.com/info/toolkit/learn-ebm/study-design- search-lters/)

Case study of a search (https://bestpractice.bmj.com/info/toolkit/learn-ebm/case-study-of-a- search/)

Appraise the evidence (https://bestpractice.bmj.com/info/toolkit/learn-ebm/appraise-the- evidence/)

Appraising 2-armed randomised controlled trials (https://bestpractice.bmj.com/info/toolkit/learn-ebm/appraising-2-armed-randomised- controlled-trials/)

Appraising multiple-armed RCTs (https://bestpractice.bmj.com/info/toolkit/learn- ebm/appraising-multiple-armed-rcts/)

Diagnostic test studies: assessment and critical appraisal (https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and- critical-appraisal/) https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 9/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice

Appraising systematic reviews (https://bestpractice.bmj.com/info/toolkit/learn-ebm/appraising- systematic-reviews/)

Multiple systematic reviews on the same question (https://bestpractice.bmj.com/info/toolkit/learn-ebm/multiple-systematic-reviews-on-the- same-question/)

Synthesise the evidence (https://bestpractice.bmj.com/info/toolkit/learn-ebm/synthesise-the- evidence/)

What is GRADE? (https://bestpractice.bmj.com/info/toolkit/learn-ebm/what-is-grade/)

Understanding statistics: other resources (https://bestpractice.bmj.com/info/toolkit/learn- ebm/understanding-statistics-other-resources/)

How to calculate risk (https://bestpractice.bmj.com/info/toolkit/learn-ebm/how-to-calculate- risk/)

EBM Toolkit home Learn, Practise, Discuss, Tools (https://bestpractice.bmj.com/info/toolkit/)

Helpful Links

Legal (https://bestpractice.bmj.com/info/legal/) Privacy (https://bestpractice.bmj.com/info/privacy/) Site map (https://bestpractice.bmj.com/info/site-map/) Contact us (https://bestpractice.bmj.com/info/contact-us- 3/)

Connect with us

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 10/11 12/7/2019 Diagnostic test studies: assessment and critical appraisal – BMJ Best Practice

 (https://twitter.com/BMJBestPractice)

 (https://www.facebook.com/BMJBestPractice)

 (https://www.youtube.com/user/BMJCompany)

 (https://bestpractice.bmj.com/info/rss-feeds/)

© BMJ Publishing Group Limited 2019. All rights reserved. | Back to top Cookie Settings

https://bestpractice.bmj.com/info/toolkit/learn-ebm/diagnostic-test-studies-assessment-and-critical-appraisal/ 11/11 Diagnostic Study Appraisal Worksheet

DIAGNOSTIC ACCURACY STUDIES Step 1: Are the results of the study valid? Was the diagnostic test evaluated in a Representative spectrum of patients (like those in whom it would be used in practice)? What is best? Where do I find the information? It is ideal if the diagnostic test is applied to the full The Methods section should tell you how patients were spectrum of patients - those with mild, severe, early and enrolled and whether they were randomly selected or late cases of the target disorder. It is also best if the consecutive admissions. It should also tell you where patients are randomly selected or consecutive admissions patients came from and whether they are likely to be so that selection bias is minimized. representative of the patients in whom the test is to be used. This paper: Yes  No  Unclear  Comment: Was the reference standard applied regardless of the index test result? What is best? Where do I find the information? Ideally both the index test and the reference standard The Methods section should indicate whether or not the should be carried out on all patients in the study. In reference standard was applied to all patients or if an some situations where the reference standard is invasive alternative reference standard (e.g., follow-up) was applied or expensive there may be reservations about subjecting to those who tested negative on the index test. patients with a negative index test result (and thus a low probability of disease) to the reference standard. An alternative reference standard is to follow-up people for an appropriate period of time (dependent on disease in question) to see if they are truly negative. This paper: Yes  No  Unclear  Comment: Was there an independent, blind comparison between the index test and an appropriate reference ('gold') standard of diagnosis? What is best? Where do I find the information? There are two issues here. First the reference standard The Methods section should have a description of the should be appropriate - as close to the 'truth' as reference standard used and if you are unsure of whether possible. Sometimes there may not be a single reference or not this is an appropriate reference standard you may test that is suitable and a combination of tests may be need to do some background searching in the area. used to indicate the presence of disease. The Methods section should also describe who conducted Second, the reference standard and the index test being the two tests and whether each was conducted assessed should be applied to each patient independently and blinded to the results of the other. independently and blindly. Those who interpreted the results of one test should not be aware of the results of the other test.

This paper: Yes  No  Unclear  Comment:

Centre for Evidence-Based Medicine, University of Oxford, 20101 Diagnostic Study Appraisal Worksheet

Step 2: What were the results? Are test characteristics presented? There are two types of results commonly reported in diagnostic test studies. One concerns the accuracy of the test and is reflected in the sensitivity and specificity. The other concerns how the test performs in the population being tested and is reflected in predictive values (also called post-test probabilities). To explore the meaning of these terms, consider a study in which 1000 elderly people with suspected dementia undergo an index test and a reference standard. The prevalence of dementia in this group is 25%. 240 people tested positive on both the index test and the reference standard and 600 people tested negative on both tests. The first step is to draw a 2 x 2 table as shown below. We are told that the prevalence of dementia is 25% therefore we can fill in the last row of totals - 25% of 1000 people is 250 - so 250 people will have dementia and 750 will be free of dementia. We also know the number of people testing positive and negative on both tests and so we can fill in two more cells of the table. Reference Standard +ve -ve Index test +ve 240 -ve 600 250 750 1000 By subtraction we can easily complete the table: Reference Standard +ve -ve Index test +ve 240 150 390 -ve 10 600 610 250 750 1000 Now we are ready to calculate the various measures. What is the measure? What does it mean? Sensitivity (Sn) = the proportion of people with the The sensitivity tells us how well the test identifies people with condition who have a positive test result. the condition. A highly sensitive test will not miss many people. In our example, the Sn = 240/250 = 0.96 10 people (4%) with dementia were falsely identified as not having it. This means the test is fairly good at identifying people with the condition. Specificity (Sp) = the proportion of people without The specificity tells us how well the test identifies people the condition who have a negative test result. without the condition. A highly specific test will not falsely identify many people as having the condition. In our example, the Sp = 600/750 = 0.80 150 people (20%) without dementia were falsely identified as having it. This means the test is only moderately good at identifying people without the condition. Positive Predictive Value (PPV) = the proportion of This measure tells us how well the test performs in this people with a positive test who have the condition. population. It is dependent on the accuracy of the test (primarily specificity) and the prevalence of the condition. In our example, the PPV = 240/390 = 0.62 Of the 390 people who had a positive test result, 62% will actually have dementia. Negative Predictive Value (NPV) = the proportion This measure tells us how well the test performs in this of people with a negative test who do not have the population. It is dependent on the accuracy of the test and the condition. prevalence of the condition. In our example, the NPV = 600/610 = 0.98 Of the 610 people with a -ve test , 98% will not have dementia.

Step 3: Applicability of the results Were the methods for performing the test described in sufficient detail to permit replication? What is best? Where do I find the information? The article should have sufficient description of the test to The Methods section should describe the test in detail. allow its replication and also interpretation of the results.

This paper: Yes  No  Unclear  Comment:

Centre for Evidence-Based Medicine, University of Oxford, 20102 Diagnostic Study Appraisal Worksheet

Centre for Evidence-Based Medicine, University of Oxford, 20103 QUADAS-2

Phase 1: State the review question:

Patients (setting, intended use of index test, presentation, prior testing):

Index test(s):

Reference standard and target condition:

Phase 2: Draw a flow diagram for the primary study

Phase 3: Risk of bias and applicability judgments QUADAS-2 is structured so that 4 key domains are each rated in terms of the risk of bias and the concern regarding applicability to the research question (as defined above). Each key domain has a set of signalling questions to help reach the judgments regarding bias and applicability.

DOMAIN 1: PATIENT SELECTION A. Risk of Bias Describe methods of patient selection:

 Was a consecutive or random sample of patients enrolled? Yes/No/Unclear  Was a case-control design avoided? Yes/No/Unclear  Did the study avoid inappropriate exclusions? Yes/No/Unclear Could the selection of patients have introduced bias? RISK: LOW/HIGH/UNCLEAR

B. Concerns regarding applicability Describe included patients (prior testing, presentation, intended use of index test and setting):

Is there concern that the included patients do not match CONCERN: LOW/HIGH/UNCLEAR the review question?

DOMAIN 2: INDEX TEST(S) If more than one index test was used, please complete for each test. A. Risk of Bias Describe the index test and how it was conducted and interpreted:

 Were the index test results interpreted without Yes/No/Unclear knowledge of the results of the reference standard?  If a threshold was used, was it pre-specified? Yes/No/Unclear Could the conduct or interpretation of the index test RISK: LOW /HIGH/UNCLEAR have introduced bias?

B. Concerns regarding applicability Is there concern that the index test, its conduct, or CONCERN: LOW /HIGH/UNCLEAR interpretation differ from the review question? DOMAIN 3: REFERENCE STANDARD A. Risk of Bias Describe the reference standard and how it was conducted and interpreted:

 Is the reference standard likely to correctly classify the target Yes/No/Unclear condition?  Were the reference standard results interpreted without Yes/No/Unclear knowledge of the results of the index test? Could the reference standard, its conduct, or its RISK: LOW /HIGH/UNCLEAR interpretation have introduced bias?

B. Concerns regarding applicability Is there concern that the target condition as defined by CONCERN: LOW /HIGH/UNCLEAR the reference standard does not match the review question?

DOMAIN 4: FLOW AND TIMING A. Risk of Bias Describe any patients who did not receive the index test(s) and/or reference standard or who were excluded from the 2x2 table (refer to flow diagram):

Describe the time interval and any interventions between index test(s) and reference standard:

 Was there an appropriate interval between index test(s) Yes/No/Unclear and reference standard?  Did all patients receive a reference standard? Yes/No/Unclear  Did patients receive the same reference standard? Yes/No/Unclear  Were all patients included in the analysis? Yes/No/Unclear Could the patient flow have introduced bias? RISK: LOW /HIGH/UNCLEAR

BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 1 of 9

Research Methods & Reporting

RESEARCH METHODS & REPORTING

STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies OPEN ACCESS Incomplete reporting has been identified as a major source of avoidable waste in biomedical research. Essential information is often not provided in study reports, impeding the identification, critical appraisal, and replication of studies. To improve the quality of reporting of diagnostic accuracy studies, the Standards for Reporting Diagnostic Accuracy (STARD) statement was developed. Here we present STARD 2015, an updated list of 30 essential items that should be included in every report of a diagnostic accuracy study. This update incorporates recent evidence about sources of bias and variability in diagnostic accuracy and is intended to facilitate the use of STARD. As such, STARD 2015 may help to improve completeness and transparency in reporting of diagnostic accuracy studies. http://www.bmj.com/ 1 2 3 4 Patrick M Bossuyt , Johannes B Reitsma , David E Bruns , Constantine A Gatsonis , Paul P 5 6 7 8 9 10 11 Glasziou , Les Irwig , Jeroen G Lijmer , David Moher , Drummond Rennie , Henrica C W de 12 13 14 15 16 17 18 19 20 Vet , Herbert Y Kressel , Nader Rifai , Robert M Golub , Douglas G Altman , Lotty Hooft , 1 1 21 Daniël A Korevaar , Jérémie F Cohen , for the STARD Group

1Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Centre, University of Amsterdam, Amsterdam, the on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright. Netherlands; 2Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, University of Utrecht, Utrecht, the Netherlands; 3Department of Pathology, University of Virginia School of Medicine, Charlottesville, VA, USA; 4Center for Statistical Sciences, Brown University School of Public Health, Providence, RI, USA; 5Centre for Research in Evidence-Based Practice, Faculty of Health Sciences and Medicine, Bond University, Gold Coast, Queensland, Australia; 6Screening and Diagnostic Test Evaluation Program, School of Public Health, University of Sydney, Sydney, New South Wales, Australia; 7Department of Psychiatry, Onze Lieve Vrouwe Gasthuis, Amsterdam, the Netherlands; 8Clinical Epidemiology Program, Ottawa Hospital Research Institute, Ottawa, Canada; 9School of Epidemiology, Public Health and Preventive Medicine, University of Ottawa, Ottawa, Canada; 10Peer Review Congress, Chicago, IL, USA; 11Philip R Lee Institute for Health Policy Studies, University of California, San Francisco, CA, USA; 12Department of Epidemiology and Biostatistics, EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, the Netherlands; 13Department of Radiology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA; 14Radiology Editorial Office, Boston, MA, USA; 15Department of Laboratory Medicine, Boston Children’s Hospital, Harvard Medical School, Boston, MA, USA; 16Clinical Chemistry Editorial Office, Washington, DC, USA; 17Division of General Internal Medicine and Geriatrics and Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA; 18JAMA Editorial Office, Chicago, IL, USA; 19Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, UK; 20Dutch Cochrane Centre, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, University of Utrecht, Utrecht, the Netherlands; 21INSERM UMR 1153 and Department of Pediatrics, Necker Hospital, AP-HP, Paris Descartes University, Paris, France.

As researchers, we talk and write about our studies, not just elements of study methods are often poorly described and because we are happy—or disappointed—with the findings, but sometimes completely omitted, making both critical appraisal also to allow others to appreciate the validity of our methods, and replication difficult, if not impossible. Sometimes study to enable our colleagues to replicate what we did, and to disclose results are selectively reported, and other times researchers our findings to clinicians, other health care professionals, and cannot resist unwarranted optimism in interpretation of their decision makers, all of whom rely on the results of strong findings.2-4 These practices limit the value of the research and research to guide their actions. any downstream products or activities, such as systematic Unfortunately, deficiencies in the reporting of research have reviews and clinical practice guidelines. been highlighted in several areas of clinical medicine.1 Essential

Correspondence to: P M Bossuyt [email protected]

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 2 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from

Reports of studies of medical tests are no exception. A growing Quality and Transparency of Health Research (EQUATOR) number of evaluations have identified deficiencies in the website at www.equator-network.org/reporting-guidelines/stard. reporting of test accuracy studies.5 These are studies in which In short, we invited the 2003 STARD group members to a test is evaluated against a clinical reference standard, or gold participate in the updating process, nominate new members, standard; the results are typically reported as estimates of the and comment on the general scope of the update. Suggested test’s sensitivity and specificity, which express how good the new members were contacted. As a result, the STARD group test is in correctly identifying patients as having the target has now grown to 85 members that include researchers, editors, condition. Other accuracy statistics can be used as well, such journalists, evidence synthesis professionals, funders, and other as the area under the receiver operating characteristics (ROC) stakeholders. curve or positive and negative predictive values. STARD group members were then asked to suggest, and later Despite their apparent simplicity, such studies are at risk of to endorse, proposed changes in a two round, web based survey. bias.6 7 If not all patients undergoing testing are included in the This served to prepare a draft list of essential items, which was final analysis, for example, or if only healthy controls are discussed in the steering committee in a two day meeting in included, the estimates of test accuracy may not reflect the Amsterdam in September 2014. The list was then piloted in performance of the test in clinical applications. Yet such crucial different groups: starting and advanced researchers, peer information is often missing from study reports. reviewers, and editors. It is now well established that sensitivity and specificity are not The general structure of STARD 2015 is similar to that of fixed test properties. The relative number of false positive and STARD 2003. A one page document presents 30 items, grouped false negative test results varies across settings, depending on under sections that follow the introduction, methods, results, how patients present and which tests they have already and discussion (IMRAD) structure of a scientific article (see undergone. Unfortunately, many authors also fail to completely table 1⇓). Several of the STARD 2015 items are identical to the report the clinical context and when, where, and how they ones in the 2003 version. Others have been reworded, combined, identified and recruited eligible study participants.8 In addition, or (if complex) split. A few have been added (see table 2⇓ for sensitivity and specificity estimates can differ because of a summary of new items and table 3⇓ for key terms). A diagram variable definitions of the reference standard against which the to describe the flow of participants through the study is now test is being compared. Thus this information should be available expected in all reports (figure⇓). in the study report. The 2003 STARD statement Scope http://www.bmj.com/ To assist in the completeness and transparency of reporting STARD 2015 replaces the original version published in 2003; diagnostic accuracy studies, a group of researchers, editors, and those who would like to refer to STARD are invited to cite this other stakeholders developed a minimum list of essential items article. The list of essential items can be seen as a minimum set, that should be included in every study report. The guiding and an informative study report will typically present more principle for developing the list was to select items that, if information. Yet we hope to find all applicable items in a well prepared report of a diagnostic accuracy study.

described, would help readers to judge the potential for bias in on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright. the study and appraise the applicability of the study findings Authors are invited to use STARD when preparing their study and the validity of the authors’ conclusions and reports. Reviewers can use the list to verify that all essential recommendations. information is available in a submitted manuscript and suggest The resulting Standards for Reporting Diagnostic Accuracy changes if key items are missing. (STARD) statement appeared in 2003 in two dozen journals.9 We trust that journals that endorsed STARD in 2003 or later It was accompanied by editorials and commentaries in several will recommend the use of this updated version and encourage other publications and endorsed by many more. compliance in submitted manuscripts. We hope that even more Since the publication of STARD, several evaluations have journals, and journal organizations, will promote the use of this pointed to small but statistically significant improvements in and comparable reporting guidelines. Funders and research reporting accuracy studies (mean gain 1.4 items (95% institutions may promote or mandate adherence to STARD as confidence interval 0.7 to 2.2)).5 10 Gradually, more of the a way to maximize the value of research and downstream essential items are being reported, but the situation remains far products or activities. from optimal. STARD may also be beneficial for reporting other studies that evaluate the performance of tests. This includes prognostic Methods for developing STARD 2015 studies, which can classify patients on the basis of whether a future event happens; monitoring studies, in which tests are The STARD steering committee periodically reviews the supposed to detect or predict an adverse event or lack of literature for potentially relevant studies to inform a possible response; studies evaluating treatment selection markers; and update. In 2013, the steering committee decided that the time more. We and others have found most of the STARD items was right to update the checklist. useful when reporting and examining such studies, although Updating had two major goals: first, to incorporate recent STARD primarily targets diagnostic accuracy studies. evidence about sources of bias, applicability concerns, and Diagnostic accuracy is not the only expression of test factors facilitating generous interpretation in test accuracy performance, nor is it always the most meaningful.12 Incremental research, and, second, to make the list easier to use. In making accuracy from combining tests, relative to a single test, can be modifications, we also considered harmonization with other more informative, for example.13 For continuous tests, reporting guidelines, such as Consolidated Standards of 11 dichotomization into test positives and negatives may not always Reporting Trials (CONSORT) 2010. be indicated. In such cases, the desirable computational and A complete description of the updating process and the graphical methods for expressing test performance are different, justification for the changes are available on the Enhancing the although many of the methodological precautions would be the

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 3 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from same, and STARD can help in reporting the study in an Increasing value, reducing waste informative way. Other reporting guidelines target more specific forms of tests, such as Transparent Reporting of a Multivariable The STARD steering committee is aware that building a list of Prediction Model for Individual Prognosis or Diagnosis essential items is not sufficient to achieve substantial (TRIPOD) for multivariable prediction models.14 improvements in reporting completeness, as the modest improvement after introduction of the 2003 list has shown. We Although STARD focuses on full study reports of test accuracy see this list not as the final product, but as the starting point for studies, the items can also be helpful when writing conference building more specific instruments to stimulate complete and abstracts, including information in trial registries, and transparent reporting, such as a checklist and a writing aid for developing protocols for such studies. Additional initiatives are authors, tools for reviewers and editors, instruction videos, and underway to provide more specific guidance for each of these teaching materials, all based on this STARD list of essential applications. items. Incomplete reporting has been identified as one of the sources STARD extensions and applications 1 of avoidable waste in biomedical research. Since STARD was The STARD statement was designed to apply to all types of initiated, several other initiatives have been undertaken to medical tests. The STARD group believed that a single checklist, enhance the reproducibility of research and promote greater for all diagnostic accuracy studies, would be more widely transparency.20 Multiple factors are at stake, but incomplete disseminated and more easily accepted by authors, peer reporting is one of them. We hope that this update of STARD, reviewers, and journal editors than separate lists for different together with additional implementation initiatives, will help types of tests such as imaging, biochemistry, or histopathology. authors, editors, reviewers, readers, and decision makers to Having a general list may necessitate additional instructions for collect, appraise, and apply the evidence needed to strengthen informative reporting, with more information for specific types decisions and recommendations about medical tests. In the end, of tests, specific applications, or specific forms of analysis. Such we are all to benefit from more informative and transparent guidance could describe the preferred methods for studying and reporting: as researchers, as healthcare professionals, as payers, reporting measurement uncertainty, for example, without and as patients. changing any of the other STARD items. The STARD group welcomes the development of such STARD extensions and This article is being simultaneously published in October 2015 by The BMJ, Radiology, and Clinical Chemistry. This article is published under invites interested groups to contact the STARD executive http://www.bmj.com/ committee before developing them. the Creative Commons CC BY-NC license http://creativecommons.org/ licenses/by-nc/4.0. Other groups may want to develop additional guidance to facilitate the use of STARD for specific applications. An STARD Group collaborators: Todd Alonzo, Douglas G Altman, Augusto example of such a STARD application was prepared for history Azuara-Blanco, Lucas Bachmann, Jeffrey Blume, Patrick M Bossuyt, taking and physical examination.15 Another type of application Isabelle Boutron, David Bruns, Harry Büller, Frank Buntinx, Sarah Byron, is the use of STARD for specific target conditions such as Stephanie Chang, Jérémie F Cohen, Richelle Cooper, Joris de Groot, 16 Henrica C W de Vet, Jon Deeks, Nandini Dendukuri, Jac Dinnes,

dementia. on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright. Kenneth Fleming, Constantine A Gatsonis, Paul P Glasziou, Robert M Golub, Gordon Guyatt, Carl Heneghan, Jørgen Hilden, Lotty Hooft, Rita Availability Horvath, Myriam Hunink, Chris Hyde, John Ioannidis, Les Irwig, Holly The new STARD 2015 list and all related documents can be Janes, Jos Kleijnen, André Knottnerus, Daniël A Korevaar, Herbert Y found on the STARD pages of the EQUATOR website. Kressel, Stefan Lange, Mariska Leeflang, Jeroen G Lijmer, Sally Lord, EQUATOR is an international initiative that seeks to improve Blanca Lumbreras, Petra Macaskill, Erik Magid, Susan Mallett, Matthew the value of published health research literature by promoting McInnes, Barbara McNeil, Matthew McQueen, David Moher, Karel transparent and accurate reporting and wider use of robust Moons, Katie Morris, Reem Mustafa, Nancy Obuchowski, Eleanor reporting guidelines.17 18 The STARD group believes that Ochodo, Andrew Onderdonk, John Overbeke, Nitika Pai, Rosanna working more closely with EQUATOR and other reporting Peeling, Margaret Pepe, Steffen Petersen, Christopher Price, Philippe guideline developers will help us to better reach shared Ravaud, Johannes B. Reitsma, Drummond Rennie, Nader Rifai, Anne objectives. We have updated the 2003 explanation and Rutjes, Holger Schunemann, David Simel, Iveta Simera, Nynke Smidt, elaboration document, which can also be found at the Ewout Steyerberg, Sharon Straus, William Summerskill, Yemisi EQUATOR website. This document explains the rationale for Takwoingi, Matthew Thompson, Ann van de Bruel, Hans van Maanen, each item and gives examples. Andrew Vickers, Gianni Virgili, Stephen Walter, Wim Weber, Marie Westwood, Penny Whiting, Nancy Wilczynski, Andreas Ziegler. The STARD list is released under a Creative Commons license. This allows everyone to use and distribute the work if they Contributors: All authors confirm they have contributed to the intellectual acknowledge the source. The STARD statement was originally content of this paper and have met the following 3 requirements: (a) reported in English, but several groups have worked on significant contributions to the conception and design, acquisition of translations in other languages. We welcome such translations, data, or analysis and interpretation of data; (b) drafting or revising the which are preferably developed by groups of researchers, by article for intellectual content; and (c) final approval of the published use of a cyclical development process, with back-translation to article. the original language and user testing.19 We have also applied Funding: There was no explicit funding for the development of STARD for a trademark for STARD to ensure that the steering committee 2015. The Academic Medical Center of the University of Amsterdam, has the exclusive right to use the word “STARD” to identify the Netherlands, partly funded the meeting of the STARD steering group goods or services. but had no influence on the development or dissemination of the list of essential items. STARD steering group members and STARD group members covered additional personal costs individually.

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 4 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from

Competing interests: All authors have completed the Clinical Chemistry 12 Bossuyt PM, Reitsma JB, Linnet K, Moons KG. Beyond diagnostic accuracy: the clinical utility of diagnostic tests. Clin Chem 2012;58:1636-43. author disclosure form: N Rifai works for Clinical Chemistry, AACC; C 13 Moons KG, de Groot JA, Linnet K, Reitsma JB, Bossuyt PM. Quantifying the added value A Gatsonisis a member of RSNA Research Development Committee. of a diagnostic test or marker. Clin Chem 2012;58:1408-17. 14 Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. 1 Glasziou P, Altman DG, Bossuyt P, et al. Reducing waste from incomplete or unusable BMJ 2015;350:g7594. reports of biomedical research. Lancet 2014;383:267-76. 15 Simel DL, Rennie D, Bossuyt PM. The STARD statement for reporting diagnostic accuracy 2 Boutron I, Dutton S, Ravaud P, Altman DG. Reporting and interpretation of randomized studies: application to the history and physical examination. J Gen Intern Med controlled trials with statistically nonsignificant results for primary outcomes. JAMA 2008;23:768-74. 2010;303:2058-64. 16 Noel-Storr AH, McCleery JM, Richard E, et al. Reporting standards for studies of diagnostic 3 Ochodo EA, de Haan MC, Reitsma JB, et al. Overinterpretation and misreporting of test accuracy in dementia: the STARDdem Initiative. Neurology 2014;83:364-73. diagnostic accuracy studies: evidence of “spin.” Radiology 2013;267:581-8. 17 Altman DG, Simera I, Hoey J, Moher D, Schulz K. EQUATOR: reporting guidelines for 4 Mathieu S, Boutron I, Moher D, Altman DG, Ravaud P. Comparison of registered and health research. Lancet 2008;371:1149-50. published primary outcomes in randomized controlled trials. JAMA 2009;302:977-84. 18 Simera I, Moher D, Hirst A, et al. Transparent and accurate reporting increases reliability, 5 Korevaar DA, Wang J, van Enst WA, et al. Reporting diagnostic accuracy studies: some utility, and impact of your research: reporting guidelines and the EQUATOR Network. improvements after 10 years of STARD. Radiology 2015;274:781-9. BMC Med 2010;8:24. 6 Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in 19 Beaton DE, Bombardier C, Guillemin F, Ferraz MB. Guidelines for the process of studies of diagnostic tests. JAMA 1999;282:1061-6. cross-cultural adaptation of self-report measures. Spine 2000;25:3186-91. 7 Whiting PF, Rutjes AW, Westwood ME, Mallett S. A systematic review classifies sources 20 Collins FS, Tabak LA. Policy: NIH plans to enhance reproducibility. Nature 2014;505:612-3. of bias and variation in diagnostic test accuracy studies. J Clin Epidemiol 2013;66:1093-104. Accepted: 18 September 2015 8 Irwig L, Bossuyt P, Glasziou P, Gatsonis C, Lijmer J. Designing studies to ensure that estimates of test accuracy are transferable. BMJ 2002;324:669-71. 9 Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards complete and accurate reporting of Cite this as: BMJ 2015;351:h5527 studies of diagnostic accuracy: the STARD Initiative. Radiology 2003;226:24-8. © Bossuyt et al 2015 10 Korevaar DA, van Enst WA, Spijker R, Bossuyt PM, Hooft L. Reporting quality of diagnostic accuracy studies: a systematic review and meta-analysis of investigations on adherence This is an Open Access article distributed in accordance with the terms of the Creative to STARD. Evid Based Med 2014;19:47-54. Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt 11 Schulz KF, Altman DG, Moher D, Group C. CONSORT 2010 statement: updated guidelines and build upon this work, for commercial use, provided the original work is properly cited. for reporting parallel group randomised trials. J Clin Epidemiol 2010;63:834-40. See: http://creativecommons.org/licenses/by/4.0/. http://www.bmj.com/ on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright.

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 5 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from

Tables

Table 1| The STARD 2015 list*

Section and topic No Item Title or abstract

1 Identification as a study of diagnostic accuracy using at least one measure of accuracy (such as sensitivity, specificity, predictive values, or AUC) Abstract

2 Structured summary of study design, methods, results, and conclusions (for specific guidance, see STARD for Abstracts) Introduction

3 Scientific and clinical background, including the intended use and clinical role of the index test 4 Study objectives and hypotheses Methods

Study design 5 Whether data collection was planned before the index test and reference standard were performed (prospective study) or after (retrospective study) Participants 6 Eligibility criteria 7 On what basis potentially eligible participants were identified (such as symptoms, results from previous tests, inclusion in registry) 8 Where and when potentially eligible participants were identified (setting, location, and dates) 9 Whether participants formed a consecutive, random, or convenience series Test methods 10a Index test, in sufficient detail to allow replication 10b Reference standard, in sufficient detail to allow replication

11 Rationale for choosing the reference standard (if alternatives exist) http://www.bmj.com/ 12a Definition of and rationale for test positivity cut-offs or result categories of the index test, distinguishing pre-specified from exploratory 12b Definition of and rationale for test positivity cut-offs or result categories of the reference standard, distinguishing pre-specified from exploratory 13a Whether clinical information and reference standard results were available to the performers or readers of the index test

13b Whether clinical information and index test results were available to the assessors of the reference standard on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright. Analysis 14 Methods for estimating or comparing measures of diagnostic accuracy 15 How indeterminate index test or reference standard results were handled 16 How missing data on the index test and reference standard were handled 17 Any analyses of variability in diagnostic accuracy, distinguishing pre-specified from exploratory 18 Intended sample size and how it was determined Results

Participants 19 Flow of participants, using a diagram 20 Baseline demographic and clinical characteristics of participants 21a Distribution of severity of disease in those with the target condition 21b Distribution of alternative diagnoses in those without the target condition 22 Time interval and any clinical interventions between index test and reference standard Test results 23 Cross tabulation of the index test results (or their distribution) by the results of the reference standard 24 Estimates of diagnostic accuracy and their precision (such as 95% confidence intervals) 25 Any adverse events from performing the index test or the reference standard Discussion

26 Study limitations, including sources of potential bias, statistical uncertainty, and generalisability 27 Implications for practice, including the intended use and clinical role of the index test Other information

28 Registration number and name of registry 29 Where the full study protocol can be accessed 30 Sources of funding and other support; role of funders

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 6 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from

Table 1 (continued)

Section and topic No Item

*At the start of each item row, authors should specify the page number of the manuscript where the item can be found. http://www.bmj.com/ on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright.

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 7 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from

Table 2| Summary of new items in STARD 2015

No Item Rationale

2 Structured abstract Abstracts are increasingly used to identify key elements of study design and results. 3 Intended use and clinical role of the test Describing the targeted application of the test helps readers to interpret the implications of reported accuracy estimates. 4 Study hypotheses Not having a specific study hypothesis may invite generous interpretation of the study results and “spin” in the conclusions. 18 Sample size Readers want to appreciate the anticipated precision and power of the study and whether authors were successful in recruiting the targeted number of participants. 26-27 Structured discussion To prevent jumping to unwarranted conclusions, authors are invited to discuss study limitations and draw conclusions keeping in mind the targeted application of the evaluated tests (see item 3). 28 Registration Prospective test accuracy studies are trials, and, as such, they can be registered in clinical trial registries, such as ClinicalTrials.gov, before their initiation, facilitating identification of their existence and preventing selective reporting. 29 Protocol The full study protocol, with more information about the predefined study methods, may be available elsewhere, to allow more fine grained critical appraisal. 30 Sources of funding Awareness of the potentially compromising effects of conflicts of interest between researchers’ obligations to abide by scientific and ethical principles and other goals, such as financial ones; test accuracy studies are no exception. http://www.bmj.com/ on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright.

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 8 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from

Table 3| Key STARD terminology

Term Explanation

Medical test Any method for collecting additional information about the current or future health status of a patient Index test The test under evaluation Target condition The disease or condition that the index test is expected to detect Clinical reference standard The best available method for establishing the presence or absence of the target condition; a would be an error-free reference standard Sensitivity Proportion of those with the target condition who test positive with the index test Specificity Proportion of those without the target condition who test negative with the index test Intended use of the test Whether the index test is used for diagnosis, screening, staging, monitoring, surveillance, prediction, prognosis, or other reasons Role of the test The position of the index test relative to other tests for the same condition (for example, triage, replacement, add-on, new test) http://www.bmj.com/ on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright.

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe BMJ 2015;351:h5527 doi: 10.1136/bmj.h5527 (Published 28 October 2015) Page 9 of 9

RESEARCH METHODS & REPORTING BMJ: first published as 10.1136/bmj.h5527 on 28 October 2015. Downloaded from

Figure

Prototypical STARD diagram to report flow of participants through the study. http://www.bmj.com/ on 9 December 2019 at India:BMJ-PG Sponsored. Protected by copyright.

Open Access: Reuse allowed Subscribe: http://www.bmj.com/subscribe 12/7/2019 The 3-min appraisal of a diagnostic test

Indian J Orthop. 2011 Sep-Oct; 45(5): 389–391. PMCID: PMC3162672 doi: 10.4103/0019-5413.80317: 10.4103/0019-5413.80317 PMID: 21886917

The 3-min appraisal of a diagnostic test

Teresa Chien, Rajesh Malhotra,1 and Mohit Bhandari

McMaster University, Hamilton, Canada 1Department of Orthopaedics, AIIMS, Delhi, India Address for correspondence: Dr. Mohit Bhandari, 2309 Hoover Court, Hamilton, L7P 4V2, ON, Canada. E-mail: [email protected]

Copyright © Indian Journal of Orthopaedics

This is an open-access article distributed under the terms of the Creative Commons Attribution-Noncommercial- Share Alike 3.0 Unported, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

I Diagnostic tests are invaluable tools used in various healthcare settings to distinguish between patients who have a disease and those who do not. It is essential for surgeons to be skilled in critically appraising published papers about a diagnostic test.1 This article will provide some simple and quick guidelines to assist in the 3-min critical appraisal process of the literature on diagnostic tests.

K C C A Many aspects of diagnostics need to be evaluated, and there are three specific areas that should be critically appraised in diagnostic test studies as described by Guyatt in his User's Guide to the Medical Literature: validity of the study, results, and the applicability of the diagnostic test [Table 1].

Are the results of the study valid? The validity of a diagnostic test study can be critically appraised through examining the study design. The patient population of the study should include a wide spectrum of patients with varying disease conditions and stages of treatment to ensure that there is genuine diagnostic uncertainty.2,3 Diagnostic uncertainty increases when symptoms of the target condition are also characteristic of other diseases.3 To minimize misjudgment of the study results, there should be a variety of patients.

Another essential component to analyze is if an independent (blind) comparison between the diagnostic test and an appropriate reference standard was done for each patient. For example, a surgeon wishing to understand whether the “Lachman” test for cruciate ligament is predictive of a tear could confirm the accuracy of the test by ensuring all patients tested received an independent “standard” confirmatory test (magnetic resonance imaging (MRI) or arthroscopic visualization). To increase the study validity and minimize potential bias in overestimating test outcomes, those interpreting the test results should be blinded and be different than those interpreting the reference standard.3 In this case, the surgeon performing the physical exam manoeuvre and grading the presence/absence of a positive Lachman would not be the same surgeon who reviewed the gold standard confirmatory test (MRI or arthroscopy).

The results of a diagnostic test should not influence a decision to perform the reference standard. This is referred to as verification bias.3 For example, verification bias results when only patients with a positive “Lachman” test get an MRI (or arthroscopy), but those with a negative test do not. A good study design should take verification bias into account and take measures for bias prevention. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3162672/?report=printable 1/5 12/7/2019 The 3-min appraisal of a diagnostic test What are the results? The terms sensitivity and specificity are often used to describe the effectiveness of a test, but the literature has shown that likelihood ratios are a better statistical tool in aiding clinical decision making.2 Sensitivity is the proportion of individuals with a target condition that test positive and specificity is the proportion of individuals without a target condition that test negative. These dichotomous measures indicate either positive or negative tests, while likelihood ratios account for the cases in the middle of a wide spectrum of patients.2,3

Likelihood ratios combine specificity and sensitivity, which provides a better measure of the test as it links the pretest probability (prevalence of disease) to the post-test probability (chances of detecting the target condition with the diagnostic test).3,4 Likelihood ratios are a ratio of the proportion of positive test results versus the proportion of negative test results.5 A likelihood ratio equal or close to 1 means that the test has minimal value because it cannot differentiate between those who have the target condition and those who do not.4 A large likelihood ratio (>1) means a larger proportion of the test results will occur in positive patients, whereas a smaller ratio (<1) indicates a higher probability that the test result will occur more frequently in healthy patients.5 The ability to understand and interpret these ratios is essential in understanding quantitative features of a study.

Are the results of this study useful in my practise? The last component to consider is the applicability of the study results to current practise. It is crucial to compare one's patient to the patient population in the study. The greater the similarities, the more appropriate and relevant the study is to one's practise. An accurate diagnostic test with minimal risks to the patient can be an invaluable stepping stone to improving healthcare. A diagnostic test that is easy to conduct and reproduce allows for a better integration into clinical practise.3

A P E In the study by Iannotti et al., an office-based ultrasonography was examined for its accuracy in diagnosing rotator cuff tears.6

Are the results of the study valid? The patient population of Iannotti et al.'s study were all clinically diagnosed with rotator cuff symptoms. By having this inclusion criterion, it limits a wide spectrum of patients which may reduce the study validity.

This study clearly defined the diagnostic test (ultrasonography) and compared it with an appropriate reference standard (operative findings); however, some areas of the study design allowed for potential bias. The study design was complex involving consecutive radiographs, clinical examinations, ultrasounds, and preoperative MRI scans prior to the study. The orthopedic surgeon did the initial physical examinations and interpreted the radiographs. The surgeon's involvement in the preliminary assessment of the patient might have caused bias to his/her diagnosis after evaluating the ultrasound and preoperative MRI results. The authors’ rationale behind their design was that the diagnostic study should be carried out in a similar way similar to a typical clinical practise, where patients will undergo similar diagnostic procedures. The authors also suggested that the surgeon's bias from a previous involvement with the patient might actually lead to an increased accuracy of the test then the blinded radiologist.

What are the results? The ultrasound had an accuracy of 80%, sensitivity of 88%, positive predictive value of 79%, negative predictive value of 90%, and a false-positive rate at 21%. These results were reported in a 3 × 3 table comparing three categories: no tear, partial-thickness tear only, and full-thickness tear with or without partial-thickness tear. If the latter two categories were combined to represent the target condition, and configured into a 2 × 2 contingency table, the likelihood ratios can be calculated.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3162672/?report=printable 2/5 12/7/2019 The 3-min appraisal of a diagnostic test The process of calculating likelihood ratios are shown in Table 2. The positive likelihood ratio is 4.8 and the negative likelihood ratio is 0.05. This indicates that a positive ultrasound is 4.8 times more like to occur in patients with a rotator cuff tear (either partial or full-thickness). Likelihood ratios are better measures of interpreting the results.

Are the results of this study useful in my practise? While this study had some limitations in its design, it concluded that office-based ultrasounds can be used effectively to diagnose rotator cuff tears when there are well-trained staff, preliminary clinical examinations, and radiographs.

S The ability to critically appraise the literature on diagnostic tests is essential for interpreting and understanding results of a study. The validity of the study, the results, and applicability of results to your patient population must be considered appropriately before applying the new knowledge in evidence-based orthopedics.

Footnotes

Source of Support: Nil

Conflict of Interest: None.

R 1. Mundi R, Chaudhry H, Singh I, Bhandari M. Checklists to improve the quality of the orthopaedic literature. Indian J Orthop. 2008;42:150–64. [PMCID: PMC2759624] [PubMed: 19826520]

2. Greenhalgh T. How to read a paper: Papers that report diagnostic or screening tests. BMJ. 1997;315:540–3. [PMCID: PMC2127365] [PubMed: 9329312]

3. Bhandari M, Montori VM, Swiontkowski MF, Guyatt GH. User's guide to the surgical literature: How to use an article about a diagnostic test. J Bone Joint Surg Am. 2003;85:1133–40. [PubMed: 12784015]

4. Dujardin B, Van den Ende J, Van Gompel A, Ugner JP, Van der Stuyft P. Likelihood ratios: A real improvement for clinical decision making? Eur J Epidemiol. 1994;10:29–36. [PubMed: 7957786]

5. Richardson WS, Wilson MC, Keitz SA, Wyer PC. EBM Teaching Scripts Working Group. Tips for Teachers of evidence-based medicine: Making sense of diagnostic test results using likelihood ratios. J Gen Intern Med. 2008;23:87–92. [PMCID: PMC2173935] [PubMed: 18064524]

6. Iannotti JP, Ciccone J, Buss D, Visotsky JL, Mascha E, Cotman K, et al. Acuracy of office-based ultrasonography of the shoulder for the diagnosis of rotator cuff tears. J Bone Joint Surg Am. 2005;87:1305–11. [PubMed: 15930541]

Figures and Tables

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3162672/?report=printable 3/5 12/7/2019 The 3-min appraisal of a diagnostic test

Table 1 Guidelines to critically appraising literature on diagnostic tests

Open in a separate window

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3162672/?report=printable 4/5 12/7/2019 The 3-min appraisal of a diagnostic test

Table 2 Likelihood ratio calculations from the practical example

Articles from Indian Journal of Orthopaedics are provided here courtesy of Wolters Kluwer -- Medknow Publications

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3162672/?report=printable 5/5 Articles

Novel lipoarabinomannan point-of-care tuberculosis test for people with HIV: a diagnostic accuracy study

Tobias Broger*, Bianca Sossen*, Elloise du Toit, Andrew D Kerkhoff, Charlotte Schutz, Elena Ivanova Reipold, Amy Ward, David A Barr, Aurélien Macé, Andre Trollip, Rosie Burton, Stefano Ongarello, Abraham Pinter, Todd L Lowary, Catharina Boehme, Mark P Nicol, Graeme Meintjes†, Claudia M Denkinger†

Summary Background Most tuberculosis-related deaths in people with HIV could be prevented with earlier diagnosis and Lancet Infect Dis 2019 treatment. The only commercially available tuberculosis point-of-care test (Alere Determine TB LAM Ag [AlereLAM]) Published Online has suboptimal sensitivity, which restricts its use in clinical practice. The novel Fujifilm SILVAMP TB LAM (FujiLAM) May 30, 2019 assay has been developed to improve the sensitivity of AlereLAM. We assessed the diagnostic accuracy of the FujiLAM http://dx.doi.org/10.1016/ S1473-3099(19)30001-5 assay for the detection of tuberculosis in hospital inpatients with HIV compared with the AlereLAM assay. See Online/Comment http://dx.doi.org/10.1016/ Methods For this diagnostic accuracy study, we assessed biobanked urine samples obtained from the FIND Specimen S1473-3099(19)30053-2 Bank and the University of Cape Town Biobank, which had been collected from hospital inpatients (aged ≥18 years) *Contributed equally with HIV during three independent prospective cohort studies done at two South African hospitals. Urine samples †Contributed equally were tested using FujiLAM and AlereLAM assays. The conduct and reporting of each test was done blind to other test FIND, Geneva, Switzerland results. The primary objective was to assess the diagnostic accuracy of FujiLAM compared with AlereLAM, against (T Broger MSc, microbiological and composite reference standards (including clinical diagnoses). E Ivanova Reipold PhD, A Macé PhD, S Ongarello PhD, C Boehme MD, Findings Between April 18, 2018, and May 3, 2018, urine samples from 968 hospital inpatients with HIV were C M Denkinger MD); evaluated. The prevalence of microbiologically-confirmed tuberculosis was 62% and the median CD4 count was Department of Medicine, 86 cells per µL. Using the microbiological reference standard, the estimated sensitivity of FujiLAM was 70·4% Faculty of Health Sciences (95% CI 53·0 to 83·1) compared with 42·3% (31·7 to 51·8) for AlereLAM (difference 28·1%) and the estimated (B Sossen MBChB, C Schutz MBChB, A Ward MBChB, specificity of FujiLAM was 90·8% (86·0 to 94·4) and 95·0% (87·7–98·8) for AlereLAM (difference –4·2%). Against R Burton MBChB, the composite reference standard, the specificity of both assays was higher (95·7% [92·0 to 98·0] for FujiLAM vs Prof G Meintjes PhD), 98·2% [95·7 to 99·6] for AlereLAM; difference –2·5%), but the sensitivity of both assays was lower (64·9% [50·1 to 76·7] Wellcome Center for Infectious for FujiLAM vs 38·2% [28·1 to 47·3] for AlereLAM; difference 26·7%). Diseases Research in Africa, Institute of Infectious Disease and Molecular Medicine Interpretation In comparison to AlereLAM, FujiLAM offers superior diagnostic sensitivity, while maintaining (B Sossen, C Schutz, A Ward, specificity, and could transform rapid point-of-care tuberculosis diagnosis for hospital inpatients with HIV. The Prof G Meintjes), and Division of Medical Microbiology applicability of FujiLAM for settings of intended use requires prospective assessment. (E du Toit PhD, Prof M P Nicol MBChB), Funding Global Health Innovative Technology Fund, UK Department for International Development, Dutch Ministry University of Cape Town, of Foreign Affairs, Bill & Melinda Gates Foundation, German Federal Ministry of Education and Research, Australian Cape Town, South Africa; National Health Laboratory Department of Foreign Affairs and Trade, Wellcome Trust, Department of Science and Technology and National Service, Cape Town, South Research Foundation of South Africa, and South African Medical Research Council. Africa (E du Toit, Prof M P Nicol); Division of HIV, Infectious Copyright © 2019 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY Diseases and Global Medicine, Zuckerberg San Francisco 4.0 license. General Hospital, University of California, San Francisco, Introduction been identified as an urgent unmet clinical need by San Francisco, CA, USA

Tuberculosis is the leading infectious cause of death WHO.3 (A D Kerkhoff MD);Wellcome Trust Liverpool Glasgow Centre globally and remains the most common cause of The Alere Determine TB LAM Ag assay (AlereLAM; for Global Health Research, mortality in people with HIV, causing an estimated Abbott, Chicago, IL, USA) detects the presence of the University of Liverpool, 300 000 deaths in 2017.1 Most tuberculosis-related deaths mycobacterial cell wall component, lipoarabinomannan Liverpool, UK (D A Barr MBChB); in people with HIV could be prevented with earlier (LAM), in a urine sample. However, in a meta-analysis FIND, Cape Town, South Africa (A Trollip PhD); Southern 1 4,5 diagnosis and treatment. Extrapulmonary tuberculosis by WHO, the sensitivity of this test was only 45% in African Medical Unit, Médecins is more common in people with HIV who are severely people with HIV, with higher sensitivity (56%) in sans Frontières, Cape Town, immunocompromised than in immunocompetent patients with CD4 counts equal to or less than 100 cells South Africa (R Burton); Public people and therefore sputum might not represent the per µL. Despite suboptimal sensitivity, the test reduces Health Research Institute Center, New Jersey Medical 2 ideal diagnostic sample. Furthermore, independent of mortality when implemented for immunocompromised School, Rutgers University, location of disease, producing a sputum sample is often hospital inpatients with HIV.6 On this basis, WHO Newark, NJ, USA difficult for patients with advanced HIV who are recommends the use of AlereLAM for people with HIV (Prof A Pinter PhD); and severely ill. As a result, non-sputum-based tests have and a CD4 count equal to or less than 100 cells per µL Department of Chemistry and www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 1 Articles

Alberta Glycomics Centre, University of Alberta, Research in context Edmonton, AB, Canada (Prof T L Lowary PhD) Evidence before this study (lipoarabinomannan or LAM) AND (Fuji*), but no articles were Correspondence to: WHO recommended the use of the rapid, point-of-care identified. Dr Claudia M Denkinger, FIND, Alere Determine TB LAM Ag assay (AlereLAM) for the diagnosis Added value of this study 1202 Geneva, Switzerland of tuberculosis in people with HIV. This recommendation was This is the first study to assess the accuracy and diagnostic yield [email protected] informed by a Cochrane systematic review and meta-analysis of of the FujiLAM assay for the diagnosis of active tuberculosis. 12 cross-sectional or cohort studies that showed a relatively Diagnostic accuracy was compared with rigorously defined low pooled sensitivity of 45% (95% CI 29–63) and specificity of microbiological and composite reference standards in three 92% (80–97) against a microbiological reference standard. cohorts of hospital inpatients with HIV. The findings from this In a subgroup of patients with HIV and CD4 counts less than or study show that FujiLAM is substantially more sensitive than equal to 100 cells per µL, pooled sensitivity was 56% (41–70). AlereLAM, while maintaining specificity, for the diagnosis of We searched PubMed for articles published between Feb 5, 2015, active tuberculosis in hospital inpatients with HIV. and Sept 17, 2018, evaluating the diagnostic accuracy or diagnostic yield of AlereLAM using the search terms (tuberculosis Implications of all the available evidence or TB) AND (lipoarabinomannan or LAM) AND (test OR assay OR Considering the substantially improved sensitivity of FujiLAM antigen OR Ag OR lateral flow assay* OR urine antigen OR point compared with AlereLAM and the high diagnostic yield compared of care) AND (accuracy OR sensitivity OR specificity OR yield OR with sputum-based diagnostics, the FujiLAM assay has the diagnos* OR screening). Our search yielded an additional potential to substantially improve rapid diagnosis of tuberculosis 23 relevant studies, which confirmed the moderate clinical in patients with HIV who are admitted to hospital and potentially sensitivity of AlereLAM. The search also identified two people with HIV in the general population. Since AlereLAM has randomised trials that demonstrated reduced mortality with demonstrated survival benefit, FujiLAM might potentially further AlereLAM point-of-care testing for tuberculosis among severely reduce tuberculosis-related mortality in people with HIV. ill inpatients with HIV. These findings will inform a WHO policy review for lipoarabinomannan-based diagnostic tests of active tuberculosis. The novel urine-based Fujifilm SILVAMP TB LAM (FujiLAM) Further research, including prospective and operational studies assay was developed to overcome the limited sensitivity of on the FujiLAM assay in settings of intended use and in additional AlereLAM and increase the diagnostic yield of rapid urinary patient populations, including outpatients with HIV, populations lipoarabinomannan testing. On Sept 17, 2018, we did a second without HIV, and paediatric populations, are needed. PubMed search using the search term (tuberculosis or TB) AND

and in those defined as seriously ill according to WHO of hospital inpatients with HIV, in whom the AlereLAM criteria (respiratory rate >30 breaths per min, body assay is recommended for use. temperature >39°C, heart rate >120 beats per min, or unable to walk unaided).4 A more sensitive, rapid urine- Methods based test could widen the indication for testing and Study participants improve the diagnosis of tuberculosis and associated In this diagnostic accuracy study, we assessed urine For more on the outcomes in people with HIV.3 samples from the FIND Specimen Bank and the University FIND Specimen Bank see A novel urine-based assay, Fujifilm SILVAMP TB LAM of Cape Town Biobank obtained from inpatients https://www.finddx.org/ specimen-bank/ (FujiLAM; Fujifilm, Tokyo, Japan), has been developed (aged ≥18 years) with HIV, collected in three independent that also detects lipoarabinomannan on an instrument- prospective cohort studies (two unpublished and one free platform, with results available in less than 1 h. published2,11) done at two district hospitals in South Africa This assay combines a pair of high affinity monoclonal (appendix). These cohorts were selected for inclusion on antibodies directed towards largely Mycobacterium the basis of the availability of frozen urine samples for a tuberculosis-specific lipoarabinomannan epitopes7–9 and a full cohort of hospital inpatients with HIV in tuberculosis silver-amplification step10 that increases the visibility of endemic settings, in whom a comprehensive work-up was test and control lines. This enables the detection of done to identify tuberculosis or alternative diagnoses. urinary lipoarabinomannan concentrations that are Standard national guidelines12 for tuberculosis and HIV approximately 30 times lower than that detected by management were used across all three cohorts. Cohort 1 AlereLAM and improved analytical specificity compared included adults with tuberculosis symptoms who were with AlereLAM, which in contrast, uses conventional able to produce sputum, and were enrolled regardless lateral flow immunoassay technology and polyclonal of CD4 count on admission to Khayelitsha Hospital antibodies.7 (Cape Town, South Africa) between Feb 22, 2017, and In this study, we aimed to assess the diagnostic August 31, 2017. Patients with extrapulmonary disease accuracy of FujiLAM for the detection of active without respiratory symptoms were excluded. Cohort 2 tuberculosis compared with AlereLAM in three cohorts included adults with HIV who were admitted to medical

2 www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 Articles

wards at GF Jooste Hospital (Cape Town, South Africa) scale card and any line identified on the test line was between June 6, 2012, and Oct 4, 2013, regardless of CD4 deemed positive. count, their ability to produce sputum, or whether or not Both AlereLAM and FujiLAM were independently read they reported tuberculosis symptoms.2,11 Study staff by two readers masked to the test results of the index or systematically attempted to collect urine, blood, and two comparator test, respectively, patient status, and all other sputum samples for testing within 24 h of hospital test results. After interpretation of the initial independent admission. Cohort 3 included adults with HIV with a CD4 test, readers compared results and, in case of discordance, count equal to or less than 350 cells per µL in whom re-inspected the test strip to establish the final consensus tuberculosis was considered the most likely diagnosis at result (through mutual agreement) that was used for presentation, who were admitted to Khayelitsha Hospital analysis. In the case of FujiLAM or AlereLAM assay between Jan 16, 2014, and Oct 19, 2016. All cohorts excluded failure, the test was repeated once. patients who were already receiving tuberculosis therapy For reference standard testing, the specimens were (appendix). In cohorts 1 and 2, enrolment was done processed using standardised protocols at centralised consecutively. In cohort 3, patients were randomly enrolled accredited laboratories of the South African National using a dice on a daily basis after all potentially eligible Health Laboratory Service (Cape Town, South Africa). The patients were identified. In all cohorts, patients were number of samples tested and testing flow for each cohort enrolled on admission to hospital. Sputum, blood, and are shown in the appendix. urine specimens for M tuberculosis reference standard Sputum collection in all cohorts was done by an testing were collected at enrolment and additional clinical experienced nurse or trained clinical research worker samples were obtained during hospital admission and at and sputum induction was done, when required, as follow-up. Follow-up was 8 weeks for cohort 1, and described previously.11 Reference standard testing was 12 weeks for cohorts 2 and 3 (appendix). done on all available sputum specimens and included See Online for appendix All studies were approved by the Human Research Xpert MTB/RIF (Xpert; Cepheid, Sunnyvale, USA; Ethics Committee of the University of Cape Town testing pre-dated rollout of Xpert Ultra MTB/RIF), smear (Cape Town, South Africa). Written informed consent was fluorescence microscopy after auramine O staining, obtained from patients, as per study protocols. Study mycobacteria growth indicator tube liquid culture participation did not affect standard of care. This study is (Becton Dickinson, Franklin Lakes, NJ, USA) and solid reported in accordance with the Standards for Reporting of culture on Löwenstein-Jensen medium. The presence of Diagnostic Accuracy Studies guidelines.13 Retrospective M tuberculosis complex in solid and liquid culture was urine lipoarabinomannan testing was supervised by the confirmed with MPT64 antigen detection or MTBDRplus study sponsor (FIND, Geneva, Switzerland) and was done line probe assays (Hain Lifesciences, Nehren, Germany). at the University of Cape Town. Blood cultures from all participants were done in BACTEC Myco/F Lytic culture vials (Becton Dickinson) Procedures and WHO prequalified in-vitro diagnostic tests were Frozen urine aliquots of unprocessed urine were thawed used for HIV testing (rapid diagnostic tests) and CD4 cell to ambient temperature and mixed manually. Samples counting (flow cytometry). For urinary Xpert testing, that were not immediately used for testing were stored at 20–40 mL urine was centrifuged and following removal 4°C for a maximum of 4 h. of supernatant the pellet was re-suspended in the AlereLAM testing was done according to the manu-​ residual urine volume and a 0·75 mL sample was used facturer’s instructions.Briefly, 60 µL urine was applied for testing.2 For cohorts 2 and 3, additional respiratory to the sample pad. After 25 min, test strips were read and non-respiratory samples such as pleural fluid, using the test’s reference scale card for grading. In cerebrospinal fluid, and tissue fine needle aspirates were parallel, testing with the FujiLAM was done according obtained, when clinically indicated, and tested to manufacturer’s instructions using urine from the using MGIT culture or Xpert. Clinical information and same aliquots. The five-step test procedure (figure 1; FujiLAM and AlereLAM results were not available to the video) took 50–60 min from sample collection to result. assessors of the reference standard at the time of testing. See Online for video Briefly, urine was added to the reagent tube up to the Before data analysis, clinical investigators, who were indicator line (approximately 200 µL), mixed, and masked to index test results, categorised patients as incubated for 40 min at ambient temperature. After having definite tuberculosis, possible tuberculosis, not mixing again, two drops of urine were added to the test tuberculosis, and unclassifiable using a combination of strip. Following this, button two was immediately clinical and laboratory findings (appendix). Patients with pressed to release a reducing agent for silver ampli-​ definite tuberculosis had microbiologically confirmed fication. After the oG Next colour indicator mark turned M tuberculosis (any culture or any Xpert positive result for orange (within 3–10 min), button 3 was pressed to M tuberculosis during admission). Patients defined as not- release a silver-ion solution to activate the silver tuberculosis had negative microscopy, cultures, and Xpert amplification reaction. The result was read within test results for M tuberculosis (and at least one non- 10 min. The FujiLAM assay does not use a reference contaminated negative culture result), had not started www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 3 Articles

allocate patients into reference standard positive versus Tuberculosis test device reference standard negative groups. The possible Sample port 2 drops Push completely tuberculosis group was deemed negative within a Button to Go next Button to release silver release microbiological reference standard but positive within a ion reagent for 312 reducing composite reference standard. Diagnostic accuracy was amplification reagent for determined separately for each cohort as per protocol. In a amplification sensitivity analysis, unclassifiable patients were included to assess the effect of exclusions on diagnostic accuracy Go-next colour indicator and control and test line reading window (appendix). Heterogeneity between cohorts was assessed using Cochran’s Q test (appendix).15 Tuberculosis test procedure We did a post-hoc analysis to estimate pooled sensitivity 60 min from sample collection to result and specificity across cohorts and CD4 strata, using a Bayesian bivariate random-effects model to account Tuberculosis 16 negative for differences in study design. Simple pooling estimates, as planned a priori, are presented in the 17 Line Tuberculosis appendix with 95% CIs based on Wilson’s score method. positive The 95% CIs of the sensitivity and specificity differences 1. Add urine to 2. Incubate for 3. Add two drops 4. On orange of the three cohorts for FujiLAM compared with the tube 40 min at position 1 press 3 AlereLAM was computed using Tango’s score method.18 and press 2 5. Interpret result The difference between two tests was considered to be

Tuberculosis test principle significant if the 95% CIs did not overlap. Cohen’s κ statistic19 was used to calculate agreement of positive and

Gold particle Silver particle negative results between the two independent readers of 0·05 µm 10 µm the lipoarabinomannan tests. In an additional post-hoc analysis, we used the total 1st Au Au Au anti- 2nd number of microbiologically confirmed tuberculosis body anti- patients, (defined as the detection of M tuberculosis by MTX-LAMTest-line body culture or Xpert in at least one clinical specimen of any antigen Test line Test line type) to calculate the comparative diagnostic yield of Au-conjugated Formation of the Silver formation a single FujiLAM, AlereLAM, sputum Xpert (version G4), primary antibody sandwich immune- around the Au captures MTX-LAM complex through particle amplifies urine Xpert, and sputum smear microscopy test from in patient urine binding to the band intensity samples collected within the first 24 h of presentation immobilised secondary antibody (appendix). This analysis only included patient samples from cohort 2, because the systematic collection of blood, Figure 1: Fujifilm SILVAMP TB LAM test device, procedure, and principle urine, and two sputum diagnostic samples was attempted One antibody binds to tetra-arabinoside and hexa-arabinoside structures in the arabinan domain of whenever possible in all patients in this cohort within the lipoarabinomannan and the other antibody targets MTX-Man capping motifs of lipoarabinomannan (MTX-Man refers to mannose caps further modified with a 5-methylthio-D-xylofuranose residue).7–9 first 24 h of admission. Au=gold. C=control line. MTX-LAM=5-methylthio-D-xylofuranose-lipoarabinomannan. T=test line. The study protocol and statistical analysis plan are available in the appendix. All data analysis was done tuberculosis treatment, and were alive or had improvement using R (version 3.5.1) and Matlab (version 2017b). in clinical tuberculosis symptoms at 3 months’ follow-up. Patients defined as possible tuberculosis did not satisfy Role of the funding source the criteria for definite tuberculosis, but had clinical or The funders of the study had no role in study design, radiological features suggestive of tuberculosis and were data collection, data analysis, data interpretation, or started on tuberculosis treatment. Patients that did not fall writing of the manuscript. The corresponding author into any of these categories were defined as unclassifiable had full access to all the data in the study and had final and were removed from the main analyses (appendix). responsibility for the decision to submit for publication.

Statistical analysis Results For the primary analysis, we calculated the point We evaluated urine samples between April 18, 2018, and estimates and 95% CIs for the sensitivity, specificity, May 3, 2018. Of the 1840 patients included in the three positive predictive value, negative predictive value, independent cohort studies, 1188 patients were eligible positive likelihood ratio, and negative likelihood ratios of for retrospective urinary lipoarabinomannan testing. Of FujiLAM and AlereLAM assays by comparison with a the 1188 eligible patients, 220 patients were excluded microbiological reference standard and a composite from the main analysis because of unavailability of a reference standard. Definite tuberculosis versus not- urine sample (n=93), failed FujiLAM tests (n=6), or tuberculosis diagnostic classifications were used to unclassifiable diagnostic status (n=121; figure 2). The

4 www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 Articles

primary reasons for which patients were deemed unclassifiable were death before diagnosis (n=62) and 1840 potentially eligible patients from three independent cohort studies screened loss to follow-up where a vital status or an improvement 140 patients included in cohort 1 in clinical status was required for diagnosis (n=17; 1018 patients included in cohort 2 appendix). Of the 968 patients included in the main 682 patients included in cohort 3 analysis (96 patients from cohort 1, 364 patients from cohort 2, and 508 patients from cohort 3), 600 (62%) were 652 patients not eligible classified as definite tuberculosis, 91 (9%) as possible 31 patients from cohort 1 excluded 28 HIV negative tuberculosis, and 277 (29%) as not-tuberculosis (table; 3 HIV status unknown appendix). The microbiological reference standard for 598 patients from cohort 2 excluded tuberculosis diagnosis was based on a total of 6397 culture 404 HIV negative 5 HIV status unknown and Xpert tests (mean 6·2 tests per patient) and included 165 pre-existing tuberculosis 3261 tests on sputum samples and 3136 tests on non- 15 refused 5 transferred to another hospital sputum samples (appendix). 236 (24%) of 968 patients 3 readmissions could not provide a sputum sample. Definite tuberculosis 1 died 23 patients from cohort 3 excluded diagnosis was based on the results from non-sputum 1 HIV negative samples for 117 (20%) of 600 patients. Most patients were 18 CD4 count >350 cells per µL young immunocompromised adults (median age 2 CD4 count unknown 2 withdrew consent 35 years [IQR 30–42]), with a median CD4 count of 113 cells per µL (IQR 40–262) in cohort 1, 153 cells per µL (53–313) in cohort 2, and 59 cells per µL (23–122) in 1118 patients with HIV eligible for retrospective urinary lipoarabinomannan testing cohort 3. 439 (45%) of 968 patients had a history of 109 patients from cohort 1 previous tuberculosis treatment and all patients in 420 patients from cohort 2 cohort 1 and 3 and 329 (90%) of 364 patients in cohort 2 659 patients from cohort 3 had a positive WHO symptom screen for tuberculosis. Overall, compared with the microbiological reference 220 patients excluded standard, the sensitivity of FujiLAM was 70·4% (95% CI 13 patients from cohort 1 10 unclassifiable 53·0–83·1) and 42·3% (31·7–51·8) for AlereLAM 3 no urine sample (difference 28·1%; figure 3). In comparison to the 56 patients from cohort 2 microbiological reference standard, the highest FujiLAM 46 unclassifiable 9 no urine sample sensitivity was observed in cohort 3 (81·0% [76·9–84·5]), 1 missing index test which enrolled patients with more advanced HIV-related 151 patients from cohort 3 65 unclassifiable immunosuppression (ie, more patients with a CD4 count 81 no urine sample less than 100 cells per µL) than cohort 2 (65·9% [57·7–73·3]) 5 missing index test and cohort 1 (59·6% [45·3–72·4]). The sensitivity of both assays was higher in patients with lower CD4 counts. In 968 patients eligible and tested patients with a CD4 count less than 100 cells per µL, 96 patients from cohort 1 FujiLAM had a sensitivity of 84·2% (71·4–91·4) compared 47 definite tuberculosis 3 possible tuberculosis with 57·3% (42·2–69·6) for AlereLAM (difference 26·9%). 46 not tuberculosis For patients with a CD4 count of more than 200 cells per 364 patients from cohort 2 µL sensitivity was 44·0% (29·7–58·5) for FujiLAM and 138 definite tuberculosis 37 possible tuberculosis 12·2% (4·6–23·7) for AlereLAM (difference 31·8%). 189 not tuberculosis Using the composite reference standard, the sensitivity 508 patients from cohort 3 415 definite tuberculosis of both assays was slightly lower than that compared with 51 possible tuberculosis the microbiological reference standard (64·9% [50·1–76·7] 42 not tuberculosis for FujiLAM vs 38·2% [28·1–47·3] for AlereLAM; difference 26·7%). Since the 95% CIs of the differences around sensitivity between FujiLAM and AlereLAM did 600 definite tuberculosis 91 possible tuberculosis 277 not tuberculosis not overlap, FujiLAM was considered to have significantly higher sensitivity than AlereLAM for all analyses, with the exception of cohort 1, in which the 95% CIs overlapped 691 CRS positive 277 CRS negative due to the small sample size (figure 3). Compared with the microbiological reference standard, 600 MRS positive 368 MRS negative the specificity of FujiLAM was 90·8% (95% CI 86·0 to 94·4) and 95·0% (87·7 to 98·8) for AlereLAM, with no significant Figure 2: Study flow diagram difference (–4·2%). Using the composite reference Details of patients are provided in the appendix. CRS=composite reference standard. MRS=microbiological standard, overall estimates of specificity were increased to reference standard. www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 5 Articles

described previously11 (appendix). 141 (34%) of 420 patients Cohort 1 Cohort 2 Cohort 3 All patients (n=96) (n=364) (n=508) (n=968) had microbiologically confirmed tuberculosis. 84 (60%) of 141 tuberculosis cases could be diagnosed with rapid tests Age, years 35 (31–43) 36 (29–42) 35 (30–43) 35 (30–42) using samples collected in the first 24 h of admission: Sex 37 (26%) from sputum Xpert and 59 (42%) from urine Women 51 (53%) 218 (60%) 254 (50%) 523 (54%) Xpert using 1 mL urine. 57 (40%) of 141 tuberculosis Men 45 (47%) 146 (40%) 254 (50%) 445 (46%) diagnoses were not achieved in the first 24 h and were Positive WHO tuberculosis 96 (100%) 329 (90%) 508 (100%) 933 (96%) established by mycobacterial culture on any specimen symptom screen collected at any point during patient admission, diagnosed History of tuberculosis 52 (54%) 162 (45%) 225 (44%) 439 (45%) by Xpert using concentrated samples from 20–40 mL Antiretroviral therapy 64 (67%) 153 (42%) 177 (35%) 394 (41%) urine or diagnosed by Xpert testing of specimens collected CD4 count, cells per µL 113 (40–262) 153 (53–313) 59 (23–122) 86 (33–190) after the first 24 h. The additional specimens collected for Diagnosis culture and Xpert testing included ascitic fluid, blood, Definite tuberculosis 47 (49%) 138 (38%) 415 (82%) 600 (62%) urine, sputum, cerebrospinal fluid, gastric lavage, pus, or Possible tuberculosis 3 (3%) 37 (10%) 51 (10%) 91 (9%) pleural fluid (appendix). Not tuberculosis 46 (48%) 189 (52%) 42 (8%) 277 (29%) Figure 4 shows the diagnostic yield of FujiLAM and CD4 count, cells per µL AlereLAM compared with other rapid diagnostic tests 0–100 44 (46%) 135 (37%) 337 (66%) 516 (53%) done within the first 24 h of hospital admission. 91 (65%) 101–200 19 (20%) 82 (23%) 115 (23%) 216 (22%) of 141 tuberculosis cases could have been diagnosed >200 30 (31%) 145 (40%) 56 (11%) 231 (24%) within a few hours of presentation with FujiLAM, Unknown 3 (3%) 2 (1%) 0 5 (1%) compared with 61 (43%) of 141 cases with AlereLAM. Outcome at 3 months A combination of sputum Xpert and FujiLAM within the Died within 3 months 1 (1%) 19 (5%) 85 (17%) 105 (11%) first 24 h of admission would have been able to diagnose Alive 58 (60%) 336 (92%) 416 (82%) 810 (84%) 102 (72%) of 141 microbiologically confirmed cases. Lost to follow-up 0 9 (2%) 7 (1%) 16 (2%) A combination of sputum smear microscopy and No follow-up 37 (39%) 0 0 37 (4%) FujiLAM would have yielded 98 (70%) of 141 diagnoses. Data are median (IQR), or n (%). Overall, 18 (2%) of 1095 FujiLAM tests failed on the first attempt. Of the 15 tests that could be repeated, Table: Demographic and clinical characteristics three failed on the second attempt resulting in a total overall error rate of 1·9% (21 of 1110 tests). FujiLAM 95·7% (92·0 to 98·0) for FujiLAM and 98·2% (95·7 to 99·6) failure rates are summarised in the appendix. The error for AlereLAM, with no significant difference (–2·5%). rate for AlereLAM on the first attempt was 0·4% (four of Using the composite reference standard, specificity of the 1095 tests) and all four repeat tests provided a result on FujiLAM assay was lower among patients with CD4 counts the second attempt. equal to or less than 100 cells per µL (91·2%, 95% CI 83·1 Inter-reader agreement was high for both FujiLAM and to 96·3) than those with CD4 counts of 101–200 cells AlereLAM tests (appendix): 97·0% (938 of 967 reads; per µL (97·8% [95% CI 90·6 to 99·9]) and higher than κ coefficient 0·94) for FujiLAM, and 96·7% (934 of 200 cells per µL (98·2% [93·8 to 99·9]; figure 3). Eight of 966 reads; κ coefficient 0·92) for AlereLAM. the 11 FujiLAM false positive samples, using the composite reference standard, were from patients with CD4 counts Discussion equal to or less than 100 cells per µL. Additional In this study of 968 hospital inpatients with HIV in a information on the 11 FujiLAM false positive results is high-burden setting, the FujiLAM point-of-care assay available in the appendix. identified a significantly higher proportion of patients Using the composite reference standard, the positive with tuberculosis than did the AlereLAM assay, while predictive value for the three different cohorts ranged maintaining comparable specificity. In all sub-analyses, from 90·6–99·4% for FujiLAM and 93·8–100·0% for the sensitivity of FujiLAM was significantly higher (range AlereLAM. The negative predictive value ranged from 22–35%) than AlereLAM. FujiLAM had the highest 24·8–71·8% for FujiLAM and 13·7–62·5% for AlereLAM. sensitivity (84·2%) in patients with the highest risk of Positive likelihood ratios ranged from 8·9–18·5 for mortality (CD4 ≤100 cells per µL), which was 26·9% FujiLAM and 13·8–17·3 AlereLAM and negative higher than that with AlereLAM. Combined with sputum likelihood ratios ranged from 0·3–0·4 for FujiLAM and Xpert, FujiLAM could diagnose nearly three-quarters of 0·6–0·7 for AlereLAM (appendix). microbiologically confirmed tuberculosis within 24 h of All patients in cohort 2 (n=420) were eligible for the hospital admission. The meta-analysis4,5 that formed the analysis of diagnostic yield (figure 2). Among the basis of WHO recommendations for the use of AlereLAM 420 eligible patients, only 153 (36%) could produce a reported an overall sensitivity of 45% in patients sputum sample within the first 24 h of admission, whereas with HIV, which is similar to the AlereLAM sensitivity 418 (100%) were able to provide a urine sample, as observed in this study (42%), suggesting that our study

6 www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 Articles

A Test n TP FP FN TN Sensitivity (95% CI) Specificity (95% CI)

MRS FujiLAM 968 455 33 145 335 70·4% (53·0 to 83·1) 90·8% (86·0 to 94·4) AlereLAM 968 268 18 332 350 42·3% (31·7 to 51·8) 95·0% (87·7 to 98·8) Difference 28·1% –4·2% CRS FujiLAM 968 477 11 214 266 64·9% (50·1 to 76·7) 95·7% (92·0 to 98·0) AlereLAM 968 281 5 410 272 38·2% (28·1 to 47·3) 98·2% (95·7 to 99·6) Difference 26·7% –2·5%

B

MRS Cohort 1 FujiLAM 96 28 4 19 45 59·6% (45·3 to 72·4) 91·8% (80·8 to 96·8) AlereLAM 96 15 1 32 48 31·9% (20·4 to 46·2) 98·0% (89·3 to 99·6) Difference 27·7% (16·9 to 41·8) –6·2% (–17·6 to 3·9) Cohort 2 FujiLAM 364 91 18 47 208 65·9% (57·7 to 73·3) 92·0% (87·8 to 94·9) AlereLAM 364 61 7 77 219 44·2% (36·2 to 52·5) 96·9% (93·7 to 98·5) Difference 21·7% (14·7 to 29·7) –4·9% (–9·3 to –1·0) Cohort 3 FujiLAM 508 336 11 79 82 81·0% (76·9 to 84·5) 88·2% (80·1 to 93·3) AlereLAM 508 192 10 223 83 46·3% (41·5 to 51·1) 89·2% (81·3 to 94·1) Difference 34·7% (30·1 to 39·5) –1·0% (–9·4 to 7·2) CRS Cohort 1 FujiLAM 96 29 3 21 43 58·0% (44·2 to 70·6) 93·5% (82·5 to 97·8) AlereLAM 96 15 1 35 45 30·0% (19·1 to 43·8) 97·8% (88·7 to 99·6) Difference 28·0% (17·5 to 41·7) –4·3% (–15·8 to 5·9) Cohort 2 FujiLAM 364 103 6 72 183 58·9% (51·5 to 65·9) 96·8% (93·2 to 98·5) AlereLAM 364 64 4 111 185 36·6% (29·8 to 43·9) 97·9% (94·7 to 99·2) Difference 22·3% (15·8 to 29·4) –1·1% (–4·9 to 2·6) Cohort 3 FujiLAM 508 345 2 121 40 74·0% (69·9 to 77·8) 95·2% (84·2 to 98·7) AlereLAM 508 202 0 264 42 43·3% (38·9 to 47·9) 100·0% (91·6 to 100) Difference 30·7% (26·2 to 35·3) –4·8% (–15·8 to 4·0)

C

MRS 0–100 cells per μL FujiLAM 516 332 20 49 115 84·2% (71·4 to 91·4) 85·0% (75·8 to 91·7) AlereLAM 516 221 8 160 127 57·3% (42·2 to 69·6) 94·1% (88·3 to 97·7) Difference 26·9% –9·1% 101–200 cells per μL FujiLAM 216 83 9 49 75 60·6% (44·4 to 72·5) 89·6% (78·5 to 98·1) AlereLAM 216 35 7 97 77 26·4% (15·2 to 38·9) 92·8% (69·2 to 99·9) Difference 34·2% –3·2% >200 cells per μL FujiLAM 231 37 4 46 144 44·0% (29·7 to 58·5) 97·0% (92·5 to 99·3) AlereLAM 231 10 3 73 145 12·2% (4·6 to 23·7) 97·2% (88·4 to 100) Difference 31·8% –0·2% CRS 0–100 cells per μL FujiLAM 516 344 8 76 88 80·6% (72·0 to 86·7) 91·2% (83·1 to 96·3) AlereLAM 516 226 3 194 93 53·1% (40·7 to 63·6) 96·6% (91·0 to 99·3) Difference 27·5% –5·4% 101–200 cells per μL FujiLAM 216 91 1 66 58 55·7% (39·9 to 67·6) 97·8% (90·6 to 99·9) AlereLAM 216 41 1 116 58 25·3% (15·3 to 35·6) 97·8% (90·6 to 99·9) Difference 30·4% 0·0% >200 cells per μL FujiLAM 231 39 2 71 119 35·5% (22·4 to 50·4) 98·2% (93·8 to 99·9) AlereLAM 231 12 1 98 120 11·3% (2·3 to 28·7) 98·9% (95·0 to 100) Difference 24·2% –0·7%

0 50 100 0150 00 SensitivitySpecificity

Figure 3: Sensitivity and specificity of FujiLAM versus AlereLAM against MRS and CRS Sensitivity and specificity of FujiLAM and AlereLAM assays for all cohorts combined (A), by cohort (B), and by CD4 count (C). Sensitivity and specificity estimates for (A) and (C) were based on analysis using a bivariate random-effects model. AlereLAM=Alere Determine TB LAM Ag assay. CRS=composite reference standard. FP=false positive. FN=false negative. FujiLAM=Fujifilm SILVAMP TB LAM assay. MRS=microbiological reference standard. TP=true positive. TN=true negative.

www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 7 Articles

fast-growing non-tuberculous mycobacteria has been ABPatients with microbiologically confirmed Patients with microbiologically confirmed 7 tuberculosis (n=141) tuberculosis and CD4 ≤100 cells per µL (n=74) excluded in previous studies. Our study has limitations that indicate further research 7 20 is warranted. FujiLAM testing was done in a research laboratory setting using biobanked specimens collected 7 11 from hospital inpatients. Although no technical reason 4 exist as to why the test would perform differently in fresh 4 1 2 3 1 2 versus frozen samples, this needs to be investigated. A 25 previous study23 suggests that early morning urine 30 1 10 6 4 5 5 5 collection could further improve the sensitivity of urinary 7 1 lipoarabinomannan-based tuberculosis testing. This 2 2 11 4 aspect could have important implications for clinical 2 practice and should be addressed in future studies. 26 7 FujiLAM has the potential to be implemented as a true point-of-care assay, but the feasibility of this approach Diagnostic yield Diagnostic yield and its effect on patient outcomes requires prospective Urine FujiLAM 64·5% (91/141) Urine FujiLAM 79·7% (59/74) Urine AlereLAM 43·3% (61/141) Urine AlereLAM 64·9% (48/74) assessment in relevant clinical settings. Both AlereLAM Urine Xpert 41·8% (59/141) Urine Xpert 58·1% (43/74) and FujiLAM cannot discern drug-resistant tuberculosis Sputum Xpert 26·2% (37/141) Sputum Xpert 24·3% (18/74) from drug-sensitive tuberculosis and therefore it is Sputum smear microscopy 19·1% (27/141) Sputum smear microscopy 18·9% (14/74) Tuberculosis cases missed 18·4% (26/141) Tuberculosis cases missed 9·5% (7/74) important that these rapid diagnostic tools are supplemented with sample collection for drug Figure 4: Diagnostic yields for cohort 2 susceptibility testing. Number of microbiologically confirmed tuberculosis diagnoses in all patients (A), and patients with CD4 counts of equal to or less than 100 cells per µL (B), detected by each diagnostic test on samples obtained within 24 h of Difficulties with regard to the assignment of diagnostic hospital admission. Numbers represent the number of tuberculosis cases diagnosed by a given assay or assays. categories have been reported in previous literature24 and Tuberculosis cases missed includes diagnoses made by positive mycobacterial culture on any specimen collected at 10% of all eligible patients could not be classified in this any point during patient admission or diagnoses made on the basis of Xpert testing of any specimen collected after study. The higher sensitivity of FujiLAM compared with the first 24 h of hospital admission. AlereLAM=Alere Determine TB LAM Ag assay. FujiLAM=Fujifilm SILVAMP TB LAM assay. AlereLAM was maintained when the unclassifiable group was included in a sensitivity analysis (appendix). 18 of the 121 unclassifiable patients had positive FujiLAM populations were similar to the populations included results and nine of these 18 patients died within 3 months in the WHO meta-analysis. Collectively, these results of enrolment. Six of the nine FujiLAM positive patients suggest that, if implemented in clinical practice and who died were not started on tuberculosis treatment; linked with appropriate treatment, the FujiLAM point-of- assuming the assay had 100% specificity, these patients care assay might be able to save lives by enabling earlier might have been true positives and thus might not have diagnosis of HIV-associated tuberculosis in a large died if they were treated for tuberculosis. proportion of hospital inpatients.6,20,21 The inclusion of patients from three cohorts from The point estimates of FujiLAM specificity were lower similar inpatient settings with different pretest probability than those for AlereLAM. Although the differences in of tuberculosis (appendix) provided an overview of the specificity between FujiLAM and AlereLAM were not performance of FujiLAM in hospital inpatients with significant, the reduced specificity of both AlereLAM and advanced HIV. However, the heterogeneity of these FujiLAM could be partly explained by the use of an cohorts is also a limitation. As a result, we presented imperfect reference standard that lacks complete results per cohort and by CD4 strata. Grouped analysis sensitivity. The existing reference standard is especially with a Bayesian bivariate random-effects model was used limited in its ability to identify tuberculosis in immuno-​ to account for heterogeneity across the cohorts. compromised patients with HIV,22 since these patients Renal tuberculosis infection has been proposed as the are more likely to have paucibacillary disease or extra- main cause of urinary lipoarabinomannan antigenuria pulmonary tuberculosis than immunocompetent patients, and positive AlereLAM results.25 However, in this study, making diagnosis more difficult. An imperfect reference the more sensitive FujiLAM assay detected a number of standard could disproportionally affect a more sensitive urine Xpert-negative patients in whom renal tuberculosis test and result in increased false positives (ie, lower is unlikely since Xpert would detect intact M tuberculosis specificity with the more sensitive FujiLAM assay). The bacteria in urine. This suggests that other mechanisms, decreasing specificity observed with decreasing CD4 cell such as passage of lipoarabinomannan or lipoara-​ count in this study and the improved specificity observed binomannan fragments through the glomerular basement with the composite reference standard in comparison to membrane, potentially exacerbated by HIV-associated the microbiological reference standard further supports nephropathy, might be more important than originally this explanation. Cross-reactivity of the antibodies used in proposed. This is supported by findings from our FujiLAM to common urinary tract pathogens and previous study,26 which showed that blood and urine

8 www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 Articles

lipoara-binomannan​ concentrations correlate in patients Acknowledgments with tuberculosis, and that of a 2018 study,7 which found The authors thank the late Stephen D Lawn who designed and led the that low urine lipoarabinomannan concentrations are cohort 2 study, Mark D Perkins, and Ranald Sutherland for helping with the conceptualisation of this work, Anna Mantsoki for data detectable in immunocompetent HIV-negative patients management, and the clinical and laboratory teams at the partner sites with tuberculosis. The use of the more sensitive FujiLAM for their efforts in the implementation, conduct, and timely completion assay could help to increase the mechanistic understanding of the study. This work was funded by Global Health Innovative of how lipoarabinomannan enters urine. Technology Fund (grant number G2015-201), UK Department for International Development (grant number 300341-102), the Dutch The diagnostic yield analysis showed that only a minority Ministry of Foreign Affairs (grant number PDP15CH14), the Bill & of patients were able to produce sputum within the first Melinda Gates Foundation (grant number OPP1105925), the Australian 24 h of admission. Although this has been demonstrated Department of Foreign Affairs and Trade (grant number 70957), 27,28 and the German Federal Ministry of Education and Research through in other studies of hospital inpatients, the percentage of Kreditanstalt für Wiederaufbau. The cohort 2 study was funded by patients able to provide sputum was particularly low in the Wellcome Trust (088590 and 085251). GM was supported by the cohort 2,11 despite substantial efforts to obtain the sample Wellcome Trust (098316 and 203135/Z/16/Z), the South African by a trained nurse, including access to sputum induction Research Chairs Initiative of the Department of Science and Technology and National Research Foundation of South Africa (grant number facilities. The inability to provide a sputum sample is likely 64787), NRF incentive funding (UID: 85858), and the South African a reflection of the severity of illness in this cohort and will Medical Research Council through its Tuberculosis and HIV be less pronounced in outpatients with HIV and Collaborating Centres Programme, with funds received from the tuberculosis symptoms, who also typically have higher National Department of Health (RFA#SAMRC-RFA-CC:TB/HIV/ 29 AIDS-01-2014). BS received salary support from the Wellcome Trust CD4 cell counts. Studies of tuberculosis diagnostic (grant number 088316). CS received funding from the South African assessments often exclude patients who cannot produce Medical Research Council through the National Health Scholarship sputum, which can, particularly in the case of inpatients Programme. ADK received funding from the National Institute of who are severely ill, lead to a biased study population and Allergy and Infectious Diseases (grant number T32 AI060530). The opinions, findings and conclusions expressed in this manuscript test assessment and exclude the population that would reflect those of the authors alone. benefit most from a non-sputum based test.30 References In conclusion, considering the higher sensitivity and 1 WHO. Global Tuberculosis Report 2018. Geneva: World Health rapid, point-of-care design of FujiLAM, this assay has Organization, 2018. the potential to transform the diagnosis of tuberculosis 2 Lawn SD, Kerkhoff AD, urtonB R, et al. Rapid microbiological screening for tuberculosis in HIV-positive patients on the first day in hospital inpatients with HIV and, potentially, for of acute hospital admission by systematic testing of urine samples people with HIV in the general population. using Xpert MTB/RIF: a prospective cohort in South Africa. BMC Med 2015; 13: 192. Contributors 3 WHO. High-priority target product profiles for new tuberculosis TB, BS, ADK, CS, EI, MPN, GM, and CMD designed the study and TB, diagnostics: report of a consensus meeting. Geneva: World Health BS, ET, AT, MPN, GM, and CMD oversaw the study. BS, ADK, CS, AW, Organization, 2014. DAB, RB, MPN, and GM coordinated the individual study sites. TB, 4 WHO. The use of lateral flow urine lipoarabinomannan assay TLL, and AP contributed to assay and reagent development. TB, BS, (LF-LAM) for the diagnosis and screening of active tuberculosis in AM, and SO did the statistical analysis and BS and TB wrote the first people living with HIV. Geneva: World Health Organization, 2015. manuscript draft. All authors contributed to interpretation of data and 5 Shah M, Hanrahan C, Wang ZY, et al. Lateral flow urine editing of the article and approved the final version of the manuscript. lipoarabinomannan assay for detecting active tuberculosis in Declaration of interests HIV-positive adults. Cochrane Database Syst Rev 2016; 5: CD011420. TB, EI, AM, SO, AT, CB, and CMD are employed by the Foundation for 6 Peter JG, Zijenah LS, Chanda D, et al. Effect on mortality of point-of-care, urine-based lipoarabinomannan testing to guide Innovative New Diagnostics (FIND). FIND is a not-for-profit foundation tuberculosis treatment initiation in HIV-positive hospital inpatients: that supports the evaluation of publicly prioritised tuberculosis assays a pragmatic, parallel-group, multicountry, open-label, randomised and the implementation of WHO-approved (guidance and controlled trial. Lancet 2016; 387: 1187–97. prequalification) assays using donor grants. FIND has product 7 Sigal GB, Pinter A, Lowary TL, et al. A novel sensitive immunoassay evaluation agreements with several private sector companies that design targeting the 5-methylthio-d-xylofuranose–lipoarabinomannan diagnostics for tuberculosis and other diseases. These agreements epitope meets the WHO’s performance target for tuberculosis strictly define FIND’s independence and neutrality with regard to these diagnosis. J Clin Microbiol 2018; 56: e01338–18. private sector companies. TB and AP report patents in the field of 8 Choudhary A, Patel D, Honnen W, et al. Characterization of the lipoarabinomannan detection. DAB reports grants from Wellcome Trust antigenic heterogeneity of lipoarabinomannan, the major surface during the conduct of the study. GM reports grants from Wellcome glycolipid of Mycobacterium tuberculosis, and complexity of antibody Trust, during the conduct of the study; and grants from South African specificities toward this antigen. J Immunol 2018; 200: 3053–66. Government (Medical Research Council, National Research Foundation, 9 Zheng RB, Jégouzo SAF, Joe M, et al. Insights into interactions of and Department of Science and Technology), The European & mycobacteria with the host innate immune system from a novel Developing Countries Clinical Trials Partnership, National Institutes of array of synthetic mycobacterial glycans. ACS Chem Biol 2017; Health Fogarty Center, and the Bill & Melinda Gates Foundation, outside 12: 2990–3002. the submitted work. MPN reports grants from the National Institutes of 10 Mitamura K, Shimizu H, Yamazaki M, et al. Clinical evaluation Health, outside the submitted work. TB, CB, and CMD report grants of highly sensitive silver amplification immunochromatography from the Global Health Innovative Technology Fund, UK Department systems for rapid diagnosis of influenza. J Virol Methods 2013; for International Development, the Dutch Ministry of Foreign Affairs, 194: 123–28. the German Federal Ministry of Education and Research, the 11 Lawn SD, Kerkhoff AD, urtonB R, et al. Diagnostic accuracy, Australian Department of Foreign Affairs and Trade, and the Bill & incremental yield and prognostic value of Determine TB-LAM for routine diagnostic testing for tuberculosis in HIV-infected patients Melinda Gates Foundation. All other authors declare no competing requiring acute hospital admission in South Africa: a prospective interests. cohort. BMC Med 2017; 15: 6 7.

www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 9 Articles

12 Southern African HIV Clinicans Soceity. Society current guidelines. 23 Gina P, Randall PJ, Muchinga TE, et al. Early morning urine collection https://sahivsoc.org/SubHeader?slug=sahcs-guidelines (accessed to improve urinary lateral flow LAM assay sensitivity in hospitalised May 14, 2019). patients with HIV-TB co-infection. BMC Infect Dis 2017; 17: 339. 13 Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated 24 Boehme CC, Nabeta P, Hillemann D, et al. Rapid molecular list of essential items for reporting diagnostic accuracy studies. detection of tuberculosis and rifampin resistance. N Engl J Med Clin Chem 2015; 61: 1446–52. 2010; 363: 1005–15. 14 Abott Laboratories. ALERE Determine TB LAM AG. 25 Cox JA, Lukande RL, Kalungi S, et al. Is urinary lipoarabinomannan https://www.alere.com/en/home/product-details/determine-tb-lam. the result of renal tuberculosis? Assessment of the renal histology html (accessed May 14, 2019). in an autopsy cohort of Ugandan HIV-infected adults. PLoS One 2015; 10: e0123323. 15 Cochran WG. The comparision of percentages in matched samples. Biometrika 1950; 37: 256–66. 26 Broger T, Tsionksy M, Mathew A, et al. Sensitive electrochemiluminescence (ECL) immunoassays for detecting 16 Guo J, Riebler A. meta4diag: Bayesian bivariate meta-analysis of lipoarabinomannan (LAM) and ESAT-6 in urine and serum from diagnostic test studies for routine practice. J Stat Softw 2018; 83: 1–31. tuberculosis patients. PLoS One 2019; 14: e0215443. 17 Newcombe RG. Two-sided confidence intervals for the single 27 Huerga H, Ferlazzo G, Bevilacqua P, et al. Incremental yield of proportion: comparison of seven methods. Stat Med 1998; including Determine-TB LAM Assay in diagnostic algorithms for 17: 857–72. hospitalized and ambulatory HIV-positive patients in Kenya. 18 Tango T. Equivalence test and confidence interval for the difference PLoS One 2017; 12: e0170976. in proportions for the paired-sample design. Stat Med 1998; 28 Boyles TH, Griesel R, Stewart A, Mendelson M, Maartens G. 17: 891–908. Incremental yield and cost of urine Determine TB-LAM and 19 Cohen J. Weighted κ: nominal scale agreement with provision for sputum induction in seriously ill adults with HIV. Int J Infect Dis scaled disagreement or partial credit. Psychol Bull 1968; 70: 213–20. 2018; 75: 67–73. 20 Gupta-Wright A, Peters JA, Flach C, Lawn SD. Detection of 29 Calligaro GL, Zijenah LS, Peter JG, et al. Effect of new tuberculosis lipoarabinomannan (LAM) in urine is an independent predictor of diagnostic technologies on community-based intensified case mortality risk in patients receiving treatment for HIV-associated finding: a multicentre randomised controlled trial. Lancet Infect Dis tuberculosis in sub-Saharan Africa: a systematic review and 2017; 17: 441–50. meta-analysis. BMC Med 2016; 14: 53. 30 Lawn SD, Kerkhoff AD, urtonB R, Meintjes G. Underestimation of 21 Gupta-Wright A, Corbett EL, van Oosterhout JJ, et al. the incremental diagnostic yield of HIV-associated tuberculosis in Rapid urine-based screening for tuberculosis in HIV-positive studies of the Determine TB-LAM Ag urine assay. AIDS 2014; patients admitted to hospital in Africa (STAMP): a pragmatic, 28: 1846–48. multicentre, parallel-group, double-blind, randomised controlled trial. Lancet 2018; 392: 292–301. 22 Lawn SD, Kerkhoff AD, icolN MP, Meintjes G. Underestimation of the true specificity of the urine lipoarabinomannan point-of-care diagnostic assay for HIV-associated tuberculosis. J Acquir Immune Defic Syndr 2015; 69: e144–46.

10 www.thelancet.com/infection Published online May 30, 2019 http://dx.doi.org/10.1016/S1473-3099(19)30001-5 Articles

Xpert MTB/RIF Ultra for detection of Mycobacterium tuberculosis and rifampicin resistance: a prospective multicentre diagnostic accuracy study

Susan E Dorman*, Samuel G Schumacher*, David Alland, Pamela Nabeta, Derek T Armstrong, Bonnie King, Sandra L Hall, Soumitesh Chakravorty, Daniela M Cirillo, Nestani Tukvadze, Nino Bablishvili, Wendy Stevens, Lesley Scott, Camilla Rodrigues, Mubin I Kazi, Moses Joloba, Lydia Nakiyingi, Mark P Nicol, Yonas Ghebrekristos, Irene Anyango, Wilfred Murithi, Reynaldo Dietze, Renata Lyrio Peres, Alena Skrahina, Vera Auchynka, Kamal Kishore Chopra, Mahmud Hanif, Xin Liu, Xing Yuan, Catharina C Boehme, Jerrold J Ellner, Claudia M Denkinger, on behalf of the study team†

Summary Lancet Infect Dis 2018; Background The Xpert MTB/RIF assay is an automated molecular test that has improved the detection of tuberculosis 18: 76–84 and rifampicin resistance, but its sensitivity is inadequate in patients with paucibacillary disease or HIV. Xpert Published Online MTB/RIF Ultra (Xpert Ultra) was developed to overcome this limitation. We compared the diagnostic performance of November 30, 2017 Xpert Ultra with that of Xpert for detection of tuberculosis and rifampicin resistance. http://dx.doi.org/10.1016/ S1473-3099(17)30691-6 This online publication has Methods In this prospective, multicentre, diagnostic accuracy study, we recruited adults with pulmonary tuberculosis been corrected. The corrected symptoms presenting at primary health-care centres and hospitals in eight countries (South Africa, Uganda, Kenya, India, version first appeared at China, Georgia, Belarus, and Brazil). Participants were allocated to the case detection group if no drugs had been taken for thelancet.com/infection on tuberculosis in the past 6 months or to the multidrug-resistance risk group if drugs for tuberculosis had been taken in the February 21, 2018. past 6 months, but drug resistance was suspected. Demographic information, medical history, chest imaging results, and See Comment page 8 HIV test results were recorded at enrolment, and each participant gave at least three sputum specimen on 2 separate days. *Contributed equally Xpert and Xpert Ultra diagnostic performance in the same sputum specimen was compared with culture tests and drug †Listed at the end of this paper susceptibility testing as reference standards. The primary objectives were to estimate and compare the sensitivity of Xpert Johns Hopkins University Ultra test with that of Xpert for detection of smear-negative tuberculosis and rifampicin resistance and to estimate and School of Medicine, Baltimore, MD, USA (Prof S E Dorman MD, compare Xpert Ultra and Xpert specificities for detection of rifampicin resistance. Study participants in the case detection D T Armstrong MHS, group were included in all analyses, whereas participants in the multidrug-resistance risk group were only included in B King MHS); FIND, Geneva, analyses of rifampicin-resistance detection. Switzerland (S G Schumacher PhD, P Nabeta MD, C C Boehme MD, Findings Between Feb 18, and Dec 24, 2016, we enrolled 2368 participants for sputum sampling. 248 participants were C M Denkinger MD); Division of excluded from the analysis, and 1753 participants were distributed to the case detection group (n=1439) and the Infectious Diseases, multidrug-resistance risk group (n=314). Sensitivities of Xpert Ultra and Xpert were 63% and 46%, respectively, for Rutgers-New Jersey Medical the 137 participants with smear-negative and culture-positive sputum (difference of 17%, 95% CI 10 to 24); 90% and School, Newark, NJ, USA (Prof D Alland MD, 77%, respectively, for the 115 HIV-positive participants with culture-positive sputum (13%, 6·4 to 21); and 88% and S Chakravorty PhD); Boston 83%, respectively, across all 462 participants with culture-positive sputum (5·4%, 3·3 to 8·0). Specificities of Xpert Medical Center and Boston Ultra and Xpert for case detection were 96% and 98% (–2·7%, –3·9 to –1·7) overall, and 93% and 98% for patients University School of Medicine, with a history of tuberculosis. Xpert Ultra and Xpert performed similarly in detecting rifampicin resistance. Boston, MA, USA (S L Hall MPH, Prof J J Ellner MD); IRCCS San Raffaele Scientific Institute, Interpretation For tuberculosis case detection, sensitivity of Xpert Ultra was superior to that of Xpert in patients with Milan, Italy (D M Cirillo MD); paucibacillary disease and in patients with HIV. However, this increase in sensitivity came at the expense of a decrease National Center for Tuberculosis and Lung in specificity. Diseases, Tbilisi, Georgia (N Tukvadze MD, Funding Government of Netherlands, Government of Australia, Bill & Melinda Gates Foundation, Government of N Bablishvili PhD); Department the UK, and the National Institute of Allergy and Infectious Diseases. of Molecular Medicine and Haematology, Faculty of Health Science, School of Copyright © The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY 4.0 license. Pathology and the National Priority Program of the Introduction cartridge-based molecular assay, as the initial test for National Health Laboratory Service, Johannesburg, South An estimated 10·4 million new tuberculosis cases occurred tuberculosis to increase case detection and improve 1 Africa (Prof W Stevens MBBCh, in 2015, but only 6·1 million (59%) were diagnosed. That identification of rifampicin resistance directly from L Scott PhD); PD Hinduja same year, an estimated 580 000 rifampicin-resistant cases sputum.3–5 Xpert is used in tuberculosis programmes in Hospital and Medical Research occurred, but only 125 000 (20%) were identified.1 more than 120 countries.6 However, Xpert’s sensitivity for Centre, Mumbai, India (C Rodrigues MD, These diagnostic gaps are caused mostly by the lack of tuberculosis detection is inadequate when few bacilli are 2 M I Kazi MTech-Biotech); highly sensitive, rapid, accessible diagnostics. WHO present in a clinical specimen. This limits the usefulness Mycobacteriology Laboratory, recommended the Xpert MTB/RIF assay (Cepheid, of Xpert in patients with sputum smear-negative or Sunnyvale, CA, USA), an automated, integrated, extrapulmonary tuber­culosis. This is particularly relevant

76 www.thelancet.com/infection Vol 18 January 2018 Articles

Department of Microbiology, Research in context School of Biomedical Sciences (Prof M Joloba PhD) Evidence before this study detection of rifampicin resistance than Xpert. Xpert Ultra was and Infectious Disease Institute In 2010, WHO endorsed the Xpert MTB/RIF assay for initial also found to have higher sensitivity than Xpert and culture in (L Nakiyingi PhD), Makerere diagnostic testing of individuals suspected of multidrug-resistant paucibacillary specimens of cerebrospinal fluid. University, Kampala, Uganda; Division of Medical tuberculosis or HIV-associated tuberculosis. In 2014, WHO Added value of this study Microbiology and Institute for expanded this recommendation for use in all patients. The This is the first prospective study on the accuracy of Xpert Ultra Infectious Diseases and diagnostic accuracy of Xpert for pulmonary tuberculosis and Molecular Medicine, University for pulmonary tuberculosis. We did this study in eight countries rifampicin resistance has been assessed in Cochrane systematic of Cape Town with high burdens of tuberculosis or drug-resistant (Prof M P Nicol PhD, reviews. The most recent update included studies described in any tuberculosis, and we applied a rigorous reference standard to Y Ghebrekristos BSc); National language until Feb 7, 2013. In 27 studies with nearly assure generalisability of the data to the tuberculosis epidemic Health Laboratory Service, 10 000 participants, the pooled sensitivities of Xpert for Groote Schuur Hospital, Cape worldwide. Our findings suggest that Xpert Ultra is substantially pulmonary tuberculosis were 98% in those who were positive by Town, South Africa more sensitive than Xpert for detection of pulmonary (Prof M P Nicol, Y Ghebrekristos); sputum smear microscopy but only 67% in those who were tuberculosis, especially for paucibacillary specimens (ie, Kenya Medical Research negative by sputum smear microscopy. Pooled sensitivity was smear-negative specimens and specimens from HIV-positive Institute, Center for Global 79% in HIV-positive patients independent of sputum smear Health Research, Kisumu, individuals in whom available tests work least well). However, status, and pooled specificity was 99%. Performance Kenya (I Anyango BSN, the increased sensitivity of Xpert Ultra came at the expense of a W Murithi BS); Universidade characteristics for rifampicin resistance were 95% sensitivity and loss of specificity. For detection of rifampicin resistance, Xpert Federal do Espirito Santo, 98% specificity. The suboptimal detection of rifampicin resistance Ultra and Xpert performed comparably. Vitoria, Brazil (Prof R Dietze MD, by Xpert in mixed populations containing rifampicin-resistant R Lyrio Peres PhD); National plus rifampicin-susceptible bacilli and some silent mutations and Implications of all the available evidence Reference Laboratory, Republican Scientific and the consequent false determinations of rifampicin resistance The improved sensitivity of the Xpert Ultra assay relative to the Practical Centre for have been confirmed in subsequent reports. Xpert assay should permit more evidence-based treatment Pulmonology and Tuberculosis, Minsk, Belarus (A Skrahina MD, The Xpert MTB/RIF Ultra (Xpert Ultra) assay was developed to decisions and at earlier stages of disease, even in people with HIV who can have high morbidity and mortality from V Auchynka MD); State TB overcome the limited sensitivity of Xpert in the detection of Training & Demonstration pulmonary tuberculosis and limited accuracy of rifampicin tuberculosis despite relatively low bacillary burdens in sputum. Centre, New Delhi, India resistance detection. We searched PubMed with the term “Xpert On the basis of these findings, WHO has concluded that Xpert (K K Chopra MD, M Hanif PhD); and Henan Provincial Chest MTB/RIF Ultra” for articles in any language published until Ultra can be used as an alternative to Xpert for initial testing in adults with signs or symptoms of tuberculosis. Further research Hospital, Zhengzhou, Henan Oct 18, 2017. Other than two commentaries, we found Province, China (X Liu MD, two primary research articles describing the limit of detection in different epidemiological settings and patient populations is Prof X Yuan MD) and the performance of Xpert Ultra for detection of needed to clarify the implications of the trade-off between Correspondence to: Mycobacterium tuberculosis in cerebrospinal fluid. Findings from increased sensitivity and decreased specificity and to Dr Claudia M Denkinger, FIND, determine the biological basis for Xpert Ultra-positive and 1202 Geneva, Switzerland analytical laboratory studies showed that Xpert Ultra had a [email protected] lower limit of bacillary detection and was more accurate for culture-negative results.

for people with HIV and for children, in whom tuberculosis resistance in mixed infections, and avoidance of false- is often difficult to diagnose and morbidity can be high.7–9 positive results for detection of rifampicin resistance in One possible consequence of imperfect test sensitivity is paucibacillary specimens.14 lack of confidence in a negative test result, leading to We compared the diagnostic accuracy of Xpert Ultra empiric treatment and possibly overtreatment that might with that of Xpert for the detection of pulmonary undermine clinical effect.10,11 For detection of rifampicin tuberculosis and rifampicin resistance in a multicentre resistance, Xpert can give a false-positive result for strains study in geographically diverse settings, representative of that carry phenotypically silent mutations or if the bacillary the intended target population for the assay. burden is very low, although this is rare.12,13 The Xpert MTB/RIF Ultra assay (Xpert Ultra) was Methods developed to overcome the limitations of the Xpert assay. Study design and participants To improve assay sensitivity in the detection of The primary objectives of this initial clinical diagnostic Mycobacterium tuberculosis complex, Xpert Ultra accuracy study were to estimate and compare the sensitivity incorporates two different multicopy amplification targets of a single Xpert Ultra test with that of a single Xpert test of (IS6110 and IS1081) and uses improved assay chemistry the same raw sputum specimen for detection of smear- and cartridge design.14 These revisions resulted in an negative tuberculosis and rifampicin resistance, and to approximately 1–log improvement in the lower limit of estimate and compare Xpert Ultra and Xpert specificities detection compared with Xpert.14 Analytical laboratory for detection of rifampicin resistance. We hypothesised that data also demonstrated improved differentiation of the sensitivity of a single Xpert Ultra test for detection of certain silent mutations, improved detection of rifampicin smear-negative tuberculosis was non-inferior to that of a www.thelancet.com/infection Vol 18 January 2018 77 Articles

single Xpert, and that the sensitivity and the specificity of Ziehl-Neelsen (Belarus site) or auramine-rhodamine Xpert Ultra for rifampicin resistance detection were non- staining (all other sites). 0·5 mL of the resuspended inferior to those of Xpert. The study was done at pellet was inoculated into liquid culture using myco­ ten reference laboratories in eight countries (South Africa, bacteria growth indicator tube (MGIT) with a BACTEC Uganda, Kenya, India, China, Georgia, Belarus, and Brazil). 960 instrument (BD Microbiology Systems, Sparks, MD, Eligible study participants were adults presenting at primary USA), and 0·2 mL was inoculated on Löwenstein-Jensen health-care centres and hospitals with pulmonary medium. Cultures positive for growth of acid-fast bacilli tuberculosis symptoms and who were willing to provide up underwent confirmation of M tuberculosis complex by to four sputum specimens at study enrolment. Participants MPT64/MPB64 antigen detection15 or line probe assays. were recruited prospectively­ into one of two groups: the Phenotypic drug susceptibility testing was done from case detection group or the multidrug-resistance risk group. the first positive M tuberculosis culture using the Participation in the case detection group required BACTEC MGIT 960 system and a rifampicin critical willingness to attend study follow-up visits 42–70 days after concentration of 1·0 µg/mL.15 Genetic drug susceptibility enrolment and that no tuberculosis drugs had been taken in testing by Sanger DNA sequencing or pyrosequencing the past 6 months. Participants assigned to the multidrug- of the 81-bp rpoB core region was done for cultured resistance risk group were at high risk of drug resistance on isolates from all participants with discordant results the basis of one or more of the following criteria: between phenotypic drug susceptibility and Xpert Ultra (1) microbiologically confirmed pulmonary tuberculosis readouts and for a subset of participants with concordant with documented rifampicin resistance and tuberculosis results. Next-generation sequencing or pyrosequencing treatment received for 31 days or less; (2) known pulmonary of IS6110, IS1081, and rpoB from the Xpert Ultra tuberculosis with suspected treatment failure; and cartridge amplicon was done on specimens for which (3) history of drug-resistant tuberculosis and off tuberculosis Xpert Ultra results were positive, but no culture was treatment for at least 3 months. The study protocol positive (appendix p 2). See Online for appendix (appendix) was reviewed and approved by ethics committees Case definitions for the primary analyses were based at study sites and supervising organisations. Written on four culture results from sputum specimens two informed consent was obtained from all study participants. and three (figure 1). A culture-positive tuberculosis case Study participation did not affect the standard of care. was defined as a participant with at least one culture positive for M tuberculosis. Culture-positive cases were Procedures considered smear-positive if they had at least one Demographic information, medical history, chest positive smear (inclusive of scanty positive smears). A imaging results, and HIV test results (at sites where part culture-negative participant had no culture positive for of routine care) were recorded at enrolment. Participants M tuberculosis and at least two cultures negative for were asked to provide minimum three sputum specimens M tuberculosis. on two separate days. Xpert and Xpert Ultra assays, smear Staff doing Xpert or Xpert Ultra assays were blinded microscopy, culture testing, and phenotypic drug to results of other study tests through use of specimen susceptibility tests for rifampicin were done on site. codes and through staffing assignments. Data were Study-specified laboratory quality assurance included the captured through dedicated data-entry systems that use of external controls (positive and negative) and swab were password-protected. testing of the specimen processing area and of GeneXpert instrument surfaces. Statistical analysis Xpert and Xpert Ultra assays were done by adding Sample size was calculated by Monte-Carlo Simulation sample reagent to the first collected sputum specimen in (appendix p 3). Sensitivity was defined as the proportion of a 2:1 dilution, and 2·0 mL of the resulting mixture was patients testing positive with the reference standard who added to one Xpert and one Xpert Ultra cartridge. tested positive by the index test (Xpert Ultra) or comparator Samples were analysed using standard four-module test (Xpert). Specificity was the proportion of patients GeneXpert instruments with automated readouts for testing negative with the reference standard who tested M tuberculosis detection (invalid [no internal assay negative by the index test or comparator test. The primary control detected]; not detected; or detected [with semi­ analysis was based on results from initial testing of the first quantitation]) and rifampicin resistance (detected, not sputum specimen with Xpert and Xpert Ultra. Participants detected, or indeterminate). The semiquantitative scale in the case detection group were included in all analyses, for Xpert Ultra results was trace, very low, low, medium, whereas participants in the multidrug-resistance risk group or high. The semiquantitative scale for Xpert results was were only included in analyses of rifampicin-resistance very low, low, medium, or high. detection. Patients were excluded from the analysis if For reference standard testing, the second and third culture contamination did not allow application of the case sputum specimens were first digested with N-acetyl-L- definition or if results of Xpert or Xpert Ultra were cysteine and sodium hydroxide and concentrated using indeterminate or missing on initial testing. The proportion standard methods.15 Smear microscopy was done using testing indeterminate is reported separately.

78 www.thelancet.com/infection Vol 18 January 2018 Articles

2368 participants eligible for enrolment

367 participants with incomplete specimens (early exclusions as per protocol)

2001 participants enrolled

Sputum 1 Sputum 2 Sputum 3 Sputum 4

NALC-NaOH Xpert Ultra NALC-NaOH NALC-NaOH

Smear Smear MGIT and MGIT and MGIT and Smear Xpert Ultra Xpert Ultra Löwenstein-Jensen Löwenstein-Jensen Löwenstein-Jensen Xpert

Drug susceptibility Drug susceptibility Drug susceptibility tests tests tests

248 excluded from analysis* 4 missing data to be classified into enrolment group 39 non-determinate Xpert result 79 non-determinate Xpert Ultra result 1 missing Xpert and Xpert Ultra results 114 missing or outstanding complete case definition 25 smear-positive with all cultures negative

1753 participants included in the analyses

1439 participants in case detection group 314 participants in multidrug resistance risk group 462 culture-positive 215 culture-positive 323 culture-positive and smear-positive 172 culture-positive and smear-positive 137 culture-positive and smear-negative 43 culture-positive and smear-negative 2 smear result missing 99 culture-negative 977 culture-negative

Figure 1: Specimen laboratory testing, participant flow, and exclusions from analysis eligibility Eligible participants were asked to provide four sputum specimens (sputum 1–4) on 2 separate days. Xpert MTB/RIF Ultra assay (Xpert Ultra) on the first of sputum specimen was the index test, and Xpert MTB/RIF assay on the first sputum specimen was the comparator test. When possible, a fourth sputum specimen was obtained for additional solid and liquid cultures in cases with Xpert and Xpert Ultra discrepant results on sputum specimen 1. Sputum 4 results were only used for secondary analyses. NALC-NaOH=N-acetyl-L-cysteine and sodium hydroxide. MGIT=mycobacteria growth indicator tube. *Some reasons for exclusion overlap.

Results for simple proportions are presented with sample size of enrolled participants with smear-negative Clopper-Pearson 95% CI. The 95% CI around differences pulmonary tuberculosis. We reasoned that assurance in proportions (for paired specimens in non-inferiority from a diagnostic accuracy study that Xpert Ultra was at analyses) was computed using Tango’s score method.16,17 least as good as Xpert would be useful to clinicians and A non-inferiority endpoint, rather than a superiority policy makers and provide rationale for a larger study to endpoint, was selected for this initial clinical diagnostic assess superiority. Superiority is demonstrated if it can accuracy study of Xpert Ultra because a superiority be shown that sensitivity of Xpert Ultra is superior to endpoint would have required a prohibitively large Xpert beyond what could occur by chance alone. To www.thelancet.com/infection Vol 18 January 2018 79 Articles

Minsk, Vitoria, Cape Town, Zheng-zhou, Tbilisi, Johannesburg, Nairobi, Mumbai, New Delhi, Kampala, All participants Belarus Brazil South Africa China Georgia South Africa Kenya India India Uganda (N=1753) (N=121) (N=128) (N=152) (N=101) (N=372) (N=234) (N=135) (N=213) (N=116) (N=181) Demographic or clinical characteristics Age, years 42 50 41 47 45 34 33 31 30 30 38 (28–56) (37–59) (34–49) (34–57) (33–57) (30–43) (26–44) (23–45) (21–45) (26–39) (28–50) Female sex 50/121 47/128 89/152 25/101 105/372 87/234 66/135 110/213 50/116 65/181 694/1753 (41%) (37%) (59%) (25%) (28%) (37%) (49%) (52%) (43%) (36%) (40%) HIV infection 7/8 7/128 87/152 0/101 7/13 157/214 78/135 8/10 7/54 83/181 441/996 (≤4%*) (5%) (57%) (≤4·0%*) (73%†) (58%) (≤4%*) (≤4%*) (46%) (25%*) History of tuberculosis‡ 5/48 10/128 59/150 1/133 95/348 55/234 20/135 7/64 28/115 15/181 295/1436 (10%) (8%) (39%) (3%) (27%) (24%) (15%) (11%)§ (24%) (8%) (21%)§ Enrolment group¶ Case detection group 48/121 128/128 150/152 33/101 348/372 234/234 135/135 67/213 115/116 181/181 1439/1753 (40%) (100%) (99%) (33%) (94%) (100%) (100%) (31%) (99%) (100%) (82%) Multidrug-resistance risk group 73/121 0/128 2/152 68/101 24/372 0/234 0/135 146/213 1/116 0/181 314/1753 (60%) (1%) (67%) (6%) (69%) (1%) (18%) Distribution in diagnostic categories Culture-positive sputum‡ 25/48 34/128 27/150 26/33 95/348 74/234 28/135 44/67 43/115 67/181 462/1439 (52%) (27%) (18%) (79%) (27%) (32%) (21%) (66%) (37%) (37%) (32%) Proportion of participants with 14/25 5/34 14/27 4/26 39/95 18/74 6/28 14/44 8/41 16/6 137/460 culture-positive sputum that (56%) (15%) (52%) (15%) (41%) (24%) (21%) (32%) (20%)|| (24%) (30%)|| was smear-negative‡ Proportion of participants with 30/51 1/35 2/27 46/89 28/100 2/67 0/26 92/168 11/43 0/78 213/684 culture-positive sputum that (59%) (3%) (7%) (52%) (29%) (3%) (55%) (26%) (31%) was rifampicin resistant**

Data are median (IQR) or n/N (%). *For sites where HIV infection status was unknown for more than 50% of study participants, we show country-level HIV prevalence among tuberculosis cases. †HIV-infection status was unknown for 20 individuals; data are percentage of patients with known HIV status. ‡Numbers shown for study participants in the case detection group. §Data were missing for three patients at the Mumbai site. ¶At each site, study participants were enrolled in one of two possible (mutually exclusive) enrolment groups: the case detection group (based on suspicion of tuberculosis) or the multidrug- resistance risk group (based on suspicion of multidrug-resistant tuberculosis). ||Smear results were missing for two participants. **Calculated as percentage of the total number of culture-positive study participants in the case detection group and the multidrug-resistance risk group with available phenotypic drug-susceptibility test results.

Table 1: Demographic and clinical characteristics, enrolment group, and distribution in diagnostic categories of the study participants

assess non-inferiority, the lower limit of the CI of the 462 (32%) participants had culture-positive sputum and difference in sensitivity (∆) was compared with the 137 (30%) participants had smear-negative sputum. Of predefined non-inferiority margin; non-inferiority the 1753 participants in the case detection and multidrug- is achieved if the lower limit of the CI of ∆ is no resistance risk groups, 684 were culture-positive and lower than the non-inferiority margin. Non-inferiority 213 (31%) of these were rifampicin-resistant on the basis margins for comparison between Xpert Ultra and Xpert of phenotypic drug susceptibility testing (table 1). were set at –7% for sensitivity to detect smear-negative Results of the comparison between Xpert and Xpert tuberculosis, and at –3% for sensitivity and specificity to Ultra sensitivity and specificity are shown in table 2 detect rifampicin resistance (appendix, p 3). A margin (appendix p 4). The increase in sensitivity of Xpert Ultra was not predefined for specificity of tuberculosis relative to Xpert was larger than the remaining sensitivity detection. We used Stata version 12 and R version 3.2.4 gap between Xpert Ultra and a single liquid culture for statistical analyses. (appendix p 5). Xpert Ultra and Xpert sensitivities using alternative tuberculosis case definitions, as used in Role of the funding source previous studies,3,4 are shown in the appendix (p 6). The funders of the study had no role in study design, 684 participants had culture-positive sputum and had data collection, data analysis, data interpretation, or phenotypic drug susceptibility test results. Xpert Ultra writing of the report. The corresponding author had full provided interpretable rifampicin drug susceptibility test access to all the data in the study and had final results for 588 participants (86%), whereas Xpert responsibility for the decision to submit for publication. provided results for 580 participants (85%; appendix p 14). The comparison of sensitivity and specificity Results between Xpert and Xpert Ultra in the detection of Between Feb 18, and Dec 24, 2016, we enrolled rifampicin resistance is shown in table 2. Incorporating 2368 participants in the study (figure 1). 1753 participants sequencing data for specimens that tested positive for met inclusion criteria and were included in the analyses. rifampicin resistance by Xpert or Xpert Ultra but Of the 1439 participants in the case detection group, rifampicin-susceptible by phenotypic drug susceptibility

80 www.thelancet.com/infection Vol 18 January 2018 Articles

Tuberculosis detection* Detection of rifampicin resistance† Sensitivity: all culture- Sensitivity: Sensitivity: Sensitivity: Specificity Sensitivity Specificity positive smear-negative HIV-negative HIV-positive (95% CI; n/N) (95% CI; n/N) (95% CI; n/N) (95% CI; n/N) (95% CI; n/N) (95% CI; n/N)‡ (95% CI; n/N)‡ Xpert 83% 46% 90% 77% 98% 95% 98% (79 to 86; 383/462) (37 to 55; 63/137)§ (84 to 94; 143/159) (68 to 84; 88/155) (97 to 99; 960/977) (91 to 98; 167/175) (96 to 99; 369/376) Xpert Ultra 88% 63% 91% 90% 96% 95% 98% (85 to 91; 408/462) (54 to 71; 86/137)§ (86 to 95; 145/159) (83 to 95; 103/115) (94 to 97; 934/977) (91 to 98; 166/175) (97 to 99; 370/376) Difference (Xpert Ultra 5·4% 17% 1·3% 13% –2·7% –0·6% 0·3% minus Xpert) (3·3 to 8·0; 25/162) (10 to 24; 23/137) (–1·8 to 4·9; 2/159) (6·4 to 21; 15/115) (–3·9 to –1·7; 36/977) (–3·2 to 1·6; 1/175) (–0·7 to 1·5; 1/376) Non-inferiority margin Not predefined –7% Not predefined Not predefined Not predefined –3% –3%

Results are based on initial testing of the first sample with Xpert MTB/RIF and Xpert MTB/RIF Ultra (Xpert Ultra) assays. Uninterpretable results (contaminated cultures or non-determinate Xpert or Ultra results) were excluded from the analysis. Culture contamination averaged 4·3–7·8%, depending on sample and culture type. Non-determinate results (invalid, error, no result) are reported in the main text. Sensitivities of Xpert and Xpert Ultra for detection of smear-positive tuberculosis (n=323) were 99% (95% CI 97–100) and 99% (97–100). *Accuracy for tuberculosis detection was estimated in study participants in the case detection group. Patients with unknown HIV-infection status are excluded from analyses stratified by HIV status but included in all other analyses. †Accuracy for detection of rifampicin resistance was estimated in all study participants with available drug susceptibility test results and valid rifampicin resistance results for both Xpert and Xpert Ultra. ‡Data on HIV-infection status were not available for 188 culture-positive and 336 culture-negative study participants. Sensitivity of Xpert and Xpert Ultra in study participants with missing HIV status was 81% and 85%, respectively. Note that the estimate for pooled sensitivity of Xpert Ultra irrespective of HIV status does not fall between the estimates for HIV-infected and HIV-uninfected individuals. §Accuracy estimates are based on the reference standard as defined in the Methods section (using four cultures to define tuberculosis); using a less stringent reference standard with only one liquid and one solid culture (both from sputum sample 2), which is similar to the reference standard used in 21 of 22 studies included in the most recent Cochrane systematic review of the Xpert assay,4 resulted in Xpert sensitivity for smear-negative tuberculosis of 73% (Cochrane review pooled estimate 67%) and Xpert Ultra sensitivity of 84% (appendix p 5).

Table 2: Comparative accuracy for detection of tuberculosis and rifampicin resistance

Sensitivity Specificity All culture-positive Smear-negative, culture-positive All culture-negative No history of tuberculosis Any history of tuberculosis (95% CI; n/N) (95% CI; n/N) (95% CI; n/N) (95% CI; n/N) (95% CI; n/N) Xpert 83% 46% 98% 98% 98% (79–86; 383/462) (37–55; 63/137) (97–99; 960/977) (97–99; 715/727) (95–99; 244/249) Xpert Ultra 88% 63% 96% 96% 93% (85–91; 408/462) (54–71; 86/137) (94–97; 934/977) (95–98; 701/727) (89–96; 232/249) Xpert Ultra, 86% 54% 98% 98% 98% no trace* (82–89; 395/462) (45–63; 74/137) (96–98; 953/977) (96–99; 709/727) (95–99; 243/249) Xpert Ultra, 88% 61% 97% 96% 98% conditional trace† (85–91; 406/462) (53–70; 84/137) (95–98; 945/977) (95–98; 701/727) (95–99; 243/249) Xpert Ultra, 87% 61% 97% 97% 95% trace-repeat‡ (84–90; 404/462) (52–69; 83/137) (95–98; 944/977) (96–98; 707/727) (91–97; 236/249)

Sensitivity varied little by history of tuberculosis and did not vary systematically. Data on tuberculosis history were not available for one patient. *Study participants testing tuberculosis-positive based on a trace-positive Xpert Ultra result (n=32) were reclassified as tuberculosis-negative. †Study participants testing tuberculosis-positive based on a trace-positive Xpert Ultra result were reclassified as tuberculosis-negative only if they had a history of tuberculosis (n=13). ‡Study participants testing tuberculosis-positive based on a trace-positive Xpert Ultra result had Xpert Ultra testing on a subsequent sputum specimen: if the subsequent sputum Xpert Ultra result was negative for M tuberculosis then the participant was reclassified as tuberculosis-negative; if the subsequent Xpert Ultra result was positive for M tuberculosis (any semiquantitative threshold), then the participant was not reclassified and remained tuberculosis-positive (14 out of 32 participants tested tuberculosis-negative on sample 2 and were reclassified; 14 tested tuberculosis-positive on sample 2 and were not reclassified; and four were were non-determinate by Xpert Ultra on sample 2 and were not reclassified).

Table 3: Test sensitivity and specificity depending on tuberculosis history and different approaches to the interpretation of semiquantitative trace-positive results for Mycobacterium tuberculosis detection by Xpert MTB/RIF Ultra (Xpert Ultra) testing gave specificity estimates of more than 99% for appendix p 8) and only approached the specificity Xpert Ultra and Xpert, which were largely attributable to of those without a history of tuberculosis if the detection of mutations CTG533CCG, CAC526AAC, and previous tuberculosis treatment was at least 7 years CTG511CCG by Xpert Ultra and Xpert (appendix p 15). before enrolment. Results of a predefined subanalysis to compare Xpert 19 (44%) of 43 participants with a positive Xpert Ultra Ultra and Xpert specificities in participants in the case test but no positive culture had an Xpert Ultra detection group with a history of tuberculosis treatment semiquantitative readout of trace. 15 (35%) participants versus no history of tuberculosis treatment are shown with apparent false-positive Xpert Ultra results were also in table 3 (appendix p 7). In participants with a history positive by Xpert (appendix p 9). Two (5%) participants of prior tuberculosis treatment, the reduction in Xpert with apparent false-positive Xpert Ultra results had Ultra specificity was greatest for those who had recently M tuberculosis identified on a follow-up culture, and two completed their tuberculosis treatment (figure 2; (5%) participants were treated for tuberculosis on the www.thelancet.com/infection Vol 18 January 2018 81 Articles

100 (100 cases per 100 000 population or less), whereas the difference betweenXpert Ultra and Xpert was greatest 90 Xpert (–8% [95% CI –14 to –5] in favour of Xpert) in participants Xpert Ultra 80 Xpert Ultra without trace with a medical history of tuberculosis who were enrolled Xpert Ultra with repeat-trace 70 in countries with high tuberculosis incidence (more than 100 cases per 100 000 population; appendix p 13). 60 On initial testing of 2001 specimens, non-determinate 50 readouts (invalid, error, no result) were obtained for 39 (2%)

Specificity (%) 40 specimens with Xpert and for 79 (4%) specimens with Xpert

30 Ultra. After excluding instrument-related errors, non- determinate readouts were obtained for 28 (1%) specimens 20 with Xpert and for 64 (3%) specimens with Xpert Ultra. A 10 single repeat test done on the same specimen that initially 0 was non-determinate reduced the number of non- 0 2 4 6 8 10 determinate results to four (<1%) specimens with Xpert and Years since treatment completion to ten (<1%) specimens for Xpert Ultra. Figure 2: Specificity estimates of Xpert MTB/RIF and Xpert MTB/RIF Ultra (Xpert Ultra) for tuberculosis case detection in patients with a tuberculosis treatment history and for different approaches to handling an Discussion initial Xpert Ultra trace-positive result Results of this multicentre diagnostic accuracy study The curves show specificity in participants with tuberculosis history as a function of the time since completion of show that the sensitivity of Xpert Ultra was superior to treatment for the previous tuberculosis episode within 10 years of enrolment (50 cases with treatment more than 10 years earlier were omitted; recoding these 50 cases to be at 10 years did not lead to any noticeable changes in that of the standard Xpert for tuberculosis case detection the findings). The results of the Xpert Ultra conditional trace results approach are not shown but would have been in participants with sputum smear-negative pulmonary directly below the curve for the Ultra without trace. Curves were created using running-line least squares (mean) tuberculosis. Xpert Ultra also had superior sensitivity for 18 smoothers with a bandwidth of 0·8. tuberculosis case detection in HIV-infected participants and in all study participants. In clinical practice, the high basis of clinical suspicion. Of the 24 (56%) participants sensitivity of Xpert Ultra could facilitate diagnosis of who did not have culture-positive or Xpert-positive tuberculosis at earlier stages of disease and diagnosis of sputum and who had not started therapy, 18 participants tuberculosis in patients with HIV and sputum smear- gave 2-month follow-up information on symptoms. negative tuberculosis, a population with high mortality. Symptoms had resolved in nine participants, improved Similarly, sensitivity gains could also be relevant for in eight participants, and had not changed in one diagnosis of tuberculosis in children and for diagnosis of participant. Sequencing of the amplicons obtained from extrapulmonary forms of tuberculosis such as menin­ 14 cartridges (14 participants) with apparent false-positive gitis. These groups were assessed in separate studies.19,20 results showed M tuberculosis DNA in 12 participants The increased sensitivity of Xpert Ultra came at the (appendix pp 10–12). expense of a loss of specificity. For Xpert Ultra, we In a post-hoc analysis, we explored the effect of observed a difference in specificity between patients with reclassifying Xpert Ultra trace-positive results as and without a medical history of tuberculosis treatment. tuberculosis-negative on sensitivity and specificity for Xpert Ultra specificity increased with increasing time case detection (appendix p 7). Eliminating the trace- since completion of treatment since the preceding positive category and reclassifying all trace-positive tuberculosis episode up to 7 years. Xpert specificity results as tuberculosis-negative improved Xpert Ultra differed by tuberculosis treatment history only if the specificity but reduced its sensitivity (table 3). preceding treatment had been completed within the past A conditional-trace approach (Xpert Ultra trace-positive 2 years. These results are in line with findings by Theron results were reclassified as tuberculosis-negative only in and colleagues21 that show that Xpert-positive, culture- participants with a history of tuber­culosis) and a trace- negative results were more common in individuals with repeat approach (participants with a trace-positive Xpert a history of tuberculosis. Extraneous M tuberculosis from Ultra result for the first specimen were classified either other specimens or the laboratory environment, or false- as tuberculosis-negative if an Xpert Ultra test result of negative cultures from over-decontamination are possible another sputum specimen was negative, or as explanations for a positive nucleic-acid amplification test tuberculosis-positive if an Xpert Ultra test result of result in participants with sputum cultures that are another sputum specimen was positive) also improved negative for M tuberculosis. However, in our study, over- Xpert Ultra specificity estimates (table 3). The conditional decontamination is not sufficient to explain all of the trace and trace-repeat approaches retained most of Xpert specificity decrement for Xpert Ultra, and environmental Ultra’s sensitivity in the smear-negative group. In a post- contamination is an unlikely explanation because we hoc analysis stratified by country-specific tuberculosis implemented rigorous laboratory quality assurance and incidence, the specificity of Xpert Ultra was almost quality monitoring throughout the study. We speculate identical to that of Xpert in countries with low incidence that in our study, most instances of Xpert Ultra-positive,

82 www.thelancet.com/infection Vol 18 January 2018 Articles

culture-negative results were caused by the presence of equivalent to that reported in a Cochrane review4 of a M tuberculosis DNA or intact M tuberculosis bacilli (either broad group of studies and sites worldwide. Our estimates living or dead, originating from the participant’s lower are also in line with WHO surveillance data on the respiratory system), or both in sputum. M tuberculosis frequency of rifampicin-resistance-conferring mutations mRNA has also been detected in sputum along with obtained from resistance surveys. persisting PET thoracic lesion activity in some patients In summary, Xpert Ultra holds promise as a rapid and with tuberculosis 1 year after standard 6-month highly sensitive test for tuberculosis case detection and tuberculosis treatment.22 It remains to be seen whether simultaneous detection of rifampicin resistance. Its apparent reductions in test specificity in patients with a sensitivity gain compared with Xpert is most apparent in history of tuberculosis will also be observed for other individuals with low sputum bacillary burdens. Imple­ molecular tests for tuberculosis that aim to improve mentation approaches will need to consider the effect of sensitivity through the detection of multicopy targets.23,24 possible false-positive Xpert Ultra results. Additional studies with longer follow-up that investigate Contributors the natural history of patients with Xpert Ultra-positive CMD, SED, DA, SGS, SC, CCB, JJE, and PN designed the study, and PN, and culture-negative results are needed to understand DTA, BK, SGS, SLH, CMD, SED, and DA oversaw the trial conduct. NT, NB, WS, LS, CR, MIK, MJ, LN, MPN, YG, IA, WM, RD, RLP, AS, the clinical relevance of these test results. VA, KKC, MH, XL, and XY coordinated individual trial sites. SGS, SED, More than half of Xpert Ultra false-positive results in CMD, DA, and DMC analysed data and developed the first manuscript patients with a history of tuberculosis were trace-positive draft. All authors contributed to data collection, interpretation of data, (the semiquantitative result corresponding to the lowest and revision of the Article and approved the final version of the Article before submission. bacillary burden), so reclassification of these results as tuberculosis-negative could be considered for all patients, Study team members Yukari C Manabe (Johns Hopkins University School of Medicine, for patients with a tuberculosis history only, or on the Baltimore, MD, USA); David Hom (Boston Medical Center and Boston basis of Xpert Ultra test results from another sputum University School of Medicine, Boston, MA, USA); specimen. These approaches mitigate some loss of Xpert Rusudan Aspindzelashvili (National Center for Tuberculosis and Lung Ultra specificity while maintaining some sensitivity gains Diseases, Tbilisi, Georgia); Anura David (National Health Laboratory Service, Johannesburg, South Africa); Utkarsha Surve (PD Hinduja over Xpert. The population-level effect of the sensitivity Hospital and Medical Research Centre, Mumbai, India); and specificity trade-off on patient-important outcomes Louis H Kamulegeya, Sheila Nabweyambo (Infectious Disease Institute would be expected to vary by setting. For Xpert Ultra, and Makerere University, Kampala, Uganda); Shireen Surtie, country-level tuberculosis incidence levels seem to affect Nchimunya Hapeela (Division of Medical Microbiology and Institute for Infectious Diseases and Molecular Medicine, University of Cape Town, test specificity. For example, in our study, Xpert Ultra National Health Laboratory Service, Groote Schuur Hospital, Cape Town, specificity was 99% in participants without a history of South Africa); Kevin P Cain, Janet Agaya, Kimberly D McCarthy tuberculosis treatment in study sites in countries where (US Centers for Disease Control and Prevention, and Division of Global the tuberculosis incidence of 100 cases per HIV and Tuberculosis, Kisumu, Kenya); Patricia Marques Rodrigues, Luiz Guilherme Schmidt Castellani, Pedro Sousa de Almeida Jr, 100 000 population or less, and 95% in patients without Paola Poloni Lobo de Aguiar (Universidade Federal do Espirito Santo, tuberculosis treatment history in countries where the Vitoria, Brazil); Varvara Solodovnikova (National Reference Laboratory, tuberculosis incidence is more than 100 cases per Republican Scientific and Practical Centre for Pulmonology and 100 000 population.1 Modelling studies are underway and Tuberculosis, Minsk, Belarus); Xianglin Ruan, Lili Liang, Guolong Zhang, Hong Zhu, Yingda Xie (Henan Provincial Chest Hospital, Zhengzhou, will allow more in-depth exploration of the trade-offs Henan Province, China). between increased numbers of patients correctly­ and Declaration of interests falsely diagnosed under different epidemiologic scenarios. SGS, PN, CMD, and CCB are employed by FIND. FIND is a not-for-profit For detection of rifampicin resistance, Xpert Ultra foundation that supports the evaluation of publicly prioritised tuberculosis specificity was non-inferior to that of Xpert. The sensitivity assays and the implementation of WHO-approved (guidance and point estimate for Xpert Ultra was slightly less than that of prequalification) assays using donor grants. FIND has product evaluation agreements with several private sector companies that design diagnostics Xpert and the confidence interval was wide, such that and related products for treatment of tuberculosis and other diseases. non-inferiority criteria were not met. Additional studies These agreements strictly define FIND’s independence and neutrality including larger numbers of rifampicin-resistant vis-a-vis the companies whose products get evaluated and describe roles and responsibilities. DA reports grants from Johns Hopkins University specimens are needed to more precisely characterise School of Medicine and Cepheid during the conduct of the study; grants Xpert Ultra accuracy for detection of rifampicin resistance. from National Institutes of Health (NIH), the Foundation for Innovative Patients belonging to the multidrug-resistance risk group New Diagnostics, and the Henry Jackson Foundation outside the were recruited mainly from four sites and, accordingly, submitted work; and US and European patents within the Rutgers University Molecular Beacon Patent Pool for non-competitive co- most rifampicin-resistant cases come from these sites. amplification methods, assays for short sequence variants, wavelength- Mutations such as Ile491Phe, which is not detected by shifting probes and primers, nucleic acid detection probes having non-fret Xpert or Xpert Ultra, might be more common in countries fluorescence quenching and kits and assays including such probes, not included in this study, and inclusion of such sites homogeneous multiplex screening assays and kits, and PCR primers and probes for M tuberculosis. WS is a board member of the Contract could potentially have reduced sensitivity estimates. Laboratory Service, the African Society for Laboratory Medicine, the However, we found no evidence of bias given that our Antimicrobial Drug Resistance Group, the World Bank/Presidential reported accuracy for rifampicin-resistance detection is Project (for mines), the Task Team for Correctional Services, and the www.thelancet.com/infection Vol 18 January 2018 83 Articles

National HIV and Tuberculosis Drug Resistance Working Group. WS 6 WHO. WHO monitoring of Xpert MTB/RIF roll-out. Geneva: declares consultancy paid to institution from the Bill & Melinda Gates World Health Organization, 2014. Foundation Grand Challenges Canada, consultancy from WHO (EID, 7 Nicol MP, Workman L, Isaacs W, et al. Accuracy of the Xpert CD4, drug resistance, POC, Xpert TB), consultancy paid to institution MTB/RIF test for the diagnosis of pulmonary tuberculosis in from Clinton Foundation (Point-of-Care), personal fees from National children admitted to hospital in Cape Town, South Africa: Health Laboratory Service, joint staff with University of the Witwatersrand, a descriptive study. Lancet Infect Dis 2011; 11: 819–24. expert testimony for grant funders (NIH, US Centers for Disease Control 8 Theron G, Peter J, van Zyl-Smit R, et al. Evaluation of the Xpert and Prevention [CDC], Global Fund), grants from NIH, CDC, the Global MTB/RIF assay for the diagnosis of pulmonary tuberculosis in a Fund, Clinton Foundation, Bill & Melinda Gates Foundation, PATH, high HIV prevalence setting. Am J Respir Crit Care Med 2011; 184: 132–40. Grand Challenges Canada, London School of Hygiene & Tropical Medicine, South African Medical Research Council, and the UK Medical 9 Sohn H, Aero AD, Menzies D, et al. Xpert MTB/RIF testing in a low tuberculosis incidence, high-resource setting: limitations in Research Council, expert training, teaching development, and speaker fees accuracy and clinical impact. Clin Infect Dis 2014; 58: 970–76. paid to institution from FIND, and conference speaker fees paid to 10 Theron G, Peter J, Dowdy D, Langley I, Squire SB, Dheda K. institution from Cepheid, outside the submitted work. LS reports having a Do high rates of empirical treatment undermine the potential effect patent USP 8709712 issued. SC now works for Cepheid and has a patent of new diagnostic tests for tuberculosis in high-burden settings? pending for some of the primers and probes used in the assay kit. All other Lancet Infect Dis 2014; 14: 527–32. authors declare no competing interests. The findings and conclusions in 11 Theron, G, Zijenah, L, Chanda, D, et al. Feasibility, accuracy, and this report are those of the authors and do not necessarily represent the clinical effect of point-of-care Xpert MTB/RIF testing for official position of the CDC. tuberculosis in primary-care settings in Africa: a multicentre, Acknowledgments randomised, controlled trial. Lancet 2014; 383: 424–35. We thank study participants, without whom this work would not have 12 Mathys V, van de Vyvere M, de Droogh E, Soetaert K, Groenen G. been possible, and the clinical and laboratory teams at the participating False-positive rifampicin resistance on Xpert MTB/RIF caused by a silent mutation in the rpoB gene. Int J Tuberc Lung Dis 2014; trial sites. We thank Kate Shearer for review of the analysis plan and 18: 1255–57. statistical code—since FIND was involved in the development of the 13 Ocheretina O, Byrt E, Mabou MM, et al. False-positive rifampin Xpert and Xpert Ultra assays, key sections of statistical code were resistant results with Xpert MTB/RIF version 4 assay in clinical completely re-written and re-run by Shearer, as an independent samples with a low bacterial load. Diagn Microbiol Infect Dis 2016; statistician, to ensure an independent analysis. We thank Martin Jones, 85: 53–55. Marie Simmons, Bob Kwiatkowski, Pam Johnson, and David Persing 14 Chakravorty S, Simmons AM, Rowneki M, et al. The new Xpert from Cepheid for their support during assay development and with MTB/RIF Ultra: improving detection of Mycobacterium tuberculosis logistics for the study. We thank Nora Champouillon from FIND for and resistance to rifampin in an assay suitable for point-of-care facilitating the project. This work was supported by grants from the testing. MBio 2017; 8: e00812–17 Government of Netherlands (DSO/GA-523/10) for assay development; 15 Global Laboratory Initiative. Mycobacteriology laboratory manual. and grants from the Government of Australia (70957), the Geneva: Global Laboratory Initiative, 2014. Bill & Melinda Gates Foundation (OPP1105925), the UK aid from the 16 Rothmann MD. Wiens BL, Chan ISF. Design and analysis of UK Government (204074-101); and by the National Institute of Allergy non-inferiority trials. Boca Raton, FL: Chapman & Hall/CRC, 2014. and Infectious Diseases, NIH, Department of Health and Human 17 FDA. Guidance for industry. Non-inferiority clinical trials. Silver Services, USA (contract #HHSN2722000900050C and K24AI104830 Spring, MD: Food and Drug Administration, 2010. (SED)) for clinical evaluation. Wendy Stevens and Lesley Scott were 18 Cleveland WS. Robust locally weighted regression and smoothing supported by funding received from the South African Medical scatterplots. J Am Stat Assoc 1979; 74: 829–36. Research Council with funds received from the South African National 19 Zar HJ, Workman L, Nicol MP. Diagnosis of pulmonary Department of Health, and the UK Medical Research Council, with tuberculosis in HIV-infected and uninfected children using Xpert funds received from the UK Government’s Newton Fund under the UK/ MTB/RIF Ultra. International Conference of the American Thoracic South Africa Newton Fund #015NEWTON TB. Investigational Society; Washington, DC, USA; May 19–24, 2017. Abstr A7610. cartridges, Xpert MTB/RIF cartridges, and GeneXpert instruments were 20 Bahr NC, Nuwagira E, Evans EE, et al. Diagnostic accuracy of Xpert generously donated by Cepheid. Cepheid personnel had no role in study MTB/RIF Ultra for tuberculous meningitis in HIV-infected adults: design, implementation, analysis, manuscript writing, or decision to a prospective cohort study. Lancet Infect Dis 2017; published online submit the study findings for publication. Sept 14. http://dx.doi.org/10.1016/S1473-3099(17)30474-7. 21 Theron G, Venter R, Calligaro G, et al. Xpert MTB/RIF results in References patients with previous tuberculosis: can we distinguish true from 1 WHO. Global tuberculosis report 2016. Geneva: World Health false positive results? Clin Infect Dis 2016; 62: 995–1001. Organization, 2016. 22 Malherbe ST, Shenai S, Ronacher K, et al. Persisting positron 2 Uplekar M, Weil D, Lonnroth K et al. WHO’s new end TB strategy. emission tomography lesion activity and Mycobacterium tuberculosis Lancet 2015; 385: 1799–801. mRNA after tuberculosis cure. Nat Med 2016; 22: 1094–100. 3 Boehme CC, Nabeta P, Hillemann D, et al. Rapid molecular 23 Hofmann-Thiel S, Hoffmann H. Evaluation of the Fluortype MTB detection of tuberculosis and rifampicin resistance. N Engl J Med for detection of Mycobacterium tuberculosis complex DNA in clinical 2010; 363: 1005–15. specimens in a low-incidence country. BMC Infect Dis 2014; 14: 59. 4 Steingart KR, Schiller I, Horne DJ, Pai MP, Boehme CC, 24 Hofmann-Thiel S, Molodtsov N, Antonenka U, Hoffmann H. Dendukuri N. Xpert MTB/RIF assay for pulmonary tuberculosis Evaluation of the Abbott RealTime MTB and RealTime MTB and rifampicin resistance in adults. Cochrane Database Syst Rev INH/RIF assays for direct detection of Mycobacterium tuberculosis 2014; 1: CD009593. complex and resistance markers in respiratory and extrapulmonary 5 WHO. Xpert MTB/RIF assay for the diagnosis of pulmonary and specimens. J Clin Microbiol 2016; 54: 3022–27. extrapulmonary TB in adults and children: policy update. Geneva: World Health Organization, 2013.

84 www.thelancet.com/infection Vol 18 January 2018 Werutsky et al. BMC Cancer (2019) 19:5 https://doi.org/10.1186/s12885-018-5233-5

RESEARCHARTICLE Open Access PET-CT has low specificity for mediastinal staging of non-small-cell lung cancer in an endemic area for tuberculosis: a diagnostic test study (LACOG 0114) Gustavo Werutsky1*, Bruno Hochhegger2, José Antônio Lopes de Figueiredo Pinto2, Jeovany Martínez-Mesa3, Mara Lise Zanini4, Eduardo Herz Berdichevski4, Eduardo Vilas4, Vinícius Duval da Silva2, Maria Teresa Ruiz Tsukazan5, Arthur Vieira5, Leandro Genehr Fritscher2, Louise Hartmann4, Marcos Alba4, Guilherme Sartori1, Cristina Matushita4, Vanessa Bortolotto1, Rayssa Ruszkowski do Amaral1, Luís Carlos Anflor Junior5, Facundo Zaffaroni1, Carlos H. Barrios1, Márcio Debiasi1 and Carlos Cezar Frietscher2

Abstract Background: The present study aims to assess the performance of 18F-FDG PET-CT on mediastinal staging of non- small cell lung cancer (NSCLC) in a location with endemic granulomatous infectious disease. Methods: Diagnostic test study including patients aged 18 years or older with operable stage I-III NSCLC and indication for a mediastinal lymph node biopsy. All patients underwent a 18F-FDG PET-scan before invasive mediastinal staging, either through mediastinoscopy or thoracotomy, which was considered the gold-standard. Surgeons and pathologists were blinded for scan results. Primary endpoint was to evaluate sensitivity, specificity and positive and negative predictive values of PET-CT with images acquired in the 1st hour of the exam protocol, using predefined cutoffs of maximal SUV, on per-patient basis. Results: Overall, 85 patients with operable NSCLC underwent PET-CT scan followed by invasive mediastinal staging. Mean age was 65 years, 49 patients were male and 68 were white. One patient presented with active tuberculosis and none had HIV infection. Using any SUV_max > 0 as qualitative criteria for positivity, sensitivity and specificity were 0.87 and 0.45, respectively. Nevertheless, even when the highest SUV cut-off was used (SUV_max ≥5), specificity remained low (0.79), with an estimated positive predictive value of 54%. Conclusions: Our findings are in line with the most recent publications and guidelines, which recommend that PET-CT must not be solely used as a tool to mediastinal staging, even in a region with high burden of tuberculosis. Trial registration: The LACOG 0114 study was registered at ClinicalTrials.gov, before study initiation, under identifier NCT02664792. Keywords: Non-small cell lung cancer, PET-CT, Mediastinal staging, Granulomatous infectious diseases

* Correspondence: [email protected] 1Latin American Cooperative Oncology Group (LACOG), Ipiranga Avenue 6681, 99A, Room 806, Porto Alegre, Brazil Full list of author information is available at the end of the article

© The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Werutsky et al. BMC Cancer (2019) 19:5 Page 2 of 8

Background Patients Lung cancer is the leading cause of cancer-related death Patients were recruited from the department of Thoracic in the world. It is responsible for 1,350,000 new cases Surgery at Hospital São Lucas, a tertiary hospital sup- and 1,180,000 deaths annually worldwide [1]. In Brazil, ported by the Public Health System in southern Brazil. the incidence of lung cancer is also rising, accounting Patients aged 18 years or older were eligible if they had for approximately 28,000 new cases and 24,500 deaths newly diagnosed or highly suspected NSCLC and indica- yearly, according to the most recent report from INCA, tion for mediastinal lymph node biopsy based on current the Brazilian National Institute of Cancer [2]. practice guidelines for staging. All patients were consid- Despite recent advances in terms of early diagnosis ered to have operable stage I-III disease after initial achieved with low-dose Computed Tomography (CT) evaluation (medical history, physical examination and screening, most cases of lung cancer are still diagnosed contrast-enhanced CT scan of the chest and upper abdo- at late clinical stages (CS), IIIb or IV. In Brazil, approxi- men). Exclusion criteria were any prior treatment for mately 70% of patients present with locally advanced or NSCLC (surgery, chemotherapy or radiotherapy), con- metastatic disease [3]. Accurate staging of patients with firmed distant metastases, pregnancy (women in child- non-small-cell lung cancer (NSCLC) is critical for defin- bearing age had to agree to taking contraceptive ing the best treatment modality and predicting prognosis measures and present a negative pregnancy test), altered [4]. In the absence of distant metastasis, the status of hematologic and biochemical function. mediastinal lymph nodes plays a critical role for treat- ment decisions. In this clinical scenario, the identifi- Procedures cation of positive mediastinal nodes changes the Patients included in the study were first subjected to a treatment option from surgery to multimodality treat- PET-CT scan, which were obtained using integrated ment approach [5]. PET–CT system (GE Discovery 600) as follows: after a Since 2003, PET scan (Positron Emission Tomography) 6-h fast, 18F-FDG was given intravenously with activity with 18F-fluorodeoxyglucose (18F-FDG) is recommended of 10 to 15 mCi. Images were acquired 1 and 2 h after for NSCLC staging due to its high sensitivity to detect the administration of 18F-FDG. Patients were scanned cancer [6, 7]. Currently, PET-CT is the gold-standard pro- from the head to the upper thigh. A diagnostic CT scan, cedure for the non-invasive staging of NSCLC patients be- obtained with the use of a standard protocol (80 to 100 cause it has the capability of identifying distant metastasis mA, 120 kV, a tube-rotation time of 0.5 s per rotation, a that would pass unnoticed in CT, preventing over 30% of pitch of 6, and a slice thickness of 5 mm, with 70 ml of unnecessary thoracotomies [8]. intravenous contrast medium containing 300 mg of Invasive staging of the mediastinal nodes with medi- iodine per milliliter [Ultravist, Bayer Schering], adminis- astinoscopy is still the standard of care and the value tered at a rate of 2.5 ml per second), preceded the PET of PET-CT for this indication is debatable. Previous scan (a 5-min emission scan per table position and 25 min studies showed sensitivity ranging from 77 to 90% total). The PET scan was reconstructed by filtered and specificity of 86% for PET-CT in detecting the back-projection and ordered-subset expectation-maximiza- spread of NSCLC to the mediastinal lymph nodes [9]. tion (OS-EM), with data from the CT scan used for attenu- One of the most important problems with the use of ation correction. Results were evaluated by a radiologist PET-CT in this situation are the false positives and a nuclear medicine specialist. The maximum standard- findings. It occurs because 18F-FDG is not a tumor ized uptake value (SUV_max) of the primary tumor was specific agent and other conditions such as granu- measured and calculated by the software according to lomatous diseases might present with a high 18F-FDG standard formulas. Mediastinal lymph node stations were uptake [10–12]. considered positive for metastatic spread if they exhibited The aim of this study is to validate PET-CT perform- focally increased FDG uptake higher than the normal back- ance on mediastinal staging of patients with NSCLC liv- ground activity (activity > background), as determined by ing in an endemic area of tuberculosis. qualitative analysis. After the PET-CT scan, invasive medi- astinal staging was performed. Mediastinoscopy or thora- cotomy were considered valid invasive mediastinal staging Methods procedures and were chosen according to surgeon’sdiscre- Trial design tion. The surgical team was blinded to PET-CT’s results. The present study is a diagnostic test study designed to evaluate PET-CT performance on the diagnosis of meta- Statistical analysis static mediastinal lymph nodes compared with the The finding of mediastinal lymph nodes with increased gold-standard invasive staging with biopsy in patients 18F-FDG uptake on PET/CT was compared with patho- with non-small cell lung carcinoma. logical examination of lymph nodes obtained in the Werutsky et al. BMC Cancer (2019) 19:5 Page 3 of 8

invasive staging procedure. The study primary endpoint analysis was performed with the use of SPSS software, ver- was to evaluate PET-CT’s performance with images ac- sion 18, and SAS software, version 9.4. quired in the 1st hour of the exam protocol in a per-patient basis. The secondary endpoints were deter- mining PET-CT performance in the 2nd hour of the Results exam protocol, and evaluate PET-CT’s performance in Baseline characteristics the 1st hour of the exam protocol evaluated per-nodal From August 2014 to August 2016, 108 patients were en- station basis (2R, 2 L, 4R, 4 L and 7). Categorical vari- rolled in the study. Eight patients did not perform the ables were described as their count and percentage. PET-CT scan. Of the remaining 100 patients, 85 under- Numerical variables were described as their median, went mediastinal sampling biopsy (by mediastinoscopy or minimum and maximum. Sensitivities, specificities and surgery) after PETCT and they were considered for pri- predictive values were calculated using predefined cut- mary analysis. The STARD flow diagram is shown in Fig. 1. offs of maximal SUV. The 95% confidence intervals were Table 1 shows baseline characteristics of eligible pa- calculated for sensitivity, specificity and predictive tients who performed mediastinal sampling. Median values. A receiver-operating-characteristic (ROC) ana- age was 65 years, 57.6% patients were male and 80.0% lysis was performed on the PET-CT per-patient results. were white. Current or former smokers accounted for In order to obtain a minimal sensitivity and specificity of 94.1% of the sample, with a high tobacco exposure 90%, while expecting a 40% rate of positive lymph nodes, (median of 45 pack-years). Although the relatively we estimated that 89 patients would be needed accepting a high incidence of tuberculosis in Brazil, only one pa- two-sided type I error of 5%. Assuming a 10% ineligibility tient had known TB infection and no patient had an rate, the total sample size was 100 patients. Statistical HIV infection.

Fig. 1 STARD flow diagram for the evaluation of 18F-FDG PET-CT on mediastinal staging of non-small cell lung cancer Werutsky et al. BMC Cancer (2019) 19:5 Page 4 of 8

Table 1 Patient’s characteristics at baseline Comparison of PET-CT and mediastinal invasive staging Characteristic N (%) or median (min-max) From 85 patients, only 23 patients (27.1%) had patho- (n = 85) logical mediastinal involvement. Of these, PET-CT cor- Age (years) 65.0 (47.0–80.0) rectly identified 20 patients (86.9%) that showed an Sex increased uptake of 18F-FDG. Conversely, PET-CT Male 49 (57.6%) showed increased uptake in the mediastinum of 34 patients that were later on not confirmed to have patho- Female 36 (42.4%) logical mediastinal involvement (false-positive rate of Race 54.8%). 28 out of 62 patients who did not have lymph White 68 (80.0%) node involvement on histological analysis did not have Black 7 (8.2%) increased uptake on PET CT (true-negative rate of Other 10 (11.8%) 45.2%). When a higher cut-off was used (SUV max ≥5), Smoking status the false-positive rate reduced to 21.0% and the true-negative rate increased to 79.0%. For the same SUV Current 35 (41.2%) cut-offs, we found an increased FDG uptake in hour 2, Former 45 (52.9%) although this difference may not be clinically relevant. Never 5 (5.9%) When evaluated the 212 available nodal stations, 38 of Tobacco Exposure (Pack-year)a 45.0 (8.1–120.0) them had pathological mediastinal involvement (17.9%). Comorbities Of those 38, PET-CT correctly identified 25 of them Hypertension 40 (47.1%) (65.8%). 128 out of 174 lymph nodes who did not have involvement on histological analysis did not have in- Diabetes 11 (12.9%) creased uptake on PET CT (true-negative rate of 73.6%). COPD 30 (35.3%) Considering SUV max ≥5, the true-positive rate de- Asthma 9 (10.6%) creased to 47.4% and the true-negative rate increased to Active Tuberculosis 1 (1.2%) 91.4% (Tables 2 and 3). HIV positive 0 (0.0%) As showed in Table 3, the highest sensitivity (87%) was Data is presented here as mean (minimum-maximum) or absolute (relative) observed for the SUV_max > 0 cut-off. The negative pre- frequencies. aThis analysis takes into account only the 79 patients that were dictive value (NPV) using this cut-off was 90%, which smokers or former smokers wasn’t changed for the images acquired in hour 2. By contrast, in the scenario with the highest specificity, when only uptake with SUV ≥ 5 was considered positive, we found a positive predictive value of only 54%. When

Table 2 PET-CT findings and pathological evaluation of mediastinal lymph nodes after surgical staging (per-patient and per-nodal- station) PET-CT SUV cut-off Pathological evaluation of PER-PATIENT (n = 85) PER-NODAL-STATION (n = 212) mediastinal lymph nodes PET-CT PET-CT HOUR 1 HOUR 2 HOUR 1 Positive Negative Positive Negative Positive Negative SUV_Maxa >0 Positive 20 3 20 3 25 13 Negative 34 28 35 27 46 128 SUV_Maxa ≥ 2.5 Positive 18 5 19 4 23 15 Negative 30 32 30 32 40 134 SUV_Maxa ≥ 3 Positive 16 7 18 5 21 17 Negative 24 38 26 36 32 142 SUV_Maxa ≥ 5 Positive 15 8 16 7 18 20 Negative 13 49 18 44 15 159 SUV_Maxa ≥ SUV Liver Positive 18 5 18 4 –– Negative 23 37 25 34 –– aSUV_Max: Maximum value of SUV uptake between 2R, 2 L, 4R, 4 L, 7 and aortopulmonary when evaluating per-patient, and 2R, 2 L, 2R, 4 L and 7 when evaluating per-nodal-station Werutsky et al. BMC Cancer (2019) 19:5 Page 5 of 8

Table 3 Sensitivity and specificity of PET-CT using different maximum SUV cutoffs for the staging of the mediastinal lymph nodes (per-patient and per-nodal-station) Cut-off Measure Hour 1 (Per-Patient) Hour 2 (Per-Patient) Hour 1 (Per-Nodal-Station) SUV_Maxa > 0 Sensitivity 0.87 (0.66–0.97) 0.87 (0.66–0.97) 0.66 (0.49–0.80) Specificity 0.45 (0.33–0.58) 0.44 (0.31–0.57) 0.74 (0.66–0.80) Positive Predictive Value 0.37 (0.31–0.44) 0.36 (0.30–0.43) 0.35 (0.28–0.43) Negative Predictive Value 0.90 (0.76–0.97) 0.90 (0.75–0.96) 0.91 (0.86–0.94) SUV_Maxa ≥ 2.5 Sensitivity 0.78 (0.56–0.93) 0.83 (0.61–0.95) 0.61 (0.43–0.76) Specificity 0.52 (0.39–0.65) 0.52 (0.39–0.65) 0.77 (0.70–0.83) Positive Predictive Value 0.38 (0.30–0.46) 0.39 (0.32–0.47) 0.37 (0.28–0.46) Negative Predictive Value 0.87 (0.74–0.94) 0.89 (0.76–0.95) 0.90 (0.86–0.93) SUV_Maxa ≥ 3 Sensitivity 0.70 (0.47–0.87) 0.78 (0.56–0.93) 0.55 (0.38–0.71) Specificity 0.61 (0.48–0.73) 0.58 (0.45–0.71) 0.82 (0.75–0.87) Positive Predictive Value 0.40 (0.31–0.50) 0.41 (0.33–0.50) 0.40 (0.30–0.50) Negative Predictive Value 0.84 (0.74–0.91) 0.88 (0.76–0.94) 0.89 (0.85–0.92) SUV_Maxa ≥ 5 Sensitivity 0.65 (0.43–0.84) 0.70 (0.47–0.87) 0.47 (0.31–0.64) Specificity 0.79 (0.67–0.88) 0.71 (0.58–0.82) 0.91 (0.86–0.95) Positive Predictive Value 0.54 (0.40–0.67) 0.47 (0.36–0.59) 0.55 (0.40–0.68) Negative Predictive Value 0.86 (0.78–0.92) 0.86 (0.77–0.92) 0.89 (0.85–0.92) SUV_Maxa ≥ SUV Liver Sensitivity 0.78 (0.56–0.93) 0.82 (0.60–0.95) – Specificity 0.62 (0.48–0.74) 0.58 (0.44–0.70) – Positive Predictive Value 0.44 (0.35–0.54) 0.42 (0.34–0.51) – Negative Predictive Value 0.88 (0.77–0.94) 0.90 (0.77–0.96) – aSUV_Max: Maximum value of SUV uptake between 2R, 2 L, 4R, 4 L, 7 and aortopulmonary when evaluating per-patient, and 2R, 2 L, 2R, 4 L and 7 when evaluating per-nodal-station using the liver FDG uptake as cut-off for SUV positivity, The role of PET-CT in mediastinal staging has been the sensitivity and specificity was not improved reviewed in a Cochrane meta-analysis [15], which in- The image acquisition in hour 2 of the protocol did cluded 18 studies that used 18F-FDG uptake higher than not change the accuracy of the test using ROC (Receiver the background activity as qualitative criteria for Operator Characteristic) curve (Fig. 2). PET-CT positivity. Sensitivity and specificity estimates Among 31 patients with no uptake in mediastinum were 77.4% (95% CI 65.3 to 86.1) and 90.1% (95% CI (SUV = 0), 3 (9.6%) had metastatic lymph node involve- 85.3 to 93.5), respectively. ment after mediastinal invasive staging (Fig. 3). However, some clinicopathological factors have been associated with incorrect PET/CT staging. On multivari- Discussion ate analysis, Al Sarraf [16] showed that rheumatoid Therapeutic options for patients with non-metastatic po- arthritis, non-insulin dependent diabetes, history of tu- tentially resectable NSCLC are mainly determined by berculosis, presence of atypical adenomatous hyperplasia the presence or absence of mediastinal lymph node me- and pneumonia were independent factors causing in- tastases (N2). While patients with resectable disease and accurate staging of mediastinal lymph nodes. no evidence of mediastinal lymph node involvement An important factor in countries with high burden of have surgery as the primary treatment, patients with N2 granulomatous infectious disease is the reduction of disease usually undergo a multimodality approach in PET-CT reliability in this scenario [17, 18]. According to order to maximize treatment outcomes. the World Health Organization (WHO), Brazil ranks as Compared with cervical mediastinoscopy, PET-CT one of the top 20 countries in terms of tuberculosis inci- has the advantage of being a non-invasive staging dence [19]. Particularly, Porto Alegre, the city where this method that is becoming increasingly available and study has taken place, has an incidence rate of 99,3 cases has solid data regarding its accuracy. Nevertheless, per 100.000 population [20]. the pivotal studies were undertaken in areas without Our report shows that no major impact in sensitivity endemic cases of tuberculosis and other infectious is seen in an area endemic for tuberculosis and is similar granulomatous disease [13, 14]. to the literature [15]. On the other hand, specificity is Werutsky et al. BMC Cancer (2019) 19:5 Page 6 of 8

Fig. 2 ROC curve comparing PET-CT performance for images acquired in hour 1 and 2 clearly affected, even when higher SUV_max cut off was found that for the same SUV cut off (≥ 5), the per-nodal used. On the per-patient analysis, for SUV_max > 0 we station specificity was slightly higher than per-patient estimated specificity and positive predictive value equal evaluation (0.91 vs 0.79, respectively). This finding is to 0,45 and 0,37, respectively. When a higher cut off was consistent with previous reports showing a decrease in used (SUV_max ≥5), specificity and positive predictive PET-CT specificity when considering only 18F-FDG up- value increased to 0,79 and 0,54; respectively. We also take as qualitative criteria for a positive exam [21–23].

Fig. 3 Correlation between maximum SUV and anatomopathological finding for the mediastinal lymphnodes Werutsky et al. BMC Cancer (2019) 19:5 Page 7 of 8

Kim [23] and Lee [21] have reported two cohorts from collection, analysis and interpretation of data and preparation of the South Korea, an endemic country for tuberculosis, with manuscript. specificity of 0,84 and 0,73, respectively. Nonetheless, Availability of data and materials both studies performed a secondary analysis that only The datasets used and/or analyzed during the current study are available considered positive mediastinal lymph nodes with from the corresponding author on reasonable request. 18F-FDG uptake without associated calcification or high Authors’ contributions attenuation. This secondary analysis showed improved GW, BH, JMM, LGF, CHB, MD and CCZ were responsible for designing and specificity of 0,96 and 0,89; respectively. conducting the study as well as interpreting study data; JALFP, MTRZ and AV Additionally, dual time point PET-CT scanning for performed thoracic procedures; MLZ, EHB, EV, LH, MA, CM and LCAJ were responsible for performing and interpreting PET-CT scans; VDS performed mediastinal node staging in NSCLC is still controversial. the histological examination of mediastinal lymph nodes and lung tumors; Although some studies have reported that it may be VB, RRA, GS and FZ performed data monitoring and statistical analysis. All au- helpful in differentiating malignancy from benign pro- thors read and approved the final manuscript. cesses, most studies have demonstrated significant over- Ethics approval and consent to participate lap of FGD uptake patterns between benign and The study was conducted in compliance with all national and international malignant lesions on delayed time point images [24–26]. ethical standards for research with humans and for research using ’ radiopharmaceuticals. All study procedures were approved by the Pontifícia We found a higher PET-CT s positivity for the same Universidade Católica do Rio Grande do Sul Institutional Ethics Committee SUV cut-offs, although not clinically relevant, which is (approval number 641.287) and patients gave written informed consent consistent with previous reports [27]. before being enrolled. Our study has some limitations. First, PET-CT has Consent for publication been compared against invasive staging and not the final Not applicable. pathologic report after surgery for all patients. Since surgery was not performed in some of the patients diag- Competing interests The authors declare that they have no competing interests. nosed with N2 disease, we did not have the pathological specimen after surgery of all patients. Second, the sensi- ’ tivity found in this report is higher than the reported in Publisher sNote Springer Nature remains neutral with regard to jurisdictional claims in the Cochrane meta-nalysis, this could be explained by published maps and institutional affiliations. the number of patients included. Schmidt-Hansen [15] found a significantly higher sensitivity in studies with < Author details 1Latin American Cooperative Oncology Group (LACOG), Ipiranga Avenue 100 participants compared with studies with 100 to 199 6681, 99A, Room 806, Porto Alegre, Brazil. 2Medical School, Pontifical Catholic participants. University of Rio Grande do Sul, Porto Alegre, Brazil. 3IMED, School of Medicine, Passo Fundo, Brazil. 4Brain Institute of Rio Grande do Sul, Porto Alegre, Brazil. 5Hospital São Lucas, Pontifical Catholic University of Rio Grande Conclusions do Sul, Porto Alegre, Brazil. In conclusion, our findings are in line with the most re- Received: 5 May 2018 Accepted: 19 December 2018 cent publications and guidelines, which recommend that PET-CT must not be solely used as a tool to mediastinal staging, even in a region with high burden of tuberculosis. References 1. Global Burden of Disease Cancer Collaboration, Fitzmaurice C, Allen C, Barber RM, Barregard L, Bhutta ZA, et al. Global, regional, and National Abbreviations Cancer Incidence, mortality, years of life lost, years lived with disability, and 18F-FDG PET-CT: Positron emission tomography with 2-deoxy-2-[fluorine- disability-adjusted life-years for 32 Cancer groups, 1990 to 2015: a 18]fluoro- D-glucose integrated with computed tomography; CNPq: National systematic analysis for the global burden of disease study. JAMA Oncol. Council for Scientific and Technological Development; CS: Clinical stage; 2017;3:524. CT: Computed tomography; HIV: Human immunodeficiency virus; 2. Instituto Nacional de Câncer José Alencar Gomes. Estimativa. Incidência de INCA: Brazilian National Institute of Cancer; kV: Kilovolt; LACOG: Latin Câncer no Brasil. Rio de Janeiro: INCA; 2018. http://www1.inca.gov.br/ American Cooperative Oncology Group; mA: Miliampere; mCi: Millicurie; estimativa/2018/estimativa-2018.pdf. mg: Milligram; ml: Milliliter; NSCLC: Non-small cell lung cancer; OS- 3. Araujo LH, Baldotto C, de CJG, Katz A, Ferreira CG, Mathias C, et al. Lung EM: Ordered-subset expectation-maximization; ROC: Receiver operator cancer in Brazil. J Bras Pneumol. 2018;44:55–64. characteristic; STARD: Essential items for reporting diagnostic accuracy 4. Goldstraw P, Chansky K, Crowley J, Rami-Porta R, Asamura H, Eberhardt WEE, studies; SUV: Standardized uptake value; SUV_max: Maximum standardized et al. The IASLC lung Cancer staging project: proposals for revision of the uptake value; TB: Tuberculosis; WHO: World Health Organization TNM stage groupings in the forthcoming (eighth) edition of the TNM classification for lung Cancer. J Thorac Oncol Off Publ Int Assoc Study Lung Acknowledgements Cancer. 2016;11:39–51. We would like to acknowledge SAS Insititute Inc. for providing support to 5. Stamatis G. Staging of lung cancer: the role of noninvasive, minimally our study by providing access to SAS® statistical products. invasive and invasive techniques. Eur Respir J. 2015;46:521–31. 6. De Leyn P, Dooms C, Kuzdzal J, Lardinois D, Passlick B, Rami-Porta R, et al. Funding Revised ESTS guidelines for preoperative mediastinal lymph node staging This study was funded by the National Council for Scientific and for non-small-cell lung cancer. Eur J Cardiothorac Surg. 2014;45(5):787–98. Technological Development (CNPq), a public agency related to Brazilian 7. National Comprehensive Cancer Network, editor. Non-small cell lung Science and Technology Ministry. Its role was exclusively financial support, cancer: version 3.2018. 2018. https://www.nccn.org/professionals/physician_ without any participation in the design and conduct of the research, gls/pdf/nscl.pdf. Accessed 26 Mar 2018. Werutsky et al. BMC Cancer (2019) 19:5 Page 8 of 8

8. Vallabhajosula S. Molecular Imaging: Radiopharmaceuticals for PET and SPECT. Berlin Heidelberg: Springer-Verlag; 2009. www.springer.com/la/book/ 9783540767350. Accessed 14 Mar 2018 9. Silvestri GA, Gonzalez AV, Jantz MA, Margolis ML, Gould MK, Tanoue LT, et al. Methods for staging non-small cell lung cancer. Chest. 2013;143:e211S– 50S. 10. Cohade C, Osman M, Pannu HK, Wahl RL. Uptake in supraclavicular area fat (“USA-fat”): description on 18F-FDG PET/CT. J Nucl Med Off Publ Soc Nucl Med. 2003;44:170–6. 11. Hany TF, Gharehpapagh E, Kamel EM, Buck A, Himms-Hagen J, von Schulthess GK. Brown adipose tissue: a factor to consider in symmetrical tracer uptake in the neck and upper chest region. Eur J Nucl Med Mol Imaging 2002;29:1393–1398. 12. Abouzied MM, Crawford ES, Nabi HA. 18F-FDG imaging: pitfalls and artifacts. J Nucl Med Technol. 2005;33:145–55 quiz 162–3. 13. Terán MD, Brock MV. Staging lymph node metastases from lung cancer in the mediastinum. J Thorac Dis. 2014;6:230–6. 14. Birim Ö, Kappetein AP, Stijnen T, Bogers AJJC. Meta-analysis of positron emission tomographic and computed tomographic imaging in detecting mediastinal lymph node metastases in nonsmall cell lung cancer. Ann Thorac Surg. 2005;79:375–82. 15. Schmidt-Hansen M, Baldwin DR, Hasler E, Zamora J, Abraira V, Roqué i Figuls M. PET-CT for assessing mediastinal lymph node involvement in patients with suspected resectable non-small cell lung cancer. Cochrane Database Syst Rev. 2014. https://doi.org/10.1002/14651858.CD009519.pub2. 16. Al-Sarraf N, Aziz R, Doddakula K, Gately K, Wilson L, McGovern E, et al. Factors causing inaccurate staging of mediastinal nodal involvement in non-small cell lung cancer patients staged by positron emission tomography. Interact Cardiovasc Thorac Surg. 2007;6:350–3. 17. Chang JM, Lee HJ, Goo JM, Lee H-Y, Lee JJ, Chung J-K, et al. False positive and false negative FDG-PET scans in various thoracic diseases. Korean J Radiol. 2006;7:57–69. 18. Harkirat S, Anana S, Indrajit L, Dash A. Pictorial essay: PET/CT in tuberculosis. Indian J Radiol Imaging. 2008;18:141–7. 19. World Health Organization. Global tuberculosis report 2016. 2016. http:// apps.who.int/iris/bitstream/10665/250441/1/9789241565394-eng.pdf. Accessed 14 Mar 2018. 20. Secretaria de Vigilância em Saúde. Boletim Epidemiológico. 9th edition. Brasília - DF: Ministério da Saúde; 2015. http://portalarquivos.saude.gov.br/ images/pdf/2015/marco/25/Boletim-tuberculose-2015.pdf 21. Lee JW, Kim BS, Lee DS, Chung J-K, Lee MC, Kim S, et al. 18F-FDG PET/CT in mediastinal lymph node staging of non-small-cell lung cancer in a tuberculosis-endemic country: consideration of lymph node calcification and distribution pattern to improve specificity. Eur J Nucl Med Mol Imaging. 2009;36:1794–802. 22. Liao C-Y, Chen J-H, Liang J-A, Yeh J-J, Kao C-H. Meta-analysis study of lymph node staging by 18 F-FDG PET/CT scan in non-small cell lung cancer: comparison of TB and non-TB endemic regions. Eur J Radiol. 2012;81:3518–23. 23. Kim YK, Lee KS, Kim B-T, Choi JY, Kim H, Kwon OJ, et al. Mediastinal nodal staging of nonsmall cell lung cancer using integrated 18F-FDG PET/CT in a tuberculosis-endemic country: diagnostic efficacy in 674 patients. Cancer. 2007;109:1068–77. 24. Shinozaki T, Utano K, Fujii H, Utano Y, Sasaki T, Kijima S, et al. Routine use of dual time 18F-FDG PET for staging of preoperative lung cancer: does it affect clinical management? Jpn J Radiol. 2014;32:476–81. 25. Cheng G, Torigian DA, Zhuang H, Alavi A. When should we recommend use of dual time-point and delayed time-point imaging techniques in FDG PET? Eur J Nucl Med Mol Imaging. 2013;40:779–87. 26. Zhao M, Ma Y, Yang B, Wang Y. A meta-analysis to evaluate the diagnostic value of dual-time-point F-fluorodeoxyglucose positron emission tomography/computed tomography for diagnosis of pulmonary nodules. J Cancer Res Ther. 2016;12:304. 27. Li X, Zhang H, Xing L, Ma H, Xie P, Zhang L, et al. Mediastinal lymph nodes staging by 18F-FDG PET/CT for early stage non-small cell lung cancer: a multicenter study. Radiother Oncol. 2012;102:246–50. Original Investigation | Oncology Assessment of the Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

Makoto Ohno, MD, PhD; Juntaro Matsuzaki, MD, PhD; Junpei Kawauchi, BS; Yoshiaki Aoki, BS; Junichiro Miura, BE; Satoko Takizawa, PhD; Ken Kato, MD, PhD; Hiromi Sakamoto, PhD; Yuko Matsushita, BS; Masamichi Takahashi, MD, PhD; Yasuji Miyakita, MD; Koichi Ichimura, MD, PhD; Yoshitaka Narita, MD, PhD; Takahiro Ochiya, PhD

Abstract Key Points Question Can serum microRNAs be IMPORTANCE A blood-based screening tool for detecting diffuse glioma is necessary to improve used to detect diffuse glioma and to clinical outcomes. differentiate glioblastoma, primary central nervous system lymphoma, and OBJECTIVES To establish models using serum microRNAs to distinguish patients with diffuse glioma metastatic brain tumors? from control individuals without cancer (the Glioma Index) and to differentiate glioblastoma (GBM), primary central nervous system lymphoma (PCNSL), and metastatic brain tumors (the Findings In this case-control diagnostic 3-Tumor Index). study of 266 patients with brain or spinal tumors and 314 control patients DESIGN, SETTING, AND PARTICIPANTS This retrospective, case-control diagnostic study included without cancer, the Glioma Index, 157 patients with diffuse glioma and 109 patients with central nervous system (CNS) diseases other constructed using 3 microRNAs, than diffuse glioma diagnosed from August 1, 2008, through May 1, 2016, and 314 sex- and distinguished patients with diffuse age-matched controls without cancer. Samples of patients with diffuse glioma and controls were glioma from controls with high randomly divided into training and validation set 1, and those of patients with CNS diseases other sensitivity (0.95) and specificity (0.97). than diffuse glioma were allocated to an exploratory set. Samples of patients with GBM, PCNSL, and The 3-Tumor Index, constructed using metastatic brain tumors were randomly divided into training and validation set 2. Data were analyzed 48 microRNAs, positively detected 16 of from April 1, 2018, to March 31, 2019. 17 glioblastomas (94.1%), 4 of 5 metastatic brain tumors (80.0%), and 4 MAIN OUTCOMES AND MEASURES The expression of 2565 microRNAs was assessed, and the of 8 primary central nervous system diagnostic performance was evaluated by calculating the area under the receiver operating lymphomas (50.0%). characteristics curve (AUC), sensitivity, specificity, and accuracy. Meaning This study appears to have identified promising microRNA RESULTS A total of 580 patients were included in the analysis (309 [53.3%] male; median age, 57 combinations for detecting diffuse years [range, 10-87 years]). In training set 1, 100 patients with diffuse glioma (median age, 56 years glioma and for distinguishing histologic [range, 14-87 years]; 55 male [55.0%]) were compared with 200 control patients (median age, 56 findings in brain tumors. years [range, 14-87 years]; 105 male [52.5%]), and the Glioma Index was constructed using 3 microRNAs (miR-4763-3p, miR-1915-3p, and miR-3679-5p). In validation set 1, the AUC was 0.99 (95% CI, 0.99-1.00); sensitivity, 0.95 (95% CI, 0.89-1.00); and specificity, 0.97 (95% CI, 0.93-1.00). + Supplemental content The Glioma Index classified 39 of 42 PCNSL samples (92.9%) and 25 of 28 metastatic brain tumor Author affiliations and article information are samples (89.3%) as positive and 2 of 2 spinal tumors (100%) as negative in the exploratory set. In listed at the end of this article. training set 2, 68 patients with GBM, 34 with PCNSL, and 23 with metastatic brain tumor were compared, and the 3-Tumor Index was constructed using 48 microRNAs. The 3-Tumor Index had an accuracy of 0.80, positively detecting 16 of 17 GBM samples (94.1%), 4 of 5 metastatic brain tumor samples (80.0%), and 4 of 8 PCNSL samples (50.0%) in validation set 2.

CONCLUSIONS AND RELEVANCE This study appears to have identified promising serum microRNA combinations for detecting diffuse glioma and for assessing histologic features of brain tumors.

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953

Open Access. This is an open access article distributed under the terms of the CC-BY License.

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 1/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

Introduction

Diffuse gliomas are the most common primary malignant brain tumors, with an incidence of 1.32 to 5.73 cases per 100 000 adults.1 They are diagnosed histologically based on the World Health Organization 2016 brain tumor classification, which integrates histopathologic diagnosis with molecular features.2 Glioblastoma (GBM) is the most malignant diffuse glioma, with a 5-year survival rate of approximately 15%.3 Standard methods for detecting glioma include neuroradiological examinations such as computed tomography and magnetic resonance imaging. Although these methods are highly sensitive and reliable, their routine use is limited by their cost and inconvenience. Early detection of cancer using screening tools, such as mammography for breast cancer, can improve patient outcomes4; however, there are no screening methods for diffuse glioma. Therefore, a screening test for the detection of diffuse glioma with low invasiveness appears to be urgently needed. Several recent studies investigated the role of circulating microRNAs (miRNAs) as diagnostic biomarkers. MicroRNAs are noncoding RNAs constituting 19 to 24 nucleotides, and they serve as hubs in gene regulatory networks by controlling numerous targets via RNA silencing and posttranscriptional regulation of gene expression.5 MicroRNAs are involved in many biological activities, including cancer development. Circulating miRNAs are stable,6 and prolonged storage at room temperature, freezing, and thawing has minimal effects on miRNA expression levels.7 Circulating miRNA tests are less invasive than other methods and are therefore good candidates for cancer screening tests. We used serum samples from 580 patients to identify promising miRNAs for the detection of diffuse gliomas. We also explored the potential of miRNA profiles to discriminate among GBM, primary central nervous system lymphoma (PCNSL), and metastatic brain tumor.

Methods

Study Population The study was approved by the institutional review board of the National Cancer Center Hospital (NCCH) and the Research Committee of Medical Corporation Shintokai Yokohama Minoru Clinic. Written informed consent was obtained from all participants. The study followed the Standards for Reporting of Diagnostic Accuracy (STARD) reporting guideline. Serum samples were obtained from patients who underwent surgery for suspected brain or spinal tumors at the Department of Neurosurgery and Neuro-oncology of the NCCH (n = 215) or who were referred to NCCH after undergoing surgery elsewhere (n = 51) from August 1, 2008, through May 1, 2016. Serum samples were registered and stored at −20 °C in the National Cancer Center Biobank. Patients who were referred to NCCH after undergoing surgery elsewhere were included only if they had residual tumor at the time of referral to NCCH. Tumor diagnoses were based on the World Health Organization 2016 classification2 (eMethods 1 in the Supplement). The 266 patients with central nervous system (CNS) disease included 157 with diffuse gliomas, 13 with glial tumors other than diffuse glioma (7 ependymomas, 3 pilocytic astrocytomas, 2 anaplastic gangliogliomas, and 1 unclassified glioma), 42 with PCNSL, 28 with metastatic brain tumor, 22 with benign brain tumors (19 meningiomas, 2 hemangiopericytomas, and 1 intracranial schwannoma), 2 with spinal schwannomas, 1 with trauma, and 1 with infarction. The 314 control patients without cancer included 157 patients with benign diseases and no cancer treated at the NCCH from 2008 through 2016 (noncancer sample 1); serum samples from these patients were stored at −20 °C and matched for sex and age with those of 157 patients with diffuse glioma. Another 157 healthy individuals older than 35 years were seen for medical checkup. Serum samples from these patients were collected at the Yokohama Minoru Clinic in 2014 (noncancer sample 2). These serum samples were stored at −80 °C and matched by sex and age with

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 2/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

those of the 157 patients with diffuse glioma. The use of samples collected and stored under different conditions minimized the effects of differences in collection and storage conditions.

MiRNA Extraction and Expression Analysis Total RNA was extracted from 300-μL serum samples using RNA extraction reagent (3D-Gene System; Toray Industries, Inc) and concentrated. Fluorescent labeling of RNA was performed using a miRNA labeling kit (3D-Gene System). RNA was hybridized to a human miRNA oligo chip (3D-Gene System) designed to detect 2565 miRNA sequences in miRBase, release 21 (http://www.mirbase.org/), and the chip was scanned (3D-Gene System). MicroRNAs with signals higher than the background signal were selected in advance (positive call), and background signals were subtracted from each positive-call miRNA signal. Only preprocessed positive-call miRNAs were used for subsequent analyses. To normalize signals across microarrays, miRNA signals were divided by the mean signals of internal control miRNAs (miR-149-3p, miR-2861, and miR-4463) that were stably detected in more than 500 serum samples.8 To identify robust miRNAs, miRNAs with normalized signal values of greater than 64 intensity units in more than 50% of samples in each group were selected. Microarray data were obtained in accordance with the Minimum Information About a Microarray Experiment (MIAME) guidelines. Data sets were submitted to the National Center for Biotechnology Information Gene Expression Omnibus database under accession number GSE 139031.

Statistical Analysis Data were analyzed from April 1, 2018, to March 31, 2019. To develop models for discrimination between patients with diffuse gliomas and control patients without cancer, samples from the patient and control groups were randomly divided (100:57) into training set 1 and validation set 1. Training set 1 was used to construct discrimination models, and validation set 1 was used to validate the discrimination models. Other brain tumors (ependymomas, pilocytic astrocytomas, gangliogliomas, PCNSL, metastatic brain tumor, meningiomas, and schwannomas), spinal tumors, and trauma and infarction cases were allocated to the exploratory set to investigate the ability of the model to identify gliomas among these cases (Figure 1A). Two-group discrimination models were constructed using Fisher linear discriminant analysis and leave-1-out cross-validation in training set 1 (a flowchart is shown in eMethods 2 in the Supplement). Cutoff values for discrimination models were set at 0 based on the Youden index. The best discrimination model (ie, the model showing maximum accuracy using the minimum number of miRNAs) was selected in training set 1. In validation set 1, diagnostic sensitivity, specificity, accuracy, and area under the receiver operating characteristics curve (AUC) were calculated. To develop a 3-group discrimination model to distinguish among GBM, PCNSL, and metastatic brain tumor, samples were randomly divided (4:1) into training set 2 and validation set 2. Three-group discrimination models were constructed by combining the 2-group discrimination models. First, 3 types of 2-group discrimination models were constructed as follows: (1) GBM vs others (PCNSL and metastatic brain tumor); (2) PCNSL vs others (GBM and metastatic brain tumor); and (3) metastatic brain tumor vs others (GBM and PCNSL). In this case, the discrimination models were constructed using the logistic least absolute shrinkage and selection operator regression analysis with 10-fold cross-validation. The random division was repeated 50 times, and fifty 2-group discrimination models were constructed. Next, fifty 3-group discrimination models were constructed by combining models 1 and 2, 2 and 3, or 1 and 3. Finally, 1 representative 3-group discrimination model was selected, and the diagnostic accuracy was tested in validation set 2 (a flowchart is shown in eMethods 3 in the Supplement). Fisher linear discriminant analysis and logistic least absolute shrinkage and selection operator regression analysis were performed using the following packages in R, version 3.1.2 (R Project for Statistical Computing): compute.es, version 0.2-4; glmnet, version 2.0-3; hash, version 2.2.6; MASS, version 7.3-45; mutoss, version 0.1-10; and pROC, version 1.8. Unsupervised clustering and heatmap generation were performed using Pearson correlation in the Ward method for linkage analysis, and

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 3/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

principal component analysis was performed using Genomics Suite, version 6.6 (Partek). Differences in characteristics between 2 or 3 groups were evaluated using the unpaired t test (continuous variables), Pearson χ2 test (categorical variables), and SPSS, version 22 (IBM Japan). The limit of statistical significance for all analyses was defined as a 2-sided P < .05.

Results

Characteristics of Patients With CNS Diseases and Controls The characteristics of the 580 participants (309 [53.3%] male; 271 [46.7%] female; median age, 57 years [range, 10-87 years]) in training set 1, validation set 1, and the exploratory set are shown in Table 1. Training set 1 (n = 300) consisted of 100 patients with diffuse glioma (median age, 56 years [range, 14-87 years]; 55 male [55.0%] and 45 female [45.0%]) and 200 control patients without cancer (median age, 56 years [range, 14-87 years; 105 male [52.5%] and 95 female [47.5%]). Validation set 1 (n = 171) consisted of 57 patients with diffuse glioma (median age, 54 years [range, 17-84 years]; 34 male [59.6%] and 23 female [40.4%]) and 114 control patients (median age, 56 years [range, 21-85 years]; 58 male [50.9%] and 56 female [49.1%]). The exploratory set (n = 109) consisted of patients with CNS diseases other than diffuse glioma (Table 1). No statistically significant differences were found in age and sex between patients with diffuse glioma and control patients in training set 1 or validation set 1 (eTable 1 in the Supplement).

Figure 1. Flowchart of the Development of the Glioma Index and the 3-Tumor Index

A Glioma index

423 National Cancer 157 General health Center Biobank checkup in a clinic (Noncancer 2)

157 Patients with benign diseases (Noncancer 1)

266 Recruited patients 314 Noncancer controls with CNS diseases

100 Diffuse glioma 200 Noncancer controls 57 Diffuse glioma 114 Noncancer controls 109 CNS diseases other than diffuse glioma

Training set 1 Validation set 1 Exploratory set

B 3-Tumor index

85 Glioblastoma 42 PCNSL 28 Metastatic

4:1 Repeated random division (50 times) Training set 2 Validation set 2 68 Glioblastoma 17 Glioblastoma 34 PCNSL 8 PCNSL 23 Metastatic tumor 5 Metastatic tumor

A, For the Glioma Index, samples were divided into training set 1, validation set 1, and the Index, samples were divided randomly into 2 groups (4:1, training set 2 and validation set exploratory set. Noncancer control samples were collected from the National Cancer 2) to develop models for discriminating between glioblastoma (GBM), primary central Center Biobank (noncancer 1 controls) and the general population undergoing routine nervous system (CNS) lymphoma (PCNSL), and metastatic brain tumors. health checkup at a clinic in Yokohama, Japan (noncancer 2 controls). B, For the 3-Tumor

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 4/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

Table 1. Patient Characteristicsa

Training Set 1 Validation Set 1 Exploratory Set Characteristic (n = 300) (n = 171) (n = 109) P Value Patients With Diffuse Glioma Total 100 (33.3) 57 (33.3) NA NA Age, median (range), y 56 (14-87) 54 (17-84) NA .46 Sex Male 55 (55.0) 34 (59.6) NA .62 Female 45 (45.0) 23 (40.4) NA Histologic finding Diffuse astrocytoma 14 (14.0) 11 (19.3) NA NA IDH-Mut 8 (8.0) 8 (14.0) NA IDH-WT 5 (5.0) 0 NA .06 NOS 1 (1.0) 3 (5.3) NA Oligodendroglioma 3 (3.0) 3 (5.3) NA IDH-Mut, 1p/19q codeletion 3 (3.0) 2 (3.5) NA .27 NOS 0 1 (1.8) NA Anaplastic astrocytoma 21 (21.0) 10 (17.5) NA IDH-Mut 10 (10.0) 2 (3.5) NA IDH-WT 4 (4.0) 5 (8.8) NA .17 NOS 7 (7.0) 3 (5.3) NA Anaplastic oligodendroglioma 6 (6.0) 4 (7.0) NA NA IDH-Mut, 1p/19q codeletion 6 (6.0) 4 (7.0) NA NA Glioblastoma 56 (56.0) 29 (50.9) NA NA IDH-Mut 3 (3.0) 1 (1.8) NA IDH-WT 43 (43.0) 25 (43.9) NA .59 NOS 10 (10.0) 3 (5.3) NA Patients With CNS Disease Other Than Diffuse Glioma Total NA NA 109 (100) NA Age, median (range), y NA NA 65 (10-85) NA Sex Male NA NA 57 (52.3) NA Female NA NA 52 (47.7) NA Histologic finding Glial tumors other than diffuse glioma NA NA 13 (11.9) NA (ependymoma, pilocytic astrocytoma, etc) Primary CNS lymphoma NA NA 42 (38.5) NA Metastatic brain tumor NA NA 28 (25.7) NA Benign brain tumor NA NA 22 (20.2) NA (meningioma, schwannoma, etc) Spinal schwannomas NA NA 2 (1.8) NA Trauma NA NA 1 (0.9) NA Infarction NA NA 1 (0.9) NA Controls Without Cancer Total 200 (100) 114 (100) NA NA Age, median (range), y 56 (14-85) 56 (21-85) NA .94 Sex Male 105 (52.5) 58 (50.9) NA .78 Female 95 (47.5) 56 (49.1) NA Abbreviations: CNS, central nervous system; IDH-Mut, Sample collection IDH1/2 mutation; IDH-WT, IDH1/2 wild-type; NA, not applicable; NOS, not otherwise specified. National Cancer Center Biobank 100 (50.0) 57 (50.0) NA NA a Unless otherwise specified, data are presented as General health checkup in a clinic 100 (50.0) 57 (50.0) NA number (percentage).

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 5/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

Development of the Glioma Index Among 2565 miRNAs, 365 had normalized signal values greater than 64 intensity units in more than 50% of samples in each group. Six miRNAs were removed based on miRBase, release 22; ultimately, 359 miRNAs were analyzed in this study. Table 2 shows the best combination models of 1 to 5 miRNAs in training set 1. For the Glioma Index, we selected a model based on 3 miRNAs (miR- 4763-3p, miR-1915-3p, and miR-3679-5p) that achieved high accuracy (0.97; 95% CI, 0.95-0.99); the AUC indicated no improvement in models with miRNA combinations. The Glioma Index had a sensitivity of 0.96 (95% CI, 0.92-1.00), a specificity of 0.97 (95% CI, 0.95-0.97), and an AUC of 0.99 (95% CI, 0.99-1.00) (Figure 2A). The Glioma Index was calculated as follows: (2.09406 × miR-4763-3p) + (1.35369 × miR-1915-3p) + (−0.378659 × miR-3679-5p) − 32.11268. The AUC of a single miRNA to distinguish diffuse glioma from noncancer controls ranged from 0.63 to 0.92 (eFigure 1 in the Supplement). The diagnostic performance of the Glioma Index was tested in validation set 1, yielding a sensitivity of 0.95 (95% CI, 0.89-1.00), specificity of 0.97 (95% CI, 0.93-1.00), and AUC of 0.99 (95% CI, 0.99-1.00) (Figure 2B). The unsupervised hierarchical clustering analysis (depicted by a heat map in Figure 2C) and principal component analysis (eFigure 2A in the Supplement) demonstrated that the Glioma Index effectively differentiated diffuse glioma from noncancer controls. The discrimination performance of the Glioma Index did not vary among diffuse astrocytoma, oligodendroglioma, anaplastic astrocytoma, anaplastic oligodendroglioma, and GBM (eFigure 2B in the Supplement). Next, we tested the diagnostic performance of the Glioma Index using the exploratory set. As shown in Figure 2D, this model classified 13 of 13 glial tumors other than diffuse glioma (100%), 39 of 42 PCNSL samples (92.9%), 25 of 28 metastatic brain tumor samples (89.3%), and 20 of 22 benign brain tumors (90.9%) as positive. Nonneoplastic cases, including trauma and infarction, were also classified as positive. Two spinal schwannoma cases were classified as negative.

Development of the 3-Tumor Index Next, we examined the ability of serum miRNA profiles to distinguish histologic features of tumors. We focused on GBM, PCNSL, and metastatic brain tumor because, although these tumors appear as contrast-enhanced lesions on computed tomography and magnetic resonance imaging, these modalities cannot always discriminate among the 3 types of tumors. The study cohort consisted of 85 patients with GBM, 42 with PCNSL, and 28 with metastatic brain tumor, with no differences in age and sex between patients with these 3 tumors (eTable 2 in the Supplement). To develop a 3-group discrimination model, we randomly divided GBM, PCNSL, and metastatic brain tumor cases into training set 2, consisting of 68 patients with GBM, 34 with PCNSL,

Table 2. Best Combination Models of MicroRNAs to Detect Diffuse Glioma in Training Set 1

Training Set 1 No. of miRNAs Model Candidates Sensitivity (95% CI) Specificity (95% CI) Accuracy (95% CI) AUC (95% CI) P Value 1 (1.42387 × miR-1225-3p) − 10.2225 0.93 (0.88-0.98) 0.83 (0.77-0.88) 0.86 (0.82-0.90) 0.95 (0.93-0.97) NA 2 (1.21875 × miR-1225-3p) + 0.90 (0.84-0.96) 0.97 (0.95-0.99) 0.95 (0.92-0.97) 0.97 (0.96-0.99) .006 (1.62781 × miR-1227-5p) − 25.2598 (vs 1-miRNA model) 3 (2.09406 × miR-4763-3p) + 0.96 (0.92-1.00) 0.97 (0.95-0.97) 0.97 (0.95-0.99) 0.99 (0.99-1.00) .01 (1.35369 × miR-1915-3p) + (vs 2-miRNA model) (–0.378659 × miR-3679-5p) − 32.11268 4 (1.78027 × miR-4763-3p) + 0.96 (0.92-1.00) 0.98 (0.96-1.00) 0.97 (0.95-0.99) 1.00 (0.99-1.00) .15 (1.24861 × miR-1915-3p) + (vs 3-miRNA model) (–0.352535 × miR-3679-5p) + (0.295729 × miR-6729-3p) −30.12238 5 (2.03114 × miR-4763-3p) + 0.96 (0.92-1.00) 1.00 (1.00-1.00) 0.99 (0.97-1.00) 0.99 (0.99-1.00) .83 (1.32272 × miR-1915-3p) + (vs 4-miRNA model) (–0.427725 × miR-3679-5p) + (0.494549 × miR-4741) + (–0.126557 × miR-204-3p) −34.1707

Abbreviations: AUC, area under the receiver operating characteristic curve; miRNAs, microRNAs; NA, not applicable.

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 6/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

Figure 2. Development and Validation of the Glioma Index

A ROC analysis of training set B ROC analysis of validation set 1

1.0 1.0

0.8 0.8

0.6 0.6

Sensitivity 0.4 Sensitivity 0.4

0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 1 – Specificity 1 – Specificity

C Heatmap

Noncancer Validation set 1 Glioma

miR-4763-3p

miR-1915-3p MicroRNA

miR-3679-5p –30 3 z Score

D Dot plot 8

6

4

2

0

–2 Glioma Index Score Glioma Index –4

–6

–8 Glial Tumor Other PCNSL Metastatic Benign Nonneoplastic Spinal Than Diffuse Glioma Tumor Brain Tumor Disease Schwannoma

A, In training set 1 of the Glioma Index, sensitivity was 0.96 (95% CI, 0.92-1.00); specificity, 0.97 (95% CI, 0.95-0.97); and area under the receiver operating characteristic (ROC) curve, 0.99 (95% CI, 0.99-1.00). B, In validation set 1 of the Glioma Index, sensitivity was 0.95 (95% CI, 0.89-1.00); specificity, 0.97 (95% CI, 0.93-1.00); and area under the ROC curve, 0.99 (95% CI, 0.99-1.00). C, The diffuse glioma and noncancer control samples in validation set 1 were plotted using unsupervised hierarchical clustering analysis with a heat map for the Glioma Index. z Scores indicate how many SDs a point is away from the mean of its data set. D, Dot plot of the Glioma Index in the exploratory set. The sensitivity for glial tumors other than diffuse glioma (n = 13) was 1.00; primary central nervous system lymphoma (PCNSL) (n = 42), 0.93; metastatic brain tumor (n = 28), 0.89; benign brain tumor (n = 22), 0.91; nonneoplastic disease (trauma and infarction) (n = 2), 1.00; and spinal schwannoma (n = 2), 0. The horizontal bar indicates the mean.

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 7/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

and 23 with metastatic brain tumor; and validation set 2, consisting of 17 with GBM, 8 with PCNSL, and 5 with metastatic brain tumor (Figure 1B). Fifty 2-group discrimination models using 50 patterns of random division were initially generated in GBM vs others (PCNSL and metastatic brain tumor), PCNSL vs others (GBM and metastatic brain tumor), and metastatic brain tumor vs others (GBM and PCNSL). In the GBM vs others discrimination model, the mean number of miRNAs was 22.6 (95% CI, 20.2-25.0), and the mean AUC was 0.91 (95% CI, 0.89-0.92) (eFigure 3A in the Supplement). Similarly, in discrimination models of PCNSL vs others, the mean number of miRNAs was 16.0 (95% CI, 13.1-19.0) and the mean AUC was 0.86 (95% CI, 0.84-0.88); for metastatic brain tumor vs others, the mean number of miRNAs was 21.1 (95% CI, 18.2-24.0) and the mean AUC was 0.97 (95% CI, 0.96-0.98) (eFigure 3B and C in the Supplement). To construct the 3-group discrimination model, 2 of the 2-group discrimination models were combined, and their accuracy was compared. In the combination of GBM vs others and PCNSL vs others, the mean accuracy was 0.76 (95% CI, 0.64-0.89) (eFigure 3D in the Supplement). In the combination of PCNSL vs others and metastatic brain tumor vs others, the mean accuracy was 0.80 (95% CI, 0.69-0.95) (eFigure 3E in the Supplement). In the combination of GBM vs others and metastatic brain tumor vs others, the mean accuracy was 0.79 (95% CI, 0.70-0.91) (eFigure 3F in the Supplement). Finally, we selected a representative model (3-Tumor Index) to discriminate GBM, PCNSL, and metastatic brain tumor in training set 2 using 48 miRNAs (eTable 3 in the Supplement). This 3-Tumor Index had an accuracy of 0.80 in training set 2 (Table 3) and classified 16 of 17 GBM samples (94.1%) and 4 of 5 metastatic brain tumor samples (80.0%) as positive in validation set 2. However, 4 of 8 PCNSL samples (50.0%) were misdiagnosed as GBM (Table 3).

Discussion

In this case-control diagnostic study, we comprehensively analyzed 580 serum samples and 2565 miRNAs, including samples from 157 patients with diffuse glioma, using a highly sensitive method of miRNA microarray analysis. We then developed the Glioma Index, which discriminates diffuse glioma from noncancer controls, and the 3-Tumor Index, which discriminates among GBM, PCNSL, and metastatic brain tumor. Previous studies9-13 reported different miRNA signatures that distinguish patients with glioma from healthy individuals with varying diagnostic performance. Zhi et al14 evaluated 739 miRNAs in serum samples from 90 patients with astrocytoma and identified 9 miRNAs (miR-15-5p, miR-16-5p, miR-19a-3p, miR-19b-3p, miR-20a-5p, miR-106a-5p, miR-130a-3p, miR-181b-5p, and miR-208a-3p) as potential biomarkers capable of distinguishing astrocytomas from controls (AUC, 0.9722). Zhou et al15 reviewed 28 articles investigating miRNA-based glioma diagnosis and reported overall sensitivity of 85%, specificity of 90%, and AUC of 93% for these models. Although these studies

Table 3. Cross Tabulation of the 3-Tumor Index in Training and Validation Set 2

True Diagnosis, No./Total No. (%) Test Result GBM PCNSL Metastatic Brain Tumor Training set GBM 62/68 (91.2) 17/34 (50.0) 1/23 (4.3) PCNSL 6/68 (8.8) 17/34 (50.0) 0/23 Metastatic brain tumor 0/68 0/34 21/23 (91.3) Not determined 0/68 0/34 1/23 (4.3) Validation set GBM 16/17 (94.1) 4/8 (50.0) 1/5 (20.0) PCNSL 1/17 (5.9) 4/8 (50.0) 0/5 Abbreviations: GBM, glioblastoma; PCNSL, primary Metastatic brain tumor 0/17 0/8 4/5 (80.0) central nervous system lymphoma.

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 8/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

suggest that miRNAs could be useful for distinguishing patients with glioma from healthy individuals, a complete consensus has not been reached to date. This inconsistency may be attributed to heterogeneity in study design (patients with gliomas or astrocytomas vs healthy controls), sample type (ie, serum vs plasma, whole blood vs exosomes), miRNA coverage, and analytical techniques. The present method using the microarray excluded the 9 miRNAs (miR-15-5p, miR-16-5p, miR-19a-3p, miR-19b-3p, miR-20a-5p, miR-106a-5p, miR-130a-3p, miR-181b-5p, and miR-208a-3p) detected by polymerase chain reaction by Zhi et al14 because of low expression levels. This discrepancy may be attributed to the different method of detecting miRNAs. The Glioma Index developed in our study was based on 3 miRNAs (miR-4763-3p, miR-1915-3p, and miR-3679-5p) and had a sensitivity of 0.95 (95% CI, 0.89-1.00), specificity of 0.97 (95% CI, 0.93-1.00), and AUC of 0.99 (95% CI, 0.99-1.00) in validation set 1. We believe these statistics represent excellent diagnostic power and provide definitive and reliable discrimination between diffuse gliomas and noncancer controls. We also investigated whether the Glioma Index could discriminate diffuse gliomas from other intracranial diseases in the exploratory set; however, we found it difficult to discriminate between diffuse gliomas and other brain tumors. We speculate that the differences between diffuse gliomas and noncancer controls detected by the Glioma Index were not the same as those between diffuse gliomas and other brain tumors. However, from a clinical point of view, the ability of the Glioma Index to distinguish brain tumors from noncancer controls suggests that it may be useful as a screening tool for detecting intracranial diseases during medical visits. There are currently no population-based screening tests for brain tumors, and the Glioma Index appears to be a promising candidate screening tool that could be used before referring patients for computed tomographic or magnetic resonance imaging examination. Furthermore, the Glioma Index showed high sensitivity not only in patients with GBM, but also in those with lower-grade glioma (100% for grade 2 diffuse astrocytoma and 100% for grade 2 oligodendroglioma) (eFigure 2B in the Supplement), suggesting the possibility of detecting diffuse glioma at the early stage. In the context of the clinical treatment of diffuse gliomas, early detection would enable tumor resection before the tumor has undergone malignant transformation or infiltrated adjacent brain tissues. Therefore, 2-step screening combined with blood-based screening test and neuroradiological examination could potentially improve patient outcomes. The Glioma Index included 3 miRNAs: miR-4763-3p, miR-1915-3p, and miR-3679-5p. The serum levels of miR-4763-3p and miR-1915-3p were higher in patients with diffuse glioma than in controls, suggesting that these miRNAs are oncogenic. By contrast, the level of miR-3679-5p was lower in patients with diffuse glioma than in controls, suggesting that the miRNA functions as a tumor suppressor. Although the roles of these miRNAs in glioma remain unclear, their cancer-related functions were suggested in previous studies.16-18 Expression of miR-4763-3p is induced by oxidative stress in hepatocellular carcinoma cells, and its putative target genes are associated with cell growth arrest and apoptosis.16 Guo et al17 showed that miR-1915-3p is upregulated in breast cancer and promotes the migration and proliferation of breast cancer cells. miR-3679-5p is significantly downregulated in malignant pancreatic tumors relative to healthy controls or benign pancreatic tumors, although its function remains unknown, to our knowledge.18 We also demonstrated that miRNA levels may be able to be used to distinguish among GBM (sensitivity, 0.94), PCNSL (sensitivity, 0.50), and metastatic brain tumor (sensitivity, 0.80). The Glioma Index consisted of 3 miRNAs and was highly sensitive and specific, whereas the 3-Tumor Index consisted of 48 miRNAs and had low sensitivity for PCNSL. This finding suggests that distinguishing between different histologic features of intracranial tumors is more complex than differentiating brain tumors from noncancer controls. Clinically, discriminating among these 3 tumor types using serum miRNAs is important because the treatment strategies for each type are different. Developing blood-based diagnostic methods that can distinguish among GBM, PCNSL, and metastatic brain tumor without a biopsy may enable timely initiation of treatment, thereby avoiding neurosurgical risks. Although the discriminatory performance for PCNSL was not satisfactory, the

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 9/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

fact that miRNAs could identify GBM and metastatic brain tumor is encouraging. Thus, miRNA-based diagnosis could complement neuroradiology to support clinical decision-making.

Limitations This study had several limitations. First, it was performed using retrospectively collected samples. Consequently, sample handling conditions before microarray analysis, such as the interval between centrifugation and storage and the storage temperature, were not strictly regulated and sometimes differed between samples. Although miRNAs are more stable than messenger RNA, various processes can affect their levels in serum.19,20 Second, we lacked an external validation cohort to test the performance of the Glioma Index and the 3-Tumor Index. We have initiated a prospective, multicenter study by collecting serum samples using standard operating procedures and will assess the generalizability of our data prospectively using this independent cohort. Third, this study was a case-control study, which does not allow calculation of positive and negative predictive values. However, because of the rarity of brain tumors, conducting prospective cohort studies is difficult in terms of cost and time. We believe the case-control study design is adequate for collecting large volumes of brain tumor samples in a short time and for exploring the proof-of-concept. Fourth, the noncancer control samples were not the optimal control population, despite being sex and age matched with the diffuse glioma group. Specimens in noncancer sample 1 were collected from patients with benign diseases and stored at −20 °C in the National Cancer Center Biobank. Specimens in noncancer sample 2 were collected from healthy individuals undergoing medical checkup and stored at −80 °C at a different institution. The use of these samples as noncancer controls may have affected the results because of the differences in patient background or storage conditions, potentially leading to the overestimation of specificity. To minimize these biases, we used samples from 2 cohorts with different backgrounds and storage conditions. Fifth, the present cohort was limited to Japanese patients. Because genetic background can affect miRNA expression, further studies are needed to confirm the performance of the present indexes in different racial/ethnic cohorts.

Conclusions

In summary, we identified promising serum miRNA combinations for detecting diffuse glioma with high sensitivity and specificity and to distinguish histologic features of brain tumors. The present data suggest that evaluation of circulating miRNAs is suitable for primary screening of brain tumors. The development of blood-based diagnostic strategies for the detection of brain tumors could contribute to improved patient outcomes. Further studies are needed to confirm the present observations.

ARTICLE INFORMATION Accepted for Publication: October 9, 2019. Published: December 6, 2019. doi:10.1001/jamanetworkopen.2019.16953 Open Access: This is an open access article distributed under the terms of the CC-BY License.©2019OhnoMetal. JAMA Network Open. Corresponding Author: Takahiro Ochiya, PhD, Division of Molecular and Cellular Medicine, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan ([email protected]). Author Affiliations: Department of Neurosurgery and Neuro-oncology, National Cancer Center Hospital, Tokyo, Japan (Ohno, Matsushita, Takahashi, Miyakita, Narita); Division of Molecular and Cellular Medicine, National Cancer Center Research Institute, Tokyo, Japan (Matsuzaki, Kawauchi, Takizawa, Ochiya); Toray Industries, Inc, Tebiro, Kamakura, Japan (Kawauchi, Takizawa); Dynacom Co, Ltd, Chiba, Japan (Aoki, Miura); Department of Gastrointestinal Medical Oncology, National Cancer Center Hospital, Tokyo, Japan (Kato); Department of Biobank and Tissue Resources, National Cancer Center Research Institute, Tokyo, Japan (Sakamoto); Division of Brain Tumor Translational Research, National Cancer Center Research Institute, Tokyo, Japan (Ichimura); Department of Molecular and Cellular Medicine, Tokyo Medical University, Tokyo, Japan (Ochiya).

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 10/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

Author Contributions: Drs Ohno and Matsuzaki contributed equally to this work. Dr Ochiya had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. Concept and design: Ohno, Matsuzaki, Kawauchi, Sakamoto, Matsushita, Narita, Ochiya. Acquisition, analysis, or interpretation of data: Ohno, Matsuzaki, Kawauchi, Aoki, Miura, Takizawa, Kato, Sakamoto, Matsushita, Takahashi, Miyakita, Ichimura, Narita. Drafting of the manuscript: Ohno, Matsuzaki, Aoki, Miura, Kato, Sakamoto, Matsushita, Narita, Ochiya. Critical revision of the manuscript for important intellectual content: Matsuzaki, Kawauchi, Takizawa, Takahashi, Miyakita, Ichimura, Narita, Ochiya. Statistical analysis: Matsuzaki, Kawauchi, Aoki, Miura. Obtained funding: Ochiya. Administrative, technical, or material support: Kawauchi, Takizawa, Kato, Sakamoto, Miyakita, Ichimura, Narita. Supervision: Kato, Narita, Ochiya. Conflict of Interest Disclosures: Dr Kawauchi reported receiving personal fees from Toray Industries, Inc, during the conduct of the study. Mr Aoki reported receiving personal fees from Dynacom Co, Ltd, outside the submitted work and being an employee of Dynacom Co, Ltd, provider of the statistical script used to select the best microRNA combination. Mr Miura reported receiving personal fees from Dynacom Co, Ltd, outside the submitted work and being an employee of Dynacom Co, Ltd, provider of the statistical script used to select the best microRNA combination. Dr Takizawa reported personal fees from Toray Industries, Inc, during the conduct of the study; personal fees from Toray Industries, Inc, outside the submitted work; and having a patent to Toray Industries, Inc, pending. Dr Kato reported receiving grants from the Japan Agency for Medical Research and Development during the conduct of the study; receiving grants from Ono Pharmaceutical Co, Ltd, Merck Sharp & Dohme, and Shionogi & Company, Ltd, outside the submitted work; and receiving personal fees from Taiho Pharmaceutical Co, Ltd, and Eli Lilly & Co outside the submitted work. Dr Ichimura reported receiving grants from Chugai Pharmaceutical Co, Ltd, Eisai Co, Ltd, Daiichi-Sankyo Company, Ltd, and EPS Corporation, unrelated to the submitted work. Dr Narita reported receiving personal fees from Chugai Pharmaceutical Co, Ltd, and grants and personal fees from Ono Pharmaceutical Co, Ltd, AbbVie, Inc, Sumitomo Dainippon Pharma Co, Ltd, Daiichi-Sankyo Company, Ltd, Eisai Co, Ltd, Stella Pharma Corp, Ohtuka Pharmaceutical Co, Ltd, Meiji Seika Kaisha, Ltd, and SBI Pharmaceuticals Co, Ltd, outside the submitted work. Dr Ochiya reported receiving grants from Japan Agency for Medical Research and Development, Kewpie Corporation, and Ono Pharmaceutical Co, Ltd, during the conduct of the study. No other disclosures were reported. Funding/Support: This study was supported by grant 15ae0101013h0001 from the Japan Agency for Medical Research and Development and the New Energy and Industrial Technology Development Organization (Dr Ochiya). The National Cancer Center Biobank is supported by fund 29-A-1 from the National Cancer Center Research and Development. Role of the Funder/Sponsor: The sponsors had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication. Additional Contributions: Tomomi Fukuda, BS, Takumi Sonoda, BS, Hiroko Tadokoro, PhD, Tatsuya Suzuki, BS, Satoshi Kondou, BS, and Makiko Ichikawa, BS (Division of Molecular and Cellular Medicine, National Cancer Center Research Institute), and Kamakura Techno-Science, Inc, performed the microarray assays. Noriko Abe (Department of Pathology and Clinical Laboratories, National Cancer Center Hospital), Michiko Ohori (National Cancer Center Hospital), and Takumi Sonoda, BS (Division of Molecular and Cellular Medicine, National Cancer Center Research Institute), collected samples from the freezing room. No compensation was received.

REFERENCES 1. Ma C, Nguyen HPT, Luwor RB, et al. A comprehensive meta-analysis of circulation miRNAs in glioma as potential diagnostic biomarker. PLoS One. 2018;13(2):e0189452. doi:10.1371/journal.pone.0189452 2. Louis DN, Ohgaki H, Wiestler OD, eds. WHO Classification of Tumours of the Central Nervous System. Revised 4th ed. Lyon, France: IARC Press; 2016. 3. Committee of Brain Tumor Registry of Japan. Report of Brain Tumor Registry of Japan (2005-2008) 14th edition. Neurol Med Chir (Tokyo). 2017;57(suppl 1):9-102. doi:10.2176/nmc.sup.2017-0001 4. Nyström L, Andersson I, Bjurstam N, Frisell J, Nordenskjöld B, Rutqvist LE. Long-term effects of mammography screening: updated overview of the Swedish randomised trials. Lancet. 2002;359(9310):909-919. doi:10.1016/ S0140-6736(02)08020-0 5. Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004;116(2):281-297. doi:10.1016/ S0092-8674(04)00045-5

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 11/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019 JAMA Network Open | Oncology Diagnostic Utility of Serum MicroRNA Classification in Patients With Diffuse Glioma

6. Schwarzenbach H, Nishida N, Calin GA, Pantel K. Clinical relevance of circulating cell-free microRNAs in cancer. Nat Rev Clin Oncol. 2014;11(3):145-156. doi:10.1038/nrclinonc.2014.5 7. Mitchell PS, Parkin RK, Kroh EM, et al. Circulating microRNAs as stable blood-based markers for cancer detection. Proc Natl Acad SciUSA. 2008;105(30):10513-10518. doi:10.1073/pnas.0804549105 8. Shimomura A, Shiino S, Kawauchi J, et al. Novel combination of serum microRNA for detecting breast cancer in the early stage. Cancer Sci. 2016;107(3):326-334. doi:10.1111/cas.12880 9. Lai NS, Wu DG, Fang XG, et al. Serum microRNA-210 as a potential noninvasive biomarker for the diagnosis and prognosis of glioma. Br J Cancer. 2015;112(7):1241-1246. doi:10.1038/bjc.2015.91 10. Regazzo G, Terrenato I, Spagnuolo M, et al. A restricted signature of serum miRNAs distinguishes glioblastoma from lower grade gliomas. J Exp Clin Cancer Res. 2016;35(1):124. doi:10.1186/s13046-016-0393-0 11. Santangelo A, Imbrucè P, Gardenghi B, et al. A microRNA signature from serum exosomes of patients with glioma as complementary diagnostic biomarker. J Neurooncol. 2018;136(1):51-62. doi:10.1007/s11060-017-2639-x 12. Tang H, Liu Q, Liu X, et al. Plasma miR-185 as a predictive biomarker for prognosis of malignant glioma. J Cancer Res Ther. 2015;11(3):630-634. doi:10.4103/0973-1482.146121 13. Yang C, Wang C, Chen X, et al. Identification of seven serum microRNAs from a genome-wide serum microRNA expression profile as potential noninvasive biomarkers for malignant astrocytomas. Int J Cancer. 2013;132(1): 116-127. doi:10.1002/ijc.27657 14. Zhi F, Shao N, Wang R, et al. Identification of 9 serum microRNAs as potential noninvasive biomarkers of human astrocytoma. Neuro Oncol. 2015;17(3):383-391. doi:10.1093/neuonc/nou169 15. Zhou Q, Liu J, Quan J, Liu W, Tan H, Li W. MicroRNAs as potential biomarkers for the diagnosis of glioma: a systematic review and meta-analysis. Cancer Sci. 2018;109(9):2651-2659. doi:10.1111/cas.13714 16. Luo Y, Wen X, Wang L, et al. Identification of MicroRNAs involved in growth arrest and apoptosis in hydrogen peroxide-treated human hepatocellular carcinoma cell line HepG2. Oxid Med Cell Longev. 2016;2016:7530853. doi:10.1155/2016/7530853 17. Guo J, Liu C, Wang W, et al. Identification of serum miR-1915-3p and miR-455-3p as biomarkers for breast cancer. PLoS One. 2018;13(7):e0200716. doi:10.1371/journal.pone.0200716 18. Xie Z, Yin X, Gong B, et al. Salivary microRNAs show potential as a noninvasive biomarker for detecting resectable pancreatic cancer. Cancer Prev Res (Phila). 2015;8(2):165-173. doi:10.1158/1940-6207.CAPR-14-0192 19. Moreau MP, Bruse SE, David-Rus R, Buyske S, Brzustowicz LM. Altered microRNA expression profiles in postmortem brain samples from individuals with schizophrenia and bipolar disorder. Biol Psychiatry. 2011;69(2): 188-193. doi:10.1016/j.biopsych.2010.09.039 20. Sourvinou IS, Markou A, Lianidou ES. Quantification of circulating miRNAs in plasma: effect of preanalytical and analytical parameters on their isolation and stability. J Mol Diagn. 2013;15(6):827-834. doi:10.1016/j.jmoldx. 2013.07.005

SUPPLEMENT. eMethods 1. Molecular Diagnosis of Diffuse Glioma eMethods 2. Algorithm: Procedure for Constructing 2-Group Discrimination Models eMethods 3. Algorithm: Procedure for Constructing 3-Group Discrimination Models eTable 1. Differences in Age and Sex Distribution Between Diffuse Glioma and Noncancer Controls eTable 2. Differences in Age and Sex Distribution Between Glioblastoma, Primary Central Nervous System Lymphoma, and Metastatic Brain Tumors eTable 3. The 48 miRNAs of the 3-Tumor Index for Discriminating Among Glioblastoma, Primary Central Nervous System Lymphoma, and Metastatic Brain Tumors eFigure 1. Diagnostic Utility of a Single miRNA to Distinguish Glioma From Noncancer eFigure 2. Validation of the Glioma Index eFigure 3. Development of the 3-Tumor Index eReferences.

JAMA Network Open. 2019;2(12):e1916953. doi:10.1001/jamanetworkopen.2019.16953 (Reprinted) December 6, 2019 12/12

Downloaded From: https://jamanetwork.com/ Christian Medical College by Vignesh Kumar Chandiraseharan on 12/06/2019