<<

Health Technology Assessment Health Technology Assessment 2007; Vol. 11: No. 50 07 o.1:N.5 Evaluation of diagnostic tests when there is no Vol. 11: No. 50 2007;

Evaluation of diagnostic tests when there is no gold standard. A review of methods

AWS Rutjes, JB Reitsma, A Coomarasamy, KS Khan and PMM Bossuyt

Feedback The HTA Programme and the authors would like to know your views about this report. The Correspondence Page on the HTA website (http://www.hta.ac.uk) is a convenient way to publish your comments. If you prefer, you can send your comments to the address below, telling us whether you would like us to transfer them to the website. We look forward to hearing from you.

December 2007

The National Coordinating Centre for Health Technology Assessment, Mailpoint 728, Boldrewood, Health Technology Assessment University of Southampton, NHS R&D HTA Programme Southampton, SO16 7PX, UK. HTA Fax: +44 (0) 23 8059 5639 Email: [email protected] www.hta.ac.uk http://www.hta.ac.uk ISSN 1366-5278 HTA

How to obtain copies of this and other HTA Programme reports. An electronic version of this publication, in Adobe Acrobat format, is available for downloading free of charge for personal use from the HTA website (http://www.hta.ac.uk). A fully searchable CD-ROM is also available (see below). Printed copies of HTA monographs cost £20 each (post and packing free in the UK) to both public and private sector purchasers from our Despatch Agents. Non-UK purchasers will have to pay a small fee for post and packing. For European countries the cost is £2 per monograph and for the rest of the world £3 per monograph. You can order HTA monographs from our Despatch Agents: – fax (with credit card or official purchase order) – post (with credit card or official purchase order or cheque) – phone during office hours (credit card only). Additionally the HTA website allows you either to pay securely by credit card or to print out your order and then post or fax it.

Contact details are as follows: HTA Despatch Email: [email protected] c/o Direct Mail Works Ltd Tel: 02392 492 000 4 Oakwood Business Centre Fax: 02392 478 555 Downley, HAVANT PO9 2NP, UK Fax from outside the UK: +44 2392 478 555 NHS libraries can subscribe free of charge. Public libraries can subscribe at a very reduced cost of £100 for each volume (normally comprising 30–40 titles). The commercial subscription rate is £300 per volume. Please see our website for details. Subscriptions can only be purchased for the current or forthcoming volume.

Payment methods Paying by cheque If you pay by cheque, the cheque must be in pounds sterling, made payable to Direct Mail Works Ltd and drawn on a bank with a UK address. Paying by credit card The following cards are accepted by phone, fax, post or via the website ordering pages: Delta, Eurocard, Mastercard, Solo, Switch and Visa. We advise against sending credit card details in a plain email. Paying by official purchase order You can post or fax these, but they must be from public bodies (i.e. NHS or universities) within the UK. We cannot at present accept purchase orders from commercial companies or from outside the UK.

How do I get a copy of HTA on CD? Please use the form on the HTA website (www.hta.ac.uk/htacd.htm). Or contact Direct Mail Works (see contact details above) by email, post, fax or phone. HTA on CD is currently free of charge worldwide.

The website also provides information about the HTA Programme and lists the membership of the various committees. Evaluation of diagnostic tests when there is no gold standard. A review of methods

AWS Rutjes,1 JB Reitsma,1 A Coomarasamy,2 KS Khan2* and PMM Bossuyt1

1 Department of Clinical , and , Academic Medical Center, University of Amsterdam, The Netherlands 2 Division of Reproductive and Child Health, University of Birmingham, UK

* Corresponding author

Declared competing interests of authors: none

Published December 2007

This report should be referenced as follows:

Rutjes AWS, Reitsma JB, Coomarasamy A, Khan KS, Bossuyt PMM. Evaluation of diagnostic tests when there is no gold standard. A review of methods. Health Technol Assess 2007;11(50).

Health Technology Assessment is indexed and abstracted in Index Medicus/MEDLINE, Excerpta Medica/EMBASE and Science Citation Index Expanded (SciSearch®) and Current Contents®/Clinical . NIHR Health Technology Assessment Programme

he Health Technology Assessment (HTA) programme, now part of the National Institute for Health TResearch (NIHR), was set up in 1993. It produces high-quality research information on the costs, effectiveness and broader impact of health technologies for those who use, manage and provide care in the NHS. ‘Health technologies’ are broadly defined to include all interventions used to promote health, prevent and treat , and improve rehabilitation and long-term care, rather than settings of care. The research findings from the HTA Programme directly influence decision-making bodies such as the National Institute for Health and Clinical Excellence (NICE) and the National Screening Committee (NSC). HTA findings also help to improve the quality of clinical practice in the NHS indirectly in that they form a key component of the ‘National Knowledge Service’. The HTA Programme is needs-led in that it fills gaps in the evidence needed by the NHS. There are three routes to the start of projects. First is the commissioned route. Suggestions for research are actively sought from people working in the NHS, the public and consumer groups and professional bodies such as royal colleges and NHS trusts. These suggestions are carefully prioritised by panels of independent experts (including NHS service users). The HTA Programme then commissions the research by competitive tender. Secondly, the HTA Programme provides grants for clinical trials for researchers who identify research questions. These are assessed for importance to patients and the NHS, and scientific rigour. Thirdly, through its Technology Assessment Report (TAR) call-off contract, the HTA Programme commissions bespoke reports, principally for NICE, but also for other policy-makers. TARs bring together evidence on the value of specific technologies. Some HTA research projects, including TARs, may take only months, others need several years. They can cost from as little as £40,000 to over £1 million, and may involve synthesising existing evidence, undertaking a trial, or other research collecting new to answer a research problem. The final reports from HTA projects are peer-reviewed by a number of independent expert referees before publication in the widely read monograph series Health Technology Assessment. Criteria for inclusion in the HTA monograph series Reports are published in the HTA monograph series if (1) they have resulted from work for the HTA Programme, and (2) they are of a sufficiently high scientific quality as assessed by the referees and editors. Reviews in Health Technology Assessment are termed ‘systematic’ when the account of the search, appraisal and synthesis methods (to minimise biases and random errors) would, in theory, permit the of the review by others. The research reported in this monograph was commissioned by the National Coordinating Centre for Research Methodology (NCCRM), and was formerly transferred to the HTA Programme in April 2007 under the newly established NIHR Methodology Panel. The HTA Programme project number is 06/90/23. The contractual start date was in October 2005. The draft report began editorial review in March 2007 and was accepted for publication in April 2007. The commissioning brief was devised by the NCCRM who specified the research question and study design. The authors have been wholly responsible for all , analysis and interpretation, and for writing up their work. The HTA editors and publisher have tried to ensure the accuracy of the authors’ report and would like to thank the referees for their constructive comments on the draft document. However, they do not accept liability for damages or losses arising from material published in this report. The views expressed in this publication are those of the authors and not necessarily those of the HTA Programme or the Department of Health. Editor-in-Chief: Professor Tom Walley Series Editors: Dr Aileen Clarke, Dr Peter Davidson, Dr Chris Hyde, Dr John Powell, Dr Rob Riemsma and Professor Ken Stein Programme Managers: Sarah Llewellyn Lloyd, Stephen Lemon, Kate Rodger, Stephanie Russell and Pauline Swinburne ISSN 1366-5278 © Queen’s Printer and Controller of HMSO 2007 This monograph may be freely reproduced for the purposes of private research and study and may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising. Applications for commercial reproduction should be addressed to: NCCHTA, Mailpoint 728, Boldrewood, University of Southampton, Southampton, SO16 7PX, UK. Published by Gray Publishing, Tunbridge Wells, Kent, on behalf of NCCHTA. Printed on acid-free paper in the UK by St Edmundsbury Press Ltd, Bury St Edmunds, Suffolk. MR Health Technology Assessment 2007; Vol. 11: No. 50

Abstract

Evaluation of diagnostic tests when there is no gold standard. A review of methods AWS Rutjes,1 JB Reitsma,1 A Coomarasamy,2 KS Khan2* and PMM Bossuyt1

1 Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Academic Medical Center, University of Amsterdam, The Netherlands 2 Division of Reproductive and Child Health, University of Birmingham, UK * Corresponding author

Objective: To generate a classification of methods and statistical modelling (latent class analysis). In the final to evaluate medical tests when there is no gold group, validate index test results, the diagnostic test standard. accuracy paradigm is abandoned and research examines, Methods: Multiple search strategies were employed to using a number of different methods, whether the obtain an overview of the different methods described results of an index test are meaningful in practice, for in the literature, including searches of electronic example by relating index test results to relevant other databases, contacting experts for papers in personal clinical characteristics and future clinical events. archives, exploring databases from previous Conclusions: The majority of methods try to impute, methodological projects and cross-checking of adjust or construct a reference standard in an effort to reference lists of useful papers already identified. obtain the familiar diagnostic accuracy , such as Results: All methods available were classified into four sensitivity and specificity. In situations that deviate only main groups. The first method group, impute or adjust marginally from the classical diagnostic accuracy for on reference standard, needs careful paradigm, these are valuable methods. However, in attention to the pattern and fraction of missing values. situations where an acceptable reference standard does The second group, correct imperfect reference not exist, applying the concept of clinical test validation standard, can be useful if there is reliable information can provide a significant methodological advance. about the degree of imperfection of the reference All methods summarised in this report need further standard and about the correlation of the errors development. Some methods, such as the construction between the index test and the reference standard. The of a reference standard using panel consensus methods third group of methods, construct reference standard, and validation of tests outwith the accuracy paradigm, have in common that they combine multiple test results are particularly promising but are lacking in to construct a reference standard outcome including methodological research. These methods deserve deterministic predefined rules, consensus procedures particular attention in future research.

iii

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Contents

List of abbreviations ...... vii Classification of methods ...... 11 Impute or adjust for missing data on Executive summary ...... ix reference standard ...... 13 Correct imperfect reference standard ...... 16 1 Background ...... 1 Construct reference standard ...... 18 Introduction and scope of the report ...... 1 Validate index test results ...... 29 Key concepts in diagnostic accuracy studies ...... 2 5 Guidance and discussion ...... 35 Problems in verification ...... 3 Guidance for researchers ...... 35 Limits to accuracy: towards a validation 2 Aims of the project ...... 7 paradigm ...... 37 Aims ...... 7 Recommendations for further research ..... 39 Outline of the report ...... 7 Acknowledgements ...... 41 3 Methods ...... 9 Literature search and inclusion of References ...... 43 papers ...... 9 Assessment of individual studies ...... 9 Appendix 1 Search terms in databases ..... 49 Overall classification of methods ...... 9 Structured summary of individual Appendix 2 Experts in peer review methods ...... 10 process ...... 51 Expert review on individual methods ...... 10 Development of research guidance ...... 10 Health Technology Assessment reports Peer review of report ...... 10 published to date ...... 53

4 Results ...... 11 Health Technology Assessment Search results and selection of studies ...... 11 Programme ...... 69

v

Health Technology Assessment 2007; Vol. 11: No. 50

List of abbreviations

BCG Bacillus Calmette–Guérin MRI magnetic resonance imaging

CI NMAR not missing at random

COPD chronic obstructive pulmonary NPV negative predictive value disease NT-proBNP N-terminal pro-brain natriuretic CT computed tomography peptide

ED emergency department PCR polymerase chain reaction

EIA enzyme immunoassay PE

DOR diagnostic PET positron emission tomography

FN false negative result PPV positive predictive value

FP false positive result RCT randomised controlled trial

LR likelihood ratio ROC receiver operating characteristic

LTBI latent tuberculosis infection TN true negative result

MAR missing at random TP true positive result

MCAR missing completely at random TST tuberculin skin test

All abbreviations that have been used in this report are listed here unless the abbreviation is well known (e.g. NHS), or it has been used only once, or it is a non-standard abbreviation used only in figures/tables/appendices in which case the abbreviation is defined in the figure legend or at the end of the table.

vii

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Executive summary

Background Methods The classical diagnostic accuracy paradigm is We employed multiple search strategies to obtain based on studies that compare the results of the an overview of the different methods described in test under evaluation (index test) with the results the literature, including searches of electronic of the reference standard, the best available databases, contacting experts for papers in method to determine the presence or absence of personal archives, exploring databases from the condition or disease of interest. Accuracy previous methodological projects (STARD and measures express how the results of the test under QUADAS) and cross-checking of reference lists of evaluation agree with the outcome of the reference useful papers already identified. standard. Determining accuracy is a key step in the health technology assessment of medical tests. We developed a classification for the methods identified through our review taking into account Researchers evaluating the diagnostic accuracy of the degree to which they represented a departure a test often encounter situations where the away from the classical diagnostic accuracy reference standard is not available in all patients, paradigm. where the reference standard is imperfect or where there is no accepted reference standard. We use For each method in our overview, we prepared a the term ‘no gold standard situations’ to refer to structured summary based on all or the most all those situations. Several solutions have been informative papers describing its rationale, its proposed in these circumstances. Most articles strengths and weaknesses, its field of application, dealing with imperfect or absent reference available software and illustrative examples of the standards focus on one type of solution and method. discuss the strengths and limitations of that approach. Few authors have compared different Based on the findings of our review, discussions approaches or provided guidelines on how to about the pros and cons of different methods in proceed when faced with an imperfect reference various situations within the research team and standard. input from expert peer reviewers, we constructed a flowchart providing general guidance to researchers faced with evaluation of tests without a Objectives gold standard. We systematically searched the literature for methods that have been proposed and/or applied Results in situations without a ‘gold’ standard, that is, a reference standard that is without error. Our From 2200 references initially checked for their project had the following aims: usefulness, we ultimately included 189 relevant articles that were subsequently used to classify and 1. To generate an overview and classification of summarise all methods into four main groups, as methods that have been proposed to evaluate follows. medical tests when there is no gold standard. 2. To describe the main methods discussing Impute or adjust for missing data on rationale, assumptions, strengths and reference standard weaknesses. In this group of methods, there is an acceptable 3. To describe and explain examples from the reference standard, but for various reasons the literature that applied one or more of the outcome of the reference standard is not obtained methods in our overview. in all patients. Methods in this group either 4. To provide general guidance to researchers impute or adjust for this missing information in facing research situations where there is no the subset of patients without reference standard gold standard. outcome. Researchers should be careful with these ix

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Executive summary

methods if (1) the pattern of missing values is important category is relating index test results not determined by the study design, but is with future clinical events, such as the number of influenced by the choice of patients and events in those tested negative for the index test physicians, or (2) the fraction of patients verified results. Test results can also be used in a with the reference standard is small within results randomised study to see whether the test can of the index tests. predict who will benefit more from one intervention than the other. Because the classical Correct imperfect reference standard accuracy paradigm is not employed, measures In this group, there is a preferred reference other than accuracy measures are calculated, standard, but this standard is known to be including event rates, relative risks and other imperfect. Solutions from this group either adjust correlation statistics. estimates of accuracy or perform sensitivity analysis to examine the impact of this imperfect reference standard. The adjustment is based on Conclusions external data (previous research) about the degree of imperfection. Correction methods can be useful The majority of methods try to impute, adjust or if there is reliable information about the degree of construct a reference standard in an effort to imperfection of the reference standard and about obtain the familiar diagnostic accuracy statistics the correlation of the errors between the index such as pairs of sensitivity and specificity or test and the reference standard. likelihood ratios. In situations that deviate only marginally from the classical diagnostic accuracy Construct reference standard paradigm, for example where there are few These methods have in common that they missing values on an otherwise acceptable combine multiple test results to construct a reference standard or where the magnitude and reference standard outcome. Groups of patients type of imperfection in a reference standard is well receive either different tests (differential documented, these are valuable methods. verification and discrepant analysis) or the same However, in situations where an acceptable set of tests, after which these results are combined reference standard does not exist, holding on to by: (1) deterministic predefined rule (composite the accuracy paradigm is less fruitful. In these reference standard); (2) consensus procedure situations, applying the concept of clinical test among experts (panel diagnosis); (3) a statistical validation can provide a significant model based on actual data (latent class analysis). methodological advance. Validating a test The prespecified rule for target condition makes that scientists and practitioners examine, using a the composite reference standard method number of different methods, whether the results transparent and easy to use, but misclassification of an index test are meaningful in practice. of patients is likely to remain. Discrepant analysis Validation will always be a gradual process. It will should not be considered in general, as the involve the scientific and clinical community method is likely to produce biased results. The defining a threshold, a point in the validation drawback of latent class models is that the target process, whereby the information gathered would condition is not defined in a clinical way, so there be considered sufficient to allow clinical use of the can be lack of clarity about what the results stand test with confidence. for in practice. Panel diagnosis also combines multiple pieces of information, but experts may Recommendations for further research combine these items in a manner that more All methods summarised in this report need closely reflects their own personal concept of the further development. Some methods, such as the target condition. construction of a reference standard using panel consensus methods and validation of tests outwith Validate index test results the accuracy paradigm, are particularly promising The diagnostic test accuracy paradigm is but are lacking in methodological research. These abandoned in this group and index test results are methods deserve particular attention in future related to relevant other clinical characteristics. An research.

x Health Technology Assessment 2007; Vol. 11: No. 50

Chapter 1 Background

“Accuracy is telling the truth … Precision is telling the decision-making. Accuracy studies may not be same story over and over again.” sufficient to evaluate the clinical value of a test, Yiding Wang especially if the new test is more sensitive than the existing test(s).10 The reason is that the results from current intervention studies may not apply to Introduction and scope of the those additional cases detected. Results from report randomised studies assessing response to in these additional cases are then required. Later As with all other elements of healthcare, medical stage evaluation studies may focus on determining tests should be thoroughly evaluated in high- the societal costs and benefits of a test. quality studies. Biased results from poorly designed, conducted or analysed studies may The focus of this report is on problems related to trigger premature dissemination and diagnostic accuracy studies. The key challenge in implementation of a and mislead diagnostic accuracy is to determine in all patients physicians to incorrect decisions regarding the whether the target condition is present or absent. care for an individual patient.1–3 Avoidance of The reference standard should provide this these perils requires a proper evaluation of classification. As such, the reference standard plays medical tests.4 a crucial role in accuracy studies.

Several authors have proposed a staged model for Problems with the reference standard abound the evaluation of medical tests.5–7 Technical in diagnostic accuracy studies. evaluations dominate in the early phases, in which (http://www.fda.gov/cdrh/osb/guidance/1428.pdf). the under different conditions of The outcome of the reference standard may not be biochemical tests and the intra- and inter-observer available in all patients, it may be unreliable, it variation of tests are evaluated. A key phase in the may be inaccurate or there could be no acceptable clinical evaluation of a test is determining its reference standard at all. As in any other form of diagnostic accuracy: the ability to discriminate epidemiological research, outcome data that are between patients who have the condition of missing or misclassified pose a great threat to the interest (target condition) and those who have validity of such studies, and diagnostic accuracy not.8 The target condition can refer to a disease, studies are no exception.11–15 We will use the term syndrome or any other identifiable condition that ‘no gold standard situations’ to refer loosely to all may prompt clinical actions such as further these situations where the outcome of the diagnostic testing, or the initiation, modification reference standard is missing, the reference or termination of treatment. standard is imperfect or there is no acceptable reference standard. By themselves, accuracy studies cannot always answer the question whether a medical test is This report gives an overview of solutions that useful or not. More informative accuracy studies have been proposed to overcome no gold standard can be designed by taking into account the likely situations in diagnostic accuracy research. Because future role of the test under evaluation. Three the scope of this report is limited to problems possible roles are replacement, addition or triage.9 related to diagnostic accuracy studies, we will not In comparative accuracy studies, the accuracy of discuss the broader issue of determining the most the test under evaluation is compared against that appropriate type of evaluation given a specific of existing diagnostic pathways, leading to more diagnostic research question. informative and possibly more efficient diagnostic accuracy studies.9 Before explaining our methods (Chapter 3) and reporting our results (Chapter 4), in the next The clinical value of a test will ultimately depend section we first explain the key features of on whether it is able to improve patient outcome. diagnostic accuracy studies and in the subsequent In most cases this will be by guiding subsequent section discuss the various mechanisms that can 1

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Background

Target condition Patients

Present Absent

Index test

+ TP FP Index test Reference standard Result – FN TN Cross classification

(a) Accuracy measures: Sensitivity = TP/(TP + FN) Specificity = TN/(TN + FP) LR+ = [TP/(TP + FN)]/[FP/(FP + TN)] LR– = [FN/(FN + TP)]/[TN/(TN + FP)] PPV = TP/(TP + FP) NPV = TN/(TN + FN) DOR = (TP/FN)/(FP/TN) (b)

FIGURE 1 (a) Classical design of a diagnostic accuracy study and (b) results of an accuracy study in the case of a dichotomous index test result.TP, true positive result; FP, false positive result; FN, false negative result; TN, true negative result; LR, likelihood ratio; PPV, positive predictive value; NPV, negative predictive value; DOR, .

lead to no gold standard situations. In Chapter 5 accuracy can be calculated from this table, as also we provide guidance on selecting the most shown in Figure 1(b). appropriate method for a given situation by addressing some key questions. In addition, we The term accuracy has been borrowed from sketch possible research alternatives in situations measurement theory, where it is defined as the where the diagnostic accuracy paradigm is unlikely closeness of agreement between an analytical to be useful. measurement and its actual (true) value.16 The first publication mentioning accuracy and the associated statistics sensitivity and specificity to Key concepts in diagnostic express the performance of a medical test was by accuracy studies Yerushalmy,17 followed by the landmark publication of Ledley and Lusted.18 In these Diagnostic accuracy studies aim to measure the publications, test results in patients known to have amount of agreement between index test results the disease of interest were compared with test and the outcome of the reference standard. The results in subjects not having the disease. Since classical outline of an accuracy study is given in then, various other measures of accuracy have Figure 1(a). The starting point is a consecutive been introduced, including likelihood ratios, series of individuals in whom the target condition predictive values and the diagnostic odds ratio.4 is suspected. The index test is performed first in All of these accuracy measures have in common all subjects, and subsequently the presence or that they need a classification of patients in those absence of the target condition is determined by with and those without the condition of interest. the outcome of the reference standard. In the case In other words, index test results are verified by of a dichotomous index test result, the results of comparing them with the outcome of the an accuracy study can be summarised in a 2-by-2 reference standard. Based on this concept, we can 2 table, as shown in Figure 1(b). Several measures of formulate the properties of the ideal reference Health Technology Assessment 2007; Vol. 11: No. 50

standard and verification procedure. The ideal examination of the removed appendix is used as verification would fulfil the following reference standard. This requires defining a criteria: threshold for the type and amount of inflammatory changes above which we will classify 1. The reference standard provides error-free a patient as having appendicitis. Different classification. reference standards may apply a different 2. All index test results are verified by the same threshold before classifying patients as having the reference standard. target condition. In the example of appendicitis, 3. The index test and reference standard are clinical follow-up and observing whether the performed at the same time, or within an symptoms of the patients improve or deteriorate interval that is short enough to eliminate will probably a higher threshold for disease, changes in target condition status. resulting in some patients being classified differently between these two reference standards; Empirical studies have shown that estimates of for example, some patients with inflammatory diagnostic accuracy are directly influenced by the changes at histological examination would have quality of the verification procedure.1,3,19 recovered in a natural way without intervention. This requires proper thinking by the researcher of what is the right definition of the target condition Problems in verification and then choosing the most appropriate reference standard in the light of this target condition. In practice, researchers may encounter situations where the ideal verification procedure cannot be Additional sources of misclassification are failures achieved. Based on the first two criteria for an in the reference standard protocol and ideal verification procedure, we will briefly discuss interpretation errors by observers. These errors in the key mechanisms why verification procedures misclassification could be preventable by stricter can produce errors. adherence to protocol or better training of observers. Examples include failure of detecting Classification errors by the reference cancer cells after fine-needle aspiration because standard the biopsy was performed outside the tumour Within the accuracy concept, the reference mass or overlooking a small pulmonary embolism standard is judged by its performance in in spiral computed tomography (CT) images. producing error-free classification with respect to the presence or absence of the target condition. Furthermore, for several target conditions there is Even if a reference standard has no analytical no reference standard based on histological or error but the target condition does not produce biochemical changes. In some of those cases, the the biochemical changes of interest, we consider it condition is defined by a combination of a ‘failure’ of the reference standard. Other symptoms and signs. Migraine is an example. The examples of imperfections of the reference presence or absence of this and similar conditions standard are tumours that are missed because they has been based on criteria developed by individual are below the level of detection in case of a researchers or on criteria established during a radiological reference standard, or the presence of consensus meeting. Such classifications can vary an alternative condition that is misclassified as the over time or across countries, and cannot be error- target condition because it produces similar free. changes in a biomarker as that of the target condition. Given the many potential factors that can lead to errors in the classification of the target condition One inherent difficulty in accuracy studies is that by a reference standard, a perfect reference we use the dichotomy of target condition present standard (e.g. providing error-free classification) is or absent, whereas in reality the target condition unlikely to exist in practice. The shift in varies from very early and minor changes to severe terminology from ‘gold standard’, suggesting a and advanced stages of the disease, thus covering standard without error, to the more neutral term a wide spectrum of disease. For some conditions, ‘reference standard’, indicating the best available defining the lower end of disease is difficult. This method, has been initiated by these observations. can be illustrated with the example of It means that researchers should always discuss the appendicitis. In the majority of accuracy studies quality of the reference standard and the potential evaluating non-invasive imaging techniques for consequences of misclassification by the reference the diagnosis of appendicitis, histological standard. 3

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Background

Patients Patients Patients

Index test Index test Index test

Reference standard … Reference standard … Ref. A Ref. B

Cross classification Cross classification Cross classification (a) (b) (c)

FIGURE 2 (a) Diagnostic accuracy study with complete verification by the same reference standard (classic design), (b) study with partial verification and (c) study with differential verification.

Whatever the cause of errors in classification of complications it is now considered unethical to target condition status by the reference standard, perform this reference standard in low-risk it will directly lead to changes in the 2-by-2 patients, for instance in patients with low clinical classification table (Figure 1b). Within the classic probability and negative D-dimer result. Other accuracy concept, any disagreement between the reasons for missing data on the outcome of the reference standard and the index test will be reference standard are situations where the labelled as a ‘false’ result for the index test. The reference standard is temporarily unavailable or net effect of the misclassification by the reference when patients and doctors decide to refrain from standard can either be an upward or downward verification. bias in estimates of diagnostic accuracy. The direction depends on whether errors by the index One solution is to leave the unverified patients out test and imperfect reference standard are of the 2-by-2 table, which is referred to as partial correlated. If errors are positively correlated, it verification (Figure 2b). Omitting unverified will erroneously increase agreement in the 2-by-2 patients from the 2-by-2 table may generate a bias. tables and estimates of accuracy will be The direction and magnitude will depend on inflated.11–15 The magnitude of the biasing effect (1) the fraction of patients that are unverified; depends on the of errors by the (2) the ratio between the number of patients with imperfect reference standard and the degree of positive and negative index test results that remain correlation in errors between index test and unverified; and (3) whether the reason for not reference standard. verifying is related to the presence or absence of the target condition.20 Partial and differential verification Even if a near perfect reference standard exists, it Because omitting patients from the 2-by-2 may be impossible, unethical or too costly to apply classification table can lead to bias, researchers this standard in all patients. In the case of target have strived to obtain complete verification by conditions that can produce multiple lesions that applying an alternative reference standard in the need histological verification, it is often impossible initially unverified patients. The use of different to verify (the countless) negative index test results. reference standards between patients is known as An example is 18-fluorodeoxyglucose-positron differential verification (Figure 2c). An example of emission tomography (PET) scanning to detect differential verification that is fairly common is possible distant metastases before planning major where follow-up is used as the alternative reference curative surgery in patients with carcinoma of the standard in patients not verified by the preferred oesophagus: only PET hot spots can be verified reference standard. In these studies, clinical histologically. follow-up is used as a proxy to obtain the information of true status at the of the Ethical reasons play a role in the choice of index test; the term delayed-type cross-sectional reference standard in pulmonary embolism. accuracy study has been introduced for this Angiography is still considered the best available design.8 For all studies with differential verification method for the detection of pulmonary where the alternative reference standard provides 4 embolisms, but because of the frequency of serious imperfect classification, it may affect the 2-by-2 Health Technology Assessment 2007; Vol. 11: No. 50

table and generate bias along the same lines as different solutions have been suggested to remedy discussed earlier in the previous section on errors the effects of verification procedures that are not in classification by the reference standard. based on using a single gold standard in all Empirical studies have shown that studies with patients. We have systematically searched the differential verification produce higher estimates literature to identify and summarise the solutions of diagnostic accuracy than studies with complete that have been proposed. We have developed a verification by the preferred reference standard.1,3 classification of these methods based on their figurative distance, that is, the extent to which Given these diverging reasons for imperfections in they depart from the classical diagnostic accuracy the verification procedure, it is not surprising that paradigm described in Figure 1.

5

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Chapter 2 Aims of the project

everal methods have been proposed to deal 3. To describe and explain examples from the Swith situations where the reference standard is literature that applied one or more of the partially unavailable or imperfect or where there is methods in our overview. no accepted reference standard. These methods 4. To provide general guidance to researchers vary from relatively simple correction methods facing research situations where there is no based on the expected degree of imperfection of gold standard. the reference standard, through more complex statistical models that construct a pseudo-reference standard, to methods that validate index test Outline of the report results by examining their association with other relevant clinical characteristics. Most articles Chapter 1 provides background information about dealing with imperfect or absent reference the key concepts in diagnostic accuracy studies, standards focus on one type of solution and in particular the role of the reference standard discuss strengths and limitations of that and the problems that can be encountered in particular approach.21–23 Few authors have the verification of index tests results. The compared different approaches or have provided methods chapter (Chapter 3) describes the guidelines on how to proceed when faced with strategies that we employed to identify and assess an imperfect reference standard24–26 methodological papers on methods relevant for (http://www.fda.gov/cdrh/osb/guidance/1428.pdf). this project. The results chapter (Chapter 4) has the following structure: the first section presents the results of the literature search; an overview Aims and classification of the different methods that we encountered is given in the second section; and a In this project, we systematically searched the description of each of the methods discussing literature for methods to be used in situations strengths and limitations is given in the third without a ‘gold’ standard. Our project had the section. In Chapter 5 we provide general guidance following aims: to researchers when faced with research situations where there is no gold standard. In addition, we 1. To generate an overview and classification of discuss some alternative options when repairing or methods that have been proposed to evaluate hanging on to the accuracy paradigm is not medical tests when there is no gold standard. helpful. 2. To describe the main methods discussing rationale, assumptions, strengths and weaknesses.

7

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Chapter 3 Methods

n our , we modified the and the Cochrane Library (DARE, CENTRAL, Irecommendations set out by the Cochrane CMR, NHS). The exact terms of the search Collaboration and the NHS Centre for Reviews strategy are given in Appendix 1. and Dissemination27 to make them more ● Searching databases that have been established applicable for a review of methodological papers. in earlier methodological projects, including STARD and QUADAS. Our approach consisted of the following steps: ● Searching personal archives. ● Contacting other experts in the field of 1. literature search and inclusion of papers diagnostic research to establish whether they 2. assessment of individual studies had additional relevant papers, especially aimed 3. overall classification of methods at retrieving articles within the grey literature. 4. structured summary of individual methods ● To locate additional information on specific 5. expert review on individual methods methods, we checked reference lists, used the 6. development of guidance statements citation tracking option of SCISEARCH and 7. peer review of total report. applied the ‘related articles’ function of PubMed. Literature search and inclusion of Inclusion of papers papers Publications in English, German, Dutch, Italian, and French were included. One researcher (AWSR) Searching for methodological papers in electronic reviewed the identified studies for inclusion in the databases is difficult because of inconsistent review and this process was checked by another indexing and the absence of a specific keyword for reviewer (JBR). The only reason for exclusion was the relevant publication types. Several methods for if the article did not address a method that could conducting diagnostic research when there is no be used in a no gold standard situation. gold standard have been developed in other areas of epidemiological, biomedical or even non- medical research. Conducting a broad literature Assessment of individual studies search to capture every single potentially relevant paper would be inappropriate, especially since We extracted a limited set of standard items from precise estimation is not the objective of a review each included paper. These included the strategy of methodological papers. Once a set of that identified the paper, the type of journal, the comprehensive papers has been obtained about a type of article and the type of method proposed. methodological issue, there is no additional value Other items were the rationale and structure of in reviewing additional papers explaining the the article, a description of its usefulness and same concept. This is known as theoretical remarks about the overlap in relation to other saturation,28 a principle that guided our search papers (all free text fields). The information and selection. extracted in this way was used to organise papers within each method. The main function of this Based on these considerations, we relied on step was to categorise papers in order to facilitate multiple strategies to obtain an overview of the the writing of a structured summary of each different methods described in the literature and method. to maximise the likelihood of identifying those papers that provide the most thorough and complete description: Overall classification of methods

● Restricted electronic searches in the following We classified all methods into groups that databases: MEDLINE (OVID and PubMed), addressed diagnostic research in situations where EMBASE (OVID), MEDION, a database of there is no gold standard in an analogous way. diagnostic test reviews (www.mediondatabase.nl), The purpose of this classification was to give an 9

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Methods

overview of methods and to serve as a starting project, its objectives and the summary of a point for formulating guidance. The key element specific method. They were asked to review this in the classification was the underlying mechanism first version, paying particular attention to the leading to the absence of perfect reference following issues: standard information (see also Chapter 1). ● Is the method accurately described? ● Are the main characteristics of the method Structured summary of individual described? ● Are key strength and weaknesses mentioned? methods ● Are key references missing? For each method in our overview, we made a ● Are tables and/or figures understandable? structured summary based on all or the most ● Do you have any other suggestions how the informative papers describing that method. Each summary can be improved? summary starts with a description of the method and its rationale, avoiding technical language Based on the comments from the expert, a final where possible. In subsequent sections, the version of each method was prepared. Because of strengths and weaknesses of the methods are time constraints, the revised version was not sent described; the field of application of the method back to the expert. in terms of the circumstances in which the method should or should not be used; available software; and an illustrative example of the method. One Development of research member of the research team (AWSR or AC) wrote guidance the first draft of a summary, which was then reviewed and modified by another team member The research team held face-to-face meetings to (JBR, PMMB, AC, KSK). These two reviewers discuss the pros and cons of different methods in continued exchanging drafts, held face-to-face different situations. Prior to this meeting, we had meetings or had teleconference meetings until contacted and interviewed a few general experts they agreed that the version was ready for expert about their preferred solutions when faced with review. research situations where there is no gold standard. Based on the discussion, a first version was drafted and circulated among the research Expert review on individual members. Further comments were incorporated to methods produce a final version. We assembled a list of potential reviewers outside our research team to review each method. These Peer review of report experts were selected based on their knowledge and/or expertise in relation to a specific method. The full report was reviewed by general experts This group included epidemiologists, identified by the research team and peer reviewers and clinicians (see Appendix 2). Each expert was assigned by the NHS Research Methodology contacted by electronic mail or by telephone and programme. was asked to comment on at least one method. These experts received a brief description of the

10 Health Technology Assessment 2007; Vol. 11: No. 50

Chapter 4 Results

Search results and selection of searches. Using the ‘related articles’ approach in studies PubMed we identified 10 additional references relevant for imputation. For the panel consensus The references retrieved by the electronic searches method, an additional search in PubMed was were assembled in a single database. After performed with the search term Delphi[textword], duplicate references had been deleted, the resulting in no relevant references. The final abstracts of 2265 references were assessed for number of relevant articles included was 189. A eligibility. Of those, 134 were labelled potentially flowchart of the retrieval and inclusion of studies relevant. Twenty-three references were excluded is given in Figure 3. because they could not be retrieved or inspection of the full article revealed that the topic was not test evaluation. Contact with experts in the field Classification of methods resulted in an additional seven references to books and 50 reference to articles. Reference checking at Many methods have been described for no gold various stages of preparing the report yielded a standard situations. An exact number is difficult to total of 11 additional relevant articles. provide because many methods can be viewed as variations or extensions of a single underlying As the number of references was limited for approach. For this report, we classified the imputation methods and the panel consensus methods into four main groups (Table 1). These method, we conducted additional electronic four groups differ in their distance or extent of

Electronic search: potentially eligible publications n = 2265

Articles excluded: n = 2131

Electronic search: initially included articles n = 134

Articles excluded: Not addressing test evaluation n = 15 Not retrievable n = 8

Electronic search: final number of included articles n = 111

Personal database searches and expert contact: Articles included n = 50 Books included n = 7

Reference checking n = 11

Additional searches n = 10

Total number included n = 189

FIGURE 3 Flowchart of the selection process of articles 11

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

TABLE 1 Classification of methods for diagnostic research where there is no gold standard

Main classification Main characteristic Section in Subdivision Chapter 4a

A. Impute or adjust for Impute the outcome of the reference standard in those patients A missing data on reference who did not receive verification by the reference standard or adjust standard estimates of accuracy based on complete cases

B. Correct imperfect Correct estimates of accuracy or perform sensitivity analysis to B reference standard examine the impact of using imperfect reference standard based on external data about the degree of imperfection

C. Construct reference Information from different tests is combined to construct the C standard reference standard outcome. Groups of patients receive either D Differential verification different tests (differential verification and discrepant analysis) or E Discrepant analysis the same set of tests after which these results are combined by: F Composite reference standard (a) deterministic predefined rule (composite reference standard) G Panel or consensus diagnosis (b) consensus procedure among experts (panel diagnosis) H Latent class analysis (c) a based on actual data (latent class analysis)

D. Validate index test results Explore meaningful relations between index test results and other I Examine patient outcomes relevant clinical characteristics. An important way to validate is to use dedicated follow-up to capture clinical events of interest in relation to index test results, including randomised diagnostic studies

a A, ‘Impute or adjust for missing data on reference standard’ (p. 13); B, ‘Correct imperfect reference standard’ (p. 16); C, ‘Construct reference standard’ (p. 18); D, ‘Differential verification’ (p. 18); E, ‘Discrepant analysis’ (p. 19); F, ‘Composite reference standard’ (p. 21); G ‘Panel or consensus diagnosis’ (p. 24); H, ‘Latent class analysis’ (p. 26); I, Validate index test results’ (p. 29).

departure from the classical diagnostic accuracy Methods in the third group, Group C: Construct paradigm including a gold standard, as explained a reference standard, have in common that they in Chapter 1. Our classification is neither combine multiple pieces of information (test exhaustive nor mutually exclusive. Researchers can results) to construct a reference standard outcome. also use a combination of methods within a single The methods differ as to whether they use a study. predefined deterministic rule to classify patients as having the target condition (composite reference In the first group of methods, Group A: Impute standard), whether only discordant results are or adjust for missing data on reference standard, retested with a second reference standard there is a reference standard providing adequate (discrepant analysis) or whether the different tests classification, but for various reasons the outcome are combined through a statistical model (latent of the reference standard is not obtained in all class analysis). Also, a panel of experts can be used patients (see Chapter 1 for an overview of to determine the presence or absence of the target reasons). The methods in this group impute or condition in each patient. adjust for this missing information in the subset of patients without reference standard outcome. The diagnostic test accuracy paradigm is Methods within this group differ in the way in abandoned in the fourth group, Group D: which they impute or adjust for this missingness. Validate index test results. In these studies, index test results are related to other relevant clinical In the second group, Group B: Correct for characteristics. An important category is relating imperfections in reference standard, there is a index test results with future clinical events, such preferred reference standard, but this standard is as the number of events in those tested negative known to be imperfect. Solutions from this group for the index test or a randomised comparison either adjust estimates of accuracy or perform between testing and non-testing. Any relevant sensitivity analysis to examine the impact of using clinical information could be used to validate this imperfect reference standard. The adjustment index test results, including cross-sectional or is based on external data (previous research) about historical data. Because this group departs 12 the degree of imperfection. completely from the classical accuracy paradigm, Health Technology Assessment 2007; Vol. 11: No. 50

measures other than accuracy measures are basis for predicting and subsequently imputing calculated, including event rates, relative risks and missing values. correlation statistics. If the pattern of missing values is related to the target condition, but through mechanisms that we Impute or adjust for missing data have not observed, it is called NMAR. This on reference standard situation can occur when incomplete verification is not design based, but determined by the In some studies, an accepted reference standard choice of patients and physicians. In these providing adequate classification is available, but situations, the mechanisms leading to missing for a variety of reasons not all patients receive this values are difficult to determine, especially when standard, leading to missing data on whether the additional patient characteristics such as target condition is present or absent. The general symptoms, signs or previous test results have not problem of missing data is well established in been recorded. epidemiology and biostatistics, and many statistical methods have been developed to deal Imputation methods with missing data.29 Most literature focuses on Imputation methods comprise two phases, an missing values on predictors, but a fair amount imputation phase where each missing value is of literature is available on missing data of replaced, and an analysis phase, where estimates outcome variables, including papers dealing of sensitivity and specificity are computed based with diagnostic research.22,23,30–41 Missing data on the now complete dataset. Many variants of on the reference standard have been labelled imputation are possible, ranging from single partial or incomplete verification in the diagnostic imputations of missing values to multiple literature, and the bias associated with it is known imputations.42–46 Instead of filling in a single as partial verification bias or sequential ordering value for each missing value, multiple imputation bias.19,20 procedure replaces each missing value with a set of plausible values that represent the Imputation methods use a mathematical function uncertainty about the correct value to impute. to fill in each missing value, whereas correction These multiple imputed data sets are then methods use a mathematical function to correct analysed (one by one) by standard procedures the indexes of accuracy directly. Three main for complete data sets. In a next step, the results patterns of partial verification or missingness have from these analyses are combined to produce been described: missing completely at random estimates and confidence intervals (CIs) that (MCAR), missing at random (MAR) and not properly reflect the uncertainty due to missing missing at random (NMAR).29 values.

In MCAR, missing values are unrelated to the The choice of imputation method is largely based status of the target condition, index test results on the pattern of missingness. Basically, the and other patient characteristics. Hence the associations between patient characteristics and occurrence is a true random process and therefore the outcome of the reference standard are the frequency of missing values is similar among evaluated through a statistical model, like logistic patients with positive or negative index test regression, subsequently to impute missing values. results. This situation can occur due to In the NMAR setting, it is very rare to identify an unavailability of the reference standard because of appropriate model for the missingness technical failures. mechanism. Consequently, the validity of the model is uncertain. For further information on More often, missing values for the reference dealing with missingness and imputations, readers standard do not occur completely at random, but are referred to Little and Rubin.29 are related to the results of the index test (more frequent in patients with negative index test result) Correction methods for missing data on and to other patient characteristics. If the reference standard outcome mechanisms that have led to the missing values on In correction methods, no effort is made to the reference standard are known and observed, impute missing values. The notion is that then the technical term ‘missing at random’ estimates of diagnostic test accuracy are likely to (MAR) is used. This term is used because within be biased if only patients who received verification strata of the mechanisms leading to missing with the reference standard (complete cases) are values, the pattern is random. This also forms the analysed. A mathematical correction of these 13

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

estimates and their CIs is warranted based on the Software mechanism of missingness. Software for dealing with missing data is now embedded in the main statistical software packages. The first publications about correction for partial verification were based on the MCAR Clinical example assumption.39 If missings occur completely at Harel and Zhou used two real data examples to random and the fractions of persons missing in illustrate the results of multiple imputation and each of the cells of the two-by-two table are likely correction methods for missing data on the to be comparable, estimates (but not CIs) of outcome of the reference standard.23 We present a accuracy can be easily recalculated without any brief summary of their results. statistical model. As this pattern is not likely to occur often, mathematical correction methods The first example addresses hepatic scintigraphy, have been developed that can deal with MAR. In an imaging scan procedure, for the detection of MAR, estimates of accuracy can be corrected if liver cancer.51 Out of 650 patients, 344 were verification is random within specific patient referred to liver pathology, which was considered profiles, based on patient characteristics or (index) the reference standard (Table 2). The MCAR test results, and the number of patients not assumption does not hold here, as 39% of the verified within each stratum is known. Here, index test positives and 63% of the index test several methods to correct estimates of sensitivity negatives are not verified. Either MAR or NMAR and specificity based on this assumption have been is true, and Harel and Zhou used the MAR described.13,22,23,32,47,48 The same is true for the assumption that non-verification occurred correction of receiver operating characteristic randomly within index test positives and (ROC) curves and associated indexes.38,47,49 negatives, respectively. Table 3 presents summary estimates of sensitivity, specificity and CIs as Strengths and weaknesses computed by eight different methods. The first Both methods aim to alleviate the bias that is method concerns a ‘complete case’ analysis, where potentially introduced by analysing only cases the 306 patients without liver pathology are with complete data on the reference standard. ignored and estimates are based on the remaining The strengths of each method are determined by 344 patients only. This method gives a higher how accurately the mechanism behind missing estimate of sensitivity and lower estimate of values is known. If the mechanisms that lead to specificity in comparison with all other methods. missing values on the reference standard are The two correction methods, as proposed by Begg (partially) unknown, the correction or imputation and Greenes,22,48 produce comparable estimates, methods are prone to bias. Moreover, correction with lower sensitivities and higher specificities and imputation methods are based on statistical compared with the complete case analysis. In modelling of the data, which requires sufficiently comparison with the remaining five multiple large sample sizes.50 The potential of bias can imputation methods, however, the sensitivities are be large when correction methods are used in lower and the specificities are higher. studies with relative small sample sizes, or if the fraction of patients receiving the reference In the second example, Harel and Zhou used data standard is small. Imputation methods, especially of a study evaluating diaphanography as a test for multiple imputation methods, seem to be more detecting breast cancer.52 Of the 900 patients robust in these situations than correction enrolled, 812 were not verified. The pattern of methods.23 missingness was either MAR or NMAR, as 55% of the diaphanography positives and only 6% of the Application diaphonography negatives were verified (Table 4). These methods can be used when a preferred Again, Harel and Zhou chose to use the MAR reference standard is available, but has not been assumption that is related to index test results applied in all patients enrolled in the study. only. Table 5 shows the estimates computed by the Ideally, partial or incomplete verification should eight methods. Here the differences between the be planned by design, so that the pattern of complete case analysis, the correction and missingness is known. Researchers should be very imputation methods are more prominent, while careful with these methods if missingness is the estimates of multiple imputations are fairly uncontrolled and open to influence by patient and close to each other. In the light of the sample size practitioner choice, if the sample size is small or if and their previous simulation work, Harel and the fraction of verified patients is small within test Zhou concluded that the estimates of multiple 14 categories. imputations are more representative of the data. Health Technology Assessment 2007; Vol. 11: No. 50

TABLE 2 Hepatic scintigraphy data: outcome of reference standard in those verified and fraction of unverified test results (adapted from Harel and Zhou23)

Liver pathology Liver pathology Liver pathology positive negative not performed

Hepatic scintigraphy positive 231 32 166 Hepatic scintigraphy negative 27 54 140 Total 258 86 306

TABLE 3 Estimates with 95% CIs of sensitivity and specificity for different methods to adjust or impute missing data on reference standard outcome (based on data from Table 2)

Procedure Sensitivity Specificity

Estimate 95% CI Estimate 95% CI

Complete cases 0.895 0.858 to 0.932 0.628 0.526 to 0.730 Begg and Greenesa 0.836 0.788 to 0.884 0.738 0.662 to 0.815 Logit version Begg and Greenesa 0.836 0.835 to 0.838 0.738 0.735 to 0.741 A&Cb 0.869 0.820 to 0.918 0.672 0.571 to 0.772 Rubin (logit)b 0.872 0.817 to 0.912 0.675 0.567 to 0.797 Wilsonb 0.869 0.837 to 0.901 0.672 0.610 to 0.733 Jeffreyb 0.872 0.838 to 0.901 0.675 0.611 to 0.734 Z&Lb 0.872 0 to 1 0.675 0 to 1

A&C, Agresti-Coull; Z&L, Zhou and Li method. a Correction method. b Multiple imputation method.

TABLE 4 Diaphanography data: outcome of reference standard in those verified and fraction of unverified test results (adapted from Harel and Zhou 200623)

Breast cancer positive Breast cancer negative No verification

Diaphanography positive 26 11 30 Diaphanography negative 7 44 782 Total 33 55 812

TABLE 5 Estimates with 95% CIs of sensitivity and specificity for different methods to adjust or impute missing data on reference standard outcome (based on data from Table 4)

Procedure Sensitivity Specificity

Estimate 95% CI Estimate 95% CI

Complete cases 0.788 0.649 to 0.927 0.800 0.694 to 0.906 Begg and Greenesa 0.280 0.127 to 0.434 0.974 0.960 to 0.989 Logit version Begg and Greenesa 0.280 0.275 to 0.285 0.974 0.973 to 0.975 A&Cb 0.706 0.560 to 0.852 0.861 0.753 to 0.970 Rubin (logit)b 0.717 0.548 to 0.841 0.869 0.721 to 0.944 Wilsonb 0.706 0.601 to 0.812 0.861 0.839 to 0.884 Jeffreyb 0.718 0.603 to 0.815 0.863 0.839 to 0.885 Z&Lb 0.717 0 to 1 0.869 0 to 1

A&C, Agresti-Coull; Z&L, Zhou and Li method. a Correction method. b Multiple imputation method. 15

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

Complete case analysis seems to overestimate but a of plausible values are available. This sensitivity and underestimate specificity, whereas range of values can then be applied in a sensitivity Begg and Greenes’s correction methods seem analyses that will produce a range of estimates of dramatically to underestimate sensitivity and accuracy for the index test. overestimate specificity. Extensions of the basic model Harel and Zhou used the MAR assumption where In many clinical situations, the assumption of missingness is related to index test results only. In conditional independence is unlikely to be true. practice, factors other than index test results, such Often the index test and reference standard have a as symptoms, signs, co-morbidity or other test tendency to make errors in the same (difficult) results, may have driven the decision to apply or patients, particularly when index test and to withhold the reference standard. Imputation reference standard are methodologically related or models can be extended to incorporate these measure the same physiologic alteration.56 In the additional sources of information, if known and detection of cancer, for example, both the index collected, in an effort to improve the prediction of test and reference standard may have difficulties the reference standard outcome in unverified in detecting early stages of cancer, and in clinical patients. chemistry, contamination of body fluid samples may affect both the index test and reference standard. Algebraic functions have therefore been Correct imperfect reference developed that build in the conditional standard dependence, such as the correlation in errors.57 Unfortunately, the amount of correlation between Description of the method errors is rarely known, hence this parameter is If a true gold standard does not exist, a logical next often varied over a wide range of plausible values. step is to search for the best available procedure to verify index test results. This is the rationale behind Strengths and weaknesses the concept of reference standard: the best available These correction methods can easily be applied method to determine the presence or absence of using simple algebraic functions. The main the target condition, not necessarily without error. limitation is that in most situations the true However, imperfection of the reference standard sensitivity and specificity of the imperfect leads to bias known as reference standard bias.11,19,53 reference standard are not known, nor does one If knowledge about the amount and type of errors of know the amount of correlation among errors of the imperfect reference standard is available, then the index test and the reference standard. If the this information can be used to correct estimates of chosen values do not match the true values, the diagnostic accuracy. We assume that all patients resulting estimates of diagnostic test accuracy have received the imperfect reference standard, so would still be biased or could become even more there are no missing data in contrast to the biased than the unadjusted ones.58 For these correction methods described in the previous reasons, researchers have tried to generalise the section. algebraic correction functions to avoid making untenable assumptions.54 Latent class analysis Basic model subsequently evolved, which can be viewed as an The basic model assumes that the error rates of extension of this method for situations where the reference standard are known and that there is no information available on either the algebraic functions can be used to recalculate error rates of the reference standard or the true estimates of accuracy. Key publications by Hadgu (see the section ‘Latent class analysis’, and colleagues54 and Staquet and colleagues55 p. 26). In latent class analysis, the result of the provide several algebraic functions, all based on (imperfect) reference standard is incorporated as the assumption of conditional independence just one source of information about the true between the index test and reference standard disease status. results. The assumption implies that the errors of both tests are independent of the underlying true Applicability status of the target condition, hence the index test The basic correction method can be applied in and reference standard do not tend to err in the any situation if the following conditions are met: same patients. reliable information on the magnitude of the error rates of the reference standard is available and the Sometimes, the exact sensitivity and specificity of conditional independence assumption is likely to 16 the imperfect reference standard are not known, be true, or there is reliable information about the Health Technology Assessment 2007; Vol. 11: No. 50

correlation between errors in the index test and another cell because we assume that cancer cells the reference standard. In all other situations, the identified by pathological evaluation always method is likely to render accuracy estimates that indicate to cancer. However, in reality, there could are biased, despite the use of a correction method. be more observations in the true positive cell when some observations were wrongly classified as false Software positives, which can happen when not enough or The algebraic functions can be used in widely not the correct biopsies were taken from a lesion. available software programs such as Excel or any These observations would be classified as false- other statistical program. positives when in reality they belong in the true positive cell. The magnitude of this type of Clinical example misclassification is assumed to be small but The development of sensitive tests for bladder unknown. cancer is critical for its early detection. A gold standard does not exist for the evaluation of For the false negative and true negative cells, the endoscopic tests as ideally this would necessitate mechanism is analogous. If cancerous lesions are removing the bladder for detailed pathological missed by both index tests, they are classified as examination, a procedure that would not be true negatives, but in reality they are false ethically justifiable. Therefore, pathological negatives. False negatives can be observed in this evaluation is done only on biopsy or surgically study because of the paired design of this study; removed tissue. In other words, only positive for example, patients underwent both index tests findings during endoscopy (index test) are so that a true positive finding with one technique verified. Histological examination alone is can be considered a false negative finding for the therefore not a reference standard providing other technique if that technique missed that perfect classification because of the unknown lesion. The magnitude of this misclassification of likelihood of missing cancerous lesions because negative index test results is also assumed to be they were not biopsied. This is the downside of small but larger than the misclassification in the not being able to verify negative findings of the positive tests. index test. The following misclassification model was used to Assumptions are necessary to correct for biases correct the counts in the observed 2-by-2 table and inherent in this approach. Schneeweiss and recalculate sensitivity based on these corrected colleagues59 provide an example of derivation of counts. Let p(A) be the probability of an interval of sensitivity estimates that includes the misclassifying true positives as false positive results true value. They compared the ability of and p(B) be the probability of misclassifying false 5-aminolevulinic acid-induced fluorescence and negative as true negative results. A more general white light endoscopy to detect bladder cancer. equation can now be derived for estimating test This is an example of a comparative diagnostic sensitivity: accuracy study with two index tests. They hypothesised that 5-aminolevulinic acid-induced ‘true positive’ + p(A) × ‘false positive’ fluorescence endoscopy is of superior diagnostic Sensitivity = ––––––––––––––––––––––––––––––––––– value. A total of 208 patients under surveillance ‘true positive’ + p(A) × after superficial bladder cancer were included. ‘false positive’ + ‘false negative’ + Multiple evaluations over time within the same p(B) × ‘true negative’ patient were included, leading to a total of 328 endoscopic evaluations in which both procedures If it were assumed that there would be no (index tests) were performed. They used sensitivity misclassification, sensitivity could be easily as the main accuracy parameter as the calculated from the observed numbers in the consequences of false negative cases would be 2-by-2 table, the so-called naive estimate. The worse and these results should be minimised by a analysis that Schneeweiss and colleagues propose good test. comprises most optimistic, most pessimistic and some realistic assumptions to generate an interval The effect of having no gold standard is that there of sensitivity.59,60 The maximum interval of can be misclassification among the four cells in the observable sensitivity for 5-aminolevulinic acid- 2-by-2 table. The observed true positive cell induced fluorescence endoscopy ranged between consists of lesions that are positive with the index 78 and 97.5%, and the best estimate for sensitivity test and confirmed on biopsy. It is unlikely that based on realistic assumptions was 93.4% (95% CI any observation in this cell should belong in 90 to 97.3). The best sensitivity estimate for white 17

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

light endoscopy was 46.7% (95% CI 39.4 to 54.3, predict, as it depends on the proportion of maximum range 47.2–53%). patients verified differently, the selection process behind patients verified differently, the properties This method to determine the maximum possible of the reference standards involved and their range of sensitivity estimates in studies in which relation with the index test.20 negative findings cannot all be verified is easily applied. Depending on the assumptions, a range Strengths and weaknesses of reasonable scenarios can be constructed and the Differential verification appears to escape the bias corresponding sensitivities can be reported. of incomplete (partial) verification, but can lead to estimates of diagnostic accuracy that differ from those obtained with full verification by the Construct reference standard preferred reference standard.1,3 Moreover, estimates of sensitivity and specificity are more Differential verification difficult to interpret, as they are based on multiple Description of the method index test–reference standard combinations.20 If Instead of ignoring, imputing or correcting the complete verification by the preferred reference is non-verified results in the analysis, a second not possible and different reference standards reference standard can be used to achieve have to be used, the best approach is to complete verification. Use of different reference incorporate differential verification in the design. standards in this way is also known as differential This means prespecifying the group of patients verification (see also Figure 2, p. 4).19 The second that will receive the first reference standard and reference standard is usually a less invasive test the group that will receive the second. An example and is less costly or less burdensome to patients. could be that all patients with a positive index test In the detection of pulmonary embolism (PE), are verified by one reference standard and all for example, it is nowadays considered unethical negative patients are verified by the second to perform pulmonary angiography in patients reference standard. In that case, the appropriate with a low suspicion of PE and a negative D-dimer measures of accuracy are the positive and negative test result.61 In many of these studies, only predictive values, calculated with corresponding patients at high-risk for PE receive the best reference standards. Because these measures are available reference standard (angiography), directly influenced by changes in prevalence the whereas ‘low-risk’ patients are likely to receive a right design has to be chosen, such as cohort different reference standard, such as clinical rather than case–control. follow-up. Field of application The use of different reference standards between Differential verification can be a reasonable option patient groups is frequently encountered in the if several acceptable diagnostic tests are available literature. In a survey of 31 diagnostic reviews, that can serve as the reference standards. A differential verification was present in 99 out of prerequisite, however, is that the verification 487 (20%) primary diagnostic accuracy studies.3 In scheme is preplanned (design-based). As in any most cases, incomplete verification is neither design, authors should report the rationale for specified in the design nor completely at random. each reference standard. Moreover, results should Triggers to perform the preferred reference be reported separately for each index standard in some patients and not in others test–reference standard combination. include a positive result on the index test, positive results from other tests or the presence of risk Software factors for the condition of interest. This means Studies with differential verification produce that differential verification is selective and shows standard accuracy results. No additional a non-random pattern, being based on decisions programming is necessary. by the practitioner or patient. Clinical example Differential verification has been shown to lead to An illustrative example of differential verification higher estimates of accuracy than studies using a that was (partly) design based can be found in the single reference standard in all patients.1,3 This publication of Kline and colleagues.63 The type of bias is known as differential verification objective of the study was to evaluate the bias, or different reference standard bias, work-up diagnostic accuracy of the combination of D-dimer bias or .19,62 The effect of differential assay with an alveolar dead-space measurement 18 verification on estimates of accuracy is difficult to for rapid exclusion of pulmonary embolism. Health Technology Assessment 2007; Vol. 11: No. 50

In this multicentre study, V/Q scans, helical are then used to update the final 2-by-2 table. computed tomography, ultrasonography, clinical Based on this final 2-by-2 table, estimates of follow-up, angiography, death, events of deep accuracy for the index test are calculated. venous thrombosis or new events of pulmonary embolism and combinations of these tests were Strengths and weaknesses used as diagnostic criteria to determine the The method is straightforward and easy to carry presence or absence of pulmonary embolism. The out without statistical expertise. At first sight, this authors describe a verification scheme to establish design seems to provide an efficient alternative to the final diagnosis that at first seems to be design reduce the number of patients who have to be based, but later on they state that the decision to tested with the best available reference standard order further imaging in patients with non- when this standard is either invasive or costly to diagnostic V/Q scans was at the discretion of the apply. Yet a fundamental problem of discrepant attending physician. analysis is that the verification pattern is dependent on the index test results.64,66 Although The authors provide a table describing the discrepant analysis provides the status of the target different criteria for the presence of PE and the condition for those who are retested, it does not number of patients. Unfortunately, the provide that information for those not retested, corresponding results of the combined which is usually the majority. Discrepant analysis D-dimer–alveolar dead-space measurement were therefore has the potential to lead to serious not stated in this table. In the results section, bias.64,65,67–71 sensitivity, specificity, negative and positive likelihood ratios and the corresponding CIs were The potential for bias has been demonstrated calculated in the usual way. algebraically and numerically in the estimation of sensitivity and specificity in a situation where the In the discussion, the authors do address the resolver test is a perfect gold standard.66,68,69 In problem of differential verification. They state that this situation, discrepant analysis will lead to computed tomography, for example, performs estimates of sensitivity and specificity for the index differently from angiography, and may have test that are biased upwards, so that the index test wrongly diagnosed some patients with PE who appears to be more accurate than it really is. The may have been free from PE. magnitude of this bias depends on the absolute number of errors of the index tests (false positive The results of this study may not be very helpful and false negative index test results) and the to clinicians in practice, as it is unclear to which amount of correlation between the errors of the patients the estimates of diagnostic accuracy apply. index test with the results of the first reference standard. These are correlated errors that will Discrepant analysis erroneously appear as either true positives or true Description of the method negatives in the final 2-by-2 table as these patients Discrepant analysis, also referred to as discrepant will not enter the second stage and therefore will resolution or discordant analysis, uses a not be corrected by the perfect resolver test. combination of reference standards in a sequential manner to classify patients as having the target Green and colleagues showed that when the condition or not.22,64,65 Discrepant analysis can resolver test is not a perfect gold standard, even provide estimates of prevalence and estimates of larger biases are possible, as the second imperfect sensitivity and specificity of the index test without reference standard can lead to further statistical modelling. misclassification of the discordant results.72 If errors of the stage 2 resolver test are correlated with errors Initially, all patients are tested with the index test of either the index test or the stage 1 imperfect and one imperfect reference standard (Figure 4). reference standard, the diagnostic accuracy of the Since the imperfect reference standard is known to stage 2 reference standard will be biased in the be imperfect, the discordant or discrepant results assessment of discordant results. In other (cases where index and first reference standard circumstances, the measured values may actually be disagree) are retested (resolved) with an additional closer to the true values.66 In general, situations reference standard, frequently called the resolver where the errors of the index test and the resolver test. The resolver test is usually a more invasive, test are related are of particular concern. more costly or otherwise burdening test, with better discriminatory properties than the first A method to correct for the potential bias in the reference standard. The results of the resolver test discrepant analysis has been proposed,66,73 but this 19

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

method is considered to be of limited applicability. (detection of infections),64,66 it can, in theory, The method requires that a random sample of be applied in all other medical fields. The patients with concordant results of the index test general notice is to refrain from discrepant and the initial reference standard is tested by the analysis because of the fundamental problem resolver test. From this sample, the concordance of the implicit incorporation of index test rate of false results is estimated and the observed results in the definition of the true disease sensitivity and specificity are corrected accordingly. status leading to potential bias.64,65 This procedure is adequate only if the resolver test (http://www.fda.gov/cdrh/osb/guidance/1428.pdf). is (nearly) perfect, a situation which occurs Only in special situations can the choice for infrequently. discrepant analysis be defended if there is a perfect resolver test and a random sample of Field of application concordant results is also verified by the resolver Although discrepant analysis has been most test using the correction method proposed by frequently applied in the field of microbiology Begg and Greenes.22,48

Study population

Index test

Reference standard 1 (R1)

Discordant results: Concordant results: T+/R1 – = ? T+/R1+ = TP T–/R1+ = ? T–/R1– = TN

Reference standard 2 (R2) Resolver test

Discordant → Resolver results: T+/R1– → R2+ = TP → T+/R1– R2– = FP Ta rget condition T–/R1+ → R2+ = FN T–/R1+ → R2– = TP Present Absent

TP = FP = + T+/R1+ & Index test T+/R1–/R2+ T+/R1–/R2– Final 2 ϫ 2 Table: FN = TN = Result – T–/R1– & T–/R1+/R2+ T–/R1+/R2–

20 FIGURE 4 Flow of patients and final 2-by-2 table in a study using discrepant analysis Health Technology Assessment 2007; Vol. 11: No. 50

Software Combining the results from the initial stage Any statistical package can be used. The (Table 6) and the second resolver stage (Table 7) calculations can also be done with a pocket leads to the final 2-by-2 table (Table 8) on which calculator. the accuracy can be calculated.

Clinical example The estimates of accuracy after retesting the 10 To diagnose Chlamydia trachomatis infection, discordant results with PCR are as follows: culture, DNA amplification methods such as polymerase chain reaction (PCR) and antigen prevalence is 26/324 = 0.080 detection methods such as enzyme immunoassay sensitivity is (20 + 4)/(20 + 4 + 3 – 1) = (EIA) are used (the same example is also used in 24/26 = 0.923 composite reference standard and latent class specificity is (294 + 1)/(294 + 1 + 7 – 4) = analysis). Alonzo and Pepe25 discuss the problems 295/298 = 0.990. of assessing the accuracy of EIA (index test) as neither of the two other methods can be If we would have calculated the estimates of considered ‘gold standards’. Culture is believed to accuracy directly after using culture as a single have nearly perfect specificity, but also misses reference standard, they would have been as patients with Chlamydia infection (false negatives). follows: PCR is believed to be more sensitive than culture to detect Chlamydia trachomatis. Alonzo and Pepe prevalence is 23/324 = 0.071 use the method of discrepant analysis to assess the sensitivity is 20/23 = 0.870 accuracy of EIA by using culture as the initial specificity is 294/301 = 0.977. reference test and PCR to resolve the discrepant results in the second stage (resolver test).25 Changes in this case are small because the frequency of discordant results was relatively low The data are derived from the study of Wu and (3.1%). The validity of the approach depends on colleagues.74 In the first stage, all 324 specimens the error rate of the resolver test and the are tested by EIA (index test) and culture (Table 6). frequency of errors in the concordant results in the In the second stage, only the discrepant results, for initial stage. example, specimens which are positive by EIA and negative by culture (n = 7) or vice versa (n = 3), are Composite reference standard retested with the resolver, PCR. The concordant Description of the method results directly enter the final 2-by-2 table as true In the absence of a single gold standard, the positives (n = 20) and true negatives (n = 294). results of several imperfect tests can be combined to create a composite reference standard. The In the second stage, the 10 discordant results are results of the component tests of the composite tested with PCR and the outcome of the PCR test determines the classification in the final 2-by-2 table (Table 7). The EIA positive and culture TABLE 7 Results of the second stage: selective testing of negative results become either true positives if the discrepant results with PCR, the resolver test PCR is positive (n = 4) or false positives if the PCR outcome is negative (n = 3). The same rule Discrepant results Resolver test PCR applies for the EIA negative and culture positive +– results (n = 3): the outcome of the PCR determines the final classification as either true EIA+ and culture – (n = 7) 4 3 negative if PCR is negative (n = 1) or false EIA– and culture + (n = 3) 2 1 negative if PCR is positive (n = 2).

TABLE 6 Initial classification using culture as the first TABLE 8 Final classification based on discrepant analysis (imperfect) reference standard method

Culture EIA Final classification

EIA + – +–

EIA+ 20 7 EIA+ 20 + 4 3 EIA– 3 294 EIA– 2 294 + 1 21

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

reference standard determine the presence of the definition was used in a study evaluating target condition based on a prespecified rule. Such Helicobacter pylori, where a patient was regarded a composite reference standard is believed to have infected whenever two or more of the six available better discriminatory properties than each of the tests were positive.78 reference standards components in isolation.24,25,64,66,75 The of the composite reference standard can often be improved by avoiding redundant Basic model testing. If the classification rule is based on one of A basic model is where two tests are applied in all the two reference tests being positive, there is no patients and a prespecified (deterministic) rule is need to perform a second reference test once the used to classify patients as having the target outcome of the first reference test is positive. condition. Several definitions for the presence of Hence retesting is only necessary in patients with a the target condition can be used, depending on negative result on the first reference test. whether emphasis should be given to detecting or Performing the more accurate reference test first excluding the target condition and on the lowers the number of patients needing additional characteristics of the available reference tests. Most testing. frequently, researchers define the target condition to be present if either one of the reference tests is Another extension is the use of statistical models positive. In the evaluation of the diagnostic to combine several reference test results to define accuracy of EIA in the detection of Chlamydia disease status as in latent class analysis. This trachomatis, for example, a composite standard of approach is discussed separately, as it does not use two reference tests has been used: culture and a prespecified deterministic rule for classifying PCR.76 EIA and the two reference tests were patients having the target condition or not (see applied in all patients and infection with Chlamydia the section ‘Latent class analysis’, p. 26, for more trachomatis was diagnosed if either culture or PCR details). was positive. If both reference tests were negative, the person was labelled free of Chlamydia Strengths and weaknesses trachomatis. The method is straightforward and easy to understand. It allows one to combine several It is important to note that the composite sources of information in order to assess whether reference standard approach differs from the target condition is present. The prespecified discrepant analysis (see the section ‘Discrepant rule to define when the target condition is present analysis’, p. 19), as the results of the index test increases transparency and avoids problems of themselves play no role in the verification incorporation and work-up bias. Unlike discrepant procedure of the composite method. The analysis, the application of the second reference difference between using a composite reference standard is independent of the index test result. standard and differential verification is that in the composite method each patient receives all A major issue is whether the combination of (necessary) components of the composite reference reference test results is a meaningful way of standards whereas in differential verification defining the target condition and whether residual subgroups of patients are verified by one reference misclassification is likely to be present. Do standard and other subgroups by a different different reference standard tests look at the same reference standard (see also the section target condition, but with different error rates, or ‘Differential verification’, p. 18). do they define the target condition differently? The latter is thought to reduce the clinical Extensions of the basic model usefulness.24 The inclusion of more than two The composite reference standard approach can reference tests in the composite reference standard be extended to include more than two reference may then obscure the final definition of the tests, as for instance in the detection of myocardial disease. infarction, where chest pain, serological markers and electrocardiograms are used to define the A general problem is the determination of the presence or absence of the disease.77 Increasing ‘cut-off value’ to classify patients as having the the number of reference tests amplifies the target condition when several imperfect reference number of ways to define the presence of the tests are available. Several suggestions have been target condition. If six reference tests are made in literature. Alonzo and Pepe referred to combined, for example, one possibility is the ‘any latent class analysis as a tool for finding an 22 test positive’ definition of disease. Another optimal cut-off to define the target condition, but Health Technology Assessment 2007; Vol. 11: No. 50

TABLE 9 summarising composite reference for the Chlamydia data: EIA is the index test and the elements of the composite reference standard are culture and PCR.

EIA Culture Composite reference standard: culture + PCR

+– + –

EIA+ 20 7 20 + 4 7 – 4 EIA– 3 294 3 + 2 294 – 2

they concluded against its use, as they see the They use the either positive rule to classify model as being prone to bias due to its patients as having Chlamydia infection meaning assumptions64 (see also the ‘Latent class analysis’, that any specimen that is culture positive or PCR p. 26). Some authors have suggested to use all positive is composite reference standard positive possible ‘cut-off values’ to define the target and any specimen that is culture negative and condition rather than choosing a single one. They PCR negative is composite reference standard evaluate the index test against all these definitions negative. This approach allows one to use several and show how the accuracy of the index test varies sources of information in order to assess if an across these definitions.75,79 infection is present based on an a priori rule.

Field of application Because they use the either positive rule, there is The composite reference standard method can be no need to test all specimens with both reference considered in all situations where a single gold tests of the composite reference standard: if a standard does not exist, but several imperfect specimen is positive on the first reference test it reference tests are available. One potential benefit will be positive on the composite reference is that it allows the researcher to ‘adjust’ the standard and no further testing is therefore threshold for disease depending on the clinical required. In the Chlamydia setting, in the first stage problem at hand. In situations where missing the all specimens are tested by EIA (index test) and target condition outweighs the risk of wrongfully culture. In the second stage, only those specimens labelling persons as diseased, using a cut-off of which are culture negative at the first stage are ‘any result positive indicates disease’ can be tested with PCR. advantageous. The composite reference standard method has been used in many areas, including Considering the data from the study by Wu and infectious diseases64,75,76 and gastroenterology.79 colleagues74 a contingency table summarising the two stages of composite reference standard is Software given in Table 9. After the first stage in assessing The method is based on simple predefined the performance of EIA using the composite classification rules, and requires no additional reference standard, the 301 culture negative software. Measures of accuracy including CIs are specimens (7 + 294) were tested with PCR. Six of calculated in their traditional way. these specimens were PCR positive and therefore considered to be infected based on the composite Clinical example reference standard. The final estimates of accuracy To diagnose Chlamydia trachomatis infection, are as follows: culture, DNA amplification methods such as PCR and antigen detection methods such as EIA are prevalence is 29/324 = 0.090 used. (The same example is used in discrepant sensitivity is (20 + 4)/(20 + 4 + 3 + 2) = analysis and latent class analysis.) Alonzo and Pepe 24/29 = 0.828 discuss the problems of assessing the accuracy of specificity is (294 – 2)/(294 – 2 + 7 – 4) = EIA (index test) as neither of the two other 292/295 = 0.990. methods can be considered ‘gold standards’.25 Culture is believed to have nearly perfect Because there is an a priori rule involving only the specificity. PCR is believed to be more sensitive two reference test, the results of the index test than culture to detect Chlamydia trachotomatis. (EIA) play no role in the classification of patients. This is different from the discrepant method Alonzo and Pepe provide an example of because there the index test does play a role as assessment of EIA’s accuracy using a composite only discrepant results move on to the second reference standard based on culture and PCR.25 stage. To put it differently, in the composite 23

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

example the classification would not have changed In the selection of experts, the qualities of the if all specimens had been tested by both reference experts are expected to be of greater importance tests (culture and PCR). Because of the nature of than the number of experts in a panel. the a priori rule (either positive), redundant testing Judgements by specialists may come closer to the can be avoided to gain efficiency. true status of patients than a random or convenience sample of physicians. An uneven One drawback of the composite reference number of experts is sometimes recommended as standard is that it requires testing the typically it facilitates decision-making in case of majority large number of specimens negative by culture. voting.80

Panel or consensus diagnosis In general, all information relevant for the Description of the method classification of patients is presented to the experts In a panel or consensus diagnosis, a group of in a standardised way. Especially, subtle experts determines the presence or absence of the information from history taking might be difficult target condition in each patient based on multiple to get across on paper. Video footage of the sources of information. These sources can include original history taking or video films of dynamic general patient characteristics, signs and radiological examinations might be an option. symptoms from history and physical examination, and other test results. Information from clinical Special care should be given to the role of the follow-up may also be included as an additional index test result in a panel diagnosis. If the index source of information to improve the classification test result is given to the experts, its importance of patients by the experts. may be overestimated, leading to inflated measures of accuracy. This is known as Variation exists in how to synthesise the input incorporation bias.2 On the other hand, the final from different experts. Experts can discuss the classification in patients with and without the information on each patient directly in a meeting target condition may become better if the results and produce a consensus diagnosis using majority of the index tests are disclosed to the experts. This voting in case of disagreement. In other studies, would then reduce the problem of misclassification experts determined the final diagnosis of the disease status. Most authors advise independently from each other and patients were withholding the index test result from the experts. only discussed in a consensus meeting in case of An attractive alternative is to use a staged disagreement among experts. The final diagnoses approach where experts make a final diagnosis from individual experts can also be combined without the index test results first and then reveal using a statistical model. the index test result and ask whether experts would like to change their classification. Because a final diagnosis is determined in all patients, researchers can calculate estimates of As stated before, several methods of synthesising accuracy in the traditional way without the use of the input from different experts have been specialised software. applied. Standard consensus meetings have been questioned because of problems arising from Several practical decisions have to be made in powerful personalities and also peer or group setting-up a panel method. These include: pressures. The Delphi procedure is a formal technique to collect and synthesise expert ● the choice of experts: number and background opinions anonymously to overcome these types of ● the number of items to be considered in problems.81 reaching a diagnosis and the way to present them Offering training and piloting sessions to experts ● whether or not to include the results of the can be used to increase agreement among experts. index test Piloting of the disease classification process may ● combining the input from experts unearth problems that can then be addressed ● training and piloting. prior to the actual consensus process. Whether or not fixed decision rules need to be established Surprisingly little has been written in the medical during this piloting process is a matter of debate. literature about these practical decisions and how they can affect the final outcome. The discussion Strengths and weaknesses that follows is therefore largely based on Consensus diagnosis is an attractive alternative if a 24 theoretical considerations. generally accepted reference standard does not Health Technology Assessment 2007; Vol. 11: No. 50

exist and multiple sources of information have to procedure and multiple sources of information be interpreted in a judicious way to reach a have to be interpreted in a judicious way to reach diagnosis. Examples include target conditions that a diagnosis. In particular, the panel method is can lead to a wide range of symptoms and that suited for target conditions that cannot be cannot be diagnosed histologically. In those cases, unequivocally defined. Examples in which panel the target condition can be defined by the diagnoses have been used include , clinicians, who take into account items of history, the underlying causes in patients with syncope and clinical examination, other test results and clinical underlying conditions in patients presenting with follow-up, in order to determine clinical dyspnoea. management. In such cases, the panel method closely reflects the clinically relevant concept of Software target condition. The accuracy statistics calculated No specialised software is needed, unless advanced in a study that relies on panel method to classify procedures such as latent class analysis are used as disease are likely to have high levels of a method of combining the results from various generalisability to clinical practice. experts (for details, see the section ‘Latent class analysis’, p. 26). Other advantages of this approach are the flexibility in determining conditions that have Clinical example been poorly defined and the option of classifying Echocardiography is considered an imperfect conditions in a binary (target condition present or reference standard test for heart failure. The absent) and also in a multi-level way (as in severe, European Society of Cardiology recommends that moderate, mild, no disease, indeterminate, etc.). the diagnosis of heart failure be “based on the symptoms and clinical findings, supported by The downside of this method can be poor inter- appropriate investigations such as and even intra-rater agreement, leading to electrocardiogram, chest X-ray, biomarkers and discordance in disease classification between and Doppler-echocardiography”. Convening a within the experts. The levels of inter- and intra- consensus panel to establish the presence or rater agreements for the experts can be measured absence of the target condition is then an and should be reported in the study as part of appropriate approach to address the issue of validation of the consensus method. imperfect reference standard in this study.

Second, the subjectivity of the disease classification Rutten and colleagues evaluated which clinical process may introduce bias. Incorporation bias or variables provide diagnostic information in test review bias looms if the result of the index test recognising heart failure in primary care patients is disclosed to the experts. with stable chronic obstructive pulmonary disease (COPD), and whether easily available tests provide Third, the absence of a strict definition of disease added diagnostic information, in particular in panel-based methods may be responsible for a N-terminal pro-brain natriuretic peptide (NT- in the definition of ‘disease’ in the proBNP.82 They studied patients over the age of group of experts, who may have a different view of 65 years with stable COPD diagnosed by a general what constitutes the target condition of interest. practitioner without a previous diagnosis of heart Piloting and some form of standardisation can failure made by a cardiologist. Clinical variables help in remedying this problem. included history of ischaemic heart disease, cardiovascular medications, body mass index, Fourth, the panel method can become a laborious displacement heart and heart rate. The tests enterprise if many items of information per patient included NT-proBNP, C-reactive protein, need to be summarised and if many patients have electrocardiography and chest radiography. An to be discussed among several experts. expert panel made up of two cardiologists, a pulmonologist and a GP determined the presence Finally, domination by assertive members of the or absence of heart failure by consensus. The panel can weaken any consensus procedure, panel used all available information, including although formal techniques such as the Delphi echocardiography, but did not use NT-proBNP in method are available to reduce this problem. making the diagnosis. When there was no consensus, the majority decision was used to Application allocate the diagnostic category. Whenever a The panel diagnosis method is well suited when situation of evenly split votes arose, the majority there is no generally accepted reference standard decision amongst the two cardiologists and the GP 25

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

was used to reach a diagnosis, thus leaving out the the true diagnosis. These unobservable outcomes vote of the pulmonologist. The situation of split are named latent variables. These latent variables vote only occurred in a minority of the cases can only be measured indirectly by eliciting (5/405, 1%). The authors re-presented a random responses that are related to the construct of sample of patients to the expert panel, blinded to interest. These measurable responses are called the original decision, and found excellent level of indicators or manifest variables. Latent variable reproducibility (Cohen’s k = 0.92). models are a group of methods that use the information from the manifest variables to identify A limited number of items from history could help subtypes of cases defined by the latent variable. GPs identify those with heart failure. NT-proBNP and electrocardiography were the two tests that The problem of evaluating an index test in the were found to be useful in improving the accuracy absence of a gold standard can be viewed as a of diagnosis. As fully appreciated by the study latent variable problem. In our case we have authors, the study suffers from the possibility of dichotomous latent variable, namely whether or incorporation bias, except in the case of the index not patients have the target condition. When the test of NT-proBNP, as results of this test were not latent variable is categorical (dichotomous being a made available to the consensus panel. The special case within this group), the latent models authors postulated that the magnitude of the are referred to as latent class models, whereas incorporation bias is likely to have been small as when the latent variable is continuous they are most diagnostic determinants were not crucial in named latent trait models. Results of the index test the panel diagnosis process. This postulate is and other imperfect tests are then the observable supported by the use of echocardiography, which (manifest) variables that can be used to estimate is central in the diagnosis of heart failure, only as the parameters that are linked to true part of the reference standard, and not as one of status, like sensitivity, specificity and prevalence. the index tests. Maximum likelihood methods can be used to estimate these parameters of the latent model. No specific algorithms or guidelines were given in the article on how the panel weighted and The latent class approach has similarities with the combined the various items of history, panel or consensus diagnosis methods, which also examination and investigations findings, although uses multiple pieces of information to construct a references were made to previous studies which reference standard. In the panel method, experts may have given some guidance on this. determine whether each patient has the target Standardisation of the disease categorisation condition given a set of test results, whereas a process and threshold for disease positivity are formal statistical model is used to obtain the recognised to be problems with heart failure. It is statistics of interest in the latent class approach. not clear if domination by a specific person or persons in the consensus panel was an issue, as Basic model this is certainly plausible when the panel has The statistical framework underlying latent class generalists and specialists with varying levels of models can be illustrated using the following basic expertise in cardiology. Despite these weaknesses, example. In this example we have three different this study is an excellent example of the use of a tests being applied in all patients with each test consensus panel as a reference standard. producing a dichotomous test result (e.g the test is either positive or negative). All three tests relate to Latent class analysis the same target condition, but none of them is This group contains many variations, but all error free. For a single test, the probability of methods have in common the use of a statistical obtaining a positive test result can be written as model to combine different pieces of information the sum of finding a positive test in a patient who (test results) from each patient to construct a has the target condition (true positive result) or a reference standard. These methods acknowledge positive test result in a patient without the target that there is no gold standard and that the condition (false positive result). These probabilities available tests are all related to the unknown true (Prob) can be written as a function of the following status: target condition present or absent.64,83–89 unknown measures: prevalence (prev), sensitivity of test 1 (sens1) and specificity of test 1 (spec1). The problem that the outcome of interest cannot The probability of finding a true positive result for be measured directly occurs in many research test 1 (TP1) can be written as situations. Examples include constructs such as × 26 intelligence, personality traits or, as in our case, Prob(TP1) = prev sens1 Health Technology Assessment 2007; Vol. 11: No. 50

and the probability for obtaining a false positive has to be equal to the total number of patients in result (FP) as the study, which means that we have 8 – 1 = 7 degrees of freedom. If the degrees of freedom are × Prob(FP1) = (1 – prev) (1 – spec1) equal to or greater than the number of parameters Therefore, the probability of a positive test results to be estimated, standard maximum likelihood for test 1 is the sum of these two probabilities methods can be used to obtain a (unique) solution. More mathematical details can be found × 85,90 Prob(+) = Prob(TP1) + Prob(FP1) = prev elsewhere. × sens1 + (1 – prev) (1 – spec1) (1) Extensions of the basic latent class model In the same way, we can write the probability for Several extensions to this basic model obtaining a true negative (TN) result as (dichotomous latent variable, three dichotomous × tests assuming uncorrelated errors) have been Prob(TN1) = (1 – prev) spec1 formulated. Here we discuss the main extensions and for a false negative (FN) test result as relevant for diagnostic research. × Prob(FN1) = prev (1 – sens1) The number and type of tests and the probability of a negative test result as Latent class models are flexible and can incorporate dichotomous results, but also ordinal 90 Prob(–) = Prob(TN1) + Prob(FN1) = test results or continuous test results. The model × × (1 – prev) spec1 + prev (1 – sens1) (2) can easily be extended to incorporate the results of more than three tests. Including additional tests Of course, the probabilities of the other tests are is beneficial from a modelling point of view as it defined in the same way, but then using the increases the available degrees of freedom (more sensitivity and specificity of that specific test. This tests lead to more test results combinations). More means that there are seven unknown parameters degrees of freedom mean that more parameters in this example: one prevalence parameter and can be estimated, for instance a correlation the sensitivity and specificity for each of the three parameter to acknowledge that errors between tests (1 + 6 = 7). tests might be correlated (see also the section below on conditional dependence). The extra With three different dichotomous tests, there are degrees of freedom can also be used for additional eight possible combinations of test results: all tests checks of the fit of the model. being positive, three variations where two tests are positive and one negative, three situations where Much attention has been given to estimating the one test is positive and two negative, and the accuracy of a test when only one other additional situation where all tests are negative. By using the test (e.g. imperfect reference standard) is probabilities for a positive [equation (1)] or available.91,92 In this case, the number of negative test result [equation (2)] for each test, we parameters to estimate (1 prevalence + 2 can write down the likelihood of observing each sensitivities + 2 specificities = 5 unknown pattern of test results. Under the assumption of parameters) is larger than the available degrees of statistical independence, the likelihood of freedom (4 test results combinations: ++, +–, –+, observing a specific pattern can be written as the – – minus 1 = 3 degrees of freedom). This means probability of observing that pattern in patients that no optimal maximum likelihood solution can who have the target condition plus the probability be identified: different combinations of values of of observing the same pattern in patients without prevalence, sensitivities and specificities fit the the target condition. data equally well. Only through restrictions can we estimate the parameters of the model, for instance For instance, the probability of observing the assuming that the sensitivities and specificities of pattern ++– is the two tests are equal. Another option is to incorporate prior information about the × × × Prob(++–) = sens1 sens2 (1 – sens3) parameters into the model by using a Bayesian × × prev + (1 – spec1) (1 – spec2) spec3) approach to estimate their values (see the section × (1 – prev) ‘Bayesian framework’, p. 28).

When we carry out such a study, we would observe Conditional dependence the number of patients for each of the eight The basic model assumed that the results of the patterns of test results. The sum of these numbers three available tests were independent conditional 27

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

on the true disease status, also known as local with credibility intervals, which can be loosely independence. Independence in this context interpreted as Bayesian CIs. means that the errors of the test are not correlated, e.g. if a diseased patient is misclassified Prior distributions typically are uninformative or by one test, it does not increase the likelihood that determined from the published literature or in this patient will be misclassified by another test. In consultation with experts. The Bayesian approach other words, there is no group of ‘difficult’ is sensitive to the chosen prior distribution used: patients in whom several tests perform less than using different priors can lead to differences in expected. This assumption might hold if tests estimates of diagnostic accuracy. measure different manifestations of the target condition and/or use different clinical methods. In The Bayesian approach can be particular helpful the detection of Chlamydia, for example, when in ill-defined situations, such as situations where antigen detection with EIA, cell culture and DNA the number of parameters to be estimated is large amplification with PCR is used, it is less likely that relative to the available degrees of freedom. The these tests make the same type of errors than if use of prior information in combination with two of the three tests were DNA amplication tests. simulations means that the Bayesian approach can An example where the assumption of conditional obtain estimates where traditional methods fail.96 independence is likely to be violated is in the Bayesian modelling is more complicated as it detection of lumbar herniation if all three requires programming, simulations and available tests are imaging tests, such as magnetic validations of the results. resonance imaging (MRI), CT and radiography, all focusing at the detection of visible abnormalities Strengths and weaknesses of the discus. The latent class analysis methods are well documented, statistically sound and have been For many situations, the independence assumption applied in many areas of research.90 The is unlikely to be true. Ignoring the correlation of characteristics of this method have been errors between tests can seriously affect the extensively studied, including in simulations estimates of accuracy.26,64 To solve this problem, studies64,85,88 and in studies that compare latent we can incorporate the correlation of errors class estimates of accuracy with the estimates between tests into the model.26,88 This requires the derived from a classic design where a gold estimation of additional parameters and therefore standard was available.97 The model provides more degrees of freedom to estimate them. estimates of sensitivity and specificity for all tests Additional degrees of freedom can be obtained by incorporated in the model, which are the indexes including more tests, repeating the study in most commonly used in test evaluation research. another population with a different prevalence of Latent class analysis is a flexible approach that can the condition (but unchanged accuracy estimates) incorporate different types of test results or incorporating prior information using a (dichotomous, ordinal and continuous). Bayesian framework (see the next section). The drawbacks of latent class analysis are well Bayesian framework described in the literature.64,85,89 The greatest The parameters of a latent class model can also be concern is not related to statistical issues but to a estimated through a Bayesian approach instead of more basic principle. In a latent class analysis, the the more traditional maximum likelihood target condition is not defined in a clinical sense. estimation (frequentist approach). The Bayesian Because we make no clinical definition of disease framework estimates the same latent model and in latent class analysis, clinicians can feel parameters, but it explicitly incorporates prior uncomfortable about what the results information to the model.54,91,93–95 In the Bayesian represent.24,64,85 approach, the unknown parameters are all treated as random variables having a probability Latent class analysis does not fully comply with a distribution. Information available on each basic principle of test evaluation research. When parameter prior to collecting the data is evaluating the performance of an index test, it summarised in the distribution, needs to be compared with a standard that is which is then combined with information from the independent of the index test itself. In latent class observed data to obtain a analysis, however, this principle is partly violated distribution for each parameter. The posterior because the index test results are often used to distribution can be used to obtain point estimates construct the reference standard. Considerations 28 for the mean and sensitivity and specificity similar to those mentioned in the panel diagnosis Health Technology Assessment 2007; Vol. 11: No. 50

methods apply (see the section ‘Panel or consensus TABLE 10 Pattern of tests results and their frequency of diagnosis’, p. 24). This problem lies more in the occurrence panel method because experts may overestimate the capabilities of the index test, whereas in the Pattern PCR EIA Culture Observed frequency latent class setting a formal statistical model is applied, which is not sensitive like humans to the 1+++19 reputation of any test. 2++–4 3+–+3 Like any statistical model, the validity of the 4–++1 results of a latent class model depends on whether 5+––2 6–+–3 the underlying assumptions are met. Whether or 7––+0 not a latent class model assumes conditional 8 – – – 292 independence is a critical issue. Simulation studies have shown that violation of conditional independence biases the estimates of diagnostic accuracy.26,64 If tests are positively related to present the observed test results of these tests in a diseased and/or not diseased patients, latent class group of 324 persons, as reported by Alonzo and 64 analysis will yield accuracy estimates which are too Pepe. The frequency of occurrence for each high, especially for the more accurate test. A pattern of test results is given. Latent class analysis problem is that the conditional independence uses the information displayed in the table to assumption cannot be fully tested in a model with determine associations between the diagnostic three dichotomous tests. Additional or repeated tests. More detail on the analytical expressions for testing to increase the degrees of freedom might estimates can be found in the paper by Pepe and 85 not be feasible for ethical or economic reasons. Janes. Referring to the detection of Chlamydia trachomatis with cell culture as the imperfect These observations have led to caution against the reference standard, specificity is assumed to be use of latent class analysis in practice when only close to 100%, whereas a wide range of values 93,98,99 three tests are available. Some even extrapolate have been reported for its sensitivity. this view to all latent class models, as even when it Different priors can be incorporated in the is possible to incorporate more information in the Bayesian approach, for instance an optimistic model, unverifiable assumptions about prior (using a uniform distribution in the range dependence are still required.64,85 80–90%) and a pessimistic prior (in the range 55–65%) for the sensitivity of culture. If nothing is Field of application known about the index test, a prior distribution Latent class analysis can be applied in those for sensitivity and specificity allowing equal situations where multiple pieces of information are weighting in the 0–100% range can be used. available for each patient. Parameter estimates derived from latent class Software analysis are as follows: Several more advanced statistical packages have implemented latent class analysis, such as Splus Prevalence: 0.081 and R. In addition, there are programs specially Sensitivity and specificity of EIA: 0.909 and developed for latent variable models such as 0.990 Latent Gold and LEM. Free software to perform Sensitivity and specificity of culture: 0.834 and Bayesian latent class analysis is available, requiring 0.997 only a user registration (WinBugs). Sensitivity and specificity of PCR: 1.000 and 0.995 Clinical example In the detection of Chlamydia trachomatis, culture, DNA amplification methods such as PCR and Validate index test results antigen detection methods such as EIA have been modelled with latent class analysis (the same In all previous methods, the focus was on example is also used in discrepant analysis and correcting imperfect reference standards, or latent class analysis). Culture is believed to have constructing a new or better reference standard, to nearly perfect specificity. PCR and EIA are calculate the classical indexes of diagnostic test believed to be more sensitive than culture to accuracy such as sensitivity and specificity. In this detect Chlamydia trachomatis. In Table 10, we section, we discuss approaches to evaluate index 29

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

test results that go beyond the diagnostic accuracy patients, based on their index test results. That paradigm. Validation is an alternative process to degree of confidence is based on a network of evaluate a medical test in the absence of an associations between the test results and other unproblematic and equivocal reference standard. pieces of information in tested patients. In this context, validity refers to how well the index test measures what it is supposed to be One important way to validate an index test is to measuring.100–102 use dedicated follow-up to capture clinical events of interest in relation to index test results. If the Many instruments in the social sciences and index test will be used for predicting future events, elsewhere in science rely on a validation process to such a prognostic study can be seen as an determine whether or not the instrument can evaluation of the predictive validity of the test.101 serve its purpose. In these cases, the diagnostic The question of whether it is safe to withhold accuracy paradigm often cannot be used because further testing in patients with low probability of the ‘truth’ cannot be observed. The latter applies pulmonary embolism and a negative D-dimer to many hypothetical or conceptual constructs. We result has been studied by looking at the 3-month cannot observe a construct, or any latent variable of venous thromboembolism, and (see the section ‘Latent class analysis’, p. 26). What reported as such.61 we can do is to observe associated attributes, as, for instance, sweating, moaning and asking for One step further is the evaluation of a test within pain medication in the evaluation of pain.101,103 In randomised clinical trials of therapy. One aim of a similar way, we can evaluate associations between the test is to identify patients who are more likely the index test and these attributes. to benefit from the new, active treatment.105,106 A possible design corresponding to that study Three different forms of validity have traditionally question is displayed in Figure 5. been distinguished: content validity, criterion- related validity and construct validity.101 In this Note that all patients receive the index test, but that context the classical diagnostic accuracy design none of them receives a reference standard. After (Figure 1, p. 2) refers to criterion-related validity testing, patients are randomly allocated to either where the reference standard provides the treatment or clinical follow-up. At the end of follow- criterion against which the index test is validated. up, the data collected can be displayed in four 2-by- Several other definitions have been provided, 2 tables (Figure 5). Different study objectives can be including intrinsic validity, logical and empirical answered by these tables. In Tables A and B in validity, factorial validity and face validity. This Figure 5, a conclusion can be drawn as to whether proliferation is in part based on a misinterpretation treatment is beneficial compared with clinical of the classical paper by Cronbach and Meehl, follow-up in index test positive patients and test who pointed to the quintessential nature of negative patients, respectively. Odds ratios and CIs construct validation in all areas where criterion- can be calculated to underpin the conclusions. In oriented definitions cannot be used.104 Tables C and D, conclusions can be drawn concerning the prognostic value of the test within Approaches towards the validation of medical tests the context of subsequent clinical decision-making. explore meaningful relations between index test Table C refers to the ability of the index test to results and other test results or clinical identify patients who are likely to benefit from characteristics, none of which can be uplifted to therapy. The test properties can be either expressed the status of a reference standard, either isolated as event rates (of, for example, poor events) or odds or in combination. Relevant items can come from ratios. Table D refers to the ability to discriminate the patients’ history, clinical examination, between different risk categories for a specific imaging, laboratory or function tests, severity event. Several modifications of the basic scores and prognostic information. randomised controlled trial (RCT) model to evaluate tests have been described by Lijmer and Construct validity is not determined by a single Bossuyt.105 , but by a body of research that demonstrates the relationship between the test Basic design and the target condition that it is intended to No single basic design exists for construct identify or characterise. Validating a test can then validation studies. There are many possible be understood as a gradual process whereby we designs to evaluate the validity of an index, as determine the degree of confidence we can place there are many different predictions possible 30 on inferences about the target condition in tested based on our theory or construct. Health Technology Assessment 2007; Vol. 11: No. 50

RCT design Follow-up Treatment Outcome

Index Patients R test

Follow-up No treatment Outcome

RCT results

A. In patients with T+: B. In patients with T–: C. Treated patients: D. Untreated patients: Poor Favourable Poor Favourable Poor Favourable Poor Favourable outcome outcome outcome outcome outcome outcome outcome outcome

Treat Treat T+ T+

No No treat treat T– T–

FIGURE 5 Randomised controlled trial (RCT) using index test results as a prognostic marker (T = test)

As an example, we look at the evaluation of a new construct validity and the inferences we can draw to evaluate psychological stress. from measuring that construct will grow. Many hypotheses may be generated, including that level of stress increases heart frequency, and Several ways exist to express associations between also blood values of cortisol. An association study the measured attribute and other attributes, events could be conducted to evaluate the correlation or subgroups, including correlation indexes, odds between the score, heart frequency, blood levels of and risk ratios and absolute and relative cortisol and past exposure to psychological stress, differences. such as surviving a Tsunami. An additional approach would be to perform a , to Strengths and weaknesses evaluate whether the elements of the In situations where no reference standard is questionnaire belong to one or more dimensions available, association studies may be the only type (factors), and to match if the identified factors of study that can be performed. The validation of correspond with the theory of psychological stress. the index test will depend on the underlying Still another theory could point out that stress is theories about the target condition, as the latter artificially induced by an intervention, such as provides us with hypotheses about the kind of showing pictures of tortured persons. If so, associations between index test and attributes that participants should have higher scores on the have to be evaluated. If the theory is wrong, totally questionnaire after such an intervention. Stated or in part, any quantitative expression of the alternatively: if patients with high scores are given strength of the association may be misleading. a tranquilliser, then their scores should drop. Here an intervention study would be conducted. If Whenever the index test results fail to show the scores drop after tranquilliser intake in those with hypothesised network of associations between the initially high scores, whereas those with low scores target condition and other observations, more remain at approximately the same level, then the than one conclusion is possible: the index test has new questionnaire can be considered to have low validity, the theory about the target condition construct validity. Additional studies would have to is not correct or both the index test and the theory be conducted. Gradually, our confidence in the are inadequate. 31

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

Follow-up

No doppler Care as usual Outcomes

Nulliparous women R

Uterine artery Positive test result >> Outcomes doppler Treatment with aspirin

Negative test result >> Outcomes Care as usual

FIGURE 6 Design of the uterine artery Doppler trial

Specific problems related to clinical follow-up have test, is absent. In the discussion (Chapter 5), we been identified. The nature of the target condition will argue that validation is a more universal way may change during clinical follow-up. Even when of appraising the value of medical tests. present, the target condition is not guaranteed to produce detectable events or complaints. The Clinical example 1: randomised controlled trial length of follow-up has to be chosen judiciously We will use a study by Subtil and colleagues titled for each specific target condition. ‘Randomised comparison of uterine artery Doppler and aspirin (100 mg) with placebo in An additional problem arises when an intervention nulliparous women’ to explain some of the design is used. If the intervention is chosen badly, and and efficiency issues that play a role when using does not achieve what it was intended to achieve, an RCT to evaluate a test.108 the construct validity of the index test will be masked. If the outcome is favourable, any chance The objective of this study was to assess the cannot be attributed exclusively to the effectiveness of a strategy of pre-eclampsia discriminatory characteristics of the index test. prevention based on routine uterine artery Hence test evaluation and management Doppler examination during the second trimester consequences cannot be disentangled in this type of pregnancy, followed by a prescription of 100 mg of test evaluation. of aspirin in those with abnormal Doppler, versus no Doppler strategy. The population were A major drawback of construct validity is that nulliparous women (no previous delivery before there is no single type of study that can be ജ22 weeks) between 14 and 20 weeks’ gestation, prescribed, as in diagnostic accuracy studies. with no history of hypertension. Randomisation Several studies have to be performed, before was used to determine whether women would enough confidence in the validity of the index test receive uterine artery Doppler testing between 22 can be achieved.107 and 24 weeks or not. Women in the no testing arm of the trial would receive usual care, as would the Field of application women in the testing arm with a negative test We hypothesise that this type of test evaluation can result (for the design, see Figure 6). The outcomes be done in all areas of medicine where an of the trial were the development of pre-eclampsia, accepted reference standard, or any other having a baby small for gestational age and 32 criterion to determine the diagnostic accuracy of a perinatal deaths. The trial did not find a difference Health Technology Assessment 2007; Vol. 11: No. 50

in outcomes between the tested (Doppler group) Clinical example 2 and 3: validation studies other and the non-tested group (no Doppler). than trials An example of a validation approach can be found This design offers an assessment of the Doppler test in a series of studies that evaluated the use of and aspirin therapy combination, as women are troponin to identify acute coronary syndrome in randomised to Doppler or no Doppler groups, with chest pain patients. Initial studies have evaluated treatment of the Doppler positive women in the cardiac troponin in an accuracy framework, using Doppler arm. It addresses the question: does testing the original WHO definition of acute myocardial with Doppler (plus aspirin for test positive cases) infarction. The latter invites the use of a prevent pre-eclampsia? In this design, if testing and composite reference standard, looking at treatment were shown not to be effective, it could be characteristic ECG changes – either ST segment because the test is inaccurate or because aspirin is elevation or the development of new Q waves – ineffective. This design, therefore, addresses two confirmed by significant changes in serial cardiac questions in one, but unfortunately, it is not often enzymes.110 Other studies have looked at 30-day possible to segregate the accuracy of the test from outcomes in patients admitted without ST the effectiveness of the treatment. segment elevation on the initial ECG. They found that cardiac events increased significantly with This design has various weaknesses. First, the increasing cardiac troponin values, suggesting that design is not efficient and requires large samples the degree of troponin elevation, and also clinical to achieve satisfactory power. This is because the variables such as previous myocardial infarction only women contributing to the expected and an ischaemic ECG, should be considered difference in outcome in the two trial arms are when deciding treatment.111 These and other those who belong to a small subgroup of those studies have led to the recommendation that with abnormal Doppler result in the Doppler arm. cardiac troponins should be the preferred markers In this study, 2491 women were randomised in a for the diagnosis of myocardial injury.112 More 2:1 ratio to Doppler and control arms, and of recent investigations have indicated that increases these 2491 women, an abnormal Doppler result in biomarkers upstream from markers of necrosis, was found in 239 (10%) women and these received such as inflammatory cytokines, cellular adhesion aspirin. Even with a 2:1 ratio of randomisation to molecules, acute-phase reactants, plaque improve the numbers of women who had destabilisation and rupture biomarkers, abnormal Doppler result (and, therefore, received biomarkers of ischaemia and biomarkers of the active treatment, aspirin), the total number of myocardial stretch may provide an even earlier women who contributed to the difference between assessment of overall patient risk. The studies to the Doppler and control arms was just 239, support this claim are all closer to the validation suggesting an under-powered study. framework than to the accuracy paradigm.113

Even if this trial recruited many more thousands Another example of association studies for test of women, and became adequately powered, and a validation includes the evaluation of new tests for benefit is subsequently shown for the Doppler latent tuberculosis infection (LTBI), a condition arm, this may not be a vindication of benefit for for which no gold standard exists. The tuberculin the combination of Doppler test and aspirin skin test (TST) has been the standard test to detect therapy. This is because more women in the LTBI, although it is known to have false results, Doppler arm would have received aspirin particularly in people with previous Bacillus compared with the no Doppler arm, and as Calmette–Guérin (BCG) vaccination. Interferon- aspirin has been shown to be generally effective in gamma assays (e.g. Quantiferon and Elispot) have reducing pre-eclampsia,109 then it is possible that recently been developed for use as new tests for the Doppler arm would have shown benefit LTBI. Tuberculosis infection evokes a strong regardless of whether the Doppler test was a good T-helper 1 (Th1) type cell-mediated immune predictor of pre-eclampsia or not or, indeed, response with release of interferon-gamma. whether the Doppler test was done or not. This is assessment of interferon-gamma production in because about 10% of women would have received response to mycobacterial antigens can be used to aspirin in the Doppler group compared with none detect latent infection with Mycobacterium or few in the control group. tuberculosis. Blood infected with M. tuberculosis contains a specific clone of T-lymphocytes This design is, therefore, of limited use in stimulated by exposure to the ESAT-6 or CFP-10 assessing the role of uterine Doppler testing for antigen. The Elispot assay is based on the aspirin therapy. detection of interferon-gamma, which these 33

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Results

ESAT-6 and CFP-10 specific T lymphocytes outbreak investigation, four exposure groups secrete. As there is no gold standard or a suitable based on proximity and shared activities with an reference standard for LTBI against which the infected case were established following detailed comparative accuracy of the TST and Elispot interviews undertaken by a school nurse. The could be assessed, a validation study described association of index test with exposure status was below provides a good alternative to the classical examined using the gradient of dose–response diagnostic accuracy paradigm. The risk of relating test results to degree of tuberculosis infection is greatest among those contacts who exposure. Comparison of these gradients showed share a room with the index case for the greatest that Elispot test was statistically significantly better length of time. This means that airborne than TST. Elispot correlated significantly more transmission increases with length of exposure and closely with M. tuberculosis exposure than did TST proximity to an infectious case of tuberculosis. on the basis of measures of proximity (p = 0.03) Hence results of tests for LTBI should correlate and duration of exposure (p = 0.007) to the with level of exposure, and the test with the infected case. TST was significantly more likely to strongest association with exposure would be likely be positive in BCG-vaccinated than in non- to be the most accurate. This issue can be vaccinated students (p = 0.002), whereas Elispot evaluated in observational studies that ascertain results were not associated with BCG vaccination exposure to tuberculosis in a relevant population (p = 0.44). The authors concluded that Elispot and setting (e.g. outbreak investigation), perform offers a more accurate approach than TST for various index tests of interest in all eligible identification of individuals who have LTBI. subjects and compare test results with exposure Further validation may come from observational status.114 TST and Elispot response to ESAT-6 and studies evaluating whether new test results are CFP-10 were evaluated in this manner in 535 predictive of development of active tuberculosis in subjects during a school outbreak.114 In this the future.

34 Health Technology Assessment 2007; Vol. 11: No. 50

Chapter 5 Guidance and discussion

n this report, we have summarised a number of methodological advance (see also the next section Imethods that can be used to evaluate tests in of this chapter). the absence of a gold standard, an error-free method to establish with certainty the presence or A different class of test evaluation methods focuses absence of the target condition in tested patients. on changes in patient outcome from using tests. In this chapter we present our conclusions as This class includes randomised clinical trials of guidance to researchers and emphasise the need tests: comparisons of testing versus not testing, or to enrich or enlarge the accuracy paradigm in the of one index test compared with another. As the evaluation of medical tests. number of testing strategies usually exceeds the number that can be adequately dealt with in an RCT, many researchers turn to modelling to Guidance for researchers explore the health consequences of testing. In decision analysis, data on test characteristics are The guidance we have distilled from the existing combined with estimates of the effectiveness, risks evidence and the recommendations from and side-effects of treatment. A test’s accuracy is researchers can be summarised as follows not carved in stone, but will depend on the target (Figure 7). Readers should be aware that this condition that is to be detected. In talking about a flowchart can only provide general guidance test’s accuracy, we will assume that the because many factors have to be balanced when corresponding target condition has been defined. choosing one method over another. If knowing a test’s accuracy is important, the next The first question to be answered is whether the question to ask is whether there is a reference uncertainty about the test can be satisfactorily standard providing adequate classification. This answered by knowing its accuracy. The diagnostic would be a reference standard about which accuracy of a test is an expression of how well the there is consensus that it is the best available test is able to identify tested persons with the method for establishing the presence or absence target condition. The limits of this report do not of the target condition. In addition, researchers allow us to discuss at length the various other ways and readers should have confidence in the in which a test can be evaluated, including classification provided by reference standard, randomised trials of test–treatment combinations although misclassification should be considered that can document whether the use of a test will in every accuracy study. lead to improved patient outcomes.106 We also omitted from consideration many of the lower- If the outcome of the reference standard cannot or level evaluations that usually precede evaluations has not been obtained in all study participants, of diagnostic accuracy, including studies statistical methods can be used to correct for the examining the reproducibility and intra- and incomplete verification. The STARD statement inter-observer variability of tests. encourages researchers to be explicit about missing information.115 With small rates of missing We found that in situations that deviate only information on reference standard outcome, these marginally from the classical diagnostic accuracy methods can be effectively used. For large volumes paradigm, for example, where there are few of missing data on the reference standard, the missing values on an otherwise acceptable soundness of the approach depends on whether reference standard or where the magnitude and the mechanisms that led to the missing data are type of imperfection in a reference standard is well known. A general problem for these methods is documented, the methods we summarised may be that the general level of acceptance for correction valuable. However, in situations where an techniques and for imputation methods is still acceptable reference standard does not exist, fairly low. holding on to the accuracy paradigm may be fruitless. In these situations, applying the concept If the pattern of missingness is not random, of clinical test validation can provide a significant researchers could consider using a second 35

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Guidance and discussion

g f alysis Consider other types of evaluation diagnosis or Consider using panel consensus latent class an No No e Yes standard adequate Yes See ‘Panel or consensus diagnosis’ (p. 24) and ‘Latent See ‘Panel classification? classification? Consider using results provide Is there a priori f provide adequate Can multiple tests consensus on what combination of test composite reference No No See ‘Impute or adjust for missing data on reference standard’ (p. 13). d b Yes Yes inadequate for imperfect Consider using classification? but providing Is there reliable See Chapter 1. imperfection of the reference standard correction methods reference standard? about the degree of external information reference standard a Is there a preferred See ‘Composite reference standard’ (p. 21). e c applied? Yes subgroups this reference standard in these Consider using an Are there specific standard cannot be subgroups in which alternative reference No No No b Yes estimates at random? outcomes are See ‘Correct imperfect reference standard’ (p. 16). Consider imputing reference standard reference standard No or correct accuracy missing outcome on missing (completely) d Is it plausible that the See Chapter 5. g

a Classic study Yes Yes Yes in all study Guidance for researchers when faced with no gold standard diagnostic research situations. participants? the reference classification? Is there a single reference standard providing adequate diagnostic accuracy Will the outcome of standard be available Is accuracy sufficient for answering the question? See ‘Differential verification’ (p.18). See ‘Differential verification’ data analysis’ (p. 26). 36 FIGURE 7 c Health Technology Assessment 2007; Vol. 11: No. 50

reference standard in patients in whom the result scales. An additional step in these subjective of the first, intended reference standard is not approaches is to look at changes in management, available. The use of that second reference either explicitly or in intended decisions. standard should preferably be stipulated in the protocol and reported as such. With differential Several authors have claimed that the accuracy verification, as in the use of one reference standard paradigm is insufficient for evaluating the clinical in test positive patients and a second one in test usefulness of tests. The methods just mentioned negative patients, the results should be reported can all be used, although most require an for each reference standard to prevent bias. appropriate reference standard (diagnostic gain) or a clear link with management consequences If there is a reference standard but one that is (RCTs, decision analysis). We feel that there is known to be prone to error, statistical methods both a need and a place for additional methods could be used to obtain estimates that are corrected for exploring the likely clinical usefulness of tests. for these reference standard errors. When there is In the final section, below, we introduce the an identified target condition but no available concept of test validation, a progressive reference standard, the researcher may aim to exploration and testing of the tests results with construct one, based on two or more other tests. If other features, as a more general and more these are used in a rule-based way to obtain productive way of evaluating tests. information about the target condition, a composite reference standard has been constructed. If such a rule cannot be made explicit, it is possible to use a Limits to accuracy: towards a panel method, by inviting experts. An alternative validation paradigm method is the use of statistical techniques to obtain the most likely classification. By selecting one of In clinical medicine, tests are not only ordered for these methods, the researchers – or the panel knowing the results for their own sake. A physician members – use their knowledge of the clinical at the emergency department (ED) not only wants target condition to identify variables or features that to know the concentration of cardiac troponin of allow the identification of patients with that the patient with chest pain, but also wants to know particular target condition. the likelihood of a diagnosis of acute coronary syndrome. In addition, he or she wants to know With these techniques, there is an implicit where to locate the patient in the risk spectrum of assumption that they can be used to classify acute coronary syndrome.77,117 The ED physician patients in previously unknown categories: the will probably rely on the troponin results and also latent classes. This means that in latent class on other findings in trying to find out what to do analysis the target condition is not defined in a with the patient. Can this patient be safely sent clinical sense, but is a mathematically defined home? Does the patient have to be admitted to entity. The classification that comes out of the the coronary care unit? Does the patient qualify analysis may not fully coincide with pre-existing for primary angioplasty? An umbrella question for knowledge of the patients and the conditions that these issues is: are patients with chest pain at the are amenable to treatment. ED better off if they are routinely tested for troponin? In the answers to these questions, the If none of these methods seem appropriate, the exact value of the troponin concentration is an researchers have to turn to alternative methods for essential but insufficient element. It is not the evaluation. Several proposals for a staged attribute that is measured (troponin), but the level evaluation of tests have been made in the of confidence by which that attribute can be used literature. A number of these focus on the added to classify the patient in the risk spectrum of the value or diagnostic gain of tests.116 This includes acute coronary syndrome. studies that determine the added discriminatory power of tests, relative to pre-existing data from For decades, the diagnostic accuracy paradigm has history, physical or previous tests. Multivariable been used to obtain answers to this type of modelling is often used for this question. It is difficult to pinpoint its origins purpose, with the change in area under the curve exactly, but they go back well to – and beyond – as one of the statistics. The subjective side of the Ledley and Lusted’s seminal 1959 paper.18 In its ‘diagnostic gain’ invites clinicians to express the most classical sense, as can be found in all reduction in experience they feel after having textbooks, the accuracy paradigm requires a gold received the index test result. These reductions are standard, a technique used to identify with elicited using visual analogue or quantitative certainty patients with the disease of interest. 37

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Guidance and discussion

Pathophysiology and histology present the syndrome’. New management options and prototypical form of a gold standard: a method to advances in testing can change the concept of determine unequivocally the presence or absence disease, as in ectopic pregnancy. This diagnosis, of disease in the human body. Sensitivity and once attached to a life-threatening condition, now specificity are the two best known statistics to covers subclinical conditions, such as ‘throphoblast express the results of the comparison of the test in regression’.118 Advances in imaging allow the results with those of the gold standard. detection of even smaller, subsegmental pulmonary emboli or micro-metastases in patients The accuracy paradigm has proven to be a very with cancer. The current ‘diagnoses’ do not valuable one. Its ubiquity has provided a focal coincide with the older categories. point for the clinical evaluation of tests. The need for knowing the diagnostic accuracy of a test has To some extent, this problem can be remedied by prompted many useful evaluations. Reports of the replacing the term ‘disease’ with ‘target condition’, poor sensitivity or specificity of a test are likely to as has been done in this report and elsewhere.5 inhibit the premature dissemination of tests in The target condition is a much broader concept, modern medicine. covering any particular disease, a disease stage or just any other identifiable condition that may The limited time frame of this report did not allow prompt clinical action, such as further diagnostic us to explore, analyse and speculate why the gold testing or the initiation, modification or standard and the diagnostic accuracy paradigm termination of treatment. have become so dominant in the evaluation of medical tests. We can only hint at the prevailing Another – and more widely documented – problem power of the classical view of clinical medicine, and is the absence of a true gold standard its classical triad aetiognosis–diagnosis–prognosis. to classify with certainty and without errors patients One can also point to the appealing simplicity of as having the target condition or not. In more the basic accuracy design, which allows study cases than many imagine, such a gold standard results to be summarised in a very simple two-by- simply does not exist. It definitely does not exist two table, or a slanted curve in two-dimensional for many of the most prevalent chronic conditions, ROC space. However, things should only be made including diabetes, migraine and cardiovascular as simple as possible and no simpler. Beyond any disease. For these problems, the term ‘reference doubt, the prevailing accuracy paradigm is far standard’ has been introduced. This term from sufficient to cover all issues in the evaluation acknowledges the absence of a ‘gold standard’ and of the clinical evaluation of tests. We will refers to the best available method for classifying summarise only a few of these issues here. patients as having the target condition.

To start, many tests are not used for making a The term reference standard may seem like a diagnosis at all. They are used for a variety of solution, but in accordance with the Law of other purposes, such as guiding treatment Frankenstein – ‘the monster you create is yours, decisions, monitoring treatment in chronic forever’ – it also introduces ineradicable subjectivity. patients, informing patients and documenting What should the reference standard be for changes in their condition, or for clinical or basic appendicitis, or for deep venous thrombosis? Is it research. Many problems in present-day an image, or is it follow-up? The two methods will healthcare do not rely on issues of diagnosis, not yield identical results, and whereas one may be definitely not in chronic conditions, where patients closer to pathophysiology, the other may be closer are known to have diabetes, cardiovascular disease to patient outcome, the penultimate criterion for or COPD. The diagnostic accuracy paradigm decisions in healthcare. A number of authors have cannot be simply applied in situations where tests suggested that these problems can be circumvented are used for purposes other than diagnosis. by jumping to the cornerstone of evaluations of interventions: the RCT. As summarised earlier in The second issue is the problematic relation the report, RCTs cannot act as a panacea.105,106 To between pathophysiology and diagnosis, and that allow meaningful interpretations of their results, between pathology and subsequent actions. such trials presume that the links between test Furthermore, the definition or the concept of a results and subsequent clinical actions have been disease can change over time as new insights well established. That may not always be the case. become available. The classical diagnosis of ‘acute myocardial infarction’ is now embedded in the In our view, a move towards a test validation 38 spectrum of conditions known as ‘acute coronary paradigm is justified. This means that scientists and Health Technology Assessment 2007; Vol. 11: No. 50

practitioners examine, using a number of different healthcare, by improving outcome, reducing costs, methods, whether the results of an index test are or both. meaningful in practice. Validation will always be a gradual process. It will involve the scientific and Validation offers a way out of the dilemma clinical community defining a threshold, a point in introduced by the multiple fixes that are needed the validation process, where the information in order to cling to the diagnostic accuracy gathered would be considered sufficient to allow paradigm in the absence of an accepted gold clinical use of the test with confidence. standard. Replacing the gold standard by the reference standard, and the pathophysiological Validating a test is a process through which detection of disease by the identification of the scientists and practitioners can find out whether target condition is not enough. Like all humans, the results of a test are meaningful. The process of we value simplicity, and the alluring attraction of validation will come after initial evaluations of the two simple statistics makes it hard to give up the basic properties of the test: its reliability, notion of fixed test characteristics. In some cases consistency, trueness. Once these hurdles have these fixes are feasible and reasonable, as has been been overcome, we can try to find out how and to summarised in this report. In others, we may well what extent the test results fit in our abandon the accuracy paradigm and replace it by understanding of a patient’s condition, its likely a new one: the concept of the validation of causes and course. medical tests.

To validate a test, we will have to build a conceptual framework, defining how the test Recommendations for further results relate to other features. These features may research be derived from history, from physical examination, from other test results, from follow- Diagnostic research with partial verification is up or from response to treatment. common, this includes research based on routinely collected clinical data. There is a need to The concept of test validation encompasses the increase awareness that naïve estimates of accuracy traditional notion of diagnostic accuracy. If there by leaving out unverified patients from the is an accepted pathophysiological gold standard, calculation can lead to serious bias. one that can be used to detect the presence of absence of disease with certainty, we can validate a There is a need to perform empirical and test by comparing its results with the findings of simulation studies that will determine the the gold standard (criterion validity). Test potential and limitations of multiple imputation validation also allows the option for evaluating test methods to reduce the bias caused by missing data results of tests that are not used, or not used on the reference standard. exclusively, in making a diagnosis. In setting up consensus-based diagnosis, there Test validation is probably the way to go in case are many practical choices to be made. These there is a test that is proclaimed to be ‘better than include the number of experts, the way patient the existing reference standard’. In these cases information is presented, and how to obtain a also, we have to show that the differences between final classification. There is a need to perform that test and the reference standard are more methodological studies to provide guidance ‘meaningful’, by demonstrating reliable in this area. associations with other findings and test results. There is a need to design studies that compare the The results of a validation process cannot be results based on a consensus procedure with the captured in simple statistics. The associations with results of latent class models when multiple pieces other variables can and must be expressed in a of information will be combined to construct a quantitative sense, but they will never be reduced reference standard. Also there is a need to examine to a simple pair of numbers, as a test’s sensitivity approaches that combine both procedures. and specificity. There is a need to develop more elaborate Discovering or demonstrating that test results are schemes outwith the accuracy paradigm for the meaningful does not justify the use of that test in validation of tests targeted at diseases or daily practice. In the end, the use of tests has to be conditions for which there is no acceptable justified by demonstrating that it leads to better reference standard. 39

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Acknowledgements

e gratefully acknowledge the many useful methods and results. Johannes Reitsma (Senior Wcomments and interesting discussions we had Clinical Epidemiologist) was a co-applicant on the with the many experts we involved in this project. project grant application and carried out work on For a list of experts, see Appendix 2. Additionally, the development of the protocol, project we gained input from discussions with Professor management, supervision of review work, as well Paul Glasziou, Dr Tracy Roberts, Dr Chris Hyde as drafting and editing of the final report. [CH is a member of the Editorial board for Health Arri Coomarasamy (Honorary Lecturer in Technology Assessment but was not involved in the Epidemiology) was a co-applicant on the project editorial process for this report] and Dr Priscilla grant application and worked on the development Harries. We thank them for their time and the of the protocol and on drafting the results. efforts that they put into this project. This report Khalid Khan (Professor of Obstetrics-Gynaecology has been shaped and improved by the comments and Clinical Epidemiology) was a main applicant of the experts, but the views expressed in this on the project grant application and also carried report are those of the authors and do not out development of the protocol, project necessarily reflect the opinions of the experts. management, drafting of results and editing of the final report. Patrick Bossuyt (Professor and Chair Contribution of authors of Department of Clinical Epidemiology) was a Anna Rutjes (Research Fellow) was a co-applicant main applicant on the project grant application on the project grant application and worked on and worked on the development of the protocol, the development of the protocol, acted as a on drafting the results and contributed to reviewer, and was involved in the drafting of discussion, as well as to editing the final report.

41

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

References

1. Lijmer JG, Mol BW, Heisterkamp S, et al. 13. Pepe MS. Incomplete data and imperfect reference Empirical evidence of design-related bias in tests. In Pepe MS, editor. The statistical evaluation of studies of diagnostic tests. JAMA 1999;282:1061–6. medical tests for classification and prediction. Oxford: Oxford University Press; 2004. pp. 168–213. 2. Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and 14. Vacek PM. The effect of conditional dependence bias in studies of diagnostic accuracy: a systematic on the evaluation of diagnostic tests. Biometrics review. Ann Intern Med 2004;140:189–202. 1985;41:959–68. 3. Rutjes AW, Reitsma JB, Di NM, Smidt N, 15. Thibodeau L. Evaluating diagnostic tests. van Rijn JC, Bossuyt PM. Evidence of bias and Biometrics 1981;37:801–4. variation in diagnostic accuracy studies. CMAJ 16. International Organization for Standardization. 2006;174:469–76. International vocabulary of basic and general terms in 4. Knottnerus JA, van Weel C. General introduction: metrology. Geneva: ISO; 1993. evaluation of diagnostic procedures. In 17. Yerushalmy J. Statistical problems in assessing Knottnerus JA, editor. The evidence base of clinical methods of , with special diagnosis. London: BMJ Books; 2002. pp. 1–18. reference to X-ray techniques. Rep 5. Sackett DL, Haynes RB. The architecture of 1947;62:1432–49. diagnostic research. In Knottnerus JA, editor. The 18. Ledley RS, Lusted LB. Reasoning foundations of evidence base of clinical diagnosis. London: BMJ medical diagnosis; symbolic logic, probability, and Books; 2002. pp. 19–38. value theory aid our understanding of how physicians reason. Science 1959;130:9–21. 6. Lijmer JG, Mol BWJ, Bonsel GJ, Prins MH, Bossuyt PM. Strategies for the evaluation of 19. Whiting P, Rutjes AW, Reitsma JB, Glas AS, diagnostic technologies. Evaluation of diagnostic Bossuyt PM, Kleijnen J. Sources of variation and tests: from accuracy to outcome (Thesis). bias in studies of diagnostic accuracy: a systematic Amsterdam; 2001. pp. 15–28. review. Ann Intern Med 2004;140:189–202. 7. Fryback DG, Thornbury JR. The efficacy of 20. Rutjes AW, Reitsma JB, Irwig L, Bossuyt PM. diagnostic imaging. Med Decis Making 1991;11: Partial and differential bias in diagnostic accuracy 88–94. studies. Sources of bias and variation in diagnostic accuracy studies (Thesis). Amsterdam; 2005. 8. Knottnerus JA, Muris JW. Assessment of the pp. 31–44. accuracy of diagnostic tests: the cross-sectional study. In Knottnerus JA, editor. The evidence base of 21. Gustafson P. The utility of prior information and clinical diagnosis. London: BMJ Books; 2002. stratification for parameter estimation with two pp. 39–59. screening tests but no gold standard. Stat Med 2005;24:1203–17. 9. Bossuyt PM, Irwig L, Craig J, Glasziou P. Comparative accuracy: assessing new tests against 22. Begg CB, Greenes RA. Assessment of diagnostic existing diagnostic pathways. BMJ 2006;332: tests when disease verification is subject to 1089–92. selection bias. Biometrics 1983;39:207–15. 10. Lord SJ, Irwig L, Simes RJ. When is measuring 23. Harel O, Zhou XH. Multiple imputation for sensitivity and specificity sufficient to evaluate a correcting verification bias. Stat Med 2006;25: diagnostic test, and when do we need randomized 3769–86. trials? Ann Intern Med 2006;144:850–5. 24. Pepe MS. The statistical evaluation of medical tests for 11. Boyko EJ, Alderman BW, Baron AE. Reference test classification and prediction. Oxford: Oxford errors bias the evaluation of diagnostic tests for University Press; 2003. ischemic heart disease. J Gen Intern Med 1988; 25. Alonzo TA, Pepe MS. Using a combination of 3:476–81. reference tests to assess the accuracy of a new diagnostic test. Stat Med 1999;18:2987–3003. 12. Deneef P. Evaluating rapid tests for streptococcal pharyngitis: the apparent accuracy of a diagnostic 26. Hui SL, Zhou XH. Evaluation of diagnostic tests test when there are errors in the standard of without gold standards. Stat Methods Med Res 1998; comparison. Med Decis Making 1987;7:92–6. 7:354–70. 43

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. References

27. Cochrane Handbook for Systematic Reviews of 42. Molenberghs G, Goetghebeur EJ, Lipsitz SR, Interventions 4.2.5 [updated May 2005]. Chichester Kenward MG, Lesaffre E, Michiels B. Missing data Wiley; 2006. perspectives of the fluvoxamine data set: a review. Stat Med 1999;18:2449–64. 28. Lilford RJ, Richardson A, Stevens A, et al. Issues in methodological research: perspectives from 43. Engels JM, Diehr P. Imputation of missing researchers and commissioners. Health Technol longitudinal data: a comparison of methods. J Clin Assess 2001;5(8). Epidemiol 2003;56:968–76. 29. Little R, Rubin D. Statistical analysis with missing 44. Liu M, Taylor JM, Belin TR. Multiple imputation data. 2nd ed. New York: Wiley; 2002. and posterior simulation for multivariate missing data in longitudinal studies. Biometrics 2000;56: 30. Paliwal P, Gelfand AE. Estimating measures of 1157–63. diagnostic accuracy when some covariate information is missing. Stat Med 2006;25:2981–93. 45. Rubin DB. Multiple imputation after 18+ years. J Am Stat Assoc 1996;91:473–89. 31. Choi BC. Sensitivity and specificity of a single diagnostic test in the presence of work-up bias. 46. Schafer JL. Multiple imputation: a primer. Stat J Clin Epidemiol 1992;45:581–6. Methods Med Res 1999;8:3–15. 32. Diamond GA. Off Bayes: effect of verification bias 47. Zhou X-H, Obuchowski NA, McClish DK. Statistical on posterior probabilities calculated using Bayes’ methods in diagnostic medicine. New York: Wiley; 2002. theorem. Med Decis Making 1992;12:22–31. 48. Greenes RA, Begg CB. Assessment of diagnostic 33. Cecil MP, Kosinski AS, Jones MT, et al. The technologies. Methodology for unbiased importance of work-up (verification) bias estimation from samples of selectively verified correction in assessing the accuracy of SPECT patients. Invest Radiol 1985;20:751–6. thallium-201 testing for the diagnosis of coronary artery disease. J Clin Epidemiol 1996;49:735–42. 49. Pisano ED, Gatsonis C, Hendrick E, et al. Diagnostic performance of digital versus film 34. Danias PG, Parker JA. Novel Internet-based tool mammography for breast-cancer screening. N Engl for correcting apparent sensitivity and specificity J Med 2005;353:1773–83. of diagnostic tests to adjust for referral (verification) bias. Radiographics 2002;22(2):e4. 50. Harrell FE, Jr., Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, 35. Held L, Ranyimbo AO. A Bayesian approach to evaluating assumptions and adequacy, and estimate and validate the false negative fraction in measuring and reducing errors. Stat Med 1996; a two-stage multiple screening test. Methods Inf 15:361–87. Med 2004;43:461–4. 51. Drum D, Christacopoulos J. Hepatic scintigraphy 36. Kosinski AS, Barnhart HX. Accounting for in clinical decision making. J Nucl Med 1969; nonignorable verification bias in assessment of 13:908–15. diagnostic tests. Biometrics 2003;59:163–71. 52. Marshall V, Williams DC, Smith KD. 37. Kosinski AS, Barnhart HX. A global sensitivity Diaphanography as a means of detecting breast analysis of performance of a medical diagnostic cancer. Radiology 1981;150:339–43. test when verification bias is present. Stat Med 2003;22:2711–21. 53. Zhou X-H, Obuchowski NA, McClish DK. Methods for correcting imperfect standard bias. Statistical methods 38. Toledano AY, Gatsonis C. Generalized for ordinal categorical data: arbitrary in diagnostic medicine. New York: Wiley; 2002. patterns of missing responses and missingness in a pp. 359–95. key covariate. Biometrics 1999;55:488–96. 54. Hadgu A, Dendukuri N, Hilden J. Evaluation of 39. Zhou XH. Correcting for verification bias in nucleic acid amplification tests in the absence of a studies of a diagnostic test’s accuracy. Stat Methods perfect gold-standard test: a review of the Med Res 1998;7:337–53. statistical and epidemiologic issues. Epidemiology 2005;16:604–12. 40. Zhou XH, Higgs RE. COMPROC and CHECKNORM: computer programs for 55. Staquet M, Rozencweig M, Lee YJ, et al. comparing accuracies of diagnostic tests using Methodology for the assessment of new ROC curves in the presence of verification bias. dichotomous diagnostic tests. J Chronic Dis Comput Methods Programs Biomed 1998;57:179–86. 1981;34:599–610. 41. Zhou XH, Higgs RE. Assessing the relative 56. Valenstein PN. Evaluating diagnostic tests with accuracies of two screening tests in the presence of imperfect standards. Am J Clin Pathol 1990;93: 44 verification bias. Stat Med 2000;19:1697–705. 252–8. Health Technology Assessment 2007; Vol. 11: No. 50

57. Brenner H. Correcting for exposure estimates computed by discrepant analysis. J Clin misclassification using an alloyed gold standard. Microbiol 1998;36:375–81. Epidemiology 1996;7:406–10. 73. Diamond GA. Affirmative actions: can the 58. Wacholder S, Armstrong B, Hartge P. Validation discriminant accuracy of a test be determined in studies using an alloyed gold standard. Am J the face of selection bias? Med Decis Making Epidemiol 1993;137:1251–8. 1991;11:48–56. 59. Schneeweiss S, Kriegmair M, Stepp H. Is 74. Wu CH, Lee MF, Yin SC, Yang DM, Cheng SF. everything all right if nothing seems wrong? A Comparison of polymerase chain reaction, simple method of assessing the diagnostic value of monoclonal antibody based enzyme immunoassay, endoscopic procedures when a gold standard is and cell culture for detection of Chlamydia absent. J Urol 1999;161:1116–19. trachomatis in genital specimens. Sex Transm Dis 1992;19:193–7. 60. Schneeweiss S. Sensitivity analysis of the diagnostic value of endoscopies in cross-sectional studies in 75. Martin DH, Nsuami M, Schachter J, et al. Use of the absence of a gold standard. Int J Technol Assess multiple nucleic acid amplification tests to define Health Care 2000;16:834–41. the infected-patient “gold standard” in clinical trials of new diagnostic tests for Chlamydia 61. van Belle A., Buller HR, Huisman MV, et al. trachomatis infections. J Clin Microbiol 2004; Effectiveness of managing suspected pulmonary 42:4749–58. embolism using an algorithm combining clinical probability, D-dimer testing, and computed 76. Jang D, Sellors JW, Mahony JB, Pickard L, tomography. JAMA 2006;295:172–9. Chernesky MA. Effects of broadening the gold standard on the performance of a 62. Ransohoff DF, Feinstein AR. Problems of spectrum chemiluminometric immunoassay to detect and bias in evaluating the efficacy of diagnostic Chlamydia trachomatis antigens in centrifuged first tests. N Engl J Med 1978;299:926–30. void urine and urethral swab samples from men. 63. Kline JA, Israel EG, Michelson EA, O’Neil BJ, Plewa Sex Transm Dis 1992;19:315–19. MC, Portelli DC. Diagnostic accuracy of a bedside D- 77. Das R, Kilcullen N, Morrell C, Robinson MB, dimer assay and alveolar dead-space measurement Barth JH, Hall AS. The British Cardiac Society for rapid exclusion of pulmonary embolism: Working Group definition of myocardial a multicenter study. JAMA 2001;285:761–8. infarction: implications for practice. Heart 64. Alonzo TA, Pepe MS. Assessing the accuracy of a 2006;92:21–6. new diagnostic test when a gold standard does not 78. Thijs JC, Van Zwet AA, Thijs WJ, et al. Diagnostic exist. UW Biostatistics Working Paper Series 1998; tests for Helicobacter pylori: a prospective evaluation paper 156. of their accuracy, without selecting a single test as 65. Hadgu A. The discrepancy in discrepant analysis. the gold standard. Am J Gastroenterol 1996; Lancet 1996;348:592–3. 91:2125–9. 66. Miller WC. Can we do better than discrepant 79. Madan K, Ahuja V, Gupta SD, Bal C, Kapoor A, analysis for new diagnostic test evaluation? Clin Sharma MP. Impact of 24-h esophageal pH Infect Dis 1998;27:1186–93. monitoring on the diagnosis of gastroesophageal reflux disease: defining the gold standard. 67. Miller WC. Bias in discrepant analysis: when two J Gastroenterol Hepatol 2005;20:30–7. wrongs don’t make a right. J Clin Epidemiol 1998; 51:219–31. 80. Gagnon R, Charlin B, Coletti M, Sauve E, Van D, V. Assessment in the context of uncertainty: how 68. Hadgu A. Bias in the evaluation of DNA- many members are needed on the panel of amplification tests for detecting Chlamydia reference of a script concordance test? Med Educ trachomatis. Stat Med 1997;16:1391–9. 2005;39:284–91. 69. Lipman H, Astles J. Quantifying the bias 81. Jones J, Hunter D. Consensus methods for associated with use of discrepant analysis. Clin medical and health services research. BMJ Chem 1998;44:108–15. 1995;311:376–80. 70. Hadgu A. Discrepant analysis: a biased and an 82. Rutten FH, Moons KG, Cramer MJ, et al. unscientific method for estimating test sensitivity Recognising heart failure in elderly patients with and specificity. J Clin Epidemiol 1999;52:1231–7. stable chronic obstructive pulmonary disease in 71. Hadgu A. Discrepant analysis is an inappropriate primary care: cross sectional diagnostic study. BMJ and unscientific method. J Clin Microbiol 2005;331:1379. 2000;38:4301–2. 83. Hui SL, Zhou XH. Evaluation of diagnostic tests 72. Green TA, Black CM, Johnson RE. Evaluation of without gold standards. Stat Methods Med Res bias in diagnostic-test sensitivity and specificity 1998;7:354–70. 45

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. References

84. Walter SD, Irwig LM. Estimation of test error 98. Black CM. Current methods of laboratory rates, disease prevalence and from diagnosis of Chlamydia trachomatis infections. Clin misclassified data: a review. J Clin Epidemiol Microbiol Rev 1997;10:160–84. 1988;41:923–37. 99. Barnes RC. Laboratory diagnosis of human 85. Pepe MS, Janes H. Insights into latent class chlamydial infections. Clin Microbiol Rev analysis. UW Biostatistics Working Paper Series 2005; 1989;2:119–36. paper 236. 100. Winter G. A comparative discussion of the notion of 86. Formann AK. Measurement errors in caries ‘validity’ in qualitative and quantitative research. The diagnosis: some further latent class models. qualitative report. 2000; 4. URL: http://www.nova.edu/ Biometrics 1994;50:865–71. ssss/QR/OR4-3/winter.html. Accessed 21 September 2007. 87. Committee on Quality Management, Japan Society of Clinical Chemistry. Proposed standard for 101. Streiner DL, Norman GR. Validity. In Streiner DL, matrix reference materials for quantitative Norman GR, editors. Health measurement scales: a laboratory methods. JPN J Clin Chem 2003; practical guide to their development and use. Oxford: 32:180–5. Oxford University Press; 1995. pp. 144–62. 102. Bland JM, Altman DG. Statistics notes: validating 88. Qu Y, Tan M, Kutner MH. Random effects models scales and indexes. BMJ 2002;324:606–7. in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996;52:797–810. 103. Fletcher RH, Fletcher SW, Wagner EH. Abnormality. In Satterfield TS, editor. Clinical 89. Bertrand P, Benichou J, Grenier P, Chastang C. epidemiology: the essentials. Baltimore, MD: Williams Hui and Walter’s latent-class reference-free and Wilkins; 1996. pp. 22–24. approach may be more useful in assessing agreement than diagnostic performance. J Clin 104. Cronbach LJ, Meehl PE. Construct validity in Epidemiol 2005;58:688–700. psychological tests. Psychol Bull 1955;52:281–302. 90. Vermunt JK, Magidson J. Latent class analysis. In 105. Lijmer JG, Bossuyt PM. Diagnostic testing and Lewis-Beck M, Bryman A, Liao TF, editors. The prognosis: the randomised controlled trial in Sage Encyclopedia of Social Sciences Research Methods. diagnostic research. In Knottnerus JA, editor. The Thousand Oaks, CA Sage Publications; 2004. evidence base of clinical diagnosis. London: BMJ pp. 549–53. Books; 2002. pp. 61–80. 91. Joseph L, Gyorkos TW, Coupal L. Bayesian 106. Bossuyt PM, Lijmer JG, Mol BW. Randomised estimation of disease prevalence and the comparisons of medical tests: sometimes invalid, parameters of diagnostic tests in the absence of a not always efficient. Lancet 2000;356:1844–7. gold standard. Am J Epidemiol 1995;141:263–72. 107. Fryback DG, Thornbury JR. The efficacy of diagnostic imaging. Med Decis Making 1991;11: 92. Black MA, Craig BA. Estimating disease 88–94. prevalence in the absence of a gold standard. Stat Med 2002;21:2653–69. 108. Subtil D, Goeusse P, Houfflin-Debarge V, et al. Randomised comparison of uterine artery Doppler 93. Hadgu A, Qu Y. A biomedical application of latent and aspirin (100 mg) with placebo in nulliparous class models with random effects. Appl Stat women: the Essai Regional Aspirine Mere-Enfant 1998;47:603–16. Study (Part 2). BJOG 2003;110:485–91. 94. Gustafson P. Measurement error and misclassification 109. Duley L, Henderson-Smart D, Knight M, King J. in statistics and epidemiology: impacts and Bayesian Antiplatelet drugs for prevention of pre-eclampsia adjustments. Boca Raton, FL: Chapman and and its consequences: systematic review. BMJ Hall/CRC; 2003. 2001;322:329–33. 95. Dendukuri N, Joseph L. Bayesian approaches to 110. Collinson PO, Stubbs PJ, Kessler AC. Multicentre modelling the conditional dependence between evaluation of the diagnostic value of cardiac multiple diagnostic tests. Biometrics 2001; troponin T, CK-MB mass, and myoglobin for 57:158–67. assessing patients with suspected acute coronary 96. Georgiadis MP, Johnson WO, Gardner IA, Singh syndromes in routine clinical practice. Heart R. Correlation-adjusted estimation of sensitivity 2003;89:280–6. and specificity of two diagnostic tests. Appl Stat 111. Kontos MC, Shah R, Fritz LM, et al. Implication of 2003;52:63–76. different cardiac troponin I levels for clinical outcomes and prognosis of acute chest pain 97. Bertrand P, Benichou J, Grenier P, Chastang C. patients. J Am Coll Cardiol 2004;43:958–65. Hui and Walter’s latent-class reference-free approach may be more useful in assessing 112. Jaffe AS, Ravkilde J, Roberts R, et al. It’s time for a agreement than diagnostic performance. J Clin change to a troponin standard. Circulation 46 Epidemiol 2005;58:688–700. 2000;102:1216–20. Health Technology Assessment 2007; Vol. 11: No. 50

113. Apple FS, Wu AH, Mair J, et al. Future biomarkers 116. Moons KG, Biesheuvel CJ, Grobbee DE. Test for detection of ischemia and risk stratification in research versus diagnostic research. Clin Chem acute coronary syndrome. Clin Chem 2004;50:473–6. 2005;51:810–24. 117. Lee TH, Goldman L. Evaluation of the patient 114. Ewer K, Deeks J, Alvarez L, et al. Comparison of T- with acute chest pain. N Engl J Med 2000; cell-based assay with tuberculin skin test for 342:1187–95. diagnosis of Mycobacterium tuberculosis infection 118. Ankum WM, Van der Veen F, Hamerlynck JV, in a school tuberculosis outbreak. Lancet Lammes FB. Suspected ectopic pregnancy. What to 2003;361:1168–73. do when human chorionic gonadotropin levels are 115. Bossuyt PM, Reitsma JB, Bruns DE, et al. The below the discriminatory zone. J Reprod Med STARD statement for reporting studies of 1995;40:525–8. diagnostic accuracy: explanation and elaboration. Clin Chem 2003;49:7–18.

47

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Appendix 1 Search terms in databases

MEDLINE EMBASE Searched 6 October 2005 via OVID Searched 6 October 2005 via OVID Date coverage: 1966–September 2005 Date coverage: 1980–September 2005 1. reference.ti. 1. reference.ti. 2. gold.ti. 2. gold.ti. 3. golden.ti. 3. golden.ti. 4. test.ti. 4. test.ti. 5. standard.ti. 5. standard.ti. 6. 1 or 2 or 3 6. 1 or 2 or 3 7. 4 or 5 7. 4 or 5 8. 6 and 7 8. 6 and 7 9. absen$.ti. 9. absen$.ti. 10. reference test.ti. 10. reference test.ti. 11. 9 and 10 11. 9 and 11 12. 9 and 1 12. 9 and 1 13. 9 and 2 13. 9 and 2 14. 9 and 3 14. 9 and 3 15. 10 and 5 15. 10 and 5 16. 8 or 11 or 12 or 13 or 14 or 15 16. 8 or 11 or 12 or 13 or 14 or 15

Cochrane databases Pubmed (DARE, CENTRAL, CMR, NHS) Searched 6 October 2005 via www.pubmed.gov Searched 13 October 2005 Date coverage: 1966–October 2005 1. reference.ti. (((“reference”[ti] OR “gold”[ti] OR “golden”[ti]) 2. gold.ti. AND(“test”[ti] OR “standard”[ti])) OR ((absen*[ti] AND 3. golden.ti. gold*[ti]) OR (“no gold*”[ti]) OR (absen*[ti] AND 4. test.ti. referen*[ti]) OR (absen*[ti] AND standard[ti]) OR 5. standard.ti. (absen*[ti] AND “reference test”[ti]))) 6. 1 or 2 or 3 7. 4 or 5 8. 6 and 7 9. absen*.ti. 10. reference test.ti. 11. 9 and 10 12. 10 and 1 13. 9 and 2 14 9 and 3 15 9 and 5 16 8 or 11 or 12 or 13 or 14 or 15

Medion Searched 24 November 2005 via www.mediondatabase.nl Filter used: ‘Methodological Studies on Systematic Reviews of Diagnostic Tests’ 1. MKD (quality assessment) 2. MBD (bias in meta-analysis) 3. MH (heterogeneity) 4. MAD (several/methods)

49

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Appendix 2 Experts in peer review process

e assembled a list of topic-specific experts ‘Validate index test results’ Woutside our research team to review each Professor Les Irwig method. In addition, several general experts Professor of Epidemiology, Department of Public reviewed the whole report. Health and Community Medicine, University of Sydney, Australia Topic-specific experts ‘Impute missing data on reference General experts reviewing the full standard’ and ‘Correct imperfect report reference standard’ Professor Les Irwig Professor Aeilko H Zwinderman Professor of Epidemiology, Department of Public Professor in Biostatistics, Department of Clinical Health and Community Medicine, University of Epidemiology and Biostatistics, Academic Medical Sydney, Australia Center, University of Amsterdam, The Netherlands Professor Karel GM Moons Professor of Clinical Epidemiology, Julius Center Dr Francisca Galindo Garre for Health Sciences and Primary Care, University , Psychometrician, Department of Medical Center, Utrecht, The Netherlands Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, Professor Aeilko H. Zwinderman The Netherlands Professor of Biostatistics, Department of Clinical Epidemiology and Biostatistics, Academic Medical ‘Construct reference standard’ Center, University of Amsterdam, Professor Les Irwig The Netherlands Professor of Epidemiology, Department of Public Health and Community Medicine, University of Professor Constantine Gatsonis Sydney, Australia Professor of Biostatistics, Community Health (Biostatistics) and Applied Mathematics Professor Karel GM Moons Brown University, Providence, RI, USA Professor of Clinical Epidemiology, Julius Center for Health Sciences and Primary Care, University Medical Center, Utrecht, The Netherlands

Dr Alexandra AH van Abswoude Methodologist, Department of Clinical Epidemiology and Biostatistics, Academic Medical Center, University of Amsterdam, The Netherlands

51

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Health Technology Assessment reports published to date

Volume 1, 1997 No. 11 No. 6 Newborn screening for inborn errors of Effectiveness of hip prostheses in No. 1 metabolism: a systematic review. primary total hip replacement: a critical Home parenteral nutrition: a systematic By Seymour CA, Thomason MJ, review of evidence and an economic review. Chalmers RA, Addison GM, Bain MD, model. By Richards DM, Deeks JJ, Sheldon Cockburn F, et al. By Faulkner A, Kennedy LG, TA, Shaffer JL. Baxter K, Donovan J, Wilkinson M, No. 2 No. 12 Bevan G. Diagnosis, management and screening Routine preoperative testing: No. 7 of early localised prostate cancer. a systematic review of the evidence. Antimicrobial prophylaxis in colorectal A review by Selley S, Donovan J, By Munro J, Booth A, Nicholl J. surgery: a systematic review of Faulkner A, Coast J, Gillatt D. No. 13 randomised controlled trials. No. 3 Systematic review of the effectiveness By Song F, Glenny AM. The diagnosis, management, treatment of laxatives in the elderly. No. 8 and costs of prostate cancer in England By Petticrew M, Watt I, Sheldon T. Bone marrow and peripheral blood and Wales. stem cell transplantation for malignancy. A review by Chamberlain J, Melia J, No. 14 A review by Johnson PWM, Simnett SJ, Moss S, Brown J. When and how to assess fast-changing technologies: a comparative study of Sweetenham JW, Morgan GJ, Stewart LA. No. 4 medical applications of four generic No. 9 Screening for fragile X syndrome. technologies. Screening for speech and language A review by Murray J, Cuckle H, A review by Mowatt G, Bower DJ, delay: a systematic review of the Taylor G, Hewison J. Brebner JA, Cairns JA, Grant AM, literature. McKee L. No. 5 By Law J, Boyle J, Harris F, Harkness A review of near patient testing in A, Nye C. primary care. Volume 2, 1998 No. 10 By Hobbs FDR, Delaney BC, Resource allocation for chronic stable Fitzmaurice DA, Wilson S, Hyde CJ, No. 1 angina: a systematic review of Thorpe GH, et al. Antenatal screening for Down’s syndrome. effectiveness, costs and No. 6 A review by Wald NJ, Kennard A, cost-effectiveness of alternative Systematic review of outpatient services Hackshaw A, McGuire A. interventions. for chronic pain control. By Sculpher MJ, Petticrew M, By McQuay HJ, Moore RA, Eccleston No. 2 Kelland JL, Elliott RA, Holdright DR, C, Morley S, de C Williams AC. Screening for ovarian cancer: Buxton MJ. a systematic review. No. 7 No. 11 By Bell R, Petticrew M, Luengo S, Neonatal screening for inborn errors of Detection, adherence and control of Sheldon TA. metabolism: cost, yield and outcome. hypertension for the prevention of A review by Pollitt RJ, Green A, No. 3 stroke: a systematic review. McCabe CJ, Booth A, Cooper NJ, Consensus development methods, and By Ebrahim S. Leonard JV, et al. their use in clinical guideline No. 12 No. 8 development. Postoperative analgesia and vomiting, Preschool vision screening. A review by Murphy MK, Black NA, with special reference to day-case A review by Snowdon SK, Lamping DL, McKee CM, Sanderson surgery: a systematic review. Stewart-Brown SL. CFB, Askham J, et al. By McQuay HJ, Moore RA. No. 9 No. 4 No. 13 Implications of socio-cultural contexts A cost–utility analysis of interferon Choosing between randomised and for the ethics of clinical trials. beta for multiple sclerosis. nonrandomised studies: a systematic A review by Ashcroft RE, Chadwick By Parkin D, McNamee P, Jacoby A, review. DW, Clark SRL, Edwards RHT, Frith L, Miller P, Thomas S, Bates D. By Britton A, McKee M, Black N, Hutton JL. McPherson K, Sanderson C, Bain C. No. 5 No. 10 Effectiveness and efficiency of methods No. 14 A critical review of the role of neonatal of dialysis therapy for end-stage renal Evaluating patient-based outcome hearing screening in the detection of disease: systematic reviews. measures for use in clinical congenital hearing impairment. By MacLeod A, Grant A, trials. By Davis A, Bamford J, Wilson I, Donaldson C, Khan I, Campbell M, A review by Fitzpatrick R, Davey C, Ramkalawan T, Forshaw M, Wright S. Daly C, et al. Buxton MJ, Jones DR. 53

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment reports published to date

No. 15 No. 5 No. 17 (Pt 1) Ethical issues in the design and conduct Methods for evaluating area-wide and The debridement of chronic wounds: of randomised controlled trials. organisation-based interventions in a systematic review. A review by Edwards SJL, Lilford RJ, health and health care: a systematic By Bradley M, Cullum N, Sheldon T. Braunholtz DA, Jackson JC, Hewison J, review. Thornton J. By Ukoumunne OC, Gulliford MC, No. 17 (Pt 2) Chinn S, Sterne JAC, Burney PGJ. Systematic reviews of wound care No. 16 management: (2) Dressings and topical Qualitative research methods in health No. 6 agents used in the healing of chronic technology assessment: a review of the Assessing the costs of healthcare wounds. literature. technologies in clinical trials. By Bradley M, Cullum N, Nelson EA, By Murphy E, Dingwall R, Greatbatch A review by Johnston K, Buxton MJ, Petticrew M, Sheldon T, Torgerson D. D, Parker S, Watson P. Jones DR, Fitzpatrick R. No. 18 No. 17 No. 7 A systematic literature review of spiral The costs and benefits of paramedic Cooperatives and their primary care and electron beam computed skills in pre-hospital trauma care. emergency centres: organisation and tomography: with particular reference to By Nicholl J, Hughes S, Dixon S, impact. clinical applications in hepatic lesions, Turner J, Yates D. By Hallam L, Henthorne K. pulmonary embolus and coronary artery No. 18 No. 8 disease. Systematic review of endoscopic Screening for cystic fibrosis. By Berry E, Kelly S, Hutton J, ultrasound in gastro-oesophageal A review by Murray J, Cuckle H, Harris KM, Roderick P, Boyce JC, et al. cancer. Taylor G, Littlewood J, Hewison J. No. 19 By Harris KM, Kelly S, Berry E, What role for statins? A review and Hutton J, Roderick P, Cullingworth J, et al. No. 9 A review of the use of health status economic model. No. 19 measures in economic evaluation. By Ebrahim S, Davey Smith G, Systematic reviews of trials and other By Brazier J, Deverill M, Green C, McCabe C, Payne N, Pickin M, Sheldon studies. Harper R, Booth A. TA, et al. By Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. No. 10 No. 20 Methods for the analysis of quality-of- Factors that limit the quality, number No. 20 life and survival data in health and progress of randomised controlled Primary total hip replacement surgery: technology assessment. trials. a systematic review of outcomes and A review by Billingham LJ, Abrams A review by Prescott RJ, Counsell CE, modelling of cost-effectiveness KR, Jones DR. Gillespie WJ, Grant AM, Russell IT, associated with different prostheses. Kiauka S, et al. A review by Fitzpatrick R, Shortall E, No. 11 Sculpher M, Murray D, Morris R, Lodge Antenatal and neonatal No. 21 M, et al. haemoglobinopathy screening in the Antimicrobial prophylaxis in total hip UK: review and economic analysis. replacement: a systematic review. Volume 3, 1999 By Zeuner D, Ades AE, Karnon J, By Glenny AM, Song F. Brown J, Dezateux C, Anionwu EN. No. 1 No. 22 Informed decision making: an annotated No. 12 Health promoting schools and health Assessing the quality of reports of bibliography and systematic review. promotion in schools: two systematic randomised trials: implications for the By Bekker H, Thornton JG, reviews. conduct of meta-analyses. By Lister-Sharp D, Chapman S, Airey CM, Connelly JB, Hewison J, A review by Moher D, Cook DJ, Jadad Stewart-Brown S, Sowden A. Robinson MB, et al. AR, Tugwell P, Moher M, Jones A, et al. No. 23 No. 2 No. 13 Economic evaluation of a primary care- Handling uncertainty when performing ‘Early warning systems’ for identifying based education programme for patients economic evaluation of healthcare new healthcare technologies. with osteoarthritis of the knee. interventions. By Robert G, Stevens A, Gabbay J. A review by Briggs AH, Gray AM. A review by Lord J, Victor C, No. 14 Littlejohns P, Ross FM, Axford JS. No. 3 A systematic review of the role of human The role of expectancies in the placebo papillomavirus testing within a cervical Volume 4, 2000 effect and their use in the delivery of screening programme. health care: a systematic review. No. 1 By Cuzick J, Sasieni P, Davies P, The estimation of marginal time By Crow R, Gage H, Hampson S, Adams J, Normand C, Frater A, et al. Hart J, Kimber A, Thomas H. preference in a UK-wide sample No. 15 (TEMPUS) project. No. 4 Near patient testing in diabetes clinics: A review by Cairns JA, van der Pol A randomised controlled trial of appraising the costs and outcomes. MM. different approaches to universal By Grieve R, Beech R, Vincent J, No. 2 antenatal HIV testing: uptake and Mazurkiewicz J. acceptability. Annex: Antenatal HIV Geriatric rehabilitation following testing – assessment of a routine No. 16 fractures in older people: a systematic voluntary approach. Positron emission tomography: review. By Simpson WM, Johnstone FD, Boyd establishing priorities for health By Cameron I, Crotty M, Currie C, FM, Goldberg DJ, Hart GJ, Gormley technology assessment. Finnegan T, Gillespie L, Gillespie W, 54 SM, et al. A review by Robert G, Milne R. et al. Health Technology Assessment 2007; Vol. 11: No. 50

No. 3 No. 14 No. 24 Screening for sickle cell disease and The determinants of screening uptake Outcome measures for adult critical thalassaemia: a systematic review with and interventions for increasing uptake: care: a systematic review. supplementary research. a systematic review. By Hayes JA, Black NA, By Davies SC, Cronin E, Gill M, By Jepson R, Clegg A, Forbes C, Jenkinson C, Young JD, Rowan KM, Greengross P, Hickman M, Normand C. Lewis R, Sowden A, Kleijnen J. Daly K, et al. No. 4 No. 15 No. 25 Community provision of hearing aids The effectiveness and cost-effectiveness A systematic review to evaluate the and related audiology services. of prophylactic removal of wisdom effectiveness of interventions to A review by Reeves DJ, Alborz A, teeth. promote the initiation of Hickson FS, Bamford JM. A rapid review by Song F, O’Meara S, breastfeeding. By Fairbank L, O’Meara S, Renfrew No. 5 Wilson P, Golder S, Kleijnen J. MJ, Woolridge M, Sowden AJ, False-negative results in screening No. 16 Lister-Sharp D. programmes: systematic review of Ultrasound screening in pregnancy: a impact and implications. systematic review of the clinical No. 26 By Petticrew MP, Sowden AJ, effectiveness, cost-effectiveness and Implantable cardioverter defibrillators: Lister-Sharp D, Wright K. women’s views. arrhythmias. A rapid and systematic No. 6 By Bricker L, Garcia J, Henderson J, review. Costs and benefits of community Mugford M, Neilson J, Roberts T, et al. By Parkes J, Bryant J, Milne R. postnatal support workers: a No. 17 No. 27 randomised controlled trial. A rapid and systematic review of the Treatments for fatigue in multiple By Morrell CJ, Spiby H, Stewart P, effectiveness and cost-effectiveness of sclerosis: a rapid and systematic Walters S, Morgan A. the taxanes used in the treatment of review. No. 7 advanced breast and ovarian cancer. By Brañas P, Jordan R, Fry-Smith A, Implantable contraceptives (subdermal By Lister-Sharp D, McDonagh MS, Burls A, Hyde C. Khan KS, Kleijnen J. implants and hormonally impregnated No. 28 intrauterine systems) versus other forms No. 18 Early asthma prophylaxis, natural of reversible contraceptives: two Liquid-based cytology in cervical history, skeletal development and systematic reviews to assess relative screening: a rapid and systematic review. economy (EASE): a pilot randomised effectiveness, acceptability, tolerability By Payne N, Chilcott J, McGoogan E. controlled trial. and cost-effectiveness. By Baxter-Jones ADG, Helms PJ, No. 19 By French RS, Cowan FM, Mansour Russell G, Grant A, Ross S, Cairns JA, Randomised controlled trial of non- DJA, Morris S, Procter T, Hughes D, et al. et al. directive counselling, cognitive–behaviour therapy and usual No. 29 No. 8 general practitioner care in the Screening for hypercholesterolaemia An introduction to statistical methods management of depression as well as versus case finding for familial for health technology assessment. mixed anxiety and depression in hypercholesterolaemia: a systematic A review by White SJ, Ashby D, primary care. review and cost-effectiveness analysis. Brown PJ. By King M, Sibbald B, Ward E, Bower By Marks D, Wonderling D, No. 9 P, Lloyd M, Gabbay M, et al. Thorogood M, Lambert H, Humphries SE, Neil HAW. Disease-modifying drugs for multiple No. 20 sclerosis: a rapid and systematic Routine referral for radiography of No. 30 review. patients presenting with low back pain: A rapid and systematic review of the By Clegg A, Bryant J, Milne R. is patients’ outcome influenced by GPs’ clinical effectiveness and cost- No. 10 referral for plain radiography? effectiveness of glycoprotein IIb/IIIa Publication and related biases. By Kerry S, Hilton S, Patel S, Dundas antagonists in the medical management A review by Song F, Eastwood AJ, D, Rink E, Lord J. of unstable angina. Gilbody S, Duley L, Sutton AJ. By McDonagh MS, Bachmann LM, No. 21 Golder S, Kleijnen J, ter Riet G. No. 11 Systematic reviews of wound care Cost and outcome implications of the management: (3) antimicrobial agents No. 31 organisation of vascular services. for chronic wounds; (4) diabetic foot A randomised controlled trial of By Michaels J, Brazier J, Palfreyman ulceration. prehospital intravenous fluid S, Shackley P, Slack R. By O’Meara S, Cullum N, Majid M, replacement therapy in serious trauma. Sheldon T. By Turner J, Nicholl J, Webber L, No. 12 Cox H, Dixon S, Yates D. Monitoring blood glucose control in No. 22 diabetes mellitus: a systematic review. Using routine data to complement and No. 32 By Coster S, Gulliford MC, Seed PT, enhance the results of randomised Intrathecal pumps for giving opioids in Powrie JK, Swaminathan R. controlled trials. chronic pain: a systematic review. By Lewsey JD, Leyland AH, By Williams JE, Louw G, Towlerton G. No. 13 Murray GD, Boddy FA. The effectiveness of domiciliary health No. 33 visiting: a systematic review of No. 23 Combination therapy (interferon alfa international studies Coronary artery stents in the treatment and ribavirin) in the treatment of and a selective review of the British of ischaemic heart disease: a rapid and chronic hepatitis C: a rapid and literature. systematic review. systematic review. By Elkan R, Kendrick D, Hewitt M, By Meads C, Cummins C, Jolly K, By Shepherd J, Waugh N, Robinson JJA, Tolley K, Blair M, et al. Stevens A, Burls A, Hyde C. Hewitson P. 55

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment reports published to date

No. 34 No. 5 No. 15 A systematic review of comparisons of Eliciting public preferences for Home treatment for mental health effect sizes derived from randomised healthcare: a systematic review of problems: a systematic review. and non-randomised studies. techniques. By Burns T, Knapp M, By MacLehose RR, Reeves BC, By Ryan M, Scott DA, Reeves C, Bate Catty J, Healey A, Henderson J, Harvey IM, Sheldon TA, Russell IT, A, van Teijlingen ER, Russell EM, et al. Watt H, et al. Black AMS. No. 6 No. 16 No. 35 General health status measures for How to develop cost-conscious Intravascular ultrasound-guided people with cognitive impairment: guidelines. interventions in coronary artery disease: learning disability and acquired brain By Eccles M, Mason J. a systematic literature review, with injury. No. 17 decision-analytic modelling, of outcomes By Riemsma RP, Forbes CA, The role of specialist nurses in multiple and cost-effectiveness. Glanville JM, Eastwood AJ, Kleijnen J. sclerosis: a rapid and systematic review. By Berry E, Kelly S, Hutton J, No. 7 By De Broe S, Christopher F, Lindsay HSJ, Blaxill JM, Evans JA, et al. An assessment of screening strategies for Waugh N. No. 36 fragile X syndrome in the UK. A randomised controlled trial to By Pembrey ME, Barnicoat AJ, No. 18 A rapid and systematic review of the evaluate the effectiveness and cost- Carmichael B, Bobrow M, Turner G. clinical effectiveness and cost- effectiveness of counselling patients with No. 8 effectiveness of orlistat in the chronic depression. Issues in methodological research: management of obesity. By Simpson S, Corney R, Fitzgerald P, perspectives from researchers and By O’Meara S, Riemsma R, Beecham J. commissioners. Shirran L, Mather L, ter Riet G. No. 37 By Lilford RJ, Richardson A, Stevens Systematic review of treatments for A, Fitzpatrick R, Edwards S, Rock F, et al. No. 19 The clinical effectiveness and cost- atopic eczema. No. 9 effectiveness of pioglitazone for type 2 By Hoare C, Li Wan Po A, Williams H. Systematic reviews of wound care diabetes mellitus: a rapid and systematic No. 38 management: (5) beds; (6) compression; review. Bayesian methods in health technology (7) laser therapy, therapeutic By Chilcott J, Wight J, Lloyd Jones assessment: a review. ultrasound, electrotherapy and M, Tappenden P. By Spiegelhalter DJ, Myles JP, electromagnetic therapy. Jones DR, Abrams KR. By Cullum N, Nelson EA, Flemming No. 20 K, Sheldon T. Extended scope of nursing practice: a No. 39 multicentre randomised controlled trial The management of dyspepsia: a No. 10 of appropriately trained nurses and systematic review. Effects of educational and psychosocial preregistration house officers in pre- By Delaney B, Moayyedi P, Deeks J, interventions for adolescents with operative assessment in elective general Innes M, Soo S, Barton P, et al. diabetes mellitus: a systematic review. surgery. By Hampson SE, Skinner TC, Hart J, By Kinley H, Czoski-Murray C, No. 40 Storey L, Gage H, Foxcroft D, et al. George S, McCabe C, Primrose J, A systematic review of treatments for Reilly C, et al. severe psoriasis. No. 11 By Griffiths CEM, Clark CM, Chalmers Effectiveness of autologous chondrocyte No. 21 RJG, Li Wan Po A, Williams HC. transplantation for hyaline cartilage Systematic reviews of the effectiveness of defects in knees: a rapid and systematic day care for people with severe mental Volume 5, 2001 review. disorders: (1) Acute day hospital versus By Jobanputra P, Parry D, Fry-Smith admission; (2) Vocational rehabilitation; No. 1 A, Burls A. (3) Day hospital versus outpatient Clinical and cost-effectiveness of care. No. 12 donepezil, rivastigmine and By Marshall M, Crowther R, Almaraz- Statistical assessment of the learning galantamine for Alzheimer’s disease: a Serrano A, Creed F, Sledge W, curves of health technologies. rapid and systematic review. Kluiter H, et al. By Ramsay CR, Grant AM, By Clegg A, Bryant J, Nicholson T, Wallace SA, Garthwaite PH, Monk AF, No. 22 McIntyre L, De Broe S, Gerard K, et al. Russell IT. The measurement and monitoring of No. 2 surgical adverse events. The clinical effectiveness and cost- No. 13 By Bruce J, Russell EM, Mollison J, The effectiveness and cost-effectiveness effectiveness of riluzole for motor Krukowski ZH. of temozolomide for the treatment of neurone disease: a rapid and systematic recurrent malignant glioma: a rapid and No. 23 review. systematic review. Action research: a systematic review and By Stewart A, Sandercock J, Bryan S, By Dinnes J, Cave C, Huang S, guidance for assessment. Hyde C, Barton PM, Fry-Smith A, et al. Major K, Milne R. By Waterman H, Tillen D, Dickson R, No. 3 de Koning K. Equity and the economic evaluation of No. 14 A rapid and systematic review of the No. 24 healthcare. clinical effectiveness and cost- A rapid and systematic review of the By Sassi F, Archard L, Le Grand J. effectiveness of debriding agents in clinical effectiveness and cost- No. 4 treating surgical wounds healing by effectiveness of gemcitabine for the Quality-of-life measures in chronic secondary intention. treatment of pancreatic cancer. diseases of childhood. By Lewis R, Whiting P, ter Riet G, By Ward S, Morris E, Bansback N, 56 By Eiser C, Morse R. O’Meara S, Glanville J. Calvert N, Crellin A, Forman D, et al. Health Technology Assessment 2007; Vol. 11: No. 50

No. 25 No. 35 No. 9 A rapid and systematic review of the A systematic review of controlled trials Zanamivir for the treatment of influenza evidence for the clinical effectiveness of the effectiveness and cost- in adults: a systematic review and and cost-effectiveness of irinotecan, effectiveness of brief psychological economic evaluation. oxaliplatin and raltitrexed for the treatments for depression. By Burls A, Clark W, Stewart T, treatment of advanced colorectal cancer. By Churchill R, Hunot V, Corney R, Preston C, Bryan S, Jefferson T, et al. By Lloyd Jones M, Hummel S, Knapp M, McGuire H, Tylee A, et al. No. 10 Bansback N, Orr B, Seymour M. No. 36 A review of the natural history and No. 26 Cost analysis of child health epidemiology of multiple sclerosis: Comparison of the effectiveness of surveillance. implications for resource allocation and inhaler devices in asthma and chronic By Sanderson D, Wright D, Acton C, health economic models. obstructive airways disease: a systematic Duree D. By Richards RG, Sampson FC, review of the literature. Beard SM, Tappenden P. By Brocklebank D, Ram F, Wright J, Volume 6, 2002 No. 11 Barry P, Cates C, Davies L, et al. No. 1 Screening for gestational diabetes: a No. 27 A study of the methods used to select systematic review and economic The cost-effectiveness of magnetic review criteria for clinical audit. evaluation. resonance imaging for investigation of By Hearnshaw H, Harker R, Cheater By Scott DA, Loveman E, McIntyre L, the knee joint. F, Baker R, Grimshaw G. Waugh N. By Bryan S, Weatherburn G, Bungay No. 12 H, Hatrick C, Salas C, Parry D, et al. No. 2 The clinical effectiveness and cost- Fludarabine as second-line therapy for effectiveness of surgery for people with No. 28 B cell chronic lymphocytic leukaemia: A rapid and systematic review of the morbid obesity: a systematic review and a technology assessment. clinical effectiveness and cost- economic evaluation. By Hyde C, Wake B, Bryan S, Barton effectiveness of topotecan for ovarian By Clegg AJ, Colquitt J, Sidhu MK, P, Fry-Smith A, Davenport C, et al. cancer. Royle P, Loveman E, Walker A. By Forbes C, Shirran L, Bagnall A-M, No. 3 No. 13 Duffy S, ter Riet G. Rituximab as third-line treatment for The clinical effectiveness of trastuzumab refractory or recurrent Stage III or IV for breast cancer: a systematic review. No. 29 follicular non-Hodgkin’s lymphoma: a By Lewis R, Bagnall A-M, Forbes C, Superseded by a report published in a systematic review and economic Shirran E, Duffy S, Kleijnen J, et al. later volume. evaluation. No. 14 No. 30 By Wake B, Hyde C, Bryan S, Barton P, Song F, Fry-Smith A, et al. The clinical effectiveness and cost- The role of radiography in primary care effectiveness of vinorelbine for breast patients with low back pain of at least 6 No. 4 cancer: a systematic review and weeks duration: a randomised A systematic review of discharge economic evaluation. (unblinded) controlled trial. arrangements for older people. By Lewis R, Bagnall A-M, King S, By Kendrick D, Fielding K, Bentley E, By Parker SG, Peet SM, McPherson A, Woolacott N, Forbes C, Shirran L, et al. Miller P, Kerslake R, Pringle M. Cannaby AM, Baker R, Wilson A, et al. No. 15 No. 31 No. 5 A systematic review of the effectiveness Design and use of : a The clinical effectiveness and cost- and cost-effectiveness of metal-on-metal review of best practice applicable to effectiveness of inhaler devices used in hip resurfacing arthroplasty for surveys of health service staff and the routine management of chronic treatment of hip disease. patients. asthma in older children: a systematic By Vale L, Wyness L, McCormack K, By McColl E, Jacoby A, Thomas L, review and economic evaluation. McKenzie L, Brazzelli M, Stearns SC. Soutter J, Bamford C, Steen N, et al. By Peters J, Stevenson M, Beverley C, Lim J, Smith S. No. 16 No. 32 The clinical effectiveness and cost- A rapid and systematic review of the No. 6 effectiveness of bupropion and nicotine clinical effectiveness and cost- The clinical effectiveness and cost- replacement therapy for smoking effectiveness of paclitaxel, docetaxel, effectiveness of sibutramine in the cessation: a systematic review and gemcitabine and vinorelbine in non- management of obesity: a technology economic evaluation. small-cell lung cancer. assessment. By Woolacott NF, Jones L, Forbes CA, By Clegg A, Scott DA, Sidhu M, By O’Meara S, Riemsma R, Shirran Mather LC, Sowden AJ, Song FJ, et al. Hewitson P, Waugh N. L, Mather L, ter Riet G. No. 17 No. 33 No. 7 A systematic review of effectiveness and Subgroup analyses in randomised The cost-effectiveness of magnetic economic evaluation of new drug controlled trials: quantifying the risks of resonance angiography for carotid treatments for juvenile idiopathic false-positives and false-negatives. artery stenosis and peripheral vascular arthritis: etanercept. By Brookes ST, Whitley E, disease: a systematic review. By Cummins C, Connock M, Peters TJ, Mulheran PA, Egger M, By Berry E, Kelly S, Westwood ME, Fry-Smith A, Burls A. Davey Smith G. Davies LM, Gough MJ, Bamford JM, et al. No. 18 No. 34 Clinical effectiveness and cost- Depot antipsychotic medication in the No. 8 effectiveness of growth hormone in treatment of patients with schizophrenia: Promoting physical activity in South children: a systematic review and (1) Meta-review; (2) Patient and nurse Asian Muslim women through ‘exercise economic evaluation. attitudes. on prescription’. By Bryant J, Cave C, Mihaylova B, By David AS, Adams C. By Carroll B, Ali N, Azam N. Chase D, McIntyre L, Gerard K, et al. 57

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment reports published to date

No. 19 No. 28 No. 2 Clinical effectiveness and cost- Clinical effectiveness and cost – Systematic review of the effectiveness effectiveness of growth hormone in consequences of selective serotonin and cost-effectiveness, and economic adults in relation to impact on quality of reuptake inhibitors in the treatment of evaluation, of home versus hospital or life: a systematic review and economic sex offenders. satellite unit haemodialysis for people evaluation. By Adi Y, Ashcroft D, Browne K, with end-stage renal failure. By Bryant J, Loveman E, Chase D, Beech A, Fry-Smith A, Hyde C. By Mowatt G, Vale L, Perez J, Wyness Mihaylova B, Cave C, Gerard K, et al. L, Fraser C, MacLeod A, et al. No. 29 No. 20 Treatment of established osteoporosis: No. 3 Clinical medication review by a a systematic review and cost–utility Systematic review and economic pharmacist of patients on repeat analysis. evaluation of the effectiveness of prescriptions in general practice: a By Kanis JA, Brazier JE, Stevenson M, infliximab for the treatment of Crohn’s randomised controlled trial. Calvert NW, Lloyd Jones M. disease. By Zermansky AG, Petty DR, Raynor By Clark W, Raftery J, Barton P, DK, Lowe CJ, Freementle N, Vail A. No. 30 Song F, Fry-Smith A, Burls A. Which anaesthetic agents are cost- No. 4 No. 21 effective in day surgery? Literature A review of the clinical effectiveness and The effectiveness of infliximab and review, national survey of practice and cost-effectiveness of routine anti-D etanercept for the treatment of randomised controlled trial. prophylaxis for pregnant women who rheumatoid arthritis: a systematic review By Elliott RA Payne K, Moore JK, are rhesus negative. and economic evaluation. Davies LM, Harper NJN, St Leger AS, By Chilcott J, Lloyd Jones M, Wight By Jobanputra P, Barton P, Bryan S, et al. J, Forman K, Wray J, Beverley C, et al. Burls A. No. 31 No. 5 No. 22 Screening for hepatitis C among Systematic review and evaluation of the A systematic review and economic injecting drug users and in use of tumour markers in paediatric evaluation of computerised cognitive genitourinary medicine clinics: oncology: Ewing’s sarcoma and behaviour therapy for depression and systematic reviews of effectiveness, neuroblastoma. anxiety. modelling study and national survey of By Riley RD, Burchill SA, Abrams KR, By Kaltenthaler E, Shackley P, Stevens current practice. Heney D, Lambert PC, Jones DR, et al. K, Beverley C, Parry G, Chilcott J. By Stein K, Dalziel K, Walker A, McIntyre L, Jenkins B, Horne J, et al. No. 6 No. 23 The cost-effectiveness of screening for A systematic review and economic No. 32 Helicobacter pylori to reduce mortality evaluation of pegylated liposomal The measurement of satisfaction with and morbidity from gastric cancer and doxorubicin hydrochloride for ovarian healthcare: implications for practice peptic ulcer disease: a discrete-event cancer. from a systematic review of the simulation model. By Forbes C, Wilby J, Richardson G, literature. By Roderick P, Davies R, Raftery J, Sculpher M, Mather L, Reimsma R. By Crow R, Gage H, Hampson S, Crabbe D, Pearce R, Bhandari P, et al. Hart J, Kimber A, Storey L, et al. No. 24 No. 7 A systematic review of the effectiveness No. 33 The clinical effectiveness and cost- of interventions based on a stages-of- The effectiveness and cost-effectiveness effectiveness of routine dental checks: a change approach to promote individual of imatinib in chronic myeloid systematic review and economic behaviour change. leukaemia: a systematic review. evaluation. By Riemsma RP, Pattenden J, Bridle C, By Garside R, Round A, Dalziel K, By Davenport C, Elley K, Salas C, Sowden AJ, Mather L, Watt IS, et al. Stein K, Royle R. Taylor-Weetman CL, Fry-Smith A, Bryan S, et al. No. 25 No. 34 No. 8 A systematic review update of the A comparative study of hypertonic A multicentre randomised controlled clinical effectiveness and cost- saline, daily and alternate-day trial assessing the costs and benefits of effectiveness of glycoprotein IIb/IIIa rhDNase in children with cystic using structured information and antagonists. fibrosis. analysis of women’s preferences in the By Robinson M, Ginnelly L, Sculpher By Suri R, Wallis C, Bush A, management of menorrhagia. M, Jones L, Riemsma R, Palmer S, et al. Thompson S, Normand C, Flather M, By Kennedy ADM, Sculpher MJ, et al. No. 26 Coulter A, Dwyer N, Rees M, A systematic review of the effectiveness, No. 35 Horsley S, et al. cost-effectiveness and barriers to A systematic review of the costs and No. 9 implementation of thrombolytic and effectiveness of different models of Clinical effectiveness and cost–utility of neuroprotective therapy for acute paediatric home care. photodynamic therapy for wet age-related ischaemic stroke in the NHS. By Parker G, Bhakta P, Lovett CA, macular degeneration: a systematic review By Sandercock P, Berge E, Dennis M, Paisley S, Olsen R, Turner D, et al. and economic evaluation. Forbes J, Hand P, Kwan J, et al. By Meads C, Salas C, Roberts T, Volume 7, 2003 No. 27 Moore D, Fry-Smith A, Hyde C. A randomised controlled crossover No. 1 No. 10 trial of nurse practitioner versus How important are comprehensive Evaluation of molecular tests for doctor-led outpatient care in a literature searches and the assessment of prenatal diagnosis of chromosome bronchiectasis clinic. trial quality in systematic reviews? abnormalities. By Caine N, Sharples LD, Empirical study. By Grimshaw GM, Szczepura A, Hollingworth W, French J, Keogan M, By Egger M, Jüni P, Bartlett C, Hultén M, MacDonald F, Nevin NC, 58 Exley A, et al. Holenstein F, Sterne J. Sutton F, et al. Health Technology Assessment 2007; Vol. 11: No. 50

No. 11 No. 21 No. 30 First and second trimester antenatal Systematic review of the clinical The value of digital imaging in diabetic screening for Down’s syndrome: the effectiveness and cost-effectiveness retinopathy. results of the Serum, Urine and of tension-free vaginal tape for By Sharp PF, Olson J, Strachan F, Ultrasound Screening Study treatment of urinary stress Hipwell J, Ludbrook A, O’Donnell M, (SURUSS). incontinence. et al. By Wald NJ, Rodeck C, By Cody J, Wyness L, Wallace S, Hackshaw AK, Walters J, Chitty L, Glazener C, Kilonzo M, Stearns S, No. 31 Mackinson AM. et al. Lowering blood pressure to prevent myocardial infarction and stroke: No. 12 No. 22 a new preventive strategy. The effectiveness and cost-effectiveness The clinical and cost-effectiveness of By Law M, Wald N, Morris J. of ultrasound locating devices for patient education models for diabetes: central venous access: a systematic a systematic review and economic No. 32 review and economic evaluation. evaluation. Clinical and cost-effectiveness of By Calvert N, Hind D, McWilliams By Loveman E, Cave C, Green C, capecitabine and tegafur with uracil for RG, Thomas SM, Beverley C, Royle P, Dunn N, Waugh N. the treatment of metastatic colorectal Davidson A. cancer: systematic review and economic No. 23 evaluation. No. 13 The role of modelling in prioritising By Ward S, Kaltenthaler E, Cowan J, A systematic review of atypical and planning clinical trials. Brewer N. antipsychotics in schizophrenia. By Chilcott J, Brennan A, Booth A, By Bagnall A-M, Jones L, Lewis R, Karnon J, Tappenden P. No. 33 Ginnelly L, Glanville J, Torgerson D, Clinical and cost-effectiveness of new et al. No. 24 and emerging technologies for early Cost–benefit evaluation of routine localised prostate cancer: a systematic No. 14 Prostate Testing for Cancer and influenza immunisation in people 65–74 review. Treatment (ProtecT) feasibility years of age. By Hummel S, Paisley S, Morgan A, study. By Allsup S, Gosney M, Haycox A, Currie E, Brewer N. Regan M. By Donovan J, Hamdy F, Neal D, No. 34 Peters T, Oliver S, Brindle L, et al. No. 25 Literature searching for clinical and No. 15 The clinical and cost-effectiveness of cost-effectiveness studies used in health Early thrombolysis for the treatment pulsatile machine perfusion versus cold technology assessment reports carried of acute myocardial infarction: a storage of kidneys for transplantation out for the National Institute for systematic review and economic retrieved from heart-beating and non- Clinical Excellence appraisal evaluation. heart-beating donors. system. By Boland A, Dundar Y, Bagust A, By Wight J, Chilcott J, Holmes M, By Royle P, Waugh N. Haycox A, Hill R, Mujica Mota R, Brewer N. No. 35 et al. No. 26 Systematic review and economic No. 16 Can randomised trials rely on existing decision modelling for the prevention Screening for fragile X syndrome: electronic data? A feasibility study to and treatment of influenza a literature review and modelling. explore the value of routine data in A and B. By Song FJ, Barton P, Sleightholme V, health technology assessment. By Turner D, Wailoo A, Nicholson K, Yao GL, Fry-Smith A. By Williams JG, Cheung WY, Cooper N, Sutton A, Abrams K. Cohen DR, Hutchings HA, Longo MF, No. 17 Russell IT. No. 36 Systematic review of endoscopic sinus A randomised controlled trial to surgery for nasal polyps. No. 27 evaluate the clinical and cost- By Dalziel K, Stein K, Round A, Evaluating non-randomised intervention effectiveness of Hickman line Garside R, Royle P. studies. insertions in adult cancer patients by By Deeks JJ, Dinnes J, D’Amico R, nurses. No. 18 Sowden AJ, Sakarovitch C, Song F, et al. By Boland A, Haycox A, Bagust A, Towards efficient guidelines: how to Fitzsimmons L. monitor guideline use in primary No. 28 care. A randomised controlled trial to assess No. 37 By Hutchinson A, McIntosh A, Cox S, the impact of a package comprising a Redesigning postnatal care: a Gilbert C. patient-orientated, evidence-based self- randomised controlled trial of No. 19 help guidebook and patient-centred protocol-based midwifery-led care Effectiveness and cost-effectiveness of consultations on disease management focused on individual women’s acute hospital-based spinal cord injuries and satisfaction in inflammatory bowel physical and psychological health services: systematic review. disease. needs. By Bagnall A-M, Jones L, By Kennedy A, Nelson E, Reeves D, By MacArthur C, Winter HR, Richardson G, Duffy S, Richardson G, Roberts C, Robinson A, Bick DE, Lilford RJ, Lancashire RJ, Riemsma R. et al. Knowles H, et al. No. 20 No. 29 No. 38 Prioritisation of health technology The effectiveness of diagnostic tests for Estimating implied rates of discount in assessment. The PATHS model: the assessment of shoulder pain due to healthcare decision-making. methods and case studies. soft tissue disorders: a systematic review. By West RR, McNabb R, By Townsend J, Buxton M, By Dinnes J, Loveman E, McIntyre L, Thompson AGH, Sheldon TA, Harper G. Waugh N. Grimley Evans J. 59

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment reports published to date

No. 39 No. 7 No. 15 Systematic review of isolation policies in Clinical effectiveness and costs of the Involving consumers in research and the hospital management of methicillin- Sugarbaker procedure for the treatment development agenda setting for the resistant Staphylococcus aureus: a review of of pseudomyxoma peritonei. NHS: developing an evidence-based the literature with epidemiological and By Bryant J, Clegg AJ, Sidhu MK, approach. economic modelling. Brodin H, Royle P, Davidson P. By Oliver S, Clarke-Jones L, By Cooper BS, Stone SP, Kibbler CC, Rees R, Milne R, Buchanan P, No. 8 Cookson BD, Roberts JA, Medley GF, Gabbay J, et al. et al. Psychological treatment for insomnia in the regulation of long-term hypnotic No. 16 No. 40 drug use. Treatments for spasticity and pain in A multi-centre randomised controlled By Morgan K, Dixon S, Mathers N, multiple sclerosis: a systematic review. trial of minimally invasive direct Thompson J, Tomeny M. By Beard S, Hunn A, Wight J. coronary bypass grafting versus percutaneous transluminal coronary No. 9 No. 41 angioplasty with stenting for proximal The inclusion of reports of randomised Improving the evaluation of therapeutic interventions in multiple sclerosis: stenosis of the left anterior descending trials published in languages other than coronary artery. English in systematic reviews. development of a patient-based measure By Reeves BC, Angelini GD, Bryan By Moher D, Pham B, Lawson ML, of outcome. AJ, Taylor FC, Cripps T, Spyt TJ, et al. Klassen TP. By Hobart JC, Riazi A, Lamping DL, Fitzpatrick R, Thompson AJ. No. 42 No. 17 The impact of screening on future No. 10 Does early magnetic resonance imaging health-promoting behaviours and health A systematic review and economic influence management or improve beliefs: a systematic review. evaluation of magnetic resonance outcome in patients referred to By Bankhead CR, Brett J, Bukach C, cholangiopancreatography compared secondary care with low back pain? Webster P, Stewart-Brown S, Munafo M, with diagnostic endoscopic retrograde A pragmatic randomised controlled et al. cholangiopancreatography. trial. By Kaltenthaler E, Bravo Vergel Y, By Gilbert FJ, Grant AM, Gillan Volume 8, 2004 Chilcott J, Thomas S, Blakeborough T, MGC, Vale L, Scott NW, Campbell MK, et al. No. 1 Walters SJ, et al. What is the best imaging strategy for No. 11 No. 18 acute stroke? The use of modelling to evaluate new The clinical and cost-effectiveness of By Wardlaw JM, Keir SL, Seymour J, drugs for patients with a chronic anakinra for the treatment of Lewis S, Sandercock PAG, Dennis MS, condition: the case of antibodies against rheumatoid arthritis in adults: a et al. tumour necrosis factor in rheumatoid systematic review and economic analysis. No. 2 arthritis. By Clark W, Jobanputra P, Barton P, Systematic review and modelling of the By Barton P, Jobanputra P, Wilson J, Burls A. investigation of acute and chronic chest Bryan S, Burls A. pain presenting in primary care. No. 19 By Mant J, McManus RJ, Oakes RAL, No. 12 A rapid and systematic review and Delaney BC, Barton PM, Deeks JJ, et al. Clinical effectiveness and cost- economic evaluation of the clinical and effectiveness of neonatal screening for cost-effectiveness of newer drugs for No. 3 inborn errors of metabolism using treatment of mania associated with The effectiveness and cost-effectiveness tandem mass spectrometry: a systematic bipolar affective disorder. of microwave and thermal balloon review. By Bridle C, Palmer S, Bagnall A-M, endometrial ablation for heavy By Pandor A, Eastham J, Beverley C, Darba J, Duffy S, Sculpher M, et al. menstrual bleeding: a systematic review Chilcott J, Paisley S. and economic modelling. No. 20 By Garside R, Stein K, Wyatt K, No. 13 Liquid-based cytology in cervical Round A, Price A. Clinical effectiveness and cost- screening: an updated rapid and No. 4 effectiveness of pioglitazone and systematic review and economic A systematic review of the role of rosiglitazone in the treatment of analysis. bisphosphonates in metastatic disease. type 2 diabetes: a systematic By Karnon J, Peters J, Platt J, By Ross JR, Saunders Y, Edmonds PM, review and economic Chilcott J, McGoogan E, Brewer N. Patel S, Wonderling D, Normand C, et al. evaluation. By Czoski-Murray C, Warren E, No. 21 No. 5 Chilcott J, Beverley C, Psyllaki MA, Systematic review of the long-term Systematic review of the clinical Cowan J. effects and economic consequences of effectiveness and cost-effectiveness of ® treatments for obesity and implications capecitabine (Xeloda ) for locally No. 14 for health improvement. advanced and/or metastatic breast cancer. Routine examination of the newborn: By Avenell A, Broom J, Brown TJ, By Jones L, Hawkins N, Westwood M, the EMREN study. Evaluation of an Poobalan A, Aucott L, Stearns SC, et al. Wright K, Richardson G, Riemsma R. extension of the midwife role No. 6 including a randomised controlled No. 22 Effectiveness and efficiency of guideline trial of appropriately trained midwives Autoantibody testing in children with dissemination and implementation and paediatric senior house newly diagnosed type 1 diabetes strategies. officers. mellitus. By Grimshaw JM, Thomas RE, By Townsend J, Wolke D, Hayes J, By Dretzke J, Cummins C, MacLennan G, Fraser C, Ramsay CR, Davé S, Rogers C, Bloomfield L, Sandercock J, Fry-Smith A, Barrett T, 60 Vale L, et al. et al. Burls A. Health Technology Assessment 2007; Vol. 11: No. 50

No. 23 No. 32 No. 41 Clinical effectiveness and cost- The Social Support and Family Health Provision, uptake and cost of cardiac effectiveness of prehospital intravenous Study: a randomised controlled trial and rehabilitation programmes: improving fluids in trauma patients. economic evaluation of two alternative services to under-represented groups. By Dretzke J, Sandercock J, Bayliss S, forms of postnatal support for mothers By Beswick AD, Rees K, Griebsch I, Burls A. living in disadvantaged inner-city areas. Taylor FC, Burke M, West RR, et al. By Wiggins M, Oakley A, Roberts I, No. 24 No. 42 Newer hypnotic drugs for the short- Turner H, Rajan L, Austerberry H, et al. Involving South Asian patients in term management of insomnia: a No. 33 clinical trials. systematic review and economic Psychosocial aspects of genetic screening By Hussain-Gambles M, Leese B, evaluation. of pregnant women and newborns: Atkin K, Brown J, Mason S, Tovey P. By Dündar Y, Boland A, Strobl J, a systematic review. No. 43 Dodd S, Haycox A, Bagust A, By Green JM, Hewison J, Bekker HL, Clinical and cost-effectiveness of et al. Bryant, Cuckle HS. continuous subcutaneous insulin No. 25 infusion for diabetes. No. 34 Development and validation of methods By Colquitt JL, Green C, Sidhu MK, Evaluation of abnormal uterine for assessing the quality of diagnostic Hartwell D, Waugh N. bleeding: comparison of three accuracy studies. outpatient procedures within cohorts No. 44 By Whiting P, Rutjes AWS, Dinnes J, defined by age and menopausal status. Identification and assessment of Reitsma JB, Bossuyt PMM, Kleijnen J. By Critchley HOD, Warner P, ongoing trials in health technology No. 26 Lee AJ, Brechin S, Guise J, Graham B. assessment reviews. EVALUATE hysterectomy trial: a By Song FJ, Fry-Smith A, Davenport multicentre randomised trial comparing No. 35 C, Bayliss S, Adi Y, Wilson JS, et al. abdominal, vaginal and laparoscopic Coronary artery stents: a rapid systematic review and economic evaluation. No. 45 methods of hysterectomy. Systematic review and economic By Hill R, Bagust A, Bakhai A, By Garry R, Fountain J, Brown J, evaluation of a long-acting insulin Dickson R, Dündar Y, Haycox A, et al. Manca A, Mason S, Sculpher M, et al. analogue, insulin glargine No. 27 No. 36 By Warren E, Weatherley-Jones E, Methods for expected value of Review of guidelines for good practice Chilcott J, Beverley C. information analysis in complex health in decision-analytic modelling in health No. 46 economic models: developments on the technology assessment. Supplementation of a home-based ␤ health economics of interferon- and By Philips Z, Ginnelly L, Sculpher M, exercise programme with a class-based glatiramer acetate for multiple sclerosis. Claxton K, Golder S, Riemsma R, et al. programme for people with osteoarthritis By Tappenden P, Chilcott JB, of the knees: a randomised controlled Eggington S, Oakley J, McCabe C. No. 37 Rituximab (MabThera®) for aggressive trial and health economic analysis. By McCarthy CJ, Mills PM, No. 28 non-Hodgkin’s lymphoma: systematic Pullen R, Richardson G, Hawkins N, Effectiveness and cost-effectiveness of review and economic evaluation. Roberts CR, et al. imatinib for first-line treatment of By Knight C, Hind D, Brewer N, chronic myeloid leukaemia in chronic Abbott V. No. 47 phase: a systematic review and economic Clinical and cost-effectiveness of once- analysis. No. 38 daily versus more frequent use of same By Dalziel K, Round A, Stein K, Clinical effectiveness and cost- potency topical corticosteroids for atopic Garside R, Price A. effectiveness of clopidogrel and eczema: a systematic review and modified-release dipyridamole in the economic evaluation. No. 29 secondary prevention of occlusive VenUS I: a randomised controlled trial By Green C, Colquitt JL, Kirby J, vascular events: a systematic review and Davidson P, Payne E. of two types of bandage for treating economic evaluation. venous leg ulcers. By Jones L, Griffin S, Palmer S, Main No. 48 By Iglesias C, Nelson EA, Cullum NA, C, Orton V, Sculpher M, et al. Acupuncture of chronic headache Torgerson DJ on behalf of the VenUS disorders in primary care: randomised Team. No. 39 controlled trial and economic analysis. Pegylated interferon ␣-2a and -2b in By Vickers AJ, Rees RW, Zollman CE, No. 30 combination with ribavirin in the McCarney R, Smith CM, Ellis N, et al. Systematic review of the effectiveness treatment of chronic hepatitis C: a and cost-effectiveness, and economic No. 49 systematic review and economic evaluation, of myocardial perfusion Generalisability in economic evaluation scintigraphy for the diagnosis and evaluation. studies in healthcare: a review and case management of angina and myocardial By Shepherd J, Brodin H, Cave C, studies. infarction. Waugh N, Price A, Gabbay J. By Sculpher MJ, Pang FS, Manca A, By Mowatt G, Vale L, Brazzelli M, No. 40 Drummond MF, Golder S, Urdahl H, Hernandez R, Murray A, Scott N, et al. Clopidogrel used in combination with et al. No. 31 aspirin compared with aspirin alone in No. 50 A pilot study on the use of decision the treatment of non-ST-segment- Virtual outreach: a randomised theory and value of information analysis elevation acute coronary syndromes: a controlled trial and economic evaluation as part of the NHS Health Technology systematic review and economic of joint teleconferenced medical Assessment programme. evaluation. consultations. By Claxton K,K Ginnelly L, Sculpher By Main C, Palmer S, Griffin S, By Wallace P, Barber J, Clayton W, M, Philips Z, Palmer S. Jones L, Orton V, Sculpher M, et al. Currell R, Fleming K, Garner P, et al. 61

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment reports published to date

Volume 9, 2005 No. 10 No. 18 Measurement of health-related quality A randomised controlled comparison of No. 1 of life for people with dementia: alternative strategies in stroke care. Randomised controlled multiple development of a new instrument By Kalra L, Evans A, Perez I, treatment comparison to provide a (DEMQOL) and an evaluation of Knapp M, Swift C, Donaldson N. cost-effectiveness rationale for the current methodology. selection of antimicrobial therapy in No. 19 By Smith SC, Lamping DL, acne. The investigation and analysis of critical Banerjee S, Harwood R, Foley B, By Ozolins M, Eady EA, Avery A, incidents and adverse events in Smith P, et al. Cunliffe WJ, O’Neill C, Simpson NB, healthcare. et al. By Woloshynowych M, Rogers S, No. 11 Taylor-Adams S, Vincent C. No. 2 Clinical effectiveness and cost- Do the findings of studies effectiveness of drotrecogin alfa No. 20 ® vary significantly according to (activated) (Xigris ) for the treatment of Potential use of routine databases in methodological characteristics? severe sepsis in adults: a systematic health technology assessment. By Dalziel K, Round A, Stein K, review and economic evaluation. By Raftery J, Roderick P, Stevens A. By Green C, Dinnes J, Takeda A, Garside R, Castelnuovo E, Payne L. No. 21 Shepherd J, Hartwell D, Cave C, Clinical and cost-effectiveness of newer No. 3 et al. immunosuppressive regimens in renal Improving the referral process for transplantation: a systematic review and familial breast cancer genetic No. 12 modelling study. counselling: findings of three A methodological review of how By Woodroffe R, Yao GL, Meads C, randomised controlled trials of two heterogeneity has been examined in Bayliss S, Ready A, Raftery J, et al. interventions. systematic reviews of diagnostic test By Wilson BJ, Torrance N, Mollison J, accuracy. No. 22 Wordsworth S, Gray JR, Haites NE, et al. By Dinnes J, Deeks J, Kirby J, A systematic review and economic Roderick P. evaluation of alendronate, etidronate, No. 4 risedronate, raloxifene and teriparatide Randomised evaluation of alternative No. 13 for the prevention and treatment of electrosurgical modalities to treat Cervical screening programmes: can postmenopausal osteoporosis. bladder outflow obstruction in men with automation help? Evidence from By Stevenson M, Lloyd Jones M, benign prostatic hyperplasia. systematic reviews, an economic analysis De Nigris E, Brewer N, Davis S, By Fowler C, McAllister W, Plail R, and a simulation modelling exercise Oakley J. Karim O, Yang Q. applied to the UK. No. 23 No. 5 By Willis BH, Barton P, Pearmain P, A systematic review to examine the A pragmatic randomised controlled trial Bryan S, Hyde C. impact of psycho-educational of the cost-effectiveness of palliative interventions on health outcomes and for patients with inoperable No. 14 costs in adults and children with difficult oesophageal cancer. Laparoscopic surgery for inguinal asthma. By Shenfine J, McNamee P, Steen N, hernia repair: systematic review of By Smith JR, Mugford M, Holland R, Bond J, Griffin SM. effectiveness and economic evaluation. Candy B, Noble MJ, Harrison BDW, et al. By McCormack K, Wake B, Perez J, No. 6 Fraser C, Cook J, McIntosh E, et al. No. 24 Impact of computer-aided detection An evaluation of the costs, effectiveness prompts on the sensitivity and specificity No. 15 and quality of renal replacement of screening mammography. Clinical effectiveness, tolerability and therapy provision in renal satellite units By Taylor P, Champness J, Given- cost-effectiveness of newer drugs for in England and Wales. Wilson R, Johnston K, Potts H. epilepsy in adults: a systematic review By Roderick P, Nicholson T, Armitage and economic evaluation. A, Mehta R, Mullee M, Gerard K, et al. No. 7 By Wilby J, Kainth A, Hawkins N, No. 25 Issues in data monitoring and interim Epstein D, McIntosh H, McDaid C, et al. Imatinib for the treatment of patients analysis of trials. with unresectable and/or metastatic By Grant AM, Altman DG, Babiker No. 16 gastrointestinal stromal tumours: AB, Campbell MK, Clemens FJ, A randomised controlled trial to systematic review and economic Darbyshire JH, et al. compare the cost-effectiveness of evaluation. No. 8 tricyclic antidepressants, selective By Wilson J, Connock M, Song F, Yao Lay public’s understanding of equipoise serotonin reuptake inhibitors and G, Fry-Smith A, Raftery J, et al. and randomisation in randomised lofepramine. By Peveler R, Kendrick T, Buxton M, No. 26 controlled trials. Indirect comparisons of competing By Robinson EJ, Kerr CEP, Stevens Longworth L, Baldwin D, Moore M, et al. interventions. AJ, Lilford RJ, Braunholtz DA, Edwards By Glenny AM, Altman DG, Song F, SJ, et al. Sakarovitch C, Deeks JJ, D’Amico R, et al. No. 17 No. 9 Clinical effectiveness and cost- No. 27 Clinical and cost-effectiveness of effectiveness of immediate angioplasty Cost-effectiveness of alternative electroconvulsive therapy for depressive for acute myocardial infarction: strategies for the initial medical illness, schizophrenia, catatonia and systematic review and economic management of non-ST elevation acute mania: systematic reviews and economic evaluation. coronary syndrome: systematic review modelling studies. By Hartwell D, Colquitt J, Loveman and decision-analytical modelling. By Greenhalgh J, Knight C, Hind D, E, Clegg AJ, Brodin H, Waugh N, By Robinson M, Palmer S, Sculpher 62 Beverley C, Walters S. et al. M, Philips Z, Ginnelly L, Bowens A, et al. Health Technology Assessment 2007; Vol. 11: No. 50

No. 28 No. 38 No. 46 Outcomes of electrically stimulated The causes and effects of socio- The effectiveness of the Heidelberg gracilis neosphincter surgery. demographic exclusions from clinical Retina Tomograph and laser diagnostic By Tillin T, Chambers M, Feldman R. trials. glaucoma scanning system (GDx) in By Bartlett C, Doyal L, Ebrahim S, detecting and monitoring glaucoma. No. 29 Davey P, Bachmann M, Egger M, By Kwartz AJ, Henson DB, The effectiveness and cost-effectiveness et al. Harper RA, Spencer AF, of pimecrolimus and tacrolimus for McLeod D. atopic eczema: a systematic review and No. 39 economic evaluation. Is hydrotherapy cost-effective? A No. 47 By Garside R, Stein K, Castelnuovo E, randomised controlled trial of combined Clinical and cost-effectiveness of Pitt M, Ashcroft D, Dimmock P, et al. hydrotherapy programmes compared autologous chondrocyte implantation with physiotherapy land techniques in for cartilage defects in knee joints: No. 30 children with juvenile idiopathic systematic review and economic Systematic review on urine albumin arthritis. evaluation. testing for early detection of diabetic By Epps H, Ginnelly L, Utley M, By Clar C, Cummins E, McIntyre L, complications. Southwood T, Gallivan S, Sculpher M, Thomas S, Lamb J, Bain L, et al. By Newman DJ, Mattock MB, Dawnay et al. ABS, Kerry S, McGuire A, Yaqoob M, et al. No. 48 No. 40 Systematic review of effectiveness of No. 31 A randomised controlled trial and cost- different treatments for childhood Randomised controlled trial of the cost- effectiveness study of systematic retinoblastoma. effectiveness of water-based therapy for screening (targeted and total population By McDaid C, Hartley S, lower limb osteoarthritis. screening) versus routine practice for Bagnall A-M, Ritchie G, Light K, By Cochrane T, Davey RC, the detection of atrial fibrillation in Riemsma R. Matthes Edwards SM. people aged 65 and over. The SAFE No. 49 No. 32 study. Towards evidence-based guidelines Longer term clinical and economic By Hobbs FDR, Fitzmaurice DA, for the prevention of venous benefits of offering acupuncture care to Mant J, Murray E, Jowett S, Bryan S, thromboembolism: systematic patients with chronic low back pain. et al. reviews of mechanical methods, oral By Thomas KJ, MacPherson H, No. 41 anticoagulation, dextran and regional Ratcliffe J, Thorpe L, Brazier J, Displaced intracapsular hip fractures anaesthesia as thromboprophylaxis. Campbell M, et al. in fit, older people: a randomised By Roderick P, Ferris G, Wilson K, No. 33 comparison of reduction and fixation, Halls H, Jackson D, Collins R, Cost-effectiveness and safety of epidural bipolar hemiarthroplasty and total hip et al. steroids in the management of sciatica. arthroplasty. No. 50 By Price C, Arden N, Coglan L, By Keating JF, Grant A, Masson M, The effectiveness and cost-effectiveness Rogers P. Scott NW, Forbes JF. of parent training/education No. 34 No. 42 programmes for the treatment of The British Rheumatoid Outcome Study Long-term outcome of cognitive conduct disorder, including oppositional Group (BROSG) randomised controlled behaviour therapy clinical trials in defiant disorder, in children. trial to compare the effectiveness and central Scotland. By Dretzke J, Frew E, Davenport C, cost-effectiveness of aggressive versus By Durham RC, Chambers JA, Barlow J, Stewart-Brown S, symptomatic therapy in established Power KG, Sharp DM, Macdonald RR, Sandercock J, et al. rheumatoid arthritis. Major KA, et al. By Symmons D, Tricker K, Roberts C, Volume 10, 2006 No. 43 Davies L, Dawes P, Scott DL. The effectiveness and cost-effectiveness No. 1 No. 35 of dual-chamber pacemakers compared The clinical and cost-effectiveness of Conceptual framework and systematic with single-chamber pacemakers for donepezil, rivastigmine, galantamine review of the effects of participants’ and bradycardia due to atrioventricular and memantine for Alzheimer’s professionals’ preferences in randomised block or sick sinus syndrome: disease. controlled trials. systematic review and economic By Loveman E, Green C, Kirby J, By King M, Nazareth I, Lampe F, evaluation. Takeda A, Picot J, Payne E, Bower P, Chandler M, Morou M, et al. By Castelnuovo E, Stein K, Pitt M, et al. Garside R, Payne E. No. 36 No. 2 The clinical and cost-effectiveness of No. 44 FOOD: a multicentre randomised trial implantable cardioverter defibrillators: Newborn screening for congenital heart evaluating feeding policies in patients a systematic review. defects: a systematic review and admitted to hospital with a recent By Bryant J, Brodin H, Loveman E, cost-effectiveness analysis. stroke. Payne E, Clegg A. By Knowles R, Griebsch I, Dezateux By Dennis M, Lewis S, Cranswick G, C, Brown J, Bull C, Wren C. Forbes J. No. 37 A trial of problem-solving by community No. 45 No. 3 mental health nurses for anxiety, The clinical and cost-effectiveness of left The clinical effectiveness and cost- depression and life difficulties among ventricular assist devices for end-stage effectiveness of computed tomography general practice patients. The CPN-GP heart failure: a systematic review and screening for lung cancer: systematic study. economic evaluation. reviews. By Kendrick T, Simons L, By Clegg AJ, Scott DA, Loveman E, By Black C, Bagust A, Boland A, Mynors-Wallis L, Gray A, Lathlean J, Colquitt J, Hutchinson J, Royle P, Walker S, McLeod C, De Verteuil R, Pickering R, et al. et al. et al. 63

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment reports published to date

No. 4 No. 12 No. 20 A systematic review of the effectiveness A series of systematic reviews to inform A systematic review of the clinical and cost-effectiveness of neuroimaging a decision analysis for and effectiveness and cost-effectiveness of assessments used to visualise the treating infected diabetic foot enzyme replacement therapies for seizure focus in people with refractory ulcers. Fabry’s disease and epilepsy being considered for By Nelson EA, O’Meara S, Craig D, mucopolysaccharidosis type 1. surgery. Iglesias C, Golder S, Dalton J, By Connock M, Juarez-Garcia A, Frew By Whiting P, Gupta R, Burch J, et al. E, Mans A, Dretzke J, Fry-Smith A, et al. Mujica Mota RE, Wright K, Marson A, et al. No. 13 No. 21 Randomised , observational Health benefits of antiviral therapy for No. 5 study and assessment of cost- mild chronic hepatitis C: randomised Comparison of conference abstracts effectiveness of the treatment of varicose controlled trial and economic and presentations with full-text veins (REACTIV trial). evaluation. articles in the health technology By Michaels JA, Campbell WB, By Wright M, Grieve R, Roberts J, assessments of rapidly evolving Brazier JE, MacIntyre JB, Palfreyman Main J, Thomas HC on behalf of the technologies. SJ, Ratcliffe J, et al. UK Mild Hepatitis C Trial Investigators. By Dundar Y, Dodd S, Dickson R, No. 22 Walley T, Haycox A, Williamson PR. No. 14 Pressure relieving support surfaces: a The cost-effectiveness of screening for No. 6 randomised evaluation. oral cancer in primary care. Systematic review and evaluation of By Nixon J, Nelson EA, Cranny G, By Speight PM, Palmer S, Moles DR, methods of assessing urinary Iglesias CP, Hawkins K, Cullum NA, et al. incontinence. Downer MC, Smith DH, Henriksson M By Martin JL, Williams KS, Abrams et al. No. 23 KR, Turner DA, Sutton AJ, Chapple C, A systematic review and economic No. 15 et al. model of the effectiveness and cost- Measurement of the clinical and cost- effectiveness of methylphenidate, No. 7 effectiveness of non-invasive diagnostic dexamfetamine and atomoxetine for the The clinical effectiveness and cost- testing strategies for deep vein treatment of attention deficit effectiveness of newer drugs for children thrombosis. hyperactivity disorder in children and with epilepsy. A systematic review. By Goodacre S, Sampson F, Stevenson adolescents. By Connock M, Frew E, Evans B-W, M, Wailoo A, Sutton A, Thomas S, et al. By King S, Griffin S, Hodges Z, Bryan S, Cummins C, Fry-Smith A, Weatherly H, Asseburg C, Richardson G, et al. No. 16 et al. Systematic review of the effectiveness No. 8 and cost-effectiveness of HealOzone® No. 24 Surveillance of Barrett’s oesophagus: for the treatment of occlusal pit/fissure The clinical effectiveness and cost- exploring the uncertainty through caries and root caries. effectiveness of enzyme replacement systematic review, expert workshop and By Brazzelli M, McKenzie L, Fielding therapy for Gaucher’s disease: economic modelling. S, Fraser C, Clarkson J, Kilonzo M, a systematic review. By Garside R, Pitt M, Somerville M, et al. By Connock M, Burls A, Frew E, Stein K, Price A, Gilbert N. Fry-Smith A, Juarez-Garcia A, No. 17 McCabe C, et al. No. 9 Randomised controlled trials of Topotecan, pegylated liposomal conventional antipsychotic versus new No. 25 Effectiveness and cost-effectiveness of doxorubicin hydrochloride and atypical drugs, and new atypical drugs salicylic acid and cryotherapy for paclitaxel for second-line or subsequent versus clozapine, in people with cutaneous warts. An economic decision treatment of advanced ovarian cancer: a schizophrenia responding poorly model. systematic review and economic to, or intolerant of, current drug By Thomas KS, Keogh-Brown MR, evaluation. treatment. By Main C, Bojke L, Griffin S, Chalmers JR, Fordham RJ, Holland RC, By Lewis SW, Davies L, Jones PB, Norman G, Barbieri M, Mather L, Armstrong SJ, et al. Barnes TRE, Murray RM, Kerwin R, et al. et al. No. 26 No. 10 A systematic literature review of the Evaluation of molecular techniques No. 18 effectiveness of non-pharmacological in prediction and diagnosis of Diagnostic tests and algorithms used in interventions to prevent wandering in cytomegalovirus disease in the investigation of haematuria: dementia and evaluation of the ethical immunocompromised patients. systematic reviews and economic implications and acceptability of their By Szczepura A, Westmoreland D, evaluation. use. Vinogradova Y, Fox J, Clark M. By Rodgers M, Nixon J, Hempel S, By Robinson L, Hutchings D, Corner Aho T, Kelly J, Neal D, et al. L, Beyer F, Dickinson H, Vanoli A, et al. No. 11 Screening for thrombophilia in high-risk No. 19 No. 27 situations: systematic review and cost- Cognitive behavioural therapy in A review of the evidence on the effects effectiveness analysis. The Thrombosis: addition to antispasmodic therapy for and costs of implantable cardioverter Risk and Economic Assessment of irritable bowel syndrome in primary defibrillator therapy in different patient Thrombophilia Screening (TREATS) care: randomised controlled groups, and modelling of cost- study. trial. effectiveness and cost–utility for these By Wu O, Robertson L, Twaddle S, By Kennedy TM, Chalder T, groups in a UK context. Lowe GDO, Clark P, Greaves M, McCrone P, Darnley S, Knapp M, By Buxton M, Caine N, Chase D, 64 et al. Jones RH, et al. Connelly D, Grace A, Jackson C, et al. Health Technology Assessment 2007; Vol. 11: No. 50

No. 28 No. 37 No. 46 Adefovir dipivoxil and pegylated Cognitive behavioural therapy in Etanercept and efalizumab for the interferon alfa-2a for the treatment of chronic fatigue syndrome: a randomised treatment of psoriasis: a systematic chronic hepatitis B: a systematic review controlled trial of an outpatient group review. and economic evaluation. programme. By Woolacott N, Hawkins N, By Shepherd J, Jones J, Takeda A, By O’Dowd H, Gladwell P, Rogers CA, Mason A, Kainth A, Khadjesari Z, Davidson P, Price A. Hollinghurst S, Gregory A. Bravo Vergel Y, et al.

No. 29 No. 38 No. 47 An evaluation of the clinical and cost- A comparison of the cost-effectiveness of Systematic reviews of clinical decision effectiveness of pulmonary artery five strategies for the prevention of non- tools for acute abdominal pain. catheters in patient management in steroidal anti-inflammatory drug-induced By Liu JLY, Wyatt JC, Deeks JJ, intensive care: a systematic review and a gastrointestinal toxicity: a systematic Clamp S, Keen J, Verde P, et al. randomised controlled trial. review with economic modelling. By Harvey S, Stevens K, Harrison D, By Brown TJ, Hooper L, Elliott RA, No. 48 Young D, Brampton W, McCabe C, et al. Payne K, Webb R, Roberts C, et al. Evaluation of the ventricular assist device programme in the UK. No. 39 No. 30 By Sharples L, Buxton M, Caine N, The effectiveness and cost-effectiveness Accurate, practical and cost-effective Cafferty F, Demiris N, Dyer M, of computed tomography screening for assessment of carotid stenosis in the UK. et al. By Wardlaw JM, Chappell FM, coronary artery disease: systematic Stevenson M, De Nigris E, Thomas S, review. No. 49 Gillard J, et al. By Waugh N, Black C, Walker S, A systematic review and economic McIntyre L, Cummins E, Hillis G. model of the clinical and cost- No. 31 effectiveness of immunosuppressive Etanercept and infliximab for the No. 40 What are the clinical outcome and cost- therapy for renal transplantation in treatment of psoriatic arthritis: children. a systematic review and economic effectiveness of endoscopy undertaken by nurses when compared with doctors? By Yao G, Albon E, Adi Y, Milford D, evaluation. Bayliss S, Ready A, et al. By Woolacott N, Bravo Vergel Y, A Multi-Institution Nurse Endoscopy Trial (MINuET). Hawkins N, Kainth A, Khadjesari Z, No. 50 By Williams J, Russell I, Durai D, Misso K, et al. Amniocentesis results: investigation of Cheung W-Y, Farrin A, Bloor K, et al. anxiety. The ARIA trial. No. 32 No. 41 By Hewison J, Nixon J, Fountain J, The cost-effectiveness of testing for The clinical and cost-effectiveness of Cocks K, Jones C, Mason G, et al. hepatitis C in former injecting drug oxaliplatin and capecitabine for the users. adjuvant treatment of colon cancer: By Castelnuovo E, Thompson-Coon J, systematic review and economic Pitt M, Cramp M, Siebert U, Price A, Volume 11, 2007 evaluation. et al. By Pandor A, Eggington S, Paisley S, No. 1 No. 33 Tappenden P, Sutcliffe P. Pemetrexed disodium for the treatment of malignant pleural mesothelioma: Computerised cognitive behaviour No. 42 therapy for depression and anxiety a systematic review and economic A systematic review of the effectiveness evaluation. update: a systematic review and of adalimumab, etanercept and economic evaluation. By Dundar Y, Bagust A, Dickson R, infliximab for the treatment of Dodd S, Green J, Haycox A, et al. By Kaltenthaler E, Brazier J, rheumatoid arthritis in adults and an De Nigris E, Tumur I, Ferriter M, economic evaluation of their cost- No. 2 Beverley C, et al. effectiveness. A systematic review and economic No. 34 By Chen Y-F, Jobanputra P, Barton P, model of the clinical effectiveness and Cost-effectiveness of using prognostic Jowett S, Bryan S, Clark W, et al. cost-effectiveness of docetaxel in combination with prednisone or information to select women with breast No. 43 prednisolone for the treatment of cancer for adjuvant systemic therapy. Telemedicine in dermatology: hormone-refractory metastatic prostate By Williams C, Brunskill S, Altman D, a randomised controlled trial. cancer. Briggs A, Campbell H, Clarke M, By Bowns IR, Collins K, Walters SJ, By Collins R, Fenwick E, Trowman R, et al. McDonagh AJG. Perard R, Norman G, Light K, No. 35 No. 44 et al. Psychological therapies including Cost-effectiveness of cell salvage and dialectical behaviour therapy for alternative methods of minimising No. 3 borderline personality disorder: perioperative allogeneic blood A systematic review of rapid diagnostic a systematic review and preliminary transfusion: a systematic review and tests for the detection of tuberculosis economic evaluation. economic model. infection. By Brazier J, Tumur I, Holmes M, By Davies L, Brown TJ, Haynes S, By Dinnes J, Deeks J, Kunst H, Ferriter M, Parry G, Dent-Brown K, et al. Payne K, Elliott RA, McCollum C. Gibson A, Cummins E, Waugh N, et al. No. 36 No. 45 No. 4 Clinical effectiveness and cost- Clinical effectiveness and cost- The clinical effectiveness and cost- effectiveness of tests for the diagnosis effectiveness of laparoscopic surgery for effectiveness of strontium ranelate for and investigation of urinary tract colorectal cancer: systematic reviews and the prevention of osteoporotic infection in children: a systematic review economic evaluation. fragility fractures in postmenopausal and economic model. By Murray A, Lourenco T, women. By Whiting P, Westwood M, Bojke L, de Verteuil R, Hernandez R, Fraser C, By Stevenson M, Davis S, Palmer S, Richardson G, Cooper J, et al. McKinley A, et al. Lloyd-Jones M, Beverley C. 65

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment reports published to date

No. 5 No. 13 No. 21 A systematic review of quantitative and A systematic review and economic The clinical effectiveness and cost- qualitative research on the role and evaluation of epoetin alfa, epoetin beta effectiveness of treatments for children effectiveness of written information and darbepoetin alfa in anaemia with idiopathic steroid-resistant available to patients about individual associated with cancer, especially that nephrotic syndrome: a systematic review. . attributable to cancer treatment. By Colquitt JL, Kirby J, Green C, By Raynor DK, Blenkinsopp A, By Wilson J, Yao GL, Raftery J, Cooper K, Trompeter RS. Knapp P, Grime J, Nicolson DJ, Bohlius J, Brunskill S, Sandercock J, Pollock K, et al. et al. No. 22 A systematic review of the routine No. 6 No. 14 monitoring of growth in children of Oral naltrexone as a treatment for A systematic review and economic primary school age to identify relapse prevention in formerly evaluation of statins for the prevention growth-related conditions. opioid-dependent drug users: of coronary events. By Fayter D, Nixon J, Hartley S, a systematic review and economic By Ward S, Lloyd Jones M, Pandor A, Rithalia A, Butler G, Rudolf M, et al. evaluation. Holmes M, Ara R, Ryan A, et al. By Adi Y, Juarez-Garcia A, Wang D, No. 23 Jowett S, Frew E, Day E, et al. No. 15 Systematic review of the effectiveness of A systematic review of the effectiveness preventing and treating Staphylococcus No. 7 and cost-effectiveness of different aureus carriage in reducing peritoneal Glucocorticoid-induced osteoporosis: models of community-based respite catheter-related infections. a systematic review and cost–utility care for frail older people and their By McCormack K, Rabindranath K, analysis. carers. Kilonzo M, Vale L, Fraser C, McIntyre L, By Kanis JA, Stevenson M, By Mason A, Weatherly H, Spilsbury et al. McCloskey EV, Davis S, Lloyd-Jones M. K, Arksey H, Golder S, Adamson J, et al. No. 24 No. 8 The clinical effectiveness and cost of Epidemiological, social, diagnostic and No. 16 repetitive transcranial magnetic economic evaluation of population Additional therapy for young children stimulation versus electroconvulsive screening for genital chlamydial with spastic cerebral palsy: therapy in severe depression: infection. a randomised controlled trial. a multicentre pragmatic randomised By Low N, McCarthy A, Macleod J, By Weindling AM, Cunningham CC, controlled trial and economic Salisbury C, Campbell R, Roberts TE, Glenn SM, Edwards RT, Reeves DJ. analysis. et al. By McLoughlin DM, Mogg A, No. 17 No. 9 Eranti S, Pluck G, Purvis R, Edwards D, Screening for type 2 diabetes: literature Methadone and buprenorphine for the et al. review and economic modelling. management of opioid dependence: By Waugh N, Scotland G, No. 25 a systematic review and economic McNamee P, Gillett M, Brennan A, A randomised controlled trial and evaluation. Goyder E, et al. economic evaluation of direct versus By Connock M, Juarez-Garcia A, indirect and individual versus group Jowett S, Frew E, Liu Z, Taylor RJ, No. 18 modes of speech and language therapy et al. The effectiveness and cost-effectiveness for children with primary language No. 10 of cinacalcet for secondary impairment. Exercise Evaluation Randomised Trial hyperparathyroidism in end-stage By Boyle J, McCartney E, Forbes J, (EXERT): a randomised trial comparing renal disease patients on dialysis: a O’Hare A. GP referral for leisure centre-based systematic review and economic exercise, community-based walking and evaluation. No. 26 advice only. By Garside R, Pitt M, Anderson R, Hormonal therapies for early breast By Isaacs AJ, Critchley JA, Mealing S, Roome C, Snaith A, cancer: systematic review and economic See Tai S, Buckingham K, Westley D, et al. evaluation. Harridge SDR, et al. By Hind D, Ward S, De Nigris E, No. 19 Simpson E, Carroll C, Wyld L. No. 11 The clinical effectiveness and Interferon alfa (pegylated and cost-effectiveness of gemcitabine for No. 27 non-pegylated) and ribavirin for the metastatic breast cancer: a systematic Cardioprotection against the toxic treatment of mild chronic hepatitis C: review and economic evaluation. effects of anthracyclines given to a systematic review and economic By Takeda AL, Jones J, Loveman E, children with cancer: a systematic evaluation. Tan SC, Clegg AJ. review. By Shepherd J, Jones J, Hartwell D, By Bryant J, Picot J, Levitt G, Davidson P, Price A, Waugh N. No. 20 Sullivan I, Baxter L, Clegg A. A systematic review of duplex No. 12 ultrasound, magnetic resonance No. 28 Systematic review and economic angiography and computed tomography Adalimumab, etanercept and infliximab evaluation of bevacizumab and angiography for the diagnosis for the treatment of ankylosing cetuximab for the treatment of and assessment of symptomatic, lower spondylitis: a systematic review and metastatic colorectal cancer. limb peripheral arterial disease. economic evaluation. By Tappenden P, Jones R, Paisley S, By Collins R, Cranny G, Burch J, By McLeod C, Bagust A, Boland A, 66 Carroll C. Aguiar-Ibáñez R, Craig D, Wright K, et al. Dagenais P, Dickson R, Dundar Y, et al. Health Technology Assessment 2007; Vol. 11: No. 50

No. 29 No. 37 No. 45 Prenatal screening and treatment A randomised controlled trial The effectiveness and cost-effectiveness strategies to prevent group B streptococcal examining the longer-term outcomes of of carmustine implants and and other bacterial infections in early standard versus new antiepileptic drugs. temozolomide for the treatment of infancy: cost-effectiveness and expected The SANAD trial. newly diagnosed high-grade glioma: value of information analyses. By Marson AG, Appleton R, Baker a systematic review and economic By Colbourn T, Asseburg C, Bojke L, GA, Chadwick DW, Doughty J, Eaton B, evaluation. Philips Z, Claxton K, Ades AE, et al. et al. By Garside R, Pitt M, Anderson R, Rogers G, Dyer M, Mealing S, No. 30 No. 38 et al. Clinical effectiveness and cost- Clinical effectiveness and cost- effectiveness of bone morphogenetic effectiveness of different models No. 46 proteins in the non-healing of fractures of managing long-term oral Drug-eluting stents: a systematic review and spinal fusion: a systematic review. anticoagulation therapy: a systematic and economic evaluation. By Garrison KR, Donell S, Ryder J, review and economic modelling. By Hill RA, Boland A, Dickson R, Shemilt I, Mugford M, Harvey I, et al. By Connock M, Stevens C, Dündar Y, Haycox A, McLeod C, No. 31 Fry-Smith A, Jowett S, Fitzmaurice D, et al. Moore D, et al. A randomised controlled trial of No. 47 postoperative radiotherapy following No. 39 The clinical effectiveness and cost- breast-conserving surgery in a A systematic review and economic effectiveness of cardiac resynchronisation minimum-risk older population. model of the clinical effectiveness and (biventricular pacing) for heart failure: The PRIME trial. cost-effectiveness of interventions for systematic review and economic By Prescott RJ, Kunkler IH, Williams preventing relapse in people with model. LJ, King CC, Jack W, van der Pol M, et al. bipolar disorder. By Fox M, Mealing S, Anderson R, No. 32 By Soares-Weiser K, Bravo Vergel Y, Dean J, Stein K, Price A, et al. Current practice, accuracy, effectiveness Beynon S, Dunn G, Barbieri M, Duffy S, No. 48 and cost-effectiveness of the school et al. Recruitment to randomised trials: entry hearing screen. No. 40 strategies for trial enrolment and By Bamford J, Fortnum H, Bristow K, Taxanes for the adjuvant treatment of participation study. The STEPS study. Smith J, Vamvakas G, Davies L, et al. early breast cancer: systematic review By Campbell MK, Snowdon C, No. 33 and economic evaluation. Francis D, Elbourne D, McDonald AM, The clinical effectiveness and By Ward S, Simpson E, Davis S, Knight R, et al. cost-effectiveness of inhaled insulin Hind D, Rees A, Wilkinson A. in diabetes mellitus: a systematic No. 49 review and economic evaluation. No. 41 Cost-effectiveness of functional By Black C, Cummins E, Royle P, The clinical effectiveness and cost- cardiac testing in the diagnosis and Philip S, Waugh N. effectiveness of screening for open angle management of coronary artery disease: glaucoma: a systematic review and a randomised controlled trial. No. 34 economic evaluation. The CECaT trial. Surveillance of cirrhosis for By Burr JM, Mowatt G, Hernández R, By Sharples L, Hughes V, Crean A, hepatocellular carcinoma: systematic Siddiqui MAR, Cook J, Lourenco T, Dyer M, Buxton M, Goldsmith K, review and economic analysis. et al. et al. By Thompson Coon J, Rogers G, Hewson P, Wright D, Anderson R, No. 42 No. 50 Cramp M, et al. Acceptability, benefit and costs of early Evaluation of diagnostic tests when screening for hearing disability: a study there is no gold standard. A review of No. 35 of potential screening tests and methods. The Birmingham Rehabilitation Uptake models. By Rutjes AWS, Reitsma JB, Maximisation Study (BRUM). Home- By Davis A, Smith P, Ferguson M, Coomarasamy A, Khan KS, based compared with hospital-based Stephens D, Gianopoulos I. Bossuyt PMM. cardiac rehabilitation in a multi-ethnic population: cost-effectiveness and No. 43 patient adherence. Contamination in trials of educational By Jolly K, Taylor R, Lip GYH, interventions. Greenfield S, Raftery J, Mant J, et al. By Keogh-Brown MR, Bachmann MO, Shepstone L, Hewitt C, Howe A, No. 36 Ramsay CR, et al. A systematic review of the clinical, public health and cost-effectiveness of No. 44 rapid diagnostic tests for the detection Overview of the clinical effectiveness of and identification of bacterial intestinal positron emission tomography imaging pathogens in faeces and food. in selected cancers. By Abubakar I, Irvine L, Aldus CF, By Facey K, Bradbury I, Laking G, Wyatt GM, Fordham R, Schelenz S, et al. Payne E.

67

© Queen’s Printer and Controller of HMSO 2007. All rights reserved.

Health Technology Assessment 2007; Vol. 11: No. 50

Health Technology Assessment Programme

Director, Deputy Director, Professor Tom Walley, Professor Jon Nicholl, Director, NHS HTA Programme, Director, Medical Care Research Department of Pharmacology & Unit, University of Sheffield, Therapeutics, School of Health and Related University of Liverpool Research

Prioritisation Strategy Group Members

Chair, Professor Bruce Campbell, Dr Edmund Jessop, Medical Dr Ron Zimmern, Director, Professor Tom Walley, Consultant Vascular & General Adviser, National Specialist, Public Health Genetics Unit, Director, NHS HTA Programme, Surgeon, Royal Devon & Exeter Commissioning Advisory Group Strangeways Research Department of Pharmacology & Hospital (NSCAG), Department of Laboratories, Cambridge Therapeutics, Health, London Professor Robin E Ferner, University of Liverpool Consultant Physician and Professor Jon Nicholl, Director, Director, West Midlands Centre Medical Care Research Unit, for Adverse Drug Reactions, University of Sheffield, City Hospital NHS Trust, School of Health and Birmingham Related Research

HTA Commissioning Board Members Programme Director, Professor Deborah Ashby, Professor Jenny Donovan, Professor Ian Roberts, Professor Tom Walley, Professor of , Professor of Social Medicine, Professor of Epidemiology & Director, NHS HTA Programme, Department of Environmental Department of Social Medicine, Public Health, Intervention Department of Pharmacology & and Preventative Medicine, University of Bristol Research Unit, London School Therapeutics, Queen Mary University of of Hygiene and Tropical University of Liverpool London Professor Freddie Hamdy, Medicine Professor of Urology, Chair, Professor Ann Bowling, University of Sheffield Professor Jon Nicholl, Professor Mark Sculpher, Professor of Health Services Director, Medical Care Research Professor Allan House, Professor of Health Economics, Research, Primary Care and Unit, University of Sheffield, Professor of Liaison Psychiatry, Centre for Health Economics, Population Studies, School of Health and Related University of Leeds Institute for Research in the University College London Research Social Services, Professor Sallie Lamb, Director, University of York Deputy Chair, Professor John Cairns, Warwick Clinical Trials Unit, Dr Andrew Farmer, Professor of Health Economics, University of Warwick Professor Kate Thomas, University Lecturer in General Public Health Policy, Professor of Complementary Practice, Department of London School of Hygiene Professor Stuart Logan, and , Primary Health Care, and Tropical Medicine, Director of Health & Social University of Leeds University of Oxford London Care Research, The Peninsula Medical School, Universities of Professor Nicky Cullum, Exeter & Plymouth Professor David John Torgerson, Director of York Trial Unit, Director of Centre for Evidence Professor Miranda Mugford, Based Nursing, Department of Department of Health Sciences, Professor of Health Economics, University of York Health Sciences, University of University of East Anglia Dr Jeffrey Aronson, York Reader in Clinical Dr Linda Patterson, Professor Hywel Williams, Pharmacology, Department of Professor Jon Deeks, Consultant Physician, Professor of Clinical Pharmacology, Professor of Health Statistics, Department of Medicine, Dermato-Epidemiology, Radcliffe Infirmary, Oxford University of Birmingham Burnley General Hospital University of Nottingham

69 Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk)

© Queen’s Printer and Controller of HMSO 2007. All rights reserved. Health Technology Assessment Programme

Diagnostic Technologies & Screening Panel Members

Chair, Dr Paul Cockcroft, Consultant Dr Jennifer J Kurinczuk, Dr Margaret Somerville, Dr Ron Zimmern, Director of Medical Microbiologist and Consultant Clinical Director of Public Health the Public Health Genetics Unit, Clinical Director of Pathology, Epidemiologist, National Learning, Peninsula Medical Strangeways Research Department of Clinical Perinatal Epidemiology Unit, School, University of Plymouth Laboratories, Cambridge Microbiology, St Mary's Oxford Hospital, Portsmouth Dr Graham Taylor, Scientific Dr Susanne M Ludgate, Director & Senior Lecturer, Professor Adrian K Dixon, Clinical Director, Medicines & Regional DNA Laboratory, The Professor of Radiology, Healthcare Products Regulatory Leeds Teaching Hospitals University Department of Ms Norma Armston, Agency, London Radiology, University of Freelance Consumer Advocate, Professor Lindsay Wilson Cambridge Clinical School Bolton Mr Stephen Pilling, Director, Turnbull, Scientific Director, Centre for Outcomes, Centre for MR Investigations & Dr David Elliman, Consultant in Research & Effectiveness, YCR Professor of Radiology, Professor Max Bachmann, Community Child Health, Joint Director, National University of Hull Professor of Health Care Islington PCT & Great Ormond Collaborating Centre for Mental Interfaces, Department of Street Hospital, London Health Policy and Practice, Health, University College Professor Martin J Whittle, University of East Anglia Professor Glyn Elwyn, London Clinical Co-director, National Research Chair, Centre for Co-ordinating Centre for Professor Rudy Bilous Health Sciences Research, Mrs Una Rennard, Women’s and Childhealth Professor of Clinical Medicine & Cardiff University, Department Service User Representative, Consultant Physician, of General Practice, Cardiff Oxford Dr Dennis Wright, The Academic Centre, Consultant Biochemist & South Tees Hospitals NHS Trust Professor Paul Glasziou, Dr Phil Shackley, Senior Clinical Director, Director, Centre for Lecturer in Health Economics, The North West London Ms Dea Birkett, Service User Evidence-Based Practice, Academic Vascular Unit, Hospitals NHS Trust, Representative, London University of Oxford University of Sheffield Middlesex

Pharmaceuticals Panel Members

Chair, Professor Imti Choonara, Dr Jonathan Karnon, Senior Dr Martin Shelly, Professor Robin Ferner, Professor in Child Health, Research Fellow, Health General Practitioner, Consultant Physician and Academic Division of Child Economics and Decision Leeds Director, West Midlands Centre Health, University of Science, University of Sheffield for Adverse Drug Reactions, Nottingham Mrs Katrina Simister, Assistant City Hospital NHS Trust, Dr Yoon Loke, Senior Lecturer Director New Medicines, Birmingham Professor John Geddes, in Clinical Pharmacology, National Prescribing Centre, Professor of Epidemiological University of East Anglia Liverpool Psychiatry, University of Oxford Ms Barbara Meredith, Dr Richard Tiner, Medical Lay Member, Epsom Director, Medical Department, Mrs Barbara Greggains, Association of the British Non-Executive Director, Dr Andrew Prentice, Senior , Greggains Management Ltd Lecturer and Consultant London Obstetrician & Gynaecologist, Dr Bill Gutteridge, Medical Department of Obstetrics & Adviser, National Specialist Gynaecology, University of Commissioning Advisory Group Cambridge Ms Anne Baileff, Consultant (NSCAG), London Nurse in First Contact Care, Dr Frances Rotblat, CPMP Southampton City Primary Care Mrs Sharon Hart, Delegate, Medicines & Trust, University of Consultant Pharmaceutical Healthcare Products Regulatory Southampton Adviser, Reading Agency, London

70 Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk) Health Technology Assessment 2007; Vol. 11: No. 50

Therapeutic Procedures Panel Members Chair, Professor Matthew Cooke, Dr Simon de Lusignan, Dr John C Pounsford, Professor Bruce Campbell, Professor of Emergency Senior Lecturer, Primary Care Consultant Physician, Consultant Vascular and Medicine, Warwick Emergency Informatics, Department of Directorate of Medical Services, General Surgeon, Department Care and Rehabilitation, Community Health Sciences, North Bristol NHS Trust of Surgery, Royal Devon & University of Warwick St George’s Hospital Medical Exeter Hospital School, London Dr Karen Roberts, Nurse Mr Mark Emberton, Senior Consultant, Queen Elizabeth Lecturer in Oncological Dr Peter Martin, Consultant Hospital, Gateshead Urology, Institute of Urology, Neurologist, Addenbrooke’s University College Hospital Hospital, Cambridge Dr Vimal Sharma, Consultant Professor Paul Gregg, Professor Neil McIntosh, Psychiatrist/Hon. Senior Professor of Orthopaedic Edward Clark Professor of Child Lecturer, Mental Health Resource Centre, Cheshire and Dr Mahmood Adil, Deputy Surgical Science, Department of Life & Health, Department of Wirral Partnership NHS Trust, Regional Director of Public General Practice and Primary Child Life & Health, University Wallasey Health, Department of Health, Care, South Tees Hospital NHS of Edinburgh Manchester Trust, Middlesbrough Professor Jim Neilson, Professor Scott Weich, Dr Aileen Clarke, Ms Maryann L Hardy, Professor of Obstetrics and Professor of Psychiatry, Consultant in Public Health, Lecturer, Division of Gynaecology, Department of Division of Health in the Public Health Resource Unit, Radiography, University of Obstetrics and Gynaecology, Community, University of Oxford Bradford University of Liverpool Warwick

Disease Prevention Panel Members Chair, Dr Elizabeth Fellow-Smith, Professor Yi Mien Koh, Dr Carol Tannahill, Director, Dr Edmund Jessop, Medical Medical Director, Director of Public Health and Glasgow Centre for Population Adviser, National Specialist West London Mental Health Medical Director, London Health, Glasgow Commissioning Advisory Group Trust, Middlesex NHS (North West London (NSCAG), London Strategic Health Authority), Professor Margaret Thorogood, Mr Ian Flack, Director PPI London Professor of Epidemiology, Forum Support, Council of University of Warwick, Ethnic Minority Voluntary Ms Jeanett Martin, Coventry Sector Organisations, Director of Clinical Leadership Stratford & Quality, Lewisham PCT, Dr Ewan Wilkinson, London Consultant in Public Health, Dr John Jackson, Royal Liverpool University General Practitioner, Dr Chris McCall, General Hospital, Liverpool Newcastle upon Tyne Practitioner, Dorset Mrs Veronica James, Chief Dr David Pencheon, Director, Mrs Sheila Clark, Chief Officer, Horsham District Age Eastern Region Public Health Executive, St James’s Hospital, Concern, Horsham Observatory, Cambridge Portsmouth Professor Mike Kelly, Dr Ken Stein, Senior Clinical Mr Richard Copeland, Director, Centre for Public Lecturer in Public Health, Lead Pharmacist: Clinical Health Excellence, Director, Peninsula Technology Economy/Interface, National Institute for Health Assessment Group, Wansbeck General Hospital, and Clinical Excellence, University of Exeter, Northumberland London Exeter

71 Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk) Health Technology Assessment Programme

Expert Advisory Network Members

Professor Douglas Altman, Professor Carol Dezateux, Professor Stan Kaye, Cancer Professor Chris Price, Professor of Statistics in Professor of Paediatric Research UK Professor of Visiting Professor in Clinical Medicine, Centre for Statistics Epidemiology, London Medical Oncology, Section of Biochemistry, University of in Medicine, University of Medicine, Royal Marsden Oxford Oxford Dr Keith Dodd, Consultant Hospital & Institute of Cancer Paediatrician, Derby Research, Surrey Professor William Rosenberg, Professor John Bond, Professor of Hepatology and Director, Centre for Health Mr John Dunning, Dr Duncan Keeley, Consultant Physician, University Services Research, University of Consultant Cardiothoracic General Practitioner (Dr Burch of Southampton, Southampton Newcastle upon Tyne, School of Surgeon, Cardiothoracic & Ptnrs), The Health Centre, Population & Health Sciences, Surgical Unit, Papworth Thame Professor Peter Sandercock, Newcastle upon Tyne Hospital NHS Trust, Cambridge Professor of Medical Neurology, Dr Donna Lamping, Department of Clinical Mr Jonothan Earnshaw, Professor Andrew Bradbury, Research Degrees Programme Neurosciences, University of Consultant Vascular Surgeon, Professor of Vascular Surgery, Director & Reader in Psychology, Edinburgh Gloucestershire Royal Hospital, Solihull Hospital, Birmingham Health Services Research Unit, Gloucester Dr Susan Schonfield, Consultant London School of Hygiene and in Public Health, Hillingdon Mr Shaun Brogan, Tropical Medicine, London Chief Executive, Ridgeway Professor Martin Eccles, PCT, Middlesex Primary Care Group, Aylesbury Professor of Clinical Effectiveness, Centre for Health Mr George Levvy, Dr Eamonn Sheridan, Chief Executive, Motor Mrs Stella Burnside OBE, Services Research, University of Consultant in Clinical Genetics, Newcastle upon Tyne Neurone Disease Association, Genetics Department, Chief Executive, Northampton Regulation and Improvement St James’s University Hospital, Professor Pam Enderby, Authority, Belfast Leeds Professor of Community Professor James Lindesay, Professor of Psychiatry for the Ms Tracy Bury, Rehabilitation, Institute of Professor Sarah Stewart-Brown, Elderly, University of Leicester, Project Manager, World General Practice and Primary Professor of Public Health, Leicester General Hospital Confederation for Physical Care, University of Sheffield University of Warwick, Therapy, London Division of Health in the Professor Gene Feder, Professor Professor Julian Little, Community Warwick Medical Professor of Human Genome Professor Iain T Cameron, of Primary Care Research & School, LWMS, Coventry Epidemiology, Department of Professor of Obstetrics and Development, Centre for Health Epidemiology & Community Gynaecology and Head of the Sciences, Barts & The London Professor Ala Szczepura, Medicine, University of Ottawa School of Medicine, Queen Mary’s School of Professor of Health Service University of Southampton Medicine & Dentistry, London Research, Centre for Health Professor Rajan Madhok, Services Studies, University of Mr Leonard R Fenwick, Dr Christine Clark, Consultant in Public Health, Warwick Chief Executive, Newcastle Medical Writer & Consultant South Manchester Primary upon Tyne Hospitals NHS Trust Dr Ross Taylor, Pharmacist, Rossendale Care Trust, Manchester Senior Lecturer, Department of Mrs Gillian Fletcher, Professor Collette Clifford, Professor Alexander Markham, General Practice and Primary Antenatal Teacher & Tutor and Professor of Nursing & Head of Director, Molecular Medicine Care, University of Aberdeen President, National Childbirth Research, School of Health Unit, St James’s University Trust, Henfield Mrs Joan Webster, Sciences, University of Hospital, Leeds Consumer member, HTA – Birmingham, Edgbaston, Professor Jayne Franklyn, Professor Alistaire McGuire, Expert Advisory Network Birmingham Professor of Medicine, Professor of Health Economics, Department of Medicine, London School of Economics Professor Barry Cookson, University of Birmingham, Director, Laboratory of Queen Elizabeth Hospital, Dr Peter Moore, Healthcare Associated Infection, Edgbaston, Birmingham Health Protection Agency, Freelance Science Writer, Ashtead London Dr Neville Goodman, Dr Andrew Mortimore, Public Consultant Anaesthetist, Health Director, Southampton Dr Carl Counsell, Clinical Southmead Hospital, Bristol Senior Lecturer in Neurology, City Primary Care Trust, Department of Medicine & Professor Robert E Hawkins, Southampton Therapeutics, University of CRC Professor and Director of Aberdeen Medical Oncology, Christie CRC Dr Sue Moss, Associate Director, Research Centre, Christie Cancer Screening Evaluation Professor Howard Cuckle, Hospital NHS Trust, Manchester Unit, Institute of Cancer Professor of Reproductive Research, Sutton Epidemiology, Department of Professor Allen Hutchinson, Paediatrics, Obstetrics & Director of Public Health & Mrs Julietta Patnick, Gynaecology, University of Deputy Dean of ScHARR, Director, NHS Cancer Screening Leeds Department of Public Health, Programmes, Sheffield University of Sheffield Dr Katherine Darton, Professor Robert Peveler, Information Unit, MIND – Professor Peter Jones, Professor Professor of Liaison Psychiatry, The Mental Health Charity, of Psychiatry, University of Royal South Hants Hospital, London Cambridge, Cambridge Southampton

72 Current and past membership details of all HTA ‘committees’ are available from the HTA website (www.hta.ac.uk) HTA

How to obtain copies of this and other HTA Programme reports. An electronic version of this publication, in Adobe Acrobat format, is available for downloading free of charge for personal use from the HTA website (http://www.hta.ac.uk). A fully searchable CD-ROM is also available (see below). Printed copies of HTA monographs cost £20 each (post and packing free in the UK) to both public and private sector purchasers from our Despatch Agents. Non-UK purchasers will have to pay a small fee for post and packing. For European countries the cost is £2 per monograph and for the rest of the world £3 per monograph. You can order HTA monographs from our Despatch Agents: – fax (with credit card or official purchase order) – post (with credit card or official purchase order or cheque) – phone during office hours (credit card only). Additionally the HTA website allows you either to pay securely by credit card or to print out your order and then post or fax it.

Contact details are as follows: HTA Despatch Email: [email protected] c/o Direct Mail Works Ltd Tel: 02392 492 000 4 Oakwood Business Centre Fax: 02392 478 555 Downley, HAVANT PO9 2NP, UK Fax from outside the UK: +44 2392 478 555 NHS libraries can subscribe free of charge. Public libraries can subscribe at a very reduced cost of £100 for each volume (normally comprising 30–40 titles). The commercial subscription rate is £300 per volume. Please see our website for details. Subscriptions can only be purchased for the current or forthcoming volume.

Payment methods Paying by cheque If you pay by cheque, the cheque must be in pounds sterling, made payable to Direct Mail Works Ltd and drawn on a bank with a UK address. Paying by credit card The following cards are accepted by phone, fax, post or via the website ordering pages: Delta, Eurocard, Mastercard, Solo, Switch and Visa. We advise against sending credit card details in a plain email. Paying by official purchase order You can post or fax these, but they must be from public bodies (i.e. NHS or universities) within the UK. We cannot at present accept purchase orders from commercial companies or from outside the UK.

How do I get a copy of HTA on CD? Please use the form on the HTA website (www.hta.ac.uk/htacd.htm). Or contact Direct Mail Works (see contact details above) by email, post, fax or phone. HTA on CD is currently free of charge worldwide.

The website also provides information about the HTA Programme and lists the membership of the various committees. Health Technology Assessment Health Technology Assessment 2007; Vol. 11: No. 50 07 o.1:N.5 Evaluation of diagnostic tests when there is no gold standard Vol. 11: No. 50 2007;

Evaluation of diagnostic tests when there is no gold standard. A review of methods

AWS Rutjes, JB Reitsma, A Coomarasamy, KS Khan and PMM Bossuyt

Feedback The HTA Programme and the authors would like to know your views about this report. The Correspondence Page on the HTA website (http://www.hta.ac.uk) is a convenient way to publish your comments. If you prefer, you can send your comments to the address below, telling us whether you would like us to transfer them to the website. We look forward to hearing from you.

December 2007

The National Coordinating Centre for Health Technology Assessment, Mailpoint 728, Boldrewood, Health Technology Assessment University of Southampton, NHS R&D HTA Programme Southampton, SO16 7PX, UK. HTA Fax: +44 (0) 23 8059 5639 Email: [email protected] www.hta.ac.uk http://www.hta.ac.uk ISSN 1366-5278