View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Elsevier - Publisher Connector

VALUE IN HEALTH 14 (2011) 953–960

Available online at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/jval

Consistency between Direct and Indirect Trial Evidence: Is Direct Evidence Always More Reliable? Jason Madan, MA, MSc, PhD*, Matt D. Stevenson, BSc, PhD, Katy L. Cooper, BSc, PhD, A. E. Ades, PhD, Sophie Whyte, BSc, MMath, PhD, Ron Akehurst, BSc (Econ), Hon MFPH University of Bristol, Bristol, UK

ABSTRACT

Objectives: To present a case study involving the reduction in inci- rect evidence (P value for consistency hypothesis 0.027). The additional dence of febrile neutropenia (FN) after with granulocyte trials were consistent with the earlier indirect, rather than the direct, colony–stimulating factors (G-CSFs), illustrating difficulties that may evidence, and there was no inconsistency between direct and indirect arise when following the common preference for direct evidence over estimates in the updated evidence. The earlier inconsistency was due indirect evidence. Methods: Evidence of the efficacy of treatments was to one trial comparing primary pegfilgrastim with no primary G-CSF. identified from two previous systematic reviews. We used Bayesian Predictive cross-validation showed that this study was inconsistent evidence synthesis to estimate relative treatment effects based on di- with the evidence as a whole and with other trials making this rect evidence, indirect evidence, and both pooled together. We checked comparison. Conclusions: Both the Cochrane Handbook and the NICE for inconsistency between direct and indirect evidence and explored Methods Guide express a preference for direct evidence. A more robust the role of one specific trial using cross-validation. A subsequent re- strategy, which is in line with the accepted principles of evidence syn- view identified further studies not available at the time of the original thesis, would be to combine all relevant and appropriate information, analysis. We repeated the analyses on the enlarged evidence base. whether direct or indirect. Results: We found substantial inconsistency in the original evidence Keywords: Bayesian methods, febrile neutropenia, granulocyte colony– base. The median odds ratio of FN for primary pegfilgrastim versus no stimulating factors, methodology, mixed treatment comparison primary G-CSF was 0.06 (95% credible interval: 0.02–0.19) based on di- Copyright © 2011, International Society for Pharmacoeconomics and rect evidence, but 0.27 (95% credible interval: 0.13–0.53) based on indi- Outcomes Research (ISPOR). Published by Elsevier Inc.

states that “the Institute has a preference for data from head-to- Introduction head RCTs” [6]. Also, section 16.6 of the Cochrane Collaboration There is often a need to synthesize evidence from multiple studies states that direct comparisons should take precedence over indi- when evaluating medical interventions from a clinical or health rect and advises that MTC should be seen as a supplement to, policy perspective. Meta-analysis is an established statistical rather than a replacement of, analysis of the direct evidence [7]. method for this synthesis when the evidence consists of several What is known about the reliability of indirect evidence and head-to-head comparisons of two treatments. The estimation of the wisdom of combining it with direct evidence? First, at an em- the relative effectiveness of two treatments for which no head-to- pirical level, studies comparing direct and indirect evidence have head evidence is available is known as an indirect comparison (IC) found no systematic differences [8]. Where differences have been [1]. Mixed treatment comparison (MTC) meta-analysis is a gener- found, they appear to have been in situations in which the doses or alization of standard pairwise meta-analysis that derives esti- treatment combinations in the direct and indirect evidence were mates of treatment effects from a synthesis of direct and indirect not comparable [8–11]. Second, at a practical level, when decisions evidence. Statistical methods appear to have been discovered in- must be made between more than two treatments, this must be dependently by several authors [1–3], although recently Bayesian based on a coherent and internally consistent set of estimates in methods based on the previous work [2] have become popular which the estimated difference between treatments A and C is the [4,5]. The advantage of this approach is that it “borrows strength” sum of the estimated treatment differences between A and B and from indirect evidence where direct evidence is lacking or sparse. B and C. Estimates with the required coherence properties are Skepticism has been expressed, however, regarding the validity of precisely what an MTC analysis delivers. Third, at a theoretical MTC and the appropriateness of including indirect evidence level, the assumptions required by an MTC analysis of AB, AC, and where direct evidence is available. This is reflected in the current BC studies are exactly the same as the assumptions that would be Guide to the Methods of Technology Appraisal issued by the Na- required by pairwise meta-analyses of the AB and AC effects in all tional Institute for Health and Clinical Excellence (NICE), which these trials (assuming that all trials had all three treatments).

* Address correspondence to: Jason Madan, Academic Unit of Primary Health Care, University of Bristol, Cotham House, Cotham Hill, Bristol, Avon, BS6 6JL UK. E-mail: [email protected]. 1098-3015/$36.00 – see front matter Copyright © 2011, International Society for Pharmacoeconomics and Outcomes Research (ISPOR). Published by Elsevier Inc. doi:10.1016/j.jval.2011.05.042 954 VALUE IN HEALTH 14 (2011) 953–960

More formally, the consistency assumptions made by MTC follow comparing either a) two different G-CSFs, both given as primary automatically from the standard exchangeability assumptions ap- prophylaxis, or b) a G-CSF given as primary prophylaxis versus no plied over the full set of trials [12]. primary prophylaxis. No primary prophylaxis could include no Every health technology assessment must begin with a defini- G-CSF provision, placebo, or secondary G-CSF prophylaxis. tion of the target population and the candidate treatments. Before Two systematic reviews and meta-analyses of G-CSFs in rela- any statistical evidence syntheses, a careful review of clinical tion to reducing FN events were identified: Kuderer et al. [17] and heterogeneity among the included trials is necessary and, in Pinto et al. [18]. These meta-analyses included 20 trials, with head- particular, whether this could be associated with effect modifi- to-head comparisons of each G-CSF (given as primary prophylaxis) cation. The presence of recognized effect modifiers in the evi- versus no primary G-CSF, as well as trials of primary pegfilgrastim dence base suggests that different decisions should be made for versus primary filgrastim. The direct evidence comparing primary the different patient groups. The presence of unrecognized ef- pegfilgrastim with no primary G-CSF consisted of a single head- fect modifiers may cause treatment effect estimates in any set to-head trial [19]. We refer to these 20 trials as the time 1 evidence of comparisons to be heterogeneous and to deviate from the base throughout our analysis. true treatment effect in the target population. This can be an Subsequently, an updated systematic review of G-CSF trials issue in any synthesis of evidence for decision making, includ- was performed [20] that identified five additional relevant ran- ing pairwise meta-analysis. It may lead to inconsistencies domized, controlled trials (RCTs). The additional studies included within a network of evidence if the distribution of effect modi- four further RCTs of primary pegfilgrastim compared to no pri- fiers across that network is unbalanced. mary G-CSF (reported in three references [21–23]), and one further In this article, we present a case study in which the direct evidence RCT of primary filgrastim compared to no primary G-CSF [24]. appeared, initially to cast doubt on the indirect evidence. Additional These additional studies would not have been available to support head-to-head data subsequently became available that were, how- a decision made at the time that the previous systematic reviews ever, consistent with the original indirect evidence. The case study were performed. We refer to the updated evidence base of 25 trials is based on a synthesis of evidence of three granulocyte colony– as the time 2 evidence base throughout our analysis. Figure 1 stimulating factors (G-CSFs) for the reduction of incidence of fe- shows the evidence network at time 1 and time 2, and the full set brile neutropenia (FN) during chemotherapy carried out in the of trial data is shown in Table 1. A table giving further clinical context of a cost-effectiveness analysis. Both indirect and direct background on these studies is included in the Appendix. Al- comparisons showed primary pegfilgrastim to be significantly su- though all met the inclusion criteria, there are differences be- perior to no primary G-CSFs: the single head-to-head trial in the tween the trials in terms of the cancer type and stage, patient age, original data set, however, showed a markedly greater effect size and chemotherapy regimen. There was no reason, however, to than that estimated from the indirect evidence. assume a priori that these differences would be significant effect Our approach is as follows: we begin with the original data set modifiers, and both previous systematic reviews had pooled stud- and show how the consistency of the direct and indirect evidence ies despite these differences. There was no indication that any of can be assessed using a “node-splitting” approach [13]. We con- the four sets of trials was different in terms of clinical heteroge- sider whether strength of evidence should lead us to prefer direct, neity from the others. indirect, or pooled evidence for decision making. We then repeat our analyses for the updated data set. We also use predictive Statistical methods cross-validation [14,15] to consider whether the original single pri- mary pegfilgrastim versus no primary G-CSF trial could be consid- Odds ratios (ORs) for each study were calculated and conventional ered a statistical outlier with respect to the time 1 and time 2 data pairwise meta-analyses were performed on the time 1 and time 2 sets. data using the STATA command metan, which implements the The case study highlights the issues that need to be considered DerSimonian and Laird method for random-effects meta-analysis when deciding whether to combine direct and indirect evidence [25]. Statistical analyses of the evidence network were carried out and also raises the question of detecting inconsistency in standard in a Bayesian Markov chain Monte Carlo (MCMC) framework using pairwise meta-analysis. We can also contrast Bayesian model the WinBUGS software [16]. We applied Bayesian random-effects checking techniques facilitated by flexible packages such as the MTC synthesis to both the time 1 and time 2 data sets [4]. Under freely available WinBUGS software [16], with commonly used fre- random effects, it is assumed that the true treatment effects in quentist methods. The case study raises questions about how to trial j comparing treatments A and B are sampled from a random- ␦ ϳ ␶ 2 interpret NICE Methods Guidance and advice given in the Co- effects distribution jAB N(dAB, ) . We began our analysis of each chrane Handbook in cases in which a decision must be made be- data set by fitting the standard MTC synthesis, which assumes ϭ tween more than two treatments. consistency between direct and indirect evidence (so that dBC Ϫ dAC dAB) and a single between-trials variance for each compari- son. We used the posterior mean deviance corrected for the satu- Methods rated model (mean residual deviance) as a measure of goodness of fit; a model that fits the data well should have a residual deviance close to the number of data points (26), which is this case is the Clinical context and sources of evidence for synthesis number of study arms. G-CSFs may be administered to patients receiving chemotherapy To assess whether direct evidence was consistent with indi- as primary prophylaxis (in every chemotherapy cycle from cycle 1) rect, we then fitted a model relaxing the consistency assumption. or as secondary prophylaxis (in all remaining cycles after a neu- In this model, the treatment effects of primary pegfilgrastim ver- tropenic event such as FN and prolonged severe neutropenia). The sus no primary G-CSF (dPeg,Null), primary pegfilgrastim versus pri- motivation for this evidence synthesis exercise was to inform an mary filgrastim (dPeg,Fil) and primary filgrastim versus no primary analysis of the cost-effectiveness of primary prophylaxis with G- G-CSF (dFil,Null) are assumed to be completely unrelated (compared ϭ ϩ CSFs in adults with solid tumors or lymphoma. The decision prob- to the standard MTC in which dPegNull dPeg,Fil dFil,Null). This lem guided the search strategy used to identify relevant evidence. model is therefore equivalent to separate meta-analyses for each Studies were excluded if they involved children or patients with relative treatment effect where head-to-head trials exist, except conditions outside the decision population (i.e., those with leuke- that a common between-trials variance term is assumed. The mia, myeloid malignancies, or myelodysplastic syndromes). In treatment effects given by this model are based solely on direct line with previous systematic reviews [17,18], we included trials evidence. However, we can use these results to calculate estimates VALUE IN HEALTH 14 (2011) 953–960 955

Fig. 1 – Network of trials for mixed treatment comparison. of treatment effect based solely on indirect evidence (e.g., the es- dence. At time 2, we could generate a prediction for the Vogel et al. timate for the treatment effect of primary pegfilgrastim vs. no trial based on the consistency model (using the totality of the re- primary G-CSF based on indirect evidence alone is given in the maining evidence, both direct and indirect). Alternatively, we can ϩ inconsistency model by dPeg,Fil dFil,Null) The model can therefore assess whether the Vogel et al. trial is consistent with the other be used to compare the direct and indirect estimates. A Bayes- primary pegfilgrastim versus no primary G-CSF trials by forming ian two-sided P value can be readily constructed in Bayesian the prediction only from the remaining direct evidence in the in- MCMC by determining the proportion of iterations in which the consistency model. Predictive cross-validation estimates the pre- direct estimate exceeds the indirect. This procedure is equiva- dicted treatment effect based on the remaining data, the between- lent to node-splitting methods for checking the consistency of trials variation, the uncertainty in both these quantities, and the MTC [13]. size of the Vogel trial. It incorporates the additional uncertainty in The time 1 analyses suggested that one particular trial by Vogel drawing a new sample from the random effects distribution and is et al. [19], which was the only trial comparing primary pegfilgras- thus a more conservative procedure than node splitting, which is tim with no primary G-CSF in the time 1 data set was giving results concerned with the equivalence of mean treatment effects. Bayes- inconsistent with the rest of the evidence. To explore the influence ian two-sided cross-validation P values were compared to their of this specific study, we carried out Bayesian predictive cross- expected order statistics, 1/(n ϩ 1) (n being the number of trials) as validation analyses [14,15]. Here the data were reanalyzed under a they reflect the extreme of n P values that could be calculated. random-effects model excluding the Vogel et al. trial. We then gen- WinBUGS code and data sets are given in Appendix A1, which erated a prediction for the relative treatment effect that would be can be found at doi:10.1016/j.val.2011.05.008. For all analyses, con- observed in a “new trial” of primary pegfilgrastim versus no pri- vergence was checked using the Gelman Rubin statistics available mary G-CSF of exactly the same size as the Vogel et al. trial. In the in WinBUGS. Posterior summaries are based on 50,000 samples, time 1 analysis, this prediction is based solely on the indirect evi- following an initial 50,000 burn-in. 956 VALUE IN HEALTH 14 (2011) 953–960

Table 1 – Incidence of febrile neutropenia (FN) in all studies of prophylactic G-CSFs. Study Total no. FN rate % Total no. FN rate % Odds ratio, mean (arm 1) (arm 2) (95% CI)

Primary pegfilgrastim No primary G-CSF *Hecht et al., 2009 [23] 123 3/123 2 118 10/118 8 0.27 (0.07–1.01) Vogel et al., 2005 [19] 463 6/463 1.3 465 78/465 17 0.07 (0.03–0.15) *Balducci et al., 2007 [21]: solid 343 14/343 4 343 34/343 10 0.39 (0.20–0.74) tumor *Balducci et al., 2007: NHL [21] 73 11/73 15 73 27/73 37 0.30 (0.14–0.67) *Romieu et al., 2007 [22]: cycle 1 30 4/30 13 29 5/29 17 0.74 (0.18–3.08) only Pooled result: mean (95% CI) OR Time 1: 0.07 (0.03–0.15) I2 not calculated; Time 2: 0.26 (0.12–0.58), I2 ϭ 73% Primary filgrastim No primary G-CSF *del Giglio et al., 2008 [24] 276 34/276 12 72 26/72 36 0.25 (0.14–0.45) Doorduijn et al., 2003 [36] 197 72/197 37 192 86/192 45 0.71 (0.48–1.07) Osby et al., 2003 [37]: CHOP 101 34/101 34 104 52/104 50 0.51 (0.29–0.89) Osby et al., 2003 [37]: CNOP 125 40/125 32 125 62/125 50 0.48 (0.19–0.80) Zinzani et al., 1997 [38] 77 4/77 5 72 15/72 21 0.21 (0.07–0.66) Pettengell et al., 1992 [39] 41 9/41 22 39 17/39 44 0.36 (0.14–0.96) Timmer-Bonte et al., 2005 [40] 90 16/90 18 85 27/85 32 0.46 (0.23–0.94) Trillet-Lenoir et al., 1993 [41] 65 17/65 26 64 34/64 53 0.31 (0.15–0.66) Crawford et al., 1991 [42] 95 38/95 40 104 80/104 77 0.20 (0.11–0.37) Fossa et al., 1998 [43] 129 25/129 19 130 38/130 29 0.58 (0.33–1.04) Pooled result: mean (95% CI) OR Time 1: 0.43 (0.32–0.58), I2 ϭ 47.8%; Time 2: 0.40 (0.30–0.54), I2 ϭ 51% Primary No primary G-CSF Chevallier et al., 1995 [44] 61 36/61 59 59 42/59 71 0.58 (0.27–1.25) Gisselbrecht et al., 1997 [45] 82 52/82 63 80 62/80 78 0.50 (0.25–1.00) Bui et al., 1995 [46]: cycle 1 22 5/22 23 26 15/26 58 0.22 (0.06–0.76) Gebbia et al., 1994 [47] 23 5/23 22 28 18/28 64 0.15 (0.04–0.54) Gebbia et al., 1993 [48] 43 5/43 12 43 14/43 33 0.27 (0.09–0.84) Pooled result: mean (95% CI) OR 0.37 (0.23–0.60), I2 ϭ 19% Primary pegfilgrastim Primary filgrastim Green et al., 2003 [49] 77 10/77 13 75 15/75 20 0.60 (0.25–1.43) Holmes et al., 2002 [50]: phase III 149 14/149 9 147 27/147 18 0.46 (0.23–0.92) Holmes et al., 2002 [51]: phase II 46 5/46 11 25 2/25 8 1.40 (0.25–7.81) Grigg et al., 2003 [53] 14 0/14 0 13 1/13 8 0.29 (0.01–7.70) Vose et al., 2003 [52]: cycles 1 29 6/29 21 31 6/31 19 1.09 (0.31–3.85) and 2

CHOP, cyclophosphamide, doxorubicin, vincristine, and prednisone; CI, confidence interval; CNOP, cyclophosphamide, vincristine, mitoxan- trone, prednisone; G-CSF, granulocyte colony–stimulating factor; NHL, non-Hodgkin’s lymphoma; OR, odds ratio. * Studies added at time 2 as a result of updated search.

pools direct and indirect evidence, lie between the direct and in- Results direct estimates. The study-specific odds ratios (ORs) and results from conventional At time 2, considerably more data are available on the primary random-effects pairwise meta-analyses and I2 values are shown in pegfilgrastim versus no primary G-CSF contrast. Here there is no Table 1. The degree of heterogeneity in all pairwise syntheses is evidence to suggest overall inconsistency: direct and indirect es- moderate to large, except that in the comparison of primary peg- timates are very close and the P value is now 0.87. Indeed the filgrastrim with primary filgrastrim, the I2 value was zero. Inspec- estimates for the primary pegfilgrastim versus no primary G-CSF tion of the study ORs suggests, however, that this reflects the small effect at time 2 are very similar to the indirect evidence at time 1. sample size of most of these studies because the crude ORs are in This suggests that the inconsistency is between the Vogel et al. fact highly heterogeneous. trial, in which risk reduction was particularly strong (from 17% to Table 2 presents the estimates generated for the relative effi- 1%), and the rest of the evidence. For completeness, Table 2 in- cacy of G-CSFs at both time 1 and time 2. Median ORs, with 95% cludes further statistics. The between-trial SD suggests distinct credible intervals, are presented for the four contrasts for which heterogeneity in the data, particularly at time 2, where the lower head-to-head trials are present. At each time point, three sets of limit is far from zero. Where there is inconsistency in the data, a results are set out: the direct and indirect estimates based on the model that imposes a consistency assumption will report more inconsistency model and the combined MTC estimates assuming heterogeneity than one that relaxes this assumption because dis- consistency. At time 1, direct estimates of each contrast differ agreement between direct and indirect does not need to be “ex- quite markedly from indirect estimates with Bayesian predictive P plained” by heterogeneity in the latter case. This can be seen in the values all at 0.027, although there is a slight overlap in confidence heterogeneity estimated by each model at time 1. At time 2, how- intervals. The same P value is obtained with every contrast be- ever, there is no difference in heterogeneity between consistency cause there can only be a single “inconsistency” in a triangular and inconsistency models, further indicating the agreement be- network, no matter which edge is compared to the other two. As tween all evidence sources. The residual deviance statistics are expected, the MTC results from the consistency model, which close to the number of data points (study arms) in all analyses and VALUE IN HEALTH 14 (2011) 953–960 957

Table 2 – Posterior odds ratios for febrile neutropenia (median and 95% credible interval) at time 1 and time 2 under models with and without the assumption of consistency. Treatment Time 1 (20 trials, 40 arms) Time 2 (25 trials, 50 arms) contrast Consistency Consistency Consistency not assumed assumed Consistency not assumed assumed

Direct OR Indirect OR P Combined (MTC) Direct OR Indirect OR P Combined (MTC) (95% CrI) (95% CrI) value OR (95% CrI) (95% CrI) (95% CrI) value OR (95% CrI)

Primary filgrastim 0.43 (0.30–0.58) 0.10 (0.02–0.33) 0.027 0.38 (0.26–0.53) 0.40 (0.27–0.56) 0.36 (0.14–0.89) 0.87 0.39 (0.28–0.54) vs.no primary G- CSF Primary 0.06 (0.02–0.19) 0.27 (0.13–0.53) 0.17 (0.09–0.33) 0.23 (0.12–0.43) 0.25 (0.12–0.55) 0.24 (0.15–0.39) pegfilgrastim vs. no primary G-CSF Primary 0.62 (0.35–1.17) 0.15 (0.04–0.46) 0.45 (0.25–0.85) 0.64 (0.33–1.30) 0.59 (0.30–1.22) 0.62 (0.38–1.04) pegfilgrastim vs. primary filgrastim Primary 0.35 (0.20–0.59) NA NA 0.34 (0.19–0.61) 0.34 (0.18–0.61) NA NA 0.34 (0.19–0.60) lenograstim vs. no primary G-CSF Heterogeneity 0.32 (0.04–0.66) 0.42 (0.16–0.78) 0.44 (0.21–0.76) 0.42 (0.20–0.73) (between-trial SD log OR, mean (95% CI) Mean residual 39.7 40.1 49.8 49.5 difference

For each comparison of treatments A and B, the odds ratios reported are posterior medians of the quantity exp(dAB), where dAB is the mean log odds ratio from a random-effects distribution. CI, confidence interval; CrI, credible interval; G-CSF, granulocyte colony–stimulating factor; MTC, mixed treatment comparison; N/A, not available; OR, odds ratio. do not distinguish the models. This is because random-effects cluded in the comparison are the other trials of primary pegfilgras- models tend to fit the data quite well, regardless of the degree of tim versus no primary G-CSF available at time 2 (P ϭ 0.010). heterogeneity. These P values do, however, need to be considered in the light The cross-validation analyses (Table 3) focus on whether the of the implied number of P values that could be generated. The results as extreme as those reported in the Vogel et al. study of Vogel et al. trial can be seen as the most extreme of all the trials in primary pegfilgrastim versus no primary G-CSF would be ex- the synthesis. At time 1, given that there are 20 trials, the P value of pected, based on a) the time 1 indirect evidence on this contrast, b) 0.047 should be compared to the most extreme expected order the totality of the remaining evidence (direct and indirect) at time statistic for 20 samples from a uniform distribution, P ϭ 0.048 (1/ 2, and c) the additional primary pegfilgrastim versus no primary 21). This suggests that it is not unlikely to find one trial as extreme G-CSF information available at time 2. At time 1, the P value is as the Vogel et al. trial given the heterogeneity in the other time 1 0.047, on the margin of conventional significance testing. By time trials. However, at time 2, the Bayesian P values, when compared 2, the P value is 0.013, providing strong evidence that the true to their expected order statistics, confirm that the Vogel et al. trial treatment effect in the trial is not consistent with expectations is not compatible with the totality of the other time 2 evidence, nor based on the other trials. This is true even if the only trials in- even with the four other trials of primary pegfilgrastim versus no primary G-CSF.

Table 3 – Cross-validation of the Vogel et al. [19] trial. Discussion Source of estimate Febrile P Expected Where the available evidence forms a connected network, as for neutropenia, value P value example in Figure 1, it is possible to estimate the relative efficacy % (95% CI) of any two treatments either directly from the head-to-head trials Observed, Vogel et al. [19] 1.3 (0.6–2.8) or indirectly from their performance against one or more common Predicted, indirect evidence, 5.7 (1.9–13.1) 0.047 0.048 comparators. This gives us three possible estimates for each treat- time 1 ment effect: direct, indirect, and pooled (MTC). We used this case Predicted, totality of remaining 6.1 (3.0–10.8) 0.013 0.038 study to illustrate a range of methods for checking the consistency evidence, time 2 of direct and indirect estimates. Consider the situation at time 1. Predicted, remaining direct 7.3 (2.8–15.4) 0.01 0.17 The Bayesian P value comparing the direct and indirect evidence evidence, time 2 would strongly suggest that the two sources are inconsistent. The cross-validation procedure is considerably more conservative and Observed and predicted percentage of febrile neutropenia in the does not reject the possibility of consistency so emphatically. This Vogel et al. [19] trial, with two-sided Bayesian P values and expected is because it is based on the predictive distribution of “new” trials, P value. Based on most extreme order statistic for N ϩ 1 samples whereas the node-splitting comparison of direct and indirect evi- from a uniform distribution, where N is the number of trials gener- dence is based on the posterior mean effects. Nevertheless, esti- ating the prediction. mates of treatment effect at time 1 vary substantially depending 958 VALUE IN HEALTH 14 (2011) 953–960

on whether they draw on direct, indirect, or pooled evidence. This fied between the Vogel et al. trial and other trials of primary peg- raises the question: which estimate should be used to represent filgrastim versus no primary G-CSF, such as the use of docetaxel treatment efficacy in economic models? chemotherapy and a lower mean age of participants. One might The Cochrane collaboration states that direct evidence should point to such differences to explain the strong risk reduction effect be accorded a higher rating in the evidence hierarchy than, and observed in the Vogel et al. trial. However, these are post hoc at- preferred to, indirect [7]. The NICE Guide to the Methods of Tech- tempts to justify the observed difference between the Vogel et al. nology Appraisal [27] allows for an MTC combining direct and in- trial and other head-to-head trials making the same comparison. direct evidence to be presented as a subsidiary analysis. It re- There are also similarities between the Vogel et al. trial and the quires, however, the reference case analysis to be based on direct other trials making the same comparison—the Romieu et al. [27] evidence. Consider the situation in which the decision problem is trial in particular (the results of which disagree most strongly with that of whether to adopt primary prophylaxis with pegfilgrastim Vogel et al.). Furthermore, these explanations do not identify the or use no primary G-CSF prophylaxis. At time 1, an investigator Vogel et al. trial as distinct from all the other trials across the charged with this decision problem and following the guidance network. Both the del Giglio et al. [24] and Holmes et al. phase III given by Cochrane and NICE would have based their analysis ex- [51] trials appear to have study characteristics very similar to clusively on the Vogel et al. trial, this being the only direct evi- those of the Vogel et al. trial (see Appendix A3 found at doi:10.1016/ dence. Possibly they would have felt quietly confident in excluding j.jval. the indirect evidence, in view of the apparent discordance be- 2011.05.008). tween the mean treatment effects obtained from direct and indi- Guidance is required from decision makers as to whether such rect sources. A key point of interest in this illustration, however, is criteria can be used to identify trials that can be excluded from the that it offers us the benefit of hindsight. At time 2, we have addi- analysis as they are too dissimilar from the decision population to tional evidence available that provides an independent (direct) be relevant. One concern is that this can lead to accusations of estimate of the “difficult” primary pegfilgrastim and no primary “cherry-picking” studies that favor a particular treatment. To G-CSF comparison. With the additional evidence, the direct and avoid this accusation, exclusion criteria should be clearly defined indirect evidence appears consistent; moreover, the direct esti- before identifying the evidence, and it may be advisable to include mate at time 2 is in line with the indirect estimate at time 1 rather all available studies unless they are clearly not relevant. In our than the direct estimate at that time. example, it might seem that this is not a major concern because Following accepted wisdom would have produced what seems both direct and indirect estimates suggest that pegfilgrastim is to be, in the light of later evidence, the “wrong” answer in a time 1 more efficacious than no primary G-CSF. However, differences in analysis of the efficacy of primary pegfilgrastim versus no primary estimates of treatment effect can have a marked impact on cost- G-CSF. The actual decision problem in this example, however, in- effectiveness, even when they agree over the statistically superi- volved a simultaneous comparison of all G-CSFs. In this situation, ority of one treatment over another. None of this changes the basic the problem with guidance favoring direct evidence is more fun- requirement that when a choice is being made among three or damental—it is impossible to follow. This is because it is not pos- more treatments, estimates of relative treatment effect must form sible to construct a coherent cost-effectiveness analysis (CEA) for a consistent set. multiple treatments based on incremental net benefit unless the The Bayesian synthesis, node-splitting, and cross-validation statistical model for the treatment effects guarantees consistency approaches are not the only methods that could have been used. A (as the MTC does). It follows that a trial that is “direct” evidence for simple frequentist approach such as that presented by Bucher et one comparison is “indirect” evidence for another. This is why al. [28], could also have been used. This would be implemented by indirect comparisons and MTC analyses have become, in practice, carrying out a standard meta-analysis of the trials on each “edge” routine whenever three or more treatments must be compared in of the triangle in the Figure 1 network, forming an indirect esti- the same CEA. mate of the primary pegfilgrastim versus no primary G-CSF effect Once it is accepted that the estimates used in the analysis must and finally comparing this to the direct evidence. In this particular be internally consistent, it is more meaningful to consider which case study, the Bucher et al. method has a potential drawback. trials to include in the analysis rather than whether to pool direct Random-effects estimates can be obtained for two of the con- and indirect evidence. In our illustrative example, the inconsis- trasts, but because there is only one pegfilgrastim versus no pri- tency is between one trial (Vogel et al.) and the others. At time 1, mary G-CSF trial at time 1, one must use a fixed-effects estimate the statistical evidence for the presence of inconsistency is at best for this contrast. marginal. At time 2, the issue is clearer. Direct and indirect evi- The Bayesian MCMC method in the WinBUGS package has al- dence are broadly consistent; one could again “prefer” direct evi- lowed us to fit a random-effects model with a shared variance to dence if the CEA concerned only primary pegfilgrastim versus no all four contrasts at the same time. This is a more reasonable primary G-CSF, although we would argue there are good reasons model and is considerably more conservative. After all, if the evi- for preferring a more robust estimate from the broader evidence dence on the other contrasts is heterogeneous, we would expect base. However, this does not solve the problem of inconsistency, evidence on primary pegfilgrastim versus no primary G-CSF to also which exists between the Vogel et al. trial and other trials making be heterogeneous. In the Bucher et al., approach, a statistical test the same comparison rather than between direct and indirect ev- for consistency of indirect and direct evidence (at time 1) would idence. This shows that inconsistency, given by some as a reason give a P value of 0.01 (see Appendix A2 found at doi:10.1016/ for avoiding the synthesis of direct and indirect evidence in an j.val.2011.05.008). In our view, this exaggerates the inconsistency MTC, can also be a problem for pairwise meta-analyses. in the evidence because it fails even to recognize that the primary The decision as to which trials to include in the analysis should pegfilgrastim versus no primary G-CSF efficacy estimate should be be guided by their relevance to the target population of the deci- seen as the mean of a random-effects distribution, let alone the sion being informed. If the patient population in the Vogel et al. need to compare the specific trial result to a suitable predictive trial matches the decision population more closely than other distribution. The Bucher et al. approach would, if used to inform studies, and it is reasonable to suppose that these differences an economic model, considerably understate uncertainty in the might modify relative treatment effects, then it may be appropri- efficacy of pegfilgrastim versus placebo compared to the more ate to base the decision on its results. From a description of the conservative shared-variance assumption (this can be seen by clinical characteristics of the trials (see Appendix, which can be comparing the 95% credible interval of 0.02–0.19 for this parame- found at doi:10.1016/j.val.2011.05.008) differences can be identi- ter given in Table 1 with the 95% confidence interval of 0.03–0.15 VALUE IN HEALTH 14 (2011) 953–960 959

given in Appendix A2 which can be found at at doi:10.1016/ or if hard copy of article, at www.valueinhealthjournal.com/issues j.val.2011.05.008). (select volume, issue, and article). Another class of models that could be applied are those that allow for different levels of between-trial variation in each of the four contrasts [12]. Table 1 presents no strong evidence against the REFERENCES homogeneous variance model that we have assumed, although a model with lower heterogeneity in the head-to-head trials could also be considered. It is unclear, however, whether this would [1] Gleser L, Olkin I. Meta-analysis for 2X2 tables with multiple treatment change conclusions in this case. A persistent difficulty in all meta- groups. In: Stangl DK, Berry DA, eds., Meta-analysis in Medicine and analyses is the paucity of information on between-trials variation. Health Policy. New York: Marcel Dekker, 2000. Heterogeneous variance methods would require informative pri- [2] Higgins JPT, Whitehead J. Borrowing strength from external trials in a ors on aspects of variation and covariation, and this is currently an meta-analysis. Stat Med 1996;15:2733–49. [3] Hasselblad V. Meta-analysis of multi-treatment studies. Med Decis active area of research. Making 1998;18:37–43. What lessons can be learned from this example? Clearly, there [4] Caldwell DM, Ades AE, Higgins JP. Simultaneous comparison of is a debate to be had about the interpretation of random-effects multiple treatments: combining direct and indirect evidence. BMJ models in health technology assessment [29,30]. But, setting that 2005;331:89700. [5] Lu G, Ades A. Assessing evidence consistency in mixed treatment aside, it suggests that, rather than prefer direct evidence, a better comparisons. JASA 2006;101:447–59. strategy may be to use all available evidence, both direct and in- [6] National Institute for Health and Clinical Excellence (NICE). Guide to direct, subject to a rigorous, prespecified set of trial inclusion cri- the methods of technology appraisal. Available from: http://www. teria, and to careful post hoc checks for consistency, although nice.org.uk/media/B52/A7/TAMethodsGuideUpdatedJune2008.pdf; 2008. [Accessed May 23, 2011]. these will seldom have enough power to detect potential prob- [7] Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of lems. Empirical studies conducted to date in which direct and in- Interventions Version 5.0.0 [updated February 2008]. Chichester: The direct estimates have been compared would support this view [8]. Cochrane Collaboration, Wiley, 2008. In cases in which the two have been discordant, there are clear [8] Song F, Altman D, Glenny M-A, Deeks J. Validity of indirect comparison for estimating efficacy of competing interventions: indications that direct and indirect estimates were based on dif- evidence from published meta-analyses. BMJ 2003;326:472–6. ferent protocols or doses [8,11,31]. Song et al. [32] have even sug- [9] Chou R, Rongwei F, Hoyt Huffman L, Korthuis PT. Initial highly-active gested that indirect evidence may be less biased in some circum- antiretroviral therapy with a protease inhibitor versus non-nucleoside stances. Another approach is the use of MTC methods to both reverse transcriptase inhibitor: discrepancies between direct and indirect meta-analyses. Lancet 2006;368:1503–15. estimate “novelty” bias [33] or bias associated with markers of [10] Caldwell DM, Gibb DM, Ades AE. Validity of indirect comparisons in lower trial quality [34] and simultaneously adjust for these biases. meta-analysis. Lancet 2007;369:270. These approaches probably require further evaluation before they [11] Chou R, Fu RW. Validity of indirect comparisons in meta-analysis: could be adopted for routine use, but they are among a growing set authors’ reply. Lancet 2007;369:271. [12] Lu G, Ades AE. Modelling between-trial variance structure in mixed of techniques designed to adjust for potential bias, without requir- treatment comparisons. Biostatistics 2009;10:792–805. ing a demonstration of a significant bias [35]. [13] Dias S, Welton NJ, Caldwell DM, Ades AE. Checking consistency in mixed treatment comparison meta-analysis. Stat Med 2010;29:932–44. [14] DuMouchel W. Hierarchical Bayes Linear Models for Meta-analysis (Report No. 27). Research Triangle Park, NC: National Institute of Conclusions Statistical Sciences, 1994. [15] Marshall EC, Spiegelhalter DJ. Approximate cross-validatory predictive We present an example in which the indirect evidence gave mark- checks in disease mapping models. Stat Med 2003;22:1649–60. [16] Spiegelhalter DJ, Thomas A, Best N, Lunn D. WinBUGS User Manual: edly better predictions of future trial results than the direct evi- Version 1.4. Cambridge, UK: MRC Biostatistics Unit, 2001. dence. Doubtless, other examples could be found in which the [17] Kuderer N, Dale D, Crawford J, Lyman G. Impact of primary opposite is true. However, if the AB, AC, and BC evidence is con- prophylaxis with granulocyte colony-stimulating factor on febrile sidered equally representative of the target population and neutropenia and mortality in adult cancer patients receiving chemotherapy: a systematic review. J Clin Oncol 2007;25:3158–67. equally vulnerable to effect modification, then the general rule, in [18] Pinto L, Liu Z, Doan Q. Comparison of pegfilgrastim with filgrastim on line with the overall aim of meta-analysis, should be to combine febrile neutropenia, grade IV neutropenia and bone pain: a meta- all sources of relevant information to obtain the most robust co- analysis of randomized controlled trials. Curr Med Res Opin 2007;23: herent estimates. Bayesian methods offer a flexible approach for 2283–95. [19] Vogel C, Wojtukiewicz M, Carroll R, et al. First and subsequent cycle the synthesis of a network of evidence and the identification of use of pegfilgrastim prevents febrile neutropenia in patients with subsets of the evidence that are mutually inconsistent. breast cancer: a multicenter, double-blind, placebo-controlled phase III study. J Clin Oncol 2005;23:1178–84. [20] Cooper KL, Madan J, Whyte S, et al. Granulocyte colony-stimulating factors for prevention of febrile neutropenia following chemotherapy: Acknowledgments systematic review and meta-analysis. University of Sheffield HEDS Discusssion Paper Series 2009;(09/07). [21] Balducci L, Al-Halawani H, Charu V, et al. Elderly cancer patients We thank Sofia Dias and Nicky Welton for their comments on receiving chemotherapy benefit from first-cycle pegfilgrastim. earlier versions of this work. Oncologist 2007;12:1416–24. Source of financial support: Funding for this study was partially [22] Romieu G, Clemens M, Mahlberg R. Pegfilgrastim supports delivery of FEC-100 chemotherapy in elderly patients with high risk breast supported by an unrestricted grant from Amgen Ltd. Comments cancer: a randomized phase 2 trial. Crit Rev Oncol Hematol 2007;64: on the manuscript were received from Amgen staff, but the study 64–72. concept, study design, final manuscript content, and decision to [23] Hecht JR, Pillai M, Gollard R, et al. Pegfilgrastim in colorectal cancer publish remained with the authors. (CRC) patients (pts) receiving every-two-week (Q2W) chemotherapy (CT): long-term results from a phase II, randomized, controlled study. J Clin Oncol 2009;27(Suppl.):Abstract 4072. [24] del Giglio A, Eniu A, Ganea-Motan D, et al. XM02 is superior to placebo and equivalent to Neupogen in reducing the duration of severe Supplemental Material neutropenia and the incidence of febrile neutropenia in cycle 1 in breast cancer patients receiving docetaxel/doxorubicin chemotherapy. BMC Cancer 2008;8:332. Supplemental material accompanying this article can be found in [25] DerSimonian R, Laird N. Meta-analysis of clinical trials. Control Clin the online version as a hyperlink at doi:10.1016/j.jval.2011.05.042, Trial 1996;7:177–88. 960 VALUE IN HEALTH 14 (2011) 953–960

[26] Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian [42] Crawford J, Ozer H, Stoller R, et al. Reduction by granulocyte colony- measures of model complexity and fit. J R Stat Soc B 2002;64:583–616. stimulating factor of fever and neutropenia induced by chemotherapy [27] National Institute for health and Clinical Excellence. Guide to the in patients with small-cell lung cancer. N Engl J Med 1991;325:164–70. methods of technology appraisal. London: NICE, 2008. [43] Fossa SD, Kaye SB, Mead GM, et al. during combination [28] Bucher HC, Guyatt GH, Griffith LE, Walter SD. The Results of direct and chemotherapy of patients with poor-prognosis metastatic germ cell indirect treatment comparisons in meta-analysis of randomized malignancy. European Organization for Research and Treatment of controlled trials. J Clin Epidemiol 1997;50:683–91. Cancer, Genito-Urinary Group, and the Medical Research Council [29] Ades AE, Lu G, Higgins JPT. The interpretation of random effects meta- Testicular Cancer Working Party, Cambridge, United Kingdom. J Clin analysis in decision models. Med Decis Making 2005;25:646–54. Oncol 1998;16:716–24. [30] Higgins JPT, Thompson SG, Spiegelhalter DJ. A re-evaluation of [44] Chevallier B, Chollet P, Merrouche Y, et al. Lenograstim prevents morbidity from intensive induction chemotherapy in the treatment of random-effects meta-analysis. J R Stat Soc A 2009;172:137–59. inflammatory breast cancer. J Clin Oncol 1995;13:1564–71. [31] Caldwell DM, Gibb DM, Ades AE. Validity of indirect comparisons in [45] Gisselbrecht C, Haioun C, Lepage E, et al. Placebo-controlled phase III meta-analysis. Lancet 2007;369:270. study of lenograstim (glycosylated recombinant human granulocyte [32] Song F, Harvey I, Lilford R. Adjusted indirect comparison may be less colony-stimulating factor) in aggressive non-Hodgkin’s lymphoma: biased than direct comparison for evaluating new pharmaceutical factors influencing chemotherapy administration. Groupe d’Etude des interventions. J Clin Epidemiol 2008;61:455–63. Lymphomes de l’Adulte. Leuk Lymphoma 1997;25:289–300. [33] Salanti G, Dias S, Welton NJ, et al. Evaluating novel agent effects in [46] Bui BN, Chevallier B, Chevreau C, et al. Efficacy of lenograstim on multiple treatments meta-regression. Stat Med 2010;29:2369–83. hematologic tolerance to MAID chemotherapy in patients with [34] Dias S, Welton NJ, Marinho VCC, et al. Estimation and adjustment of advanced soft tissue sarcoma and consequences on treatment dose- bias in randomised evidence by using mixed treatment comparison intensity. J Clin Oncol 1995;13:2629–36. meta-analysis. J R Stat Soc A 2010;173:613–29. [47] Gebbia V, Valenza R, Testa A, et al. A prospective randomized trial of [35] Moreno SG, Sutton AJ, Turner EH, et al. Novel methods to deal with thymopentin versus granulocyte--colony stimulating factor with or publication biases: secondary analysis of antidepressant trials in the without thymopentin in the prevention of febrile episodes in cancer FDA trial registry database and related journal publications. BMJ 2009; patients undergoing highly cytotoxic chemotherapy. Anticancer Res 339:b2981. 1994;14:731–4. [36] Doorduijn JK, van der HB, van Imhoff GW, et al. CHOP compared with [48] Gebbia V, Testa A, Valenza R, et al. A prospective evaluation of the CHOP plus granulocyte colony-stimulating factor in elderly patients activity of human granulocyte-colony stimulating factor on the with aggressive non-Hodgkin’s lymphoma. J Clin Oncol 2003;21: prevention of chemotherapy-related neutropenia in patients with 3041–50. advanced carcinoma. J Chemother 1993;5:186–90. [37] Osby E, Hagberg H, Kvaloy S, et al. CHOP is superior to CNOP in elderly [49] Green M, Koelbl H, Baselga J, et al. A randomized double-blind patients with aggressive lymphoma while outcome is unaffected by multicenter phase III study of fixed-dose single-administration filgrastim treatment: results of a Nordic Lymphoma Group pegfilgrastim versus daily filgrastim in patients receiving randomized trial. Blood 2003;101:3840–8. myelosuppressive chemotherapy. Ann Oncol 2003;14:29–35. [38] Zinzani PL, Pavone E, Storti S, et al. Randomized trial with or without [50] Holmes F, O’Shaughnessy J, Vukelja S, et al. Blinded, randomized, multicenter study to evaluate single administration pegfilgrastim once granulocyte colony-stimulating factor as adjunct to induction VNCOP- per cycle versus daily filgrastim as an adjunct to chemotherapy in B treatment of elderly high-grade non-Hodgkin’s lymphoma. Blood patients with high-risk stage II or stage III/IV breast cancer. J Clin 1997;89:3974–9. Oncol 2002;20:727–31. [39] Pettengell R, Gurney H, Radford J, Deakin D. Granulocyte colony- [51] Holmes F, Jones S, O’Shaughnessy J, et al. Comparable efficacy and stimulating factor to prevent dose-limiting neutropenia in non- safety profiles of once-per-cycle pegfilgrastim and daily injection Hodgkin’s lymphoma: a randomized controlled trial. Blood 1992;80: filgrastim in chemotherapy-induced neutropenia: a multicenter dose- 1430–6. finding study in women with breast cancer. Ann Oncol 2002;13:903–9. [40] Timmer-Bonte J, de Boo T, Smit H, et al. Prevention of chemotherapy- [52] Vose J, Crump M, Lazarus H, et al. Randomized, multicenter, open- induced febrile neutropenia by prophylactic antibiotics plus or minus label study of pegfilgrastim compared with daily filgrastim after granulocyte colony-stimulating factor in small-cell lung cancer: a chemotherapy for lymphoma. J Clin Oncol 2003;21:514–9. Dutch randomized phase III study. J Clin Oncol 2005;23:7974–84. [53] Grigg A, Solal-Celigny P, Hoskin P, et al. Open-label, randomized study [41] Trillet-Lenoir V, Green J, Manegold C, et al. Recombinant granulocyte of pegfilgrastim vs. daily filgrastim as an adjunct to chemotherapy in colony stimulating factor reduces the infectious complications of elderly patients with non-Hodgkin’s lymphoma. Leuk Lymphoma cytotoxic chemotherapy. Eur J Cancer 1993;29:319–24. 2003;44:1503–8.