Methods for Adjusting for Publication Bias and Other Small-Study Effects In

METHODS FOR ADJUSTING FOR PUBLICATION BIAS

AND OTHER SMALL‐STUDY EFFECTS

IN EVIDENCE SYNTHESIS

Thesis submitted for the degree of

Doctor of Philosophy

At the University of Leicester

Santiago Gutierrez Moreno BSc

Department of Health Sciences

University of Leicester

September 2009

METHODS FOR ADJUSTING FOR PUBLICATION BIAS

AND OTHER SMALL‐STUDY EFFECTS

IN EVIDENCE SYNTHESIS

Santiago Gutierrez Moreno BSc

Abstract

Meta-analyses usually combine published studies, omitting those that for some reason have not been published. If the reason for not publishing is other than random, the problem of publication bias arises. Research into publication bias suggests that it is the ‘interest level’, or statistical significance of findings, not study rigour or quality, that determines which research gets published and subsequently publicly available.

When the results of the scientific literature as a whole are considered, such publication practices distort the true picture, which may exaggerate clinical effects resulting in potentially erroneous clinical decision-making. Therefore, meta-analyses (as well as other more complex evidence synthesis models) based on the published literature should be seen as ‘at risk’ of publication bias, which has the potential to bias conclusions and thus adversely affect decision-making. Many methods exist for detecting publication bias, but this alone is not sufficient if results from meta-analyses are going to be used within a decision-making framework. What is required in the view of this thesis is a reliable way to adjust pooled estimates for publication bias.

This thesis explores different novel and existing approaches to publication bias adjustment, including frequentist and Bayesian approaches with the aim to identifying those with the most desirable statistical properties. Special attention is given to regression-based methods commonly used to test for the presence of publication bias (and other ‘small-study effects’). The regression-based approach is seen to produce very encouraging results in a case study for which gold standard data exists. The incorporation of external information about the direction and strength of the bias is also explored in the hope of improving the methods’ performance. Ultimately, the routine estimation of the bias-adjusted effect is recommended as it improves the overall results compared to standard meta-analysis.

Acknowledgements

I would like to express my gratitude to Prof. Tony Ades for sharing his novel

Bayesian semi-parametric model considered in chapter 10 in addition to providing crucial input and inspiring thinking throughout the thesis. I thank Prof. Keith Abrams for providing constructive criticism in the methodological issues and devoting time to me despite his frantic agenda. I am also incredibly grateful to Dr. Nicola Cooper for being so supportive and encouraging while helping to improve the overall quality of this work.

I would like to thank Prof. John Thompson, Dr. Tom Palmer, Dr. Nicola Novielli and Dr.

Jaime Peters for constructive discussions regarding the work contained in this thesis.

Dr. Jaime Peters also assisted me with the English grammar for which I am grateful.

Thanks also to all my other colleagues and friends at the department for having made my stay here so wonderful.

Needless to say, this thesis would have never been possible without the guidance of my outstanding supervisor Prof. Alex Sutton. What is more, I am greatly beholden to him for putting some much faith in me from the very first day. I would like to acknowledge the Medical Research Council (HSRC) for sponsoring me during this four-year doctoral programme; in particular, to Prof. Paul Dieppe and his team for continuously injecting (the so much needed) enthusiasm to the PhD students. I am deeply indebted to my wife, mother, sisters and friends for their unconditional support and infinite faith in me. Ultimately, this thesis is dedicated to the memory of my father, who sadly did not live to celebrate its completion.

[email protected]

TABLE OF CONTENTS

Abstract

Chapter 1 – Introduction

1.1. Background………………………………………………………………...…. 1 1.2. Aims of the thesis……………………………………………………….……. 7

1.3. Thesis outline……………………………………………….………………… 9

Chapter 2 – Literature review on meta-analysis

2.1. Introduction to meta-analysis……………………………………………… 11 2.2. Fixed-effect meta-analysis model………………………………………… 12 2.3. Between-study variability (heterogeneity) ………………………….…… 16 2.4. Random-effects meta-analysis model……………………………….…… 21

2.5. Introduction to meta-regression…………………………………………… 26 2.6. Aggregation bias………………………………………………………….… 29 2.7. Multiple meta-regression…………………………………………………… 30 2.8. The regression model…………………………………………….………… 31 2.9. Summary………………………………………………………………..…… 33

Chapter 3 – Literature review on publication bias and small-study effects

3.1. Biases in meta-analytical data………………………………………..…… 34

3.2. Reporting biases………………………………………………………….… 35

3.3. Introduction to small-study effects………………………………………… 37

3.4. Sources of small-study effects………………………………..…………… 38

• 3.4.1. Reporting biases…………………………………………….…… 40

• 3.4.2. True/genuine heterogeneity……………………..……………… 42

• 3.4.3. Data irregularities………………………………………………… 44

• 3.4.4. Artifactual heterogeneity………………………………………… 48 3.5. Summary……………………………………………………………..……… 52

Chapter 4 – Evidence for publication bias (and other small-study

effects) and how to address it in meta-analysis 4.1. Evidence for publication bias (and other small-study effects) ……….… 53

• 4.1.1. Meta-epidemiological studies…………….………………..….… 53

• 4.1.2. Funnel plot …………………………………………………..…… 54

• 4.1.3. Contour-enhanced funnel plot……………………………..…….58

• 4.1.4. Tests for publication bias/small-study effects………….……… 61

4.2. Methods for addressing publication bias…………………………….…… 66

• 4.2.1. Prevention …………………………………………………...…… 66

• 4.2.2. Best evidence synthesis approach…………………...…...…… 68

• 4.2.3. Grey literature…………………………………………..………… 69

• 4.2.4. File-drawer number……………………………………..………...70

• 4.2.5. Trim & Fill adjustment method…………………………..……… 71

• 4.2.6. Selection models………………………………………………… 74

• 4.2.7. Multiple imputation……………………………………..………… 77 4.3. Summary………………………………………………………………..…… 78

Chapter 5 – Underlying theory for publication bias

adjustment through regression 5.1. Proposed method for adjusting for publication bias (and other small-study effects) …………………………………………… 79

5.2. The original Egger’s model………………………………………………… 84

5.3. Biases affecting the Egger’s model……………………….…….………… 86

• 5.3.1. Structural correlation……………………………………..……… 86

• 5.3.2. Regression assumptions………………………………..….…… 90

• 5.3.3. Measurement error and attenuation bias……….…………...… 91

• 5.3.4. Heteroscedasticity…………………………………………..…… 99 5.4. Weighted regression and the Egger’s model variants………………… 100

5.5. Discussion………………………………………………………….…….… 104 5.6. Summary………………………………………………………...……….… 108

Chapter 6 – Assessment of existing & novel methods for adjusting for

publication bias (and other small-study effects) through simulation 6.1. Introduction……………………………………………………………….… 109

6.2. Published simulation studies evaluating methods for publication bias..111 6.3. Statistical methods to be evaluated……………………………………… 114

• 6.3.1. Non-parametric adjustment Trim and Fill method………...… 118

• 6.3.2. Parametric adjustment methods (regression-based) ….…… 118

• 6.3.3. Conditional methods……………………………………….…… 122 6.4. Simulation procedures………………………………………………..…… 122

• 6.4.1. Level of dependence between simulated datasets……….… 123

• 6.4.2. Software to perform simulations………………………….…… 123

• 6.4.3. Number of simulations to be performed……………………… 123 6.5. Methods for generating the datasets…………………………………..… 124

• 6.5.1. Underlying effect size………………………………………...… 125

• 6.5.2. Number of primary studies in a meta-analysis……….……… 126

• 6.5.3. Event rate……………………………………………………..… 126

• 6.5.4. Number of events………………………………………...…..… 128

• 6.5.5. Number of subjects…………………………………………..… 128

• 6.5.6. Ratio of subjects………………………………………….…..… 130

• 6.5.7. Inducing heterogeneity…………………………..……….….… 130

• 6.5.8. Inducing publication bias……………………………...…..…… 134

o 6.5.8.1. Inducing publication bias by p-value………….….…137 o 6.5.8.2. Inducing publication bias by effect size……….……142 • 6.5.9. Impact of publication bias on between-study variance………145 6.6. Simulation scenarios to be investigated……………………………….…146

6.7. Criteria to assess the methods performance………………………….…148

• 6.7.1. Assessment of bias (model accuracy) ……………………..…149

• 6.7.2. Combined assessment of bias (model accuracy) and variability (model precision) ………………………………………………………..149

• 6.7.3. Assessment of coverage……………………………………...…152

• 6.7.4. Assessment of variability……………………………………..….153

6.8. Results of the simulation study……………………………………………155 6.9. Discussion ………………………………………………………………..…168 6.10. Summary……………………………………………………………….…..178

Chapter 7 – Adjustment method implemented on a case study where a gold standard exists 7.1. Antidepressants case study………………………………………….…….179 7.2. Data Collection……………………………………………………….…...…180

7.3. Analysis……………………………………………………….……….…..…181 7.4. Results………………………………………………………….……….……182 7.5. Discussion…………………………………………………….…………..…190 7.6. Summary……………………………………………………….………….…198

Chapter 8 – The case for the routine implementation of the adjustment method

8.1. Introduction……………………………………………………….………..…199 8.2. Weighting properties of the regression approach………………...……...199

8.3. Pre-eclampsia case study………………………………………….…….…201 8.4. ‘Set shifting’ ability case study…………………………………….…...…..206 8.5. Discussion……………………………………………………………...….…210 8.6. Summary………………………………………………………………..……213

Chapter 9 – Simplified Rubin’s surface estimation

9.1. Introduction……………………………………………………….….………214

9.2. The ‘effect size surface’ approach…………………………….…..………215

9.3. Rubin’s surface function in terms of small-study effects…….….………217

9.4. Discussion……………………………………………………….…..………223

9.5. Summary………………………………………………………….…………224

Chapter 10 – Novel Bayesian semi-parametric regression model

10.1. Introduction to Bayesian statistics in meta-analysis………………..…225 10.2. Parametric versus non-parametric modelling……………………….…233

10.3. WinBUGS – Bayesian model fitting software………………………..…234 10.4. Description of the semi-parametric regression model…………………235 10.5. Implementation upon the magnesium case study…………………..…239 10.6. Additive regression and sensitivity analysis of model parameters...... 240 10.7. Simulation study & customized models to be evaluated………….…..247

10.8. Results of the simulation study…………………………………….….…248 10.9. Simulation procedures & model checking………………………..….....249 10.10. Discussion & Summary………………………………………………....250

CHAPTER 11 – A fully Bayesian approach to regression adjustment by using

prior information 11.1. Use of external data to inform the small-study effects trend……. ..…251 11.2. Application to the antidepressants case study……..…………….……252

11.3 Deriving the prior distribution…. …………………………………………256 11.4. Discussion on the use of prior information…………………………..…264 11.5. Introduction to network meta-analysis…………………………….….…268 11.6. Adapting the adjustment method to network meta-analysis…….….…271

11.7. Summary………………………………………………………………...…272

CHAPTER 12 – Discussion & conclusions

12.1. Thesis summary………………………………………………….……..…273

12.2. Discussion……………………………………………………………….…276

12.3. Further work…………………………………………………………..……283

12.4. Conclusions……………………………………………………………...…290

APPENDIXES 1. Logistic regression formulation……………………………………………….….…292 2. Relationship between p-value and effect size……………………………….……293 3. Publication bias intensity levels established by Duval and Tweedie……………295

4. Additional plots summarising the results from the remaining scenarios……..…295 5. Derivation of equations presented in chapter 8……………………………..….…315

BIBLIOGRAPHY…………………………………………………………………….……320

ABBREVIATIONS

CI Confidence Interval DIC Deviance Information Criterion FDA Food and Drug Administration (USA) FE Fixed-Effect IPD Individual Patient Data lnOR Natural logarithm of the Odds Ratio MA Meta-Analysis MCMC Markov Chain Monte Carlo MR Meta-Regression MSE Mean sum of Squared Errors OLS Ordinary Least Squares OR Odds Ratio PB Publication Bias RCT Randomised Controlled Trial RE Random-Effects SMD Standardised Mean Difference TF Trim & Fill VWLS Variance-Weighted Least Squares WOLS Weighted Ordinary Least Squares

Publications, poster and oral presentations associated with this thesis

1. Article published in the BMJ relating to chapter 7 (Moreno et al 2009b) 2. Letter published in the Lancet about the impact of publication bias on network meta-analysis (Turner et al 2009a) 3. Article published in the BMC Medical Research Methodology journal regarding the simulation study described in chapter 6 (Moreno et al 2009a) 4. Article accepted for publication in PLoS Medicine evaluating the ‘deceptive’ efficacy of C reactive protein as a prognostic marker among patients with stable coronary artery disease (Hemingway et al 2009) 5. Article published in the Stata journal about the use of WinBUGS from within Stata (Thompson et al 2006) 6. Article accepted for publication in the Stata journal about further Stata commands for Bayesian analysis (Thompson et al 2009) 7. Article published in the Stata journal about the Stata command for the contour- enhanced funnel plots (Palmer et al 2008), which was later compiled in a book (Sterne et al 2009a) 8. Article in press by the journal of the Royal Statistical Society (Serie A) Assessing publication bias in meta-analysis in the presence of between-study heterogeneity (Peters et al 2009) 9. Oral presentation during the 16th German Cochrane Colloquium summarizing chapters 5-7 (Moreno et al 2008) 10. Oral presentation during the MiM (Methods in Meta-analysis) meeting (June 2007) describing the simulation study design and the interpretation of preliminary results 11. My poster presentation at the Royal Statistical Society received the best poster award at the 2007 conference in York. The poster described the design of the simulation study and the interpretation of its preliminary results 12. Oral presentation at the conference for the society of Medical Decision Making in Birmingham (June 2006). Preliminary results from the Bayesian semi-parametric model described in chapter 10 were discussed 13. Other publications by the author of the thesis that are not related to this research project (Mar et al 2006, Comas et al 2008, Roman et al 2008, Oliva et al 2009, Parra-Blanco et al 2010) 14. A press release about some of the results of the thesis has been posted on AlphaGalileo, the online news centre for European research for distribution to

journalists all over the world (www.alphagalileo.org/ViewItem.aspx?ItemId=60234&CultureCode=en)

Introduction

1.1. Background

A collection of unbiased data is vital in order to make reliable inferences (Copas &

Li 1997). It is well known that the selection of evidence favouring one particular hypothesis will lead to biased inferences (Melander et al 2003). As an illustrative example, news coverage is generally focussed on what editors and journalists perceive to be of ‘interest’ to the public (Goldacre 2008). Therefore, if the general public solely considers the media coverage as the only valid source of information, it is not surprising that the general opinion is systematically biased. For instance, railway safety is generally perceived worse since its privatisation. Undoubtedly, the extensive and continuous media coverage of the tragic rail accidents has led to this widely held belief.

What this thesis proposes is that instead of taking for granted that any compilation of evidence is fairly representative of the state of affairs, evidence should be first scrutinised to examine whether it is indeed the case. In truth, there is not enough robust evidence from the accident data 1967-2005 to conclude that privatisation compromised safety (Evans 2007). Similar to news coverage, scientific literature can also provide a distorted picture of the truth with consequences that cannot be ignored

(Dickersin 2008, Landefeld & Steinman 2009).

In health-related research, systematic reviews are used to compile, assess and summarise the scientific evidence around a particular research question. The overall aim of these reviews is to evaluate comprehensively the available evidence keeping potential biases to a minimum. This is achieved by following a systematic protocol so that the review is undertaken in a structured and explicit way (see the Cochrane guideline on literature reviews as the most prominent one (Jørgensen et al 2006,

Higgins & Green 2008, Jørgensen et al 2008, Anne et al 2009)). This systematic

1 approach to evidence synthesis also allows for reproducibility and for ease in updating the review (Sutton et al 2000a, Egger et al 2001, Sutton et al 2009b).

Meta-analysis (MA) is considered to be the quantitative feature of the evidence synthesis process. MAs following systematic reviews of randomised controlled trials

(RCTs) are regarded as the highest level of evidence in medicine for evaluating interventions (Harbour & Miller 2001) and are the main source of knowledge for physicians and other health professionals (Stinson & Mueller 1980). In mathematical essence, the estimate from a MA can be thought of as a weighted average of the results of the primary studies, where the weighting depends on some measure of study precision (so that smaller studies are given less weight). However, MA results are only as valid as the available evidence and so depend on the appropriateness of the compiled studies (Stangl & Berry 2000, Melander et al 2003).

This thesis focuses on the most common meta-analytical setting. That is, summary efficacy data from RCTs. Indeed, RCTs are the most popular study design in scientific experiments for investigating efficacy of interventions, particularly health technologies in the medical science. Its key advantage is that random allocation of interventions

(treatment or control) to patients plays a key role in preventing confounding bias in the experimental studies (Kunz et al 2008). Interestingly, a MA of RCTs shall be considered as an observational study (Egger et al 1997b, Good & Hardin 2003) because all the available studies are combined without any kind of randomisation between them. Yet, inferences from MAs are usually made under the assumption of randomisation. Thus, MAs are susceptible to unforeseen selection mechanisms that could induce sampling bias with the end result of “evidence b(i)ased medicine”

(Melander et al 2003).

Predominantly, MA is used to estimate the pooled treatment effect based on all the evidence collected. Unfortunately, MAs may provide a bias estimate of efficacy due to publication bias (PB), which may threaten the validity of findings and consequently mislead decision makers into incorrect funding and service provision (Rising et al 2008,

Turner et al 2008a). Conventionally, PB is defined as the tendency to publish a study based on its results, rather than based on its theoretical or methodological quality

(Berlin et al 1989). This implies that it is the ‘interest level’, or statistical significance of findings (Sterling 1959, Nieminen et al 2007), not study rigour or quality, that determines which research is published and subsequently publicly available. Since PB threatens the internal validity of meta-analytic results, PB is one of the most widely researched topics in MA (Stangl & Berry 2000). Indeed, PB is considered the most important bias within the selection biases category, which refers to the lack of accuracy of the sampling frame (Delgado-Rodriguez & Llorca 2004). That is, the selection process generates a sample that is not representative of the population of existing studies because the missing studies are not missing at random. Therefore, the inferences made from this biased sample may be erroneous.

In other words, PB occurs when the published studies in the MA are not representative of the totality of all existing research (Preston et al 2004), where the absent studies are missing depending on the perceived value of their findings. To sum up, research that achieves ‘interesting’ or encouraging results is more likely published, potentially biasing the literature towards ‘interesting’ conclusions that may not be representative of the underlying truth (Rothstein et al 2005). Hence, it is imperative that, whenever decisions need to be made based on the evidence available, awareness is needed when interpreting the findings because they may be subjected to this phenomenon -publication bias (PB)- as it has come to be known (Egger & Smith

1995, Sterne et al 2001b).

The problem of PB was initially raised as early as 1605, with a clear reference to medical science in 1909 (Dickersin et al 1993, Petticrew 1998). Evidence about PB was first compiled by Sterling in 1959 (Sterling 1959, Sterling et al 1995) when he realised that the vast majority of published papers reported significant results, which entailed that studies with non-significant findings were somehow more likely to be missing/unpublished. Since then, there has been much scientific discussion of the topic

(Chalmers et al 1990, Dickersin 1990, Song et al 2000, Liesegang et al 2008) revealing some shocking episodes (Egger & Smith 1995). Statisticians have also developed methods to either detect or adjust for PB although their application is not routine in the literature (Thornton & Lee 2000, Pharm et al 2001, Rothstein et al 2005).

The reason PB is such a crucial topic in health sciences is that it applies to the reporting of study findings on health technologies. The consequences of a biased medical literature range from waste of public resources to the harm of human patients.

For these reasons, PB can be considered scientific misconduct (Dickersin 2008,

Liesegang et al 2008). All areas of empirical research are susceptible to PB (Stanley

2005) but the consequences of PB can be arguably more severe in health care even when a pooled estimate is not intended, such as in qualitative research (Petticrew et al

2008). Altogether, trials on new drug therapies that achieve ‘interesting’ or encouraging results are more likely published than those with ‘uninteresting’ results (Liesegang et al

2008). Consequently, the synthesis of the biased evidence will tend to exaggerate the benefits of the novel therapy even when such therapy may be equally effective than the comparator, ineffective or even harmful (Smith 1980). Regardless of the motives for its existence, either unintentional or deliberate (Rennie 1999, Calnan et al 2006, Young et al 2008), PB has severe implications. The results of the evidence synthesis exercise are compromised and the integrity of medical research is questioned (Rennie 1999,

Jørgensen et al 2006, Jørgensen et al 2008); but more importantly, exaggerated clinical effects lead those caring for patients to make potentially inappropriate treatment decisions (Rothstein et al 2005). The unpredictable consequences of this affect

4 patients, and hence, all of us at some point during our lives. What is more, ethical implications may derive from breaking the agreement between investigators and trial participants by not publishing all the results on human studies (Chalmers 1990,

Krzyzanowska et al 2003, Curt & Chabner 2008, Doroshow 2008, Dubben 2009). Thus,

PB is of such concern in the field of medicine that ignoring it is definitely not an option

(Baker & Jackson 2006).

A simple and effective measure to attenuate substantially the global problem of PB has been proposed (Horton & Smith 1999, Abbasi & Godlee 2005, Liesegang et al

2008). Prior registration of drug trials must become a requisite if their results are to be utilised, entailing that no trials with ‘uninteresting’ or discouraging results can just vanish. To this end, the World Health Organization is making a significant contribution to the prevention of PB by proposing world standards for prospective trial registration of all human medical research (WHO 2009b).

Although the best solution to PB is to prevent it (Rothstein et al 2005), the truth is that the underlying selection mechanism that leads to PB has continued unchanged over a period of at least thirty years (Dickersin et al 1993, Sterling et al 1995). The use of gold standard data sources, such as the US Food and Drug Administration (FDA) trial registry database, is one way of achieving a less biased data collection (Lee et al

2008, Rising et al 2008, Turner et al 2008a). Nevertheless, this is a lengthy and not always feasible remedy, and so there is a need to rely on analytic methods to deal with the problem.

The typical procedure followed when PB is suspected in a MA is to simply test for its presence. However, the question remains as to what to do when the test result is positive. Should all the trials in the MA be disregarded because there is evidence of significant PB in the data? (Vandenbroucke 1998, Sterne et al 2001b); or alternatively, should one act with caution when interpreting the results? Neither approach is sufficient

5 if the MA result is aimed at informing policymaking. Often, it is equally inappropriate to assume the absence of PB when failing to detect PB since the probability of a type II error is large in such tests, particularly in heterogeneous MAs with few studies (Peters et al 2005, Ioannidis 2008b, Peters et al 2009). What is required, then, is a reliable way to adjust pooled estimates for PB accordingly to allow for more reliable decision- making.

Altogether, this thesis highlights the serious consequences for science in general and clinical decisions in particular when based on selective and thus biased information; specifically, the problem of PB. The view of this thesis is that conclusions from the MA cannot be just a warning message about the potential danger of apparent

PB, but a corrected effect estimate that can be assumed unbiased so that it is incorporated into decision-making to allow for more consistent judgments. It is crucial, however, to be realistic about the potential achievements of this research project in developing statistical methods to adjust for PB. The main limitation is that the underlying criteria for publication is unknown and may differ between the different stakeholders involved in the publication process. Indeed, the publication selection mechanism “may depend on many factors for which the available data can only act as a proxy” (Copas & Malley 2008). Until such a time when the uncertainty surrounding the process of PB is understood (if ever), no statistical approach will entirely correct for PB, but ignoring it is an unwise option (Copas 2005, Baker & Jackson 2006). Hence, the ultimate objective of this thesis is to develop a statistical method able to estimate a pooled estimate adjusted for PB as is feasible for the purpose of decision-making in health policy. The next section develops the aims of the thesis more explicitly.

1.2. Aims of the thesis

The issues that justify the need for this project have been covered above. With this in mind, the overall aim of this work is to contribute to the development of valid approaches to addressing the problem of PB in evidence synthesis, with special attention to MA. The project consists of the following elements to facilitate the accomplishment of this core aim:

1. Investigation of the problem of PB (and other biases) affecting meta-analytic data.

Review and critically appraise approaches that have so far been proposed to tackle

PB; and a discussion of advantages and disadvantages of the different methods

will accompany the presentation of each one.

2. Development of alternative techniques aimed at overcoming some of the limitations

of currently applied methods, either by extending and adapting existing methods or

by developing new ones. To facilitate the evaluation and comparison between

competing adjustment methods, a simulation study is undertaken. Moreover,

several case studies are used to demonstrate the implementation of the adjustment

methods proposed, to compare them with currently used methods, and to illustrate

the potential impact of inappropriate analyses on the estimation of the bias-

adjusted pooled effect.

3. Provide recommendations to advise meta-analysts as to how best to address PB

(and other small-study effects) in MA and other more complex evidence synthesis

models. This implies the proposal of an adjustment method as the preferred one

justified by its strengths and limitations in comparison to the alternatives. Besides

proposing an adjustment method, exploration of the potential benefits and

methodological challenges in incorporating external information (within a Bayesian

framework) is carried out with the intention of improving the accuracy of the bias-

corrected pooled effect.

4. In addition to presenting the results, the simulation study designed here is

proposed as a consensus simulation framework in which future testing and

adjustment methods can be evaluated. This should alleviate the previous problems

of the methods being evaluated under different (and arguably favourable)

simulation conditions.

1.3. Thesis outline

Subsequent to the introductory chapter, chapter 2 comprises a literature review of essential issues concerning MA. It covers fixed and random-effects approaches to MA.

Methodological approaches to dealing with fundamental aspects of meta-analytic data such as heterogeneity are also dealt with in chapter 2 with special focus on meta- regression.

Chapter 3 defines the different types of biases affecting meta-analytic data, highlighting how PB, as well as other biases, induce a traceable trend known as small- study effects that is often used to detect the presence of PB in MAs. Chapter 4 reviews and critically appraises the approaches that have so far been proposed to address PB.

Chapter 5 develops an alternative approach to PB adjustment by proposing a regression-based method as the most coherent way of tackling PB (and other small- study effects).

Chapter 6 presents a simulation study designed to compare novel and existing methods to adjust for PB (and other small-study effects) from a frequentist perspective.

In chapter 7, the preferred adjustment method is illustrated in a case study as a first step to check the external validity of the method’s results. In order to facilitate a better understanding of the properties of the proposed method relative to the standard MA, chapter 8 derives algebraically the weighting scheme of the adjustment method of choice.

Chapter 9 investigates the links between Rubin’s ‘effect size surface estimation’ approach and the adjustment method proposed here. The following chapter 10 addresses some shortcomings in the frequentist-based adjustment methods by embarking on a Bayesian approach to PB adjustment. Chapter 11 goes a step further by investigating the way in which external information can further assist in the goal of adjusting for PB more accurately. This chapter also examines the benefits and methodological challenges in adapting the adjustment method of choice to the more complex evidence synthesis framework of network MA (also known as mixed treatment comparison models (Sutton et al 2008)).

Chapter 12 concludes by identifying important issues for routine practice. For that, chapter 12 summarizes the comparative benefits and limitations of the proposed adjustment (discussed across the thesis) used to justify replacing the present naïve MA approach with the routine adjustment of small-study effects. Future research to address unanswered questions is also suggested. Some additional material such as published research articles and supplementary plots can be found in the appendixes.

Literature review on meta-analysis

2.1. Introduction to meta-analysis

MA allows the quantitative estimation of the mean effect across several primary studies. The major benefit of MA is that it can provide a more precise answer to the research question of interest compared to a unique trial thanks to the statistical power gained. There are numerous meta-analytic techniques, which have been extensively discussed in the literature (Stangl & Berry 2000, Sutton et al 2000a, Egger et al 2001).

Techniques relevant to the aims of this thesis are reviewed in this and subsequent chapters. As noted earlier, the most commonly reported pooled estimate is calculated using a weighted average of the results of primary studies, where the weight corresponds to the precision of the effect estimate so that studies with large precision are given more weight in the MA. The simplest one is the fixed-effect (FE) MA that defines study precision as the inverse of study variance, and so it is known as the inverse-variance weighted MA model (section 2.2). Whenever heterogeneous effects need combining, a random-effects (RE) MA model is used (section 2.4). This is also based on the concept of inverse-variance weighting although a between-study variance parameter is comprised in the weighting to allow for heterogeneity (section 2.3). Both

FE & RE MA models are used in this thesis hereafter by default.

The inverse-variance approach can be applied in many MA scenarios to any outcome measure with an associated standard error. Other methods of estimation of the standard FE pooled estimate are the Mantel-Haenszel (Mantel & Haenszel 1959),

Peto (Peto et al 1985) and maximum likelihood based methods (Emerson 1994).

Although these are also available, they are likely to give very similar results to the inverse-variance approach in the ordinary RCT setting. The Bayesian approach to MA

(Spiegelhalter et al 2003) is also applied in chapter 10 and tends to provide similar

11 results to the frequentist inverse-variance approach (Good & Hardin 2003) provided vague prior information is used.

This thesis focuses on the most common meta-analytical setting (Morton et al

2004) for RCTs, with special attention to dichotomous outcomes (binary data). Only the logarithm odds ratio is considered as the summary statistic for binary data by means of the logistic link function. Note that any log transformation exclusively relates to the natural log transformation with base e. Nevertheless, some case studies also consider continuous data in the form of standardised mean differences, which is also common in the context of MA.

Fixed and random-effects meta-analytic models are the two main models used to combine results from individual studies (Song et al 2001); and the choice of one over the other is usually made on the basis of the variability between the study effect estimates (Viechtbauer 2007). Between-study variability (section 2.3), also known as heterogeneity (Sutton & Higgins 2008), is an important feature of any MA that must be considered and explored (Thompson 1994). Moreover, it is extensively advocated that the estimation of heterogeneity and exploration of its sources is as important as the estimation of the pooled effect itself (Stangl & Berry 2000, Morton et al 2004) (section

2.5).

2.2. Fixed-effect meta-analysis model

The fixed-effect (FE) MA model assumes that the summary estimates from the individual studies are all estimating the same underlying effect; i.e. the observed study effect sizes are all sampled from a common underlying distribution and therefore are homogeneous. Any differences between study effect estimates are assumed to be due to sampling error only (i.e. within-study variance) so that patients are assumed comparable between studies in relation to treatment efficacy. That is, patients’ and

12 study characteristics (such as the way the therapies are applied) are assumed not to interact with the effect size. If some of these characteristics did interact, then all studies and patients should share them (in equal measure), otherwise heterogeneous effects could arise. Figure 2.1 exhibits the effect size distribution of four hypothetical homogeneous studies that share a common underlying effect size θ. However, the meta-analyst only observes the reported effect sizes on the right hand side (◊); which only differ due to sampling error (i.e. within-study variation).

Figure 2.1 Illustration of a meta-analysis under the fixed-effect model assumptions

(with permission from Wolfgang Viechtbauer; presentation at Reading University 2008)

The parameterisation below of a FE MA model suitably accounts for continuous outcome data (Sutton et al 2000a),

2 [Equation 2.1] yi ~ N(θ,σ )

th yi is the effect size estimate in the i study, while θ is the true common effect size

2 with variance σ . Hence, the pooled estimate of effect θ is given by:

∑ ∑

The weights that minimise σ 2 (variance of θ) are the inverse of the estimate of

2 study variance wi= 1/vi. An estimate of σ is given by the reciprocal of the sum of weights σ 1 ∑ w

Although many RCTs collect binary data, often summarised and reported as odds ratios (OR), such data can be suitably combined with the above model. Note that the

OR is defined as a measure of association between exposure to a risk factor or intervention and the clinical event. More specifically, the OR is the ratio of two odds, one representing the active treatment group and the other placebo or no treatment group. Although the OR is conceptually challenging to interpret, it could be interpreted as a risk ratio (ratio of two probabilities) when pi is small, which is easier to understand conceptually. However, this equivalence in interpretation is erroneous as soon as pi is no longer small (Norton et al 2004).

th The OR of the i study can be calculated byOR where for each study i, a and b represent the observed number who experience the outcome of interest in the treated and control groups, respectively, and c and d are the numbers corresponding to those not developing the outcome in the treated and control group

th respectively. Thus, the sample size of the i study corresponds to the sum of ai, bi, ci and di.

The variance of a study lnOR is calculated as

Binary outcome data can be meta-analysed with the above model because the OR lies on a continuous scale and its transformation to the (natural) log scale benefits from improved normality and subsequent symmetry such that lnOR ∈(−∞, + ∞) . Since

ORs tend to follow a skewed distribution, the logarithmic transformation is preferred, although the lnOR distribution is still somewhat skewed (Chinn 2000, Sterne et al

2000).

Note, however, that the natural logarithm can be defined for all positive real OR values OR ∈[0, + ∞) , but not for OR=0. Nevertheless, well-powered clinical trials are never expected to conclude with exact zero relative efficacy. Zero events in at least one of the study arms are a typical problem of low powered trials, resulting in ORs that are not finite (Thompson & Sharp 1999). Fortunately, there are straightforward solutions (Rücker et al 2009) such as continuity correction methods available to deal with this issue (Bradburn et al 2007). Typically, 0.5 is added to each of the 2x2 table cells whenever there is at least one zero cell. Furthermore, if the statistical analysis is performed considering the original binary data directly (through a binomial model) rather than assuming approximate normality of the lnOR, no correction is needed.

Indeed, the exact distributions obtained by specifying the binomial likelihood are preferable to the normal approximation of the summary statistics of lnORs

(Spiegelhalter et al 2000b), although the binomial likelihood is only preferable in

15 principle since it generally produces very similar results to the normal approximation.

The choice of statistical model is only likely to affect MAs of a few small trials

(Thompson & Sharp 1999). In this sense, the simulation study in chapter 6 simulates

MAs based on studies reporting binary outcome data, while the statistical methods implemented there assume normality of the lnOR. Chapter 10, however, presents a more complex logistic regression model that, obviously, considers the original binary data directly.

The FE MA assumption of homogeneity among study effects is frequently unrealistic because protocols are never identical between clinical trials. In fact, it is widely believed that heterogeneity is intrinsic to meta-analytic data and therefore unavoidable (Higgins & Thompson 2002, Higgins 2008). Hence, by definition, FE MA ignores expected heterogeneity and so tends to provide an underestimate of the variance, which may lead to spurious significant findings (Stangl & Berry 2000).

2.3. Between-study variability (Heterogeneity)

Throughout this thesis heterogeneity and between-study variance are considered to be exchangeable terms in accordance with the scientific literature (Thompson &

Sharp 1999, Sutton & Higgins 2008). Heterogeneity can be defined as the variability in the true underlying effects between studies that cannot be explained by random sampling alone (within-study variation) (Higgins & Thompson 2002). There may be many sources of heterogeneity, but broadly speaking, heterogeneity can be observed due to (Song et al 2001, Xu et al 2008):

• True heterogeneity (due to variation in patient populations or interventions)

• Artifactual heterogeneity (e.g. due to different study designs being combined,

choice of statistical methodology, etc)

• Chance variation

Although the use of the RE MA has been advocated as the only sensible MA model, even if heterogeneity is not detected statistically (Thompson & Pocock 1991), the choice between FE and RE models is commonly made on the results of a statistical test (Viechtbauer 2007). The ‘Q’ test is the typical test (Takkouche et al 1999, Xu et al

2008) used to detect whether the null hypothesis of homogeneity is reasonable (i.e. all study effects are drawn from a common underlying effect distribution) (Cochran 1954,

Higgins et al 2002). However, the Q test is reported to have low statistical power

(Hedges & Olkin 1985, Hardy & Thompson 1998, Gavaghan et al 2000, Mittlböck &

Heinzl 2006), implying that the probability of a false negative result is higher than expected (i.e. the test is more likely to fail to detect heterogeneity, which means that the null hypothesis of homogeneity is often incorrectly accepted). Because of the low power, failing to reject homogeneity does not necessarily imply that heterogeneity is absent (Xu et al 2008). To compensate for this inconvenience, a threshold p-value of

0.10 is generally adopted instead of the classic 0.05 (Jackson 2006b). Nevertheless, if the viewpoint that heterogeneity is intrinsic to MA is adopted (as it is in this thesis), then testing becomes redundant (Thompson & Higgins 2002).

The amount of heterogeneity can be measured by estimating the magnitude of the between-study variance τ2. Several approaches have been developed for calculating an estimate of τ2 (Thompson & Sharp 1999, Viechtbauer 2005). The DerSimonian-

Laird approach (DerSimonian & Laird 1986) is the most popular (Thorlund et al 2008) and is used here. It is important to remember that conventionally the τ2 estimate is assumed known and equal to the population value. In this sense, MAs of a few (and small) studies can potentially produce a τ2 estimate that is not a suitable approximation to the population estimate (Thompson & Sharp 1999, Viechtbauer 2005, Ioannidis et al

2007).

The popular I 2 statistic provides a measure of inconsistency between the study- specific effect sizes (lack of overlap between the confidence intervals of the various independent studies). As opposed to τ2, which measures between-study variance, I 2 measures inconsistency by estimating the percentage of variation across studies that is

⎛ τˆ2 ⎞ ⎜ I 2 = ⎟ due to heterogeneity rather than chance alone (Higgins et al 2002) ⎜ 2 2 ⎟ . ⎝ τˆ +σˆ ⎠

An advantage over τ2 is that I 2 does not depend on the number of studies, their design or outcome measure so that it can be employed to compare the levels of heterogeneity across different MAs. It could be argued that a statistic combining a measure of variability (within-study variance) with a measure of uncertainty (between- study variance) can lead to a tricky interpretation of I 2. Nevertheless, the pragmatic definition of I 2 is that it symbolizes the proportion of total variation in the study point estimates that is attributable to heterogeneity.

Another serious concern is that I 2 depends on the precision of the studies being meta-analysed. In other words, MAs with identical magnitudes of τ2 but different

2 degrees of σ2 will produce different values of I . As a result, the popular threshold I 2 values of 25%, 50% & 75% (suggesting low, moderate and high intensities of heterogeneity respectively) can be very misleading; particularly if they are used to decide whether to pool studies (Higgins & Thompson 2002, Higgins et al 2003, Rücker et al 2008c, Rücker et al 2008d). Therefore, any decision on pooling studies should be made on the basis of the clinical relevance of τ, which is measured on the same scale as the outcome (Rücker et al 2008c).

If heterogeneity is observed or suspected, the recommended approaches to dealing with it are (Petitti 2001, Higgins et al 2002, Morton et al 2004):

• Stratify the studies into homogeneous subgroups and then fit separate FE MAs

• Accommodate heterogeneity (while ignoring its sources) by implementing a RE MA

• Investigate the probable causes of heterogeneity by fitting a meta-regression model

with study-level effect modifiers as covariates (Thompson 1994)

The stratification approach has the disadvantage of reducing the size of the individual MAs. Since MAs of RCTs are rarely large, stratifying the original MA into subgroups will further reduce the already low statistical power (Song et al 2001) usually producing unreliable results (Brookes et al 2001).

Blindingly accommodating heterogeneity by using a RE MA model is not ideal

(Higgins et al 2009). The RE MA assumes that heterogeneity (generated by unidentified effect modifiers or other factors) varies randomly, an untenable assumption for most situations, and will result in a biased pooled effect estimate if heterogeneity is not random. This strong assumption of randomness is known to weaken as background context-specific information becomes available, making much of the heterogeneity explainable (Hughes et al 1992, Thompson 1993, Greenland & O'Rourke

2001).

In accordance with much of the scientific literature, the view taken in this thesis is that explaining heterogeneity and exploring its impact upon the MA conclusions is more important than attempting to measure or detect something that is supposedly omnipresent (Song et al 2001, Thompson & Higgins 2002, Higgins et al 2003, Higgins

2008, Higgins et al 2009). Indeed, the PRISMA (Preferred Reporting Items for

Systematic Reviews & Meta-Analyses) statement, which is an evolution of the

QUORUM (Quality Of Reporting Of Meta-analyses) statement (Moher et al 1999), clearly indicates that heterogeneity should be estimated and assessed by investigating its sources when possible (Liberati et al 2009, Moher et al 2009). To this end, heterogeneity is commonly investigated by means of meta-regression (Thompson &

Sharp 1999, Morton et al 2004), which will be addressed in section 2.5. Moreover, because study-level covariates are unlikely to explain all of the heterogeneity, remaining heterogeneity should be accommodated in a RE meta-regression

(Thompson & Sharp 1999, Thompson & Higgins 2002). In this sense, heterogeneity can be split into two categories depending on whether it can be explained by an observed (study-level) effect modifier or not. Hence, from now on, heterogeneity will be labelled either explainable or unexplained/residual, where residual heterogeneity refers to variation that has not been explained by either random sampling (within-study variation) or study-level covariates (Peters et al 2009).

In summary, the chosen approach for dealing with heterogeneous MAs will depend on the view of the meta-analyst. In this sense, some argue that whenever heterogeneity is present, studies are uncombinable and thus the MA result is meaningless (Stangl & Berry 2000). Conversely, other researchers consider the RE model to be the correct choice, particularly in decision-making scenarios where a more generalisable pooled estimate may be required (DerSimonian & Laird 1986, Fleiss &

Gross 1991, Song et al 2001). More importantly, it has become widely accepted among the scientific community that any analysis that overlooks heterogeneity is scientifically primitive and potentially misleading for decision-making (DerSimonian & Laird 1986,

Brand & Kragt 1992, Thompson 1994, Song et al 2001). Hence, the least the meta- analyst can do in order to explain heterogeneity is to understand what the objectives of the different studies are. These objectives relate to whether the studies are evaluating not just a particular medication but also its dose size, schedule or duration among other factors (Senn 2007b).

2.4. Random-effects meta-analysis model

The RE MA model was initially proposed by DerSimonian & Laird (DerSimonian &

Laird 1986, DerSimonian & Kacker 2007) and assumes that the observed effects are independent and identically sampled from a hypothetical population as outlined in equation 2.2 below. The observed study-specific effects δi are assumed heterogeneous due to random variability around the population mean effect size. The model can be conceptualised as a two-stage hierarchical process. The observed effect sizes (lnORi) differ from the study-specific population effect δi by an amount characterised by the

2 sampling error (within-study variance σi ). At the same time, the study-specific population effect δi differs from the average population effect θ by an amount symbolized by the between-study variance τ2. As a result, both within and between- study variability are accommodated into the statistical model. The parameterisation below of a RE MA model is appropriate for combining continuous outcome data (Sutton et al 2000a).

2 2 y i ~ N ( δ i , σ i ) with δ i ~ N ( θ , τ ) [Equation 2.2]

th yi is the effect size estimate in the i study, while δi are the true study-specific treatment effects drawn from a common Normal distribution with average treatment

2 2 th effect θ and between-study variance τ . The σi is the within-study variance in the i study. When τ 2=0, the RE MA model collapse into a FE MA model (equation 2.1)

(Sutton et al 2000a). The pooled estimate of effect θ and its variance are calculated as

1 in the FE MA model with the only difference that the weights are now .

As a result, the RE model gives relatively more equal weighting to studies than the

FE model, which weights heavier the more precise studies (Cox & Solomon 2003).

N (θ ,τ 2 ) could be regarded as a hyper-distribution within a hierarchical (i.e. multi-level) model (Kirkwood & Sterne 2006) (see figure 2.2 below), although the assumption of normality only holds asymptotically (Viechtbauer 2005) (i.e. whenever the study sample sizes are not too small (Copas & Lozada-Can 2009)). Instead, the random effects can also be modelled distribution-free (Higgins et al 2009), although the assumptions needed to support a particular distribution are almost never verifiable due to the small size of most MAs (Harbord et al 2006), so that an assessment of the goodness-of-fit is rarely performed in any MA (Sutton & Higgins 2008).

Alternatively, the statistical model could assume the data being sampled from two distributions, where most data points come from a homogeneous normal distribution

~0, with some studies considered outliers coming from a more dispersed

distribution ~0, . Indeed, a statistical model that combined them would be similar to the t-distribution (Lee & Thompson 2008). Recall that the student distribution is in fact an infinite mixture of normal distributions with the mixing distribution being controlled by the degrees of freedom, whereas the combined approach corresponds to the mixture of only two normal distributions (Denison et al 2002). Interestingly, thanks to the wider tails of the t-distribution, some supposedly heterogeneous studies under the FE MA model (equation 2.1) would be regarded as homogeneous, reducing the overall I 2. In summary, while other more complex options exist, heterogeneity is commonly modelled by means of random effects, which are generally assumed normally distributed in both Bayesian and classical settings (Higgins et al 2009).

Figure 2.2 exhibits the effect size distributions of four imaginary heterogeneous studies with different true effect sizes δi randomly distributed around the average effect size θ of the overall population of studies. However, the meta-analyst only observes the reported effect sizes on the right hand side (◊); which differ due to both sampling

22 error (within-study variation) and random variability in the average population effect size (between-study variation).

Figure 2.2 Illustration of a meta-analysis under the random-effects model assumptions (used with permission from Wolfgang Viechtbauer)

2 δi ~ N(θ, τ )

The fact that the RE MA contains one more parameter (τ) than the FE model entails more data to achieve the same statistical power. Of course, the RE model reduces to FE whenever τ 2=0; and in practice, the model choice depends on whether

τ2>0. From a philosophical perspective, the fundamental difference between FE & RE

MA models is whether the pooled effect is assumed to estimate the fixed population effect (fixed-effect), or the average effect size in the population (random-effects)

(Viechtbauer 2007). Although the results from RE MAs are generally considered conservative because of their wider confidence intervals, results can be seriously misleading if heterogeneity does not arise randomly (Poole & Greenland 1999). If effect

2 sizes δi truly differ from the mean effect θ due to random variability, i.e. τ >0, both FE

23 and RE MA models agree in their estimated pooled effects, but not on the length of their 95% confidence interval. On the other hand, if there is, for instance, an effect modifier with a directional effect, the RE MA model will be incapable of accommodating it properly and problems will arise until the effect modifier is accounted for by means of meta-regression (Song et al 2001). But before describing meta-regression, visual inspection of the data can be valuable.

Meta-analytic data are best displayed in a forest plot (Bax et al 2009). Since a picture is worth a thousand words (Olkin 1999), an illustrative forest plot is given in figure 2.3. In this example, the forest plot displays the study names beside the mean

OR estimates and 95% confidence intervals. The pooled RE MA estimate is displayed at the bottom of the plot. The event outcome data from both study arms are given on the right-hand side of the plot together with the estimated relative weight allocated to each study according to the RE MA model. The illustrative dataset corresponds to the controversial case of magnesium (Egger & Smith 1995). Results from a MA indicated enthusiastic protective effects from magnesium against myocardial infarction. This enthusiasm was suddenly evaporated when the results from a mega-trial (ISIS-4) designed to confirm the MA results showed no effect whatsoever (Higgins &

Spiegelhalter 2002). Further discussion is available in section 2.5

Figure 2.3 Forest plot of the magnesium dataset reporting the random-effects meta-analysis pooled estimate

Study Random effets Events, Events, % ID ORlnOR (95% CI) Treatment Control Weight

Bertschat 0.30 (0.01, 7.88) 0/22 1/21 1.29 Ceremuzynski 0.28 (0.03, 2.88) 1/25 3/23 2.34 Pereira 0.11 (0.01, 0.97) 1/27 7/27 2.65 Golf 0.43 (0.13, 1.44) 5/23 13/33 6.36 Morton 0.44 (0.04, 5.02) 1/40 2/36 2.17 Abraham 0.96 (0.06, 15.77)1/48 1/46 1.70 Schechter 0.09 (0.01, 0.74) 1/59 9/56 2.81 Singh 0.50 (0.17, 1.43) 6/76 11/75 7.54 Schechter 1 0.13 (0.03, 0.60) 2/89 12/80 4.63 Schechter 2 0.21 (0.07, 0.64) 4/107 17/108 6.97 Thogersen 0.45 (0.13, 1.54) 4/130 8/122 6.26 Rasmussen 0.35 (0.15, 0.78) 9/135 23/135 9.75 Feldstedt 1.25 (0.48, 3.26) 10/150 8/148 8.32 Smith 0.28 (0.06, 1.36) 2/200 7/200 4.40 LIMIT-2 0.74 (0.56, 0.99) 90/1159 118/1157 15.69 ISIS-4 1.06 (1.00, 1.13) 2216/29011 2103/2903 17.12 Overall (I-squared = 68.2%, p = 0.000) 0.48 (0.33, 0.71) 2353/31301 2343/31306 100.00

NOTE: Weights are from random effects analysis

.011 1 90.8

Exploratory investigation of heterogeneity can be achieved by means of graphical methods (Song et al 2001, Bax et al 2009). In this sense, forest plots can be used to classify studies according to potential effect modifiers in order to reveal underlying trends. Typical examples include study-specific characteristics such as year of publication, sample size, dose size, etc. Alternatively, studies can be ranked by their estimated effect size with the hope that potential underlying effect modifiers will be uncovered. Forest plots are also useful for cumulative MAs (Whitehead 1997), where studies are ordered according to a specified variable and pooled together consecutively. This allows investigation of the influence of separate studies on the pooled estimate according to the specified variable. If the year of publication is used, it can help to clarify when the collective evidence was sufficient to turn the treatment

25 under consideration statistically efficacious (if at all) (Sutton & Higgins 2008).

Moreover, if results are still inconclusive and a new trial is required in order to resolve the research question, it can be argued that powering the prospective trial with reference to the updated MA may be more reasonable than calculating its sample size in isolation (Sutton et al 2007, Sutton et al 2009a, Sutton et al 2009b).

2.5. Introduction to meta-regression

Meta-regression (MR) can assist in explaining heterogeneity by allowing for the influence of potential effect modifiers. MR is used to investigate associations between study characteristics and study effects as in standard regression (Berkey et al 1998,

Greenwood et al 1999, Thompson & Sharp 1999, Schmid et al 2004). If a covariate, such as length of follow-up, can explain some of the heterogeneity, adjusting/controlling for this covariate should allow a more robust estimate of the pooled response line (figure 2.4). Note as well that exploration of heterogeneity should ideally be carried out even if the Q test for homogeneity is non-significant (Thompson &

Higgins 2002) since “lack of evidence of heterogeneity is not evidence of homogeneity”

(Ioannidis et al 2007).

The following figure illustrates MR applied to the magnesium dataset. Study sample size has been mentioned as a potential (surrogate) effect modifier and so its influence is investigated (Higgins & Spiegelhalter 2002) (slope p-value= 0.001)

Figure 2.4 Illustrative meta-regression in the magnesium dataset. Study log OR estimate of 16 trials using magnesium to prevent myocardial infarction are regressed on their study log sample size. The area of each circle is inversely proportional to the variance of the log OR estimate 0 -.5 -1 -1.5 Effect estimate log(OR) estimate Effect -2 -2.5 4 6 8 10 12 log study sample size

With respect to categorical covariates, an advantage of MR is that it allows a more compelling assessment of heterogeneity compared with a stratified/subgroup analysis approach (Freemantle et al 1999, Brookes et al 2001), since differences between subgroups, as well as differences within subgroups, can be investigated (Thompson &

Higgins 2002).

MR, however, is not free from limitations and pitfalls. Thompson and Higgins

(Thompson & Higgins 2002) summarize them in their exceptional 2002 article. Some of the key limitations of MR are further discussed now due to their relevance to this thesis.

In particular, MR is used later to develop a method for obtaining PB adjusted MA estimates.

MR is critically prone to type I errors (i.e. spurious associations). Therefore, any relationships found through MR ought to be regarded as exploratory rather than confirmatory. Note that the nature of MR is observational because the covariates are never randomised making MR vulnerable to confounding bias. As a result, MR should be used for hypothesis generating rather than hypothesis testing. Since MR is, in essence, a post-hoc analysis, where multiple testing (data dredging) is usually carried out in the search for statistical significant associations, covariates to be assessed should be pre-specified in the MA protocol (Thompson & Sharp 1999). Nonetheless, data dredging seems the norm in MR making reported/published significant associations need be considered with scepticism (Chan et al 2004a, Chan et al 2004b).

Type II errors are also higher than expected in MR, particularly if summary patient characteristic data are used. (Lambert et al 2002). Moreover, MR has greater statistical power when the covariate of interest is highly variable between studies and, for covariates based on patient characteristics such as age, low variability within studies

(Thompson & Higgins 2002).

Study characteristics such as the geographical area of study, length of follow-up or year of publication can be easily incorporated into the MR to help explain heterogeneity between studies (Berkey et al 1995a, Berkey et al 1998). Of course, some study characteristics can be a proxy for the true causal effect modifier, and alternative explanations need to be always contemplated (Thompson & Sharp 1999, Song et al

2001). For instance, year of publication is not be the cause of heterogeneity but may act as a surrogate for improvement of treatment efficacy over time (thanks to medical advances).

Patient characteristics, on the other hand, have to be included in a MR with a lot more caution than study characteristics because they are aggregate measures, e.g. average patient age or percentage of patients with a particular characteristic (Lambert

28 et al 2002). In this sense, MR inferences can be potentially misleading if derived from aggregated data due to potential aggregation bias, as described next.

2.6. Aggregation bias

Aggregation bias can lead to erroneous beliefs that individual studies follow the average behaviour denoted by the MR (Robinson 1950, Morgenstern 1982, Greenland

1987). This contra-intuitive phenomenon is commonly known as the Simpson's paradox (Yule 1903, Simpson 1951, Tu et al 2008, Barza et al 2009), although it has received other names such as ‘ecological fallacy’, ‘ecological confounding’ or

‘ecological bias’ depending on the research area (epidemiology, sociology, biostatistics, etc). Hence, care is needed not to over-interpret results when making inferences about within-study relationships based on across-trials relationships from the MR (Robinson 1950, Selvin 1958, Nurmi 1998, Berlin et al 2002, Thompson &

Higgins 2002, Delgado-Rodriguez & Llorca 2004, Wakefield 2004, Riley et al 2008).

Note that MA per se is not influenced by the Simpson’s paradox (Altman & Deeks

2002).

An example of Simpson’s paradox (Rücker & Schumacher 2008) in the MR context could be formulated as follows. Age is known to be correlated with treatment efficacy so that the elderly benefit more from the treatment than younger patients do; and so age is recorded for each patient but is reported as an aggregate study-level variable (average patient age). Each clinical trial (unit of analysis) summarizes its results in terms of its mean effect size as well as the average patient age. However, it is possible that the MR could suggest that younger patients respond better to the treatment, than the older patients, or that no association exists. One interpretation is that treatment effect is related to age within each trial but is not related to mean age across trials. Such illogical situations have occurred in reality (Higgins et al 2001) although less paradoxical scenarios are also possible, such as MR suggesting more or less intense associations than truly exists.

A way of avoiding potential aggregation bias, as well as minimising confounding bias, is to use individual patient data (IPD) rather than aggregate/summary statistics

(Berlin et al 2002, Lambert et al 2002, Riley et al 2008). Although the MA of IPD is the gold standard for investigating patient characteristics, it is rarely carried out because of the extra statistical and computational complexity in addition to the scarcity of IPD available. Nevertheless, the present trend is to recommend that trialists make the IPD dataset available whenever possible (Good & Hardin 2003), such as on the internet.

However, this thesis is restricted to the use of aggregate data in MAs and so there is no further consideration of IPD MAs.

2.7. Multiple meta-regression

Theoretically, a multiple MR can simultaneously incorporate covariates such as study length, dose size, compliance, age, gender, study quality, etc. to assess their impact on effect size (Siersma et al 2007). However, in addition to the problem of colinearity (i.e. high correlation among different predictor variables) (Berlin & Antman

1994), the most important weakness of MR is the typical lack of sufficient statistical power to detect across-trails associations convincingly (Greenwood et al 1999).

Furthermore, MAs in the medical literature are typically small (Schmid et al 1998), resulting in a limited number of degrees of freedom available for the regression analysis, which makes MR results difficult to interpret. Therefore, only very few covariates can be realistically incorporated in a MR, if any at all. A direct consequence of this typical shortage of degrees of freedom is that simple linear regression models are preferred to non-linear or multiple ones (Schmid et al 1998), even though they might be considered too restrictive to accurately capture the true underlying relationship; which in most real cases is more complex than just linear (Denison et al

2002, Bagnardi et al 2004). With this in mind, non-linear models are investigated in chapter 10.

There are, nevertheless, several linear MR models available in the literature

(Harbord & Higgins 2008), although they do not tend to be properly described or understood (Thompson & Sharp 1999, Thompson & Higgins 2002). Therefore, since a significant part of this thesis relies on the idea of MR, the underlying theory of the statistical model is given in the following section.

2.8. The regression model

The objective of regression analysis is to fit a curve describing the underlying functional relationship f(x) between the dependent variable y and the explanatory variable x (in the simplest case of a single covariate regression model). Then, the functional relationship corresponds to the conditional expectation of y given x (Denison et al 2002) f (x) = E(y | x)

The true functional relationship is usually unknown and must be identified from the data. With the aim of revealing the true role of the predictor in the data generation process, alternative regression models are compared by assessing how closely they fit the data in order to identify the best fitting model. However, since only a sample of the whole population is usually observed, only an approximation g(x) to the true function f(x) can be pursued y = g(x) + ε (Denison et al 2002). Thus, the resulting regression model can only be considered as a good approximation to the truth and is likely to suffer from model misspecification to a certain extent (Bowden et al 2006).

In the context of binary outcome data, the response variable follows a binomial model y ∈ {0, 1}; where scores/thresholds are used to dichotomised a genuine continuous variable into a two-class categorical variable (Thoresen 2007). Binary data is directly considered by logistic regression through a binomial model, where the problem of interest is to find, from the data, the functional relationship between the

31 predictor variable and the two possible response outcomes. For instance, patients with some particular predictor (e.g. yes/no smoker) have a different probability of developing the outcome of interest (e.g. lung cancer). Hence, logistic regression is frequently applied to estimate whether the treatment has an effect (lnOR) in, for example, reducing the number of clinical events relative to the control arm (the logistic formulation used later on in chapter 10 is available from appendix 1). An alternative approach to the logistic regression is to assume normality of the lnOR (Hamza et al

2 2008), where sei is the variance of lnORi (Thompson & Sharp 1999).

2 lnOR ~ N(α + β × x , se ) [Equation 2.3] i i i

The regression intercept α corresponds to the population effect size in the lnOR scale when xi =0. Without loss of generalisability, the regression slope β can be interpreted as the estimated increase in lnOR per unit increase in x. Whenever residual heterogeneity remains, a RE MR model becomes appropriate (Thompson & Sharp

1999).

2 2 lnOR ~ N(α + β × x , se +τ ) [Equation 2.4] i i i

As mentioned previously, the use of binomial models is preferable in principle to the normal approximations of the summary statistics lnOR because they provide marginally more accurate results (Hamza et al 2008). Even so, the normal approximation is considered a more flexible and general framework for MR (Siersma et al 2007), although both models generally produce very similar results. The choice of statistical model is only likely to have an impact in scenarios with just small studies

(Thompson & Sharp 1999, Dohoo et al 2007, Hamza et al 2008). In this sense, chapter

5 applies several regression models assuming normality of the lnOR, whereas chapter

10 implements a logistic regression.

2.9. Summary

Advantages of a MA include greater statistical power and thus, the potential for a more precise pooled estimate than that available from a single study (provided the data are unbiased and not contradictory). Data consistency is characterised by the notion of heterogeneity, where variability in the benefit of the intervention between studies cannot be explained by random sampling alone. MR provides a framework for the investigation of potential sources of heterogeneity between studies and the possibility for the results to be more generalisable than single studies (Fleiss & Gross 1991,

Blettner 1999). However, care should be taken to prevent misleading interpretations of the results from MR due to its limitations and pitfalls (Thompson & Higgins 2002).

Altogether, chapter 2 complies with the early aims of the thesis by covering the fundamental issues concerning MA. The following chapter further reviews the literature to identify the different types of biases affecting meta-analytic data, with special attention, of course, to PB.

Literature review on publication bias

& small-study effects

3.1. Biases in meta-analytical data

Biases affecting evidence synthesis can be divided into two major groups, that is, internal and external biases (Turner et al 2009b). External bias refers to the generalisabity of the study’s findings to routine clinical practice in the community

(Higgins et al 2008a). For instance, a study addressed to a population not covered by the research question may not be very relevant regardless of the study rigour. External bias can be (subjectively) measured by comparing the trial and routine practice in terms of compliance, patients characteristics, drug regimes/duration, co-morbidities, etc

(Rothwell 2005, Gartlehner et al 2006). On the other hand, internal bias refers to the methodological rigour of a trial reflected by the quality of the randomization process, allocation concealment, blinding, attrition, exclusion of patients from the analysis, etc

(table 3.1). Note that assessment of the internal validity of findings from a single RCT is usually referred to in the literature as ‘assessment of methodological quality’ or ‘quality assessment’ (Higgins et al 2008a).

Table 3.1 Cochrane classification of internal biases (adapted from the Cochrane

handbook (Higgins et al 2008a))

Type of bias Description

Systematic differences between baseline Selection bias characteristics of the groups that are compared

Systematic differences between groups in the Performance bias care that is provided, or in exposure to factors other than the interventions of interest

Systematic differences between groups in Attrition bias withdrawals from a study

Systematic differences between groups in how Detection bias outcomes are determined

Systematic differences between reported and Reporting biases unreported findings

Internal (study quality) and external (generalisability) validity need to be traded off because they both exist on the same continuum. That is to say, that increasing one will often decrease the other. For example, patient and doctor preferences are common in day-to-day practice but they are ignored in a double blinded study design (Gartlehner et al 2006). Despite the obvious influence of PB on the generalisability of the MA results, PB (as well as the remaining reporting biases) is considered an internal bias because it refers to the publication of the study (table 3.1). Because the focus of this thesis is a form of reporting bias, PB, the other internal biases are not considered thereafter but they shall not be disregarded in any thoughtful data analysis.

3.2. Reporting biases

Reporting biases refer to the influence of the perceived benefit of study results on their dissemination. The different types of reporting biases comprise how, when and where research findings are reported, while PB specifically refers to whether these findings are published or not (table 3.2). To keep the impact of reporting biases to a

minimum, the Cochrane Collaboration recommends a systematic approach to review

and synthesise all available evidence (Jørgensen et al 2006, Jørgensen et al 2008). It

follows an explicit protocol that approaches the review in a structured and overt way

(Higgins & Green 2008).

Table 3.2 Cochrane classification of reporting biases (adapted from the Cochrane

handbook (Sterne et al 2008))

Type of reporting bias Description

Publication bias Publication of research findings depends on the nature and (Rothstein et al 2005) direction of the results

Time lag bias The rapid or delayed publication of findings, depending on (Stern & Simes 1997, Lee et al the nature and direction of the results 2008)

Multiple (duplicate) The duplication of published research findings, depending publication bias on the nature and direction of the results (Gøtzsche 1989)

The publication of research findings in journals with different Location bias ease of access or levels of indexing in standard databases, (Egger & Davey Smith 1997) depending on the nature and direction of results.

Citation bias The citation or non-citation of research findings, depending (Ravnskov 1992) on the nature and direction of the results

Language bias The publication of research findings in a particular language, (Egger et al 1997a) depending on the nature and direction of the results

Outcome reporting bias The selective reporting of some outcomes but not others, (Chan et al 2004b) depending on the nature and direction of their results

For a more complete description of reporting biases, please refer to the Cochrane

handbook (Sterne et al 2008). All reporting biases are a concern due to their serious

consequences for science in general and clinical decisions in particular. Note that they

all imply a type of selection in the information considered/collected and thus a form of

36 selection bias. This thesis pays special attention to PB although the remaining forms of reporting biases are also addressed embedded in the concept of small-study effects.

3.3. Introduction to small-study effects

This peculiar phenomenon by which smaller studies show a greater effect than larger studies (Berlin et al 1989, Sterne et al 2000) is being increasingly noticed since the use of funnel plots (described in section 4.1.2) to display MA data became widespread. For instance, Begg & Berlin (Begg & Berlin 1988) noticed in 1988 that: “a casual glance at the data gives the impression that increased sample size leads inexorably to a mean effect size of zero, a rather pessimistic commentary on the current state of cancer (page 462)”. A large number of examples exist in the literature where smaller studies are said to report larger effect sizes than bigger studies. More interestingly, this phenomenon could partially explain the placebo effect observed in a recent systematic review comparing placebo with no treatment (figure 3.1)

(Hrobjartsson & Gotzsche 2004).

Figure 3.1 Illustrative asymmetric funnel plot of 118 trials using the standardised mean difference (SMD) as the outcome to measure the placebo effect (Hrobjartsson &

Gotzsche 2004). A funnel plot is a scatter plot of all the observed study effect sizes

(SMD on the x-axis) against a measure of their study precision (on the y-axis)

Traditionally, PB has been seen as the main reason for the observed small-study effects in most MAs (Williamson & Gamble 2005, Williamson et al 2005, Dwan et al

2008, Ioannidis 2008b). However, small-study effects can be induced by a variety of factors besides PB (McMahon et al 2008), which can be classified as either biases or sources of heterogeneity (Sterne et al 2000, Ioannidis 2008b). Altogether, it can be argued that results from a MA displaying an asymmetric funnel plot should be regarded with caution because the small-study effects threaten the validity from the MA results

(Egger et al 1997c, Richy et al 2004). Since no overview of the many potential causes currently exists, the first detailed review on the sources of small-study effects is presented here.

To facilitate a better understanding of the phenomenon, the following section begins with the enumeration of potential sources of small-study effects (table 3.3), accompanied by a description of their mechanism for inducing it. Suggestions are also made as to how the ‘true’ effect could be predicted if one is willing to make some strong assumptions. Principally that larger studies better estimate the effect size of interest because they are less influenced by reporting biases and more accurately reflect routine clinical care received by the general population (with the condition of interest). Consequently, there is a predictable reduction in the magnitude of the effect size by moving from efficacy to effectiveness results (Gartlehner et al 2006, McMahon et al 2008). This key assumption is used in chapter 5 to justify a novel method to adjust for PB and other small-study effects.

3.4. Sources of small-study effects

Table 3.3 lists all potential sources of small-study effects. An explanation of each one follows.

Table 3.3 Potential sources of small-study effects (adapted and augmented from several sources (Egger et al 1997c, Wisløff et al 2006, Sterne et al 2008))

1) Reporting (selection) biases • Publication bias (Rothstein et al 2005) • Within-study selective reporting bias (Hutton & Williamson 2000, Hahn et al 2002, Chan et al 2004a, Dwan et al 2008) • Time lag bias (Stern & Simes 1997, Hopewell et al 2007a) • Location bias (Egger & Davey Smith 1997, Vickers et al 1998, Timmer et al 2002, Hopewell et al 2007b) • Citation bias (Gøtzsche 1987, Kjaergard & Gluud 2002, Nieminen et al 2007) • Language bias (Egger et al 1997a, Jüni et al 2002, Egger et al 2003, Pham et al 2005, Galandi et al 2006, Nartey et al 2007) • Multiple (duplicate) publication bias (Gøtzsche 1989, Leizorovicz et al 1992, Huston & Moher 1996, Tramer et al 1997, Melander et al 2003, von Elm et al 2004)

2) True/genuine heterogeneity • True differences in underlying populations (Arends et al 2000, Sharp & Thompson 2000, Higgins & Spiegelhalter 2002) • Different interventions intensity (Greenwood et al 1999) • Different follow-up (Lawlor & Hopker 2001) • Different compliance (Glasziou 1992)

3) Data irregularities • Methodological quality of the study (Schulz et al 1995, Egger et al 2003, Als-Nielsen et al 2004, Tierney & Stewart 2005, Pildal et al 2007, Jüni et al 2008, Kjaergard et al 2008, Wood et al 2008, Jefferson et al 2009) • Early stopping rules (Montori et al 2005, Bassler et al 2007, Goodman 2008, Guyatt et al 2008b) • Sponsorship/funding bias (Als-Nielsen et al 2003, Lexchin et al 2003, Perlis et al 2005, Heres et al 2006, Jørgensen et al 2006, Doucet & Sismondo 2008, Jørgensen et al 2008) & fraud (Ranstam et al 2000, Martinson et al 2005, Landefeld & Steinman 2009) • Intervention intensity associated to study size (Kjaergard et al 2001)

4) Artifactual heterogeneity • Structural correlation (Berkey et al 1995b) • Sample size calculation (Linde et al 1997, Terrin et al 2003) • Choice of study effect measure (OR, relative risk, mean difference, etc) (Sterne & Egger 2001) • Choice of study precision measure (a function of variance or sample size) (Sterne & Egger 2001)

5) Chance (i.e. sampling error) (Egger et al 1997c)

3.4.1. Reporting biases

Publication bias

The problem of PB is usually assessed by means of investigating the presence of small-study effects (Ioannidis 2008b). Because larger trials experience less the phenomenon of PB, a reasonable assumption is that the probability of study suppression decreases as study size increases (Berlin et al 1989, Copas & Shi 2000b,

Lee et al 2008). Of course, a larger number of patients do not ensure publication, but the participation of more patients and health professionals makes omitting the study results more challenging. For instance, increased pressure to publish in the larger collaborative trials despite the trial results could partially explain why larger trials experience less the impact of PB (Berlin et al 1989). It then follows that a hypothetical study of infinite size would have no chance of being suppressed, and so provides an unbiased estimate of the population effect (Moreno et al 2009a).

Within-study selective reporting bias

The selective reporting of outcomes within a single study is also known as outcome reporting bias (Hahn et al 2000, Hahn et al 2002). This and PB can be seen as a two-level missing outcome data problem, where PB selects the studies to be reported whilst outcome reporting bias selects what study outcomes are reported.

Indeed, recent evidence has exposed an overlap where studies may be unpublished simply because all of its outcomes were non-significant while other studies may be published because the significant outcomes have been selectively reported (Hutton &

Williamson 2000, Chan et al 2004a, Schulz & Grimes 2005, Chan et al 2008a, Dwan et al 2008, Lee et al 2008, Rising et al 2008, Turner et al 2008a). Furthermore, the consequences of selective reporting within published studies for MAs are qualitatively the same as the suppression of whole studies; that is, a systematic exaggeration of the pooled effect (Decullier et al 2005, Furukawa et al 2007, Moreno et al 2009b). Since outcome reporting bias is more recurrent in smaller studies (Williamson et al 2005), small-study effects are induced (Moreno et al 2009b). Therefore, it is reasonable to

40 assume that the probability of within-study selective reporting decreases as study size increases (Liberati et al 2009). And so, a hypothetical study of infinite size would not be affected and would provide an unbiased estimate of the population effect.

Biases due to time lag, multiple publication, location, citation & language

The remaining reporting biases are also more likely to affect smaller studies and so are sources of small-study effects (Sterne et al 2000). For example, it is reasonable to assume that the probability of rapid publication, unique publication, ease of access, citation and English written (Galandi et al 2006, Shang et al 2007) increases as study size rises. Under this assumption, a hypothetical study of infinite size would be always cited, be freely accessible and published in English without duplication or undue delay. However, it seems contra-intuitive to think that multiple (duplicate) publication bias affects smaller studies to a greater extent since larger studies generate more data, allowing the publication of multiple articles reporting different outcomes, effect measures, subgroups, etc (Bender et al 2008). Note however that overt multiple publications clearly cross-referencing each other should not induce bias in a review

(Tramer et al 1997); whereas unclear or covert cross-referencing of duplicate articles

(deliberately favouring the experimental treatment) will bias the review in the form of small-study effects (Egger et al 1997c). Instead, note that published ‘positive’ studies, which are usually smaller, are more likely to produce multiple publications (Sterne et al

2000).

3.4.2. True/genuine heterogeneity

True differences in underlying populations

True heterogeneity related to patient populations or interventions can induce small-study effects if effect modifiers exist confounding the relationship between study effect and its precision (Moreno et al 2009b). For instance, if an intervention is more effective in higher-risk patients (Arends et al 2000, Sharp & Thompson 2000, Dohoo et al 2007), and small studies are, on average, conducted in higher-risk patients, this may result in a larger treatment efficacy being observed in the smaller studies (Higgins &

Spiegelhalter 2002, Shechter & Shechter 2005). In this scenario, it can be argued that larger studies better reflect the baseline risk, (and so true effect size) experienced at the population level because they contain more patient heterogeneity than smaller studies (McMahon et al 2008). Therefore, extrapolation of the small-study effects trend to a hypothetical study of infinite size should predict the effect size experienced by the general population, which by definition is of average-risk.

Different interventions intensity

Another possible explanation for the discrepancies in effect size between large and small studies may be explained by the systematic variation of efficacy in the active or control intervention. For instance, increasing the treatment intensity via higher drug doses is expected to produce larger effects (Greenwood et al 1999), at the same time as more intense treatments are more often employed in smaller studies (Terrin et al

2003). Equally, if the active treatment remains unchanged over time while the control/standard intervention improves, the effect size from larger studies (which tend to be carried out after the small studies (Kjaergard et al 2001)) is reduced compared to the earlier smaller studies, again inducing small-study effects. In both scenarios, extrapolation of the small-study effects trend to a hypothetical study of infinite size would predict the effect size currently gained from ordinary treatment intensity.

Different follow-up

As with differences in the intensity of interventions, systematic differences in the length of follow-up between large and small studies can generate small-study effects if follow-up is associated to treatment effect (Egger et al 1997c, Lawlor & Hopker 2001).

It can also be argued that larger studies better represent the effect size that would be experienced at the population level, while smaller studies (with a systematically different average follow-up length to that from larger studies) are less representative

(Hooper et al 2002). Then again, the effect size of interest is best predicted by a hypothetical study of infinite size.

Different compliance

Varying levels of compliance usually induce heterogeneity in MAs, with lower levels of compliance expected to weakening the intervention effect (Glasziou 1992). It may be reasonable to assume that patients are more closely monitored in smaller trials, and so they are expected to experience higher compliance levels, resulting in larger treatment effects being observed in the smaller studies. Furthermore as noted, it can be argued that larger studies better reflect the compliance rates (and so, effect size) that would be experienced at the population level. Hence, extrapolation to a hypothetical study of infinite size shall predict the effect size experienced by the general population, with compliance levels more accordant to the larger studies.

Overall, if a confounder exists that induces small-study effects, the extrapolation will predict the population effect size for that particular covariate value routinely seen in practice. Other unidentified effect modifiers that contribute to genuine heterogeneity might exist, which do not exhibit the small-study effects pattern. These can be expected to contribute to the overall heterogeneity but are independent of study precision and thus not a source of small-study effects. Consistent with present recommendations (section 2.5), the view in this thesis is that they should be, in principle, incorporated into the regression model.

3.4.3. Data irregularities

Methodological quality of the study

Trial quality refers to the conviction that the study design, conduct and reporting minimise the risk of bias in the results. It is important that the quality of reporting is not confused with the quality of the design and conduct of the trial (Jüni et al 2001,

Huwiler-Muntener et al 2002, Cipriani et al 2007b). For instance, a good study design carried out to high standards may be inadequately reported, whereas a report not revealing a bad-designed or bad-conducted study can give the false impression of a study of fairly good quality. Some argue that high standards in reporting are strongly associated with good trial design and conduct (Kjaergard et al 1999); although others disagree and oppose the use of reported design variables (e.g. blinding) as a surrogate for true study quality (Huwiler-Muntener et al 2002). Quality scores are also known to be poor predictors of study quality due to unavoidable misspecification (Greenland

1994a, b, Jüni et al 1999, Greenland & O'Rourke 2001, Herbison et al 2006, Sterne et al 2009b). Moreover, the highly dimensional quality of a study is context specific

(Jefferson et al 2009), and the study design variables may have a non-additive and nonlinear effect (Greenland & O'Rourke 2001). Interestingly, a measure of study precision has been proposed as a more suitable predictor of study quality (Shang et al

2005, Nartey et al 2007).

Nevertheless, there are certain aspects of the methodological quality of studies that are vital for the adequate control of bias (the randomization process, allocation concealment, blinding, attrition, exclusion of patients from the analysis, etc) (Jüni et al

2001, Balk et al 2002, Sterne et al 2002, Pildal et al 2007, Kjaergard et al 2008, Kunz et al 2008, Wood et al 2008). Some even assert that trial quality is a more important source of bias than reporting biases (Egger et al 2002). Interestingly, these same study design variables help characterise a study of perfect design (Greenland & O'Rourke

2001), denoted by a hypothetical study of infinite size (Rubin 1992).

Of course, larger number of patients in the trial does not ensure proper study design and conduct, but the involvement and collaboration among more researchers with more time and financial resources (typical of larger studies) does (Kjaergard et al

1999, Sterne et al 2001b). Altogether, there is irrefutable empirical evidence confirming that study quality is associated to the reported effect size, with lower quality studies reporting larger effects (Schulz et al 1995, Moher et al 1998, Jefferson et al 2009).

Moreover, empirical evidence suggests that smaller studies are, on average, of lower methodological quality, which fuels the risk of bias towards positive results and therefore small-study effects (Berlin et al 1989, Kjaergard et al 1999, Sterne et al 2000,

Als-Nielsen et al 2004, Shang et al 2005). For example, the legitimacy of the results from a RCT relies on the preservation of the original randomised allocation (i.e. intention-to-treat analysis). This type of analysis (as opposed to per-protocol) is more often performed in studies of higher methodological quality (Ruiz-Canela et al 2000).

There is empirical evidence suggesting that larger trials are more likely to perform an intention-to-treat analysis (Ruiz-Canela et al 2000, Jüni et al 2008), whereas smaller studies tend to follow the per-protocol approach (by excluding patients with protocol deviations from the analysis), undermining trial external validity (i.e. generalisability)

(Gartlehner et al 2006) and exaggerating effect sizes (Schulz & Grimes 2002, Melander et al 2003, Juni & Egger 2005, Tierney & Stewart 2005, Porta et al 2007, Nüesch et al

2008). Furthermore, as discussed in section 7.5, switching per-protocol by intention-to- treat produces study results that are more precise; and so, small-study effects are inevitably induced. Because per-protocol aims to estimate drug efficacy while intention- to-treat estimates effectiveness, there is a predictable reduction in the magnitude of effect size (Gartlehner et al 2006, McMahon et al 2008). Hence, predicting a study of infinite size characterises a hypothetical study of the best quality in that context, which implicitly refers to effectiveness rather than efficacy.

Early stopping rules

Studies stopped early ‘for benefit’ face the potential for overestimation of treatment effects (Montori et al 2005, Bassler et al 2007). Recurrent interim analyses stop trials as soon as a significant positive effect is detected (Schulz & Grimes 2005).

Since smaller studies are more influenced by sampling error, they are also at a higher risk of achieving a spurious significant beneficial effect and consequently being stopped early (Goodman 2007, Guyatt et al 2008b). Hence, early stopping ‘for benefit’ will tend to produce exaggerated treatment effects in the smaller studies, leading to small-study effects (Berlin et al 1989, Hughes et al 1992). Predicting a study of infinite size should provide an unbiased estimate of the population effect unaffected by bias induced by early stopping.

Sponsorship/funding bias & fraud

The FDA is concerned that sponsors may design head-to-head trials in such a way that they favour their own drugs, for example under-dosing a competing drug when assessing efficacy. To prevent this from occurring, the FDA does not allow comparative claims. Heres et al (Heres et al 2006) reported that 90% of the head-to-head comparisons between active treatments sponsored by the industry favoured the sponsor’s drug. Placebo-controlled RCTs funded by for-profit organisations are known to favour the experimental intervention more than those funded by non-profit organisations (Kjaergard & Als-Nielsen 2002, Als-Nielsen et al 2003). One likely explanation for this, which is supported by overwhelming evidence, is self-imposed publication restraint (in addition to other reporting biases) (Lexchin et al 2003,

Melander et al 2003, Dickersin 2005, Jørgensen et al 2006, Sismondo 2007, Dickersin

2008, Jørgensen et al 2008, Lee et al 2008, Rising et al 2008, Sismondo 2008, Turner et al 2008b, Landefeld & Steinman 2009). For instance, in oncology, Ramsey et al found that industry-sponsored trials were the least published (5.9%) where 75% reported ‘positive’ results (Ramsey & Scoggins 2008). Of course, other reasons may

46 co-exist (Ranstam et al 2000, Jacobs 2002, Martinson et al 2005, Lynch et al 2007,

Curt & Chabner 2008).

According to Perlis et al, for-profit funded trials are more likely to be successful because industry only funds studies for which the experimental intervention has already been shown to be beneficial in smaller pilot trials (supposedly carried out by less-well-funded investigators). However, the idea that large trials are undertaken only once prior beliefs of success exists is unconvincing (Berlin et al 1989).

Perlis et al (Perlis et al 2005) also speculated that the industry financially supports larger and better-designed studies, which are adequately powered to identify significant differences if such differences do exist, and so avoid false negative conclusions common in underpowered trials. Although the reasoning seems sound, there is evidence to the contrary. Als-Nielsen et al (Als-Nielsen et al 2003) found no difference in the average size of studies funded by for-profit and non-profit organizations. Moreover, their analysis indicated that study size did not explain why industry-funded trials are more likely to recommend the experimental drug as the treatment of choice compared with non-industry-funded trials. However, they found that for-profit funded trials had, on average, superior methodological quality in terms of blinding and allocation concealment. Of course, that alone cannot explain why industry funded trials tend to contribute more than its fair share to the pool of statistically significant results compared to the non-profit funded trials.

One possible reason for the presence of (sponsorship) bias in for-profit funded studies is dishonest conduct (Ranstam et al 2000, Martinson et al 2005). If sponsorship bias arises because the experimental treatment is given an unfair advantage over the comparator control treatment, such bias might occur more frequently in smaller studies.

Including more patients in trials usually means the involvement and collaboration of more researchers, and this may lead to more scrutiny and supervision in the design,

47 conduct and reporting of trials. The relative lack of scrutiny and supervision in smaller studies may explain the small-study effects (Egger et al 1997c, Sterne et al 2000).

Intervention intensity associated to study size

It has been alleged that interventions may be implemented less thoroughly in larger studies, which may explain their lower effect sizes compared to smaller studies

(Woods 1995, Kjaergard et al 2001). Likewise, smaller studies may be offering more tailored and intensive levels of care than larger studies (more representative of routine care), which may artificially raise the intervention effect. Both aspects induce small- study effects. A hypothetical study of infinite size would provide a more generalisable estimate of the population effect; particularly since routine care is known to be less meticulous in the implementation of interventions than randomised trials under control conditions.

3.4.4. Artifactual heterogeneity

Structural correlation

Whenever structural correlation exists between the effect estimate and its estimated precision (Berkey et al 1995b, Thompson et al 1997, Macaskill et al 2001,

Rücker et al 2008a) (explored in section 5.3.1), it biases the pooled effect size because the effects estimated for smaller (less precise) studies are exaggerated, and therefore suggesting that a study of infinite size would be unaffected.

Sample size calculation

A fundamental assumption for the argument on the infinite size study to be tenable is that the underlying effect size must be independent of study precision. Note that study precision is alleged here to be a surrogate for sources of small-study effects and not an effect modifier itself. This assumption is crucial because if study precision induces the small-study effects by itself, predicting a study of infinite size automatically

48 becomes meaningless. In fact, some have argued (Linde et al 1997, Terrin et al 2003) that the power calculation explains the small-study effects at least partially, when heterogeneous studies include a range of true effects (Lau et al 1998). Since the largest effects need smaller studies (i.e. fewer participants) to be shown statistically effective, artificial small-study effects are inevitably induced when combining such heterogeneous studies. In other words, because the power calculation of a prospective study is determined by the expected size of effect to be detected, large studies normally aim for small effect sizes, whereas small studies aim for large effects. It seems reasonable that large studies are not undertaken unless there is a ‘clinically important’ small effect to detect (Berlin et al 1989). Then, unless there is a significant effect to be detected, the trial is never carried out, which will make the funnel plot

(section 4.1.2) exhibit the predictable small-study effects pattern (Terrin et al 2003).

While this argument is perfectly reasonable in theory, it could be argued that sample size is calculated in a more pragmatic way in real life (e.g. based on the expected availability of patients and funding) (Begg & Berlin 1988, Berlin et al 1989, Terrin et al

2003).

The basis for such artificial small-study effects is the use of earlier small pilot studies to derive prior beliefs about the true underlying treatment effect for power calculation purposes (Berlin et al 1989). Note, however, that due to the nature of small pilot studies (attempting to detect typically small clinical effects), their estimates of the

‘true’ effect size are very imprecise at best, and at worst completely misleading (Begg

& Berlin 1988). Specifically, the observed effect size differs from the true study-specific treatment effect by an amount characterised by the sampling error (i.e. within-study variance), which increases as study size diminishes. The presence of unidentified effect modifiers adds more uncertainty without a predictable pattern. In conclusion, it can be argued that although the sample size is supposedly planned in advance in order to detect the effect size previously observed by the small pilot study, the sample size estimation is weakly correlated to the ‘true’ underlying treatment effect (Berlin et al

1989), generally resulting in unrealistic effect sizes being expected by overoptimistic researchers (Magazin et al 2008). As it happens, the crucial assumption by which study precision is independent of study effect also underpins funnel plots (section 4.1.2)

(Light & Pillemar 1984) and standard RE MAs (Higgins et al 2009). Hence, although the sample size calculation could theoretically induce small-study effects, it is assumed here not to do so.

Choices of measure for study effect and study precision

In relation to the study effect measure for the funnel plot (section 4.1.2) x-axis, the natural logarithm of the odds ratio (lnOR) is the preferred choice because of its symmetric properties and lack of constraint (Sterne & Egger 2001). It is essential that the y-axis on the funnel plot facilitates the perception of small-study effects whenever it is present (Sterne & Egger 2001). The choice between (a function of) variance and sample size depends on which one is more closely related to the sources of small- study effects, and so manifest the trend more clearly. In this sense, some may sensibly argue that (a function of) sample size (which does not experience measurement error or structural correlation) is the best covariate to predict a study of infinite size for obvious reasons. However, (a function of) the variance is preferable since the merger of PB and outcome reporting bias is increasingly recognized as the greater source of small-study effects (Williamson & Gamble 2005, Ioannidis 2008b, Moreno et al 2009b).

In this instance, both biases are mainly aimed at achieving statistical significance

(Sterne et al 2001b, Melander et al 2003, Chan et al 2004a, Chan et al 2004b, Hotopf

& Barbui 2005, Furukawa et al 2007, Dwan et al 2008, Rising et al 2008, Turner et al

2008a), making the small-study effects pattern more visible through a function of variance than of sample size (Sterne & Egger 2001). In this sense, (a function of) the variance can be argued more informative because statistical significance of a clinical trial does not only depend on sample size but also on the number of patients experiencing the event of interest. Sterne et al exemplify this by saying: “A study with

100,000 patients and 10 events is less powerful and will produce less precise estimates than a

50 study with 1000 patients and 100 events” (Sterne & Egger 2001). Naturally, other sources of small-study effects can be seen to be more closely related to sample size but they are not thought of as being influential. Interestingly, a recent simulation study showed both the effect size variance and (a function of) sample size to be the preferred surrogates to model small-study effects (Moreno et al 2009a). These results are not surprising since both must be highly correlated in the full set of studies.

Chance

It is debatable whether all listed factors do induce small-study effects or whether sometimes it is simply a product of ‘chance’ (i.e. sampling error) (Egger et al 1997c).

Altogether, all listed factors can always be expected to induce small-study effects apart from those factors generating ‘genuine heterogeneity’. These are context-dependent and so need to be explored on a case-by-case basis to determine whether it is plausible to have induced small-study effects and so interpret results accordingly.

3.5. Summary

Chapter 3 meets the initial aim of the thesis by investigating the problem of PB

(and other biases) affecting meta-analytic data. After classifying the different types of biases affecting MAs, PB is identified as one of the reporting biases that may be present. The presence of PB is often investigated by examining the small-study effects trend in MAs, although it can be induced by a variety of other factors besides PB. Since no overview of the many potential causes of such small-study effects currently exists, the first review on its sources was undertaken here. The findings from the review suggest that many of the factors commonly known to biasing MA manifest themselves as small-study effects. In fact, modelling this trend with the intention to estimate the unbiased effect size (from a hypothetical study of infinite size) will not only allow for PB but for all other sources of small-study effects simultaneously. This idea is developed further in chapter 5, where a regression-based approach is advocated to model small- study effects effectively.

In the mean time, chapter 4 attempts to fulfil the aims of the thesis specified in section 1.2 by examining alternative approaches to uncover the presence of PB (and other small-study effects). Likewise, already available approaches designed for tackling

PB are also outlined. Their limitations and strengths are described to justify the methods that are used thereafter.

Evidence for publication bias (and other small- study effects) & how to address it in meta-analysis

This chapter is divided into two main sections. First, the methods commonly used to reveal the existence of PB (and other small-study effects) are described in section

4.1. Second, the methods used to tackle such bias, starting with prevention, are explained and discussed in section 4.2.

4.1. Evidence for publication bias (and other small-study effects)

4.1.1. Meta-epidemiological studies

In recent times, a new meta-meta-analytic approach has become prominent in the understanding of the influence of study-specific characteristics upon the observed effect size. This novel meta-analytic approach has become known as meta- epidemiological studies (Naylor 1997, Sterne et al 2002, Siersma et al 2007, Wood et al 2008). The idea is to estimate the average bias associated with a particular trial characteristic (e.g. inappropriate blinding) by comparing effect sizes in RCTs with and without the characteristic within each MA, then combining across MAs. Lack of blinding and inadequate allocation concealment have been consistently reported to be the two most important characteristics consistently biasing treatment effects (Gluud 2006,

Sterne et al 2009b). With regards to the small-study effects, only two research projects have been found to examine the influence of study precision/size on the pooled effect using this methodology (Scally 2006, Jüni et al 2008). These have provided further evidence for the existence of small-study effects in published MAs. Altogether, the meta-epidemiological approach, which investigates the association between RCTs characteristics and observed study effects, has successfully identified study precision

(in addition to other study characteristics) with consistently influencing study effect sizes in one direction.

4.1.2. Funnel plot

A funnel plot is a scatter plot of all the observed study effect sizes (usually on the x-axis) against a measure of their study precision on the y-axis (Sterne & Egger 2001).

Under unbiased conditions, studies should be located scattered around mimicking the appearance of an inverted funnel (Light & Pillemar 1984) (figure 4.1). This can be explained by the greater influence of sampling error among the less precise (smaller) studies (towards the bottom of the plot) producing a funnel shape symmetric around the mean effect.

Figure 4.1 Illustrative funnel plot of study effect sizes against the inverse of their standard erroor

The funnel plot has become one of the simplest and most popular tools for investigating PB by means of visual inspection of funnel asymmetry in the form of small-study effects. The widespread idea is that a funnel plot perceived asymmetric suggests the presence of PB (Light & Pillemar 1984). As mentioned earlier, the selective publication of studies is expected to leave unpublished more studies reporting lack of benefit (open circles in figure 4.2). The asymmetry produced in the funnel plot can be described as a gap in one lower extreme of the funnel shape. As a result, the pooled effect is overestimated as seen in these two rather popular plots used with permission from Sterne et al (Sterne & Harbord 2004).

Figure 4.2 Hypothetical funnel plots from Sterne et al (Sterne et al 2000).

Symmetrical plot (top) in the absence of bias with a true effect=0.6 (open circles indicate smaller studies showing no statistically significant effects); (below) asymmetrical plot in the presence of publication bias (smaller studies showing no statistically significant effects are missing) (biased pooled effect≈0.4)

Unbiased MAs with enough studies will take the form of a symmetric funnel; whereas smaller, more typical MAs may be wrongly identified as asymmetric because of chance variation (Terrin et al 2005). Funnel symmetry can also be distorted by small- study effects produced by reporting biases (e.g. outcome reporting bias (Moreno et al

2009b)) or confounding (caused by an effect modifier associated to both study precision and effect size (Lau et al 2006)). A review of potential sources of small-study effects is available in chapter 3.

Visual inspection of funnel plots has been criticised mainly due to the inherent subjectivity in the identification of asymmetry (in the form of small-study effects), which may then lead to erroneous claims about its cause (e.g. presence of PB) (Tang & Liu

2000, Terrin et al 2005, Lau et al 2006, Ioannidis 2008b). This thesis agrees that funnel plots should be always interpreted with caution (Song et al 2001). They should never guide decisions about the cause of apparent asymmetry because important factors/biases may be overlooked. That is, besides chance variation, the asymmetric appearance may be distorted by different forms of reporting biases and/or confounding caused by an effect modifier associated to study precision. For that reason, it has been suggested that funnel plots should not be seen as a tool to detect PB but as a “generic means of displaying small-study effects” (Sterne et al 2008). Indeed, questioning the validity of the evidence due to visible asymmetry may be potentially misleading (Lau et al 2006), and so in this thesis, the use of funnel plots is advocated as a tool to examine the data and generate hypotheses rather than for hypothesis testing. The assessment of the causes for funnel plot asymmetry should be undertaken using formal statistical methods (Song et al 2001).

Investigation of PB can be achieved by means of other graphical methods too

(Song et al 2001, Bax et al 2009). The Galbraith radial plot is conceptually similar to the funnel plot but rarely implemented (Galbraith 1988, Copas & Lozada-Can 2009). It plots the standardized effect sizes against the reciprocal of their standard errors

(Thompson & Sharp 1999). The funnel plot is preferred to the Galbraith plot because of its simplicity of interpretation, and so it is the one used hereafter.

Figure 4.3 Illustrative funnel plot of the magnesium dataset using the Stata command metafunnel (Sterne et al 2009a). The pseudo 95% confidence interval lines are drawn around the fixed-effect pooled effect.

Funnel plot with pseudo 95% confidence limits 0 .5 1 Standard error 1.5 2

-4 -2 0 2 4 Effect estimate ln(OR)

With a view to disentangling genuine PB from other causes of funnel plot asymmetry, the funnel plot can be enhanced by including contours separating areas of statistical significance and non-significance (Spiegelhalter 2005, Palmer et al 2008,

Peters et al 2008, Sterne et al 2009a). To this end, contour-enhanced funnel plots are presented next with the hope that it helps identify those studies that were selectively reported according to the statistical significance of their findings; which ultimately is a major driver of PB.

4.1.3. Contour-enhanced funnel plot

A contour-enhanced funnel plot denotes traditionally perceived ‘milestones’ of significance such as the 1%, 5% and 10% levels (Gerber & Malhotra 2008). In this way, the level of statistical significance of every study’s effect estimate is displayed, suggesting where missing studies could be located (Palmer et al 2008, Peters et al

2008) (figure 4.4). This can aid interpretation of the funnel plot if, for example, studies appear to be missing in areas of statistical non-significance, which would add credence to the notion that the asymmetry is due to PB. Conversely, if the areas of the funnel where studies are perceived to be missing are areas of higher statistical significance, the cause of asymmetry is more likely to be due to factors other than PB.

Figure 4.4 Illustrative contour-enhanced funnel plot of the magnesium dataset using the Stata command confunnel (Sterne et al 2009a). Fixed and random-effects meta-analysis pooled estimates display their 95% confidence intervals at the top

0.0

0.5

1.0 Standard error

1.5

Magnesium effect p < 1% 1% 10% FE MA FDA RE MA

-4 -2 0 2 4 Effect estimate ln(OR)

Although the magnesium case study provides a clear example of small-study effects, it is not so obvious that it is entirely related to the significance of the study findings (N.B. the four smallest studies are non-significant). Conversely, the following case study about antidepressants (Moreno et al 2009b) unmistakably illustrates a situation where PB seems to be the reason for such strong funnel plot asymmetry

(figure 4.5). Note that most studies suspiciously reach significance at least at the 5% level. Besides, the area where studies appear to be ‘missing’ are contained within the area where non-significant studies would be located (i.e. inside the white triangular defined by the p-value=0.10 contour boundaries). This adds further credence to the hypothesis that the observed asymmetry is caused by PB.

Figure 4.5 Illustrative contour-enhanced funnel plot revealing publication bias

0.0 Studies p < 1% 1% 10% 0.1

Standard error 0.2

0.3

-1 -.5 0 .5 1 Effect estimate

The contours of statistical significance are derived from the Wald test of the null hypothesis that a study's effect estimate was equal to zero (Peters et al 2008, Sterne et al 2009a). However, there is an issue that studies with small sample sizes may have

59 derived p-values using t-tests or other ‘exact tests’ (Mehta et al 1984), which might not necessarily match the results from the Wald test used to draw the contours. Since all statistical tests will converge as sample size increases, this issue will be a concern only for small trials (e.g. below 40 patients per trial) or with few clinical events (sparse data).

If there are a limited number of such trials within the MA, then this issue can be anticipated not to greatly affect the analysis (Moreno et al 2009b).

There is ongoing debate about the usefulness of any form of funnel plot (and related tests) in the investigation of PB (or other small-study effects). While their use is widely advocated (Rothstein et al 2005, Sterne et al 2008) some question their validity

(Tang & Liu 2000, Terrin et al 2003, Terrin et al 2005, Lau et al 2006, Ioannidis &

Trikalinos 2007a, Ioannidis 2008b). Recently, there has been extensive research into refining tests for funnel plot asymmetry in the form of small-study effects (Egger et al

1997c, Sterne et al 2000, Harbord et al 2006, Peters et al 2006, Rücker et al 2008a) and while this thesis support the formalisation of any assessment, none of the tests considers the statistical significance of the available study estimates. It is for this reason that the contours on the funnel plot can be considered an essential component of distinguishing PB from other causes of small-study effects.

A limitation of the contour-enhanced funnel plot is that it not only detects PB but all remaining reporting biases, since they are also assumed to induce selective reporting of findings depending on their statistical significance (Sterne et al 2001b, Dwan et al

2008). Empirical evidence indicates that the contour-enhanced funnel plots disentangles genuine reporting biases (rather than PB alone) from other sources of small-study effects (Moreno et al 2009b). Interestingly, one of the harshest critics of funnel plots published a paper reviewing the same antidepressants dataset (Ioannidis

2008a) and hopefully this will make them concede there is value in the use of this diagnostic method.

Note that contour-enhanced funnel plots should be used only to aid in the interpretation of the causes of funnel asymmetry (i.e. reporting biases) but never guide decisions on precise causality (e.g. PB) (since many factor are perhaps overlapping).

With the aim of improving objectivity in the identification of systematic asymmetry in the form of small-study effects, statistical tests have been developed (Song et al 2001,

Rothstein et al 2005). Nevertheless, test results need to be interpreted taking into consideration the visual inspection of the funnel plot (Sterne et al 2008).

4.1.4. Tests for publication bias/small-study effects

There is a large literature on alternative tests for funnel plot asymmetry. While attempts have been made, no clear consensus among meta-analysts has been reached on which test should be applied based on its superior performance (Thornton

& Lee 2000, Sterne et al 2008). The different characteristics of the simulation studies where the performance of each test was assessed (section 6.2) may explain the difficulty when comparing them and so the lack of consensus. To solve this problem, chapter 6 proposes a unifying simulation framework where all methods could be evaluated and compared.

Despite the lack of agreement among meta-analysts, the Egger’s test (Egger et al

1997c), which is the earliest regression-based test, has become the most popular due, at least partially, to its methodological simplicity as well as widespread citation (Sterne et al 2008). In Egger’s test, the standardized effect size is regressed on its standard error by assuming that the publication selection process is related to the precision of the trials’ results. Due to concerns regarding its performance, several alternatives have now been developed and proposed (table 4.1).

Table 4.1 Proposed tests for systematic funnel plot asymmetry as listed by the

Cochrane handbook (Sterne et al 2008). Ntot is the total sample size, NE and NC are the sizes of the experimental and control intervention groups, S is the total number of events across both groups and F = Ntot – S

Test Description Rank correlation between standardized intervention effect (Begg & Mazumdar and its standard error. Test not recommended due to its low 1994) power (Sterne et al 2008) Linear regression of intervention effect estimate against its (Egger et al 1997c) standard error, weighted by the inverse of the variance of the intervention effect estimate Linear regression of intervention effect estimate on

(Tang & Liu 2000) 1/sqrt(Ntot) with weights Ntot. Test not yet evaluated under simulation and thus not recommended

Linear regression of intervention effect estimate on Ntot, with (Macaskill et al 2001)* weights SxF/Ntot Linear regression of log odds ratio on 1/sqrt(ESS) with

(Deeks et al 2005)* weights ESS, where effective sample size ESS = 4NE xNC / Ntot. Test aimed at MA of diagnostic tests Modified version of the test proposed by Egger et al based (Harbord et al 2006)* on the ‘score’ (O–E) and ‘score variance’ (V) of the log odds ratio

Linear regression of intervention effect estimate on 1/Ntot, (Peters et al 2006)* with weights SxF/Ntot Rank correlation test, using mean and variance of the non- (Schwarzer et al 2007)* central hyper-geometric distribution Test based on arcsine transformation of observed risks, (Rücker et al 2008a) with explicit modelling of between-study heterogeneity

* Test originally formulated in terms of odds ratios, but may be applicable to other measures of intervention effect. The tests Egger, Harbord and Peters are later implemented in the simulation study carried out in chapter 6 (section 6.3.2), although they are used for adjusting rather than detection purposes.

Interestingly, most statistical tests are based on a linear regression model of (a function of) the effect size on a measure of its precision (whether a transformation of standard error or study size). This regression analysis is consistent with the principle of funnel plot symmetry in the absence of sources of small-study effects (Berlin et al

1989, Sterne & Egger 2001). Evidence of an association suggests that smaller, less precise studies have larger effect sizes than the more precise studies, i.e. small-study effects (figure 4.6). If such an association, represented by the regression slope, is statistically significant, the null hypothesis of funnel symmetry is rejected.

Figure 4.6 Illustrative Egger’s regression line applied on the magnesium dataset using the Stata command metafunnel (Baxter et al 1996, Sterne et al 2009a) (Egger’s p-value<0.00). The pseudo 95% confidence interval lines are drawn around the fixed- effect pooled effect.

Funnel plot with pseudo 95% confidence limits 0 .5 1 Standard error 1.5 2

-4 -2 0 2 4 Effect estimate ln(OR)

None of the statistical tests for small-study effects are recommended unless at least ten studies with varying sizes are available; otherwise, the statistical power is too low to distinguish chance from real small-study effects (Sterne et al 2008). Even so, in this thesis it is argued that testing for small-study effects does not provide a solution to the problem of bias (Sterne et al 2001a, Sterne et al 2008). For instance, a statistically significant finding will only force a cautious interpretation of the MA pooled effect.

Equally, a non-significant result will need to be interpreted with caution because failing to detect small-study effects does not automatically imply its absence. This is particularly important considering that all tests are known to have larger than expected type II error rates (Sterne et al 2008), which increase as heterogeneity grows (Terrin et al 2003). Therefore, detection alone can be considered:

1. limited since the likely impact of bias is not assessed (Ioannidis & Trikalinos 2007a)

2. problematic since the chance of a type II error is typically high (Lau et al 2006) &

3. insufficient if the results of the MA are to inform policy decisions

In relation to the first point, a sensitivity analysis can help assess the likely impact of the bias upon the MA results. Disappointingly, the Cochrane handbook suggests performing such sensitivity analysis only when the test result is statistically significant despite the typically high type II error rates (Sterne et al 2008). The implementation of the following techniques is recommended by the Cochrane handbook as sensitivity analyses to assess how the MA results vary under different assumptions.

1. Comparing FE and RE MA estimates

2. Trim and Fill (Duval & Tweedie 2000b, Duval & Tweedie 2000a) (section 4.2.5)

3. Selection modelling techniques (Hedges 1992, Copas 1998, Copas & Shi 2001,

Vevea & Woods 2005, Copas & Malley 2008) (see section 4.2.6)

4. Testing for excess of studies with significant results (Ioannidis & Trikalinos 2007b)

Sensitivity of results to inherently untestable assumptions is very important in policy making. By attaching utilities to various scenarios, a decision can be made, for example, to choose the one that minimizes the maximum risk. Ideally, a corrected estimate for potential biases is preferable when, for example, the pooled effect estimate needs to be used in a cost-effectiveness model. That is why in this thesis it is argued in favour of correcting the bias in MAs to allow for more reliable decision- making. In this sense, note that methods 2 and 3 (above) have the ability to estimate intervention effects corrected for funnel asymmetry, although they are not widely recommended for that purpose due to their limitations (discussed shortly). In the case that the adjusted pooled effect was used for inference purposes, it would be important to note that, according to the Cochrane handbook, performing an adjustment conditional on the test result has never been assessed and therefore it can only be regarded as exploratory analysis (Sterne et al 2008). Recall that serious concerns exist in relation to the poor statistical power of all existing tests (Sterne et al 2000,

Schwarzer et al 2002, Ioannidis & Trikalinos 2007a, Rücker et al 2008a), which may lead to the inappropriate use of methods in scenarios without PB (or other small-study effects). With the intention of evaluating such conditional approach to adjustment, it is incorporated in the simulation study of chapter 6. This evaluates the conditional approach and compares its performance to unconditional approaches for the first time.

In conclusion, detection of PB/small-study effects alone is not very meaningful

(Ioannidis & Trikalinos 2007a) or helpful if the results of the MA are to be used to inform policymaking. In this thesis, the recommendation is to go beyond testing since testing can only provide a warning message about the threat of its presence. Thus, chapter 5 proposes a regression-based adjustment method based on the same regression models above used to test for small-study effects. Before that, other methods for addressing PB are described and discussed.

4.2. Methods for addressing publication bias

4.2.1. Prevention

Undoubtedly, prevention of PB is always more desirable compared to detection or adjustment (Lau et al 2006, Dubben 2009). Many have advocated the compulsory prospective registration of clinical trials in the public domain in order to alleviate the problem of PB (Simes 1986, Easterbrook 1987, Easterbrook et al 1991, Dickersin

1992, Easterbrook 1992, Horton 1997, Horton & Smith 1999, Abbasi 2004, De Angelis et al 2004, Abbasi & Godlee 2005, Decullier et al 2005, Berlin & Ghersi 2006, Laine et al 2007). This attempt to alleviate PB should make negative or unfavourable findings easier to identify since trials can no longer vanish.

The World Health Organization is making a significant contribution to prevent PB

(Goldacre 2006) by proposing world standards for prospective trial registration of all human medical research (Evans et al 2004, Gülmezoglu et al 2005, Sim et al 2006,

WHO 2009b). Due to the non-compulsory character of the organisation (Goldacre

2006), it is up to the member states to pass laws to make registration mandatory within their own territories. For instance, the FDA (US governmental institution for drugs administration) has stepped in by passing a law to make prospective trial registration of

‘applicable’ clinical trials a legal requirement in the USA (FDA Amendment Act passed in 2007 (FDA 2009)). The World Medical Association is calling for mandatory registration of clinical trials too through the ‘Declaration of Helsinki’ (WMA 2009). In fact, from October 2008, this Declaration of ethical principles for medical research involving human subjects includes two new statements that represent a significant change from earlier versions of the Declaration (i.e. paragraphs 19 and 30):

19 “Every clinical trial must be registered in a publicly accessible database before recruitment of the first subject”

30 “Authors, editors and publishers all have ethical obligations with regard to the publication of the results of research. Authors have a duty to make publicly available the results of their research on human subjects and are accountable for the completeness and accuracy of their reports. They should adhere to accepted guidelines for ethical reporting. Negative and inconclusive as well as positive results should be published or otherwise made publicly available. Sources of funding, institutional affiliations and conflicts of interest should be declared in the publication. Reports of research not in accordance with the principles of this

Declaration should not be accepted for publication.”

The reason such measures are taking so much effort and several decades to enforce despite the evident benefits of a registration scheme (WHO 2009a) can only be explained by the resistance from stakeholders, whose particular interests may not be promoted by such ‘modus operandi’ (Rennie & Flanagin 1992, Rennie 1999). For example, the pharmaceutical industry in particular does not seem very enthusiastic about bringing to light trial results that do not favour its commercial interests. Other concerns include the loss of commercial competitive advantage if key pharmacological/ trial/protocol information is made public. From a societal perspective, in contrast, there are many benefits gained by registering trials (besides the remedy of PB) such as the reduction in trial duplication. Undeniably, accurate information about past and on-going clinical trials is fundamental to guide medical research or health policies.

Extensive media coverage of a number of recent scandals (Curfman et al 2005,

Pearson 2006) have transformed stakeholders’ views, and in particular that of the pharmaceutical, biotechnology and device industry, towards a consensus (Gibson

2004, Sim et al 2006). Proof of this is the 2005 global pharmaceutical industry position on disclose of information about clinical trials (IFPMA 2009). The motivation for this agreement mainly relies on ethical duties towards trial participants (Krzyzanowska et al

2003), reducing reporting biases (Rennie 1999) and free availability of information; although (disappointingly) no universal registration process has been settled as yet

(WHO 2009a). To overcome this important limitation, the World Health Organization has set up the ICTRP (International Clinical Trials Registry Platform) linking registers of clinical trials (WHO 2009b).

Altogether, no one questions that the best solution to PB (or any other reporting biases) is to prevent it occurring in the first place (Rothstein et al 2005). Using a gold standard data source, such as the FDA trial registry database (ClinicalTrials.gov) in addition to the original study protocols, is one way of achieving this (Chan 2008).

However, this option is still a long way from becoming a reality for many analyses

(Thornton & Lee 2000). Hence, there is often a need to rely on analytic methods to deal with the problem. To this end, the earliest attempts to tackle PB can be broadly termed

‘best evidence synthesis’ (Slavin 1986), which are described next.

4.2.2. Best evidence synthesis approach

There have been simple methods proposed in the past to proceed when PB is suspected in a MA. Berlin et al (Berlin et al 1989) suggest a form of best evidence synthesis where sample size is the key factor for inclusion into the MA. They argue that since large studies are usually published, MAs of only large studies should not be subjected to PB. It could be argued, however, that although PB might be less likely among large studies, it cannot be confidently rejected and thus, this type of approach could be potentially misleading (Stern & Simes 1997). At the same time, the concept of study size depends on the context where the study takes place. For instance, the largest studies in a MA might just include a handful more patients than the rest.

Therefore, this approach would ignore the entire body of evidence in favour of the somewhat largest/s studies. Peters (Peters 2006) evaluated this approach in a simulation study by only considering the single either largest or most precise studies in the MA instead of meta-analysing all the studies. Since the performance was poor in comparison to other methods (particularly in the presence of heterogeneity), this approach is not investigated hereafter.

Another method also based on the idea of ‘best evidence synthesis’ was proposed by Slavin (Slavin 1986, 1995). This approach only incorporates studies into a MA if they exceed some pre-specified quality criteria threshold. Although study quality is an important concern when synthesising evidence (Egger et al 2002), this approach could be considered too restrictive because it potentially ignores evidence that may still be relevant. Methodological options to deal with this approach involve incorporating study quality as a predictor of effect size in MR (Sterne et al 2000) or to further weight the evidence with respect to their quality worthiness. However, the latter approach leads to a debate on how best translate study quality parameters into appropriate weights

(Herbison et al 2006, Peters 2006), which is still on-going (Higgins et al 2008a)).

Prospective MA where clinical trials are identified and included into the MA long before their results become known is also advocated as a way of avoiding some problems raised in retrospective MAs (Margitic et al 1992, Margitic et al 1995, Berlin &

Ghersi 2006, Larsson et al 2007). Conclusions can then be regarded as being free from reporting biases such as PB since studies were included before results of individual trials were known. However, this approach faces many challenges that make it impractical, particularly in relation to the time between setting up the MA, carrying out the trials and obtaining the MA results.

4.2.3. Grey Literature

Systematic reviews are recommended to include trials in both the published and grey literature in order to help minimise the effects of PB in the review (McAuley et al

2000, Hopewell et al 2005, Hopewell et al 2007b). Grey literature refers to studies, whose results are not published in peer-reviewed journals but are available in the form of abstracts from conferences, unpublished theses or dissertations, studies with unreported results, chapters from books, personal correspondence and research reports results that are internal to organisations. Tracking down the grey literature can be very time-consuming and so a major drawback. Equally awkward is that the quality

69 of grey studies has not been evaluated through the usual peer-review process.

Interestingly, a recent Cochrane review on this topic found limited evidence to corroborate the general perception that grey studies are of less quality than peer- reviewed published studies (Hopewell et al 2007b). Despite this, a reasonable search for grey literature can make available a surprising number of previously unknown studies (Brealey et al 2006), which should be considered in the light of the entire body of evidence.

4.2.4. File-drawer number

This method is a sensitivity analysis that tries to quantify the number of hypothetical unpublished (non-significant) studies (averaging a null result) that would be required to nullify the significant result of the MA of published studies (Rosenthal

1979). The more non-significant studies needed to overrule significance, the less likely that PB will have a noticeable impact on the results. In addition to the many shortcomings listed by Sutton et al (Sutton et al 2000b), the number of studies supposedly left unpublished according to the file-drawer approach is highly sensitive to the effect size assumed for the unpublished studies. What is more important, this method does not produce an adjusted estimate of effect. Hence, this approach is discarded in favour of other more sophisticated methods (Iyengar & Greenhouse 1988,

Becker 2005, Sterne et al 2008).

4.2.5. Trim & Fill adjustment method

Trim & fill (TF) (Duval & Tweedie 2000b, Duval & Tweedie 2000a) is possibly the most popular method for the examination of the possible effect of PB on the pooled estimate. TF can be defined as an iterative non-parametric adjustment method based on a rank-based data augmentation technique to account for asymmetry on the funnel plot. According to TF, funnel asymmetry emerges because studies reporting the most extreme ‘negative’ effect sizes are left unpublished. Briefly, this method ‘trims’ the asymmetric studies on the right-hand side of the funnel (depending on context, the effect sizes may need multiplying by -1 to ‘flip’ the funnel); in such a way that the asymmetric studies that do not have a counterpart on the opposite side of the funnel are removed until a symmetric ‘centre’ of the funnel remains. A revised pooled estimate

‘adjusting’ for PB is then derived from this reduced dataset. The ‘trimmed’ studies are reinstated and studies, assumed to be missing, are imputed on the opposite side of the funnel by ‘reflecting’ the trimmed studies about the adjusted pooled effect line.

Uncertainty in the ‘adjusted’ pooled effect is calculated using this augmented dataset.

Imputing the ‘missing’ studies in this way assumes that they have the same precision as their asymmetric counterparts and lie equidistant from the centre of the trimmed funnel; thus leading to a spurious decrease in the confidence interval width around the adjusted pooled effect (Preston et al 2004).

Another important limitation of TF is that it assumes that funnel asymmetry appears because studies reporting the most extreme ‘negative’ effect sizes are left unpublished, while the size of the study and the statistical significance of its findings are ignored (Sutton et al 2000b). Chapter 3 exposed how many of the factors commonly known to biasing MA produce funnel asymmetry in the form of small-study effects. Hence, the imputed studies assumed to be missing by TF might be utterly unrealistic because the funnel asymmetry may be caused by other causes besides PB.

The view taken in this thesis is that small-study effects are very difficult to track back to their original sources. Nonetheless, TF is generally recommended as a sensitivity

71 analysis (Duval & Tweedie 2000b, Peters et al 2007) rather than as an adjustment method per se, particularly since it has been observed to inappropriately adjust for PB when none exists (Terrin et al 2003).

Three possible estimators for the number of ‘missing’ studies were initially described (Duval & Tweedie 2000b, Duval & Tweedie 2000a), L0, R0 and Q0; but findings from their simulations suggest that L0 and R0 perform better than Q0 and so are the preferred estimators. They are given by Equations 4.1 and 4.2 respectively.

4S rank − n(n + 1) L0 = 2n −1 [Equation 4.1]

R0 = γ − 1 [Equation 4.2]

Where n is the number of studies in the MA and Srank is the sum of the ranks associated with positive values of the absolute observed effects around the current estimate of the pooled effect size (the Wilcoxon statistic for the dataset). γ is the rightmost run of ranks associated with positive values of the values of the observed effect sizes minus the current estimate of the global effect size. These values change iteratively as the current estimated R0 extreme studies are trimmed before calculation of the next iteration. At each iteration L0 and R0 are rounded up to the nearest integer and provide estimates of the number of missing studies. The iterative calculation of both estimators continues until convergence is achieved.

The user-written Stata program ‘metatrim’ is used for the TF analysis presented in this thesis (Steichen 2000, Sterne et al 2009a). This Stata command carries out the TF algorithm using L0 as the default estimator although R0 and Q0 can be manually specified instead. Peters et al (Peters et al 2007) suggest that this default estimator

72 may not always give the least biased estimates in all scenarios since an estimators’ performance depend on two factors: the selection model assumed to induce PB and the size of the MA. However, since results were similar for both estimators relative to differences with the other methods, only the results from the R0 estimator are reported in the simulation study in chapter 6.

Given that a FE or RE MA model can be used in both the iterative part of the

‘trimming’ process, and the calculation of the adjusted pooled effect from the ‘filled’ MA, there are four possible model combinations: fixed-fixed-effect (FE model for the

‘trimming’ process and FE on the ‘filled’ dataset), fixed-random-effects, random-fixed- effect, random-random-effects. The random-fixed model is ignored because it makes little contextual sense to estimate the number of theoretical missing studies based on a

RE model while estimating an adjusted pooled effect under a FE model.

The random-random-effects version was initially adopted by Duval & Tweedie

(Duval & Tweedie 2000b) because of its conservative confidence intervals; however, other options have also received attention (Rothstein et al 2005, Peters et al 2007).

Sutton (Rothstein et al 2005) points out the danger of using a RE model for the trimming stage since this estimator is more affected by PB/small-study effects than the

FE one because smaller studies are given more weight in the process. Therefore fixed- fixed and fixed-random (the latter taking account of any heterogeneity in the ‘filled’ dataset) both have some theoretical appeal and have been shown to perform competitively in a previous simulation study (Peters et al 2007) and so are also evaluated in the simulation study of chapter 6.

A parametric version of the TF has been recently published (Formann 2008). It assumes that a normal distribution underlies the entire set of published and unpublished studies. It then assumes that only studies with the most extreme unfavourable results are left unpublished, which results in the distribution of published

(observed) studies being (normally) truncated. The degree of truncation is derived in order to estimate the unbiased pooled effect size as well as the proportion of missing studies. The publication of this adjustment method is posterior to the simulation study in chapter 5 and so it is not included. Nevertheless, it could be argued that the typical limited size of MAs of RCTs is a strong limitation when attempting to estimate the degree of truncation.

A somewhat similar method to TF searches for equilibrium between study weights on either side of the pooled effect, implying funnel plot symmetry (Richy & Reginster

2006). Hypothetical studies can be added to achieve the weight equilibrium by reflecting the distribution of the ‘positive’ studies. In addition to the shared limitations with the TF method above, this method assumes the initial fulcrum equal to the MA pooled estimate, which is biased if small-study effects exits and so misleads the remaining of the adjustment approach. Thus, this method is not recommended. Other more sophisticated methods have been developed to address PB. The most mathematical minded approach is termed ‘selection models’ and is described next.

4.2.6. Selection models

‘Selection models’ are generally recommended to investigate how the findings of the MA may be affected by potential PB (i.e. sensitivity analysis) rather than derive an adjusted effect (Sutton et al 2000b, Copas & Malley 2008). Although considerable efforts have been made to develop ‘selection modelling’ techniques (Hedges 1984,

Hedges & Olkin 1984, Iyengar & Greenhouse 1988, Dear & Begg 1992, Hedges 1992,

Hedges & Vevea 1996, Copas & Li 1997, Copas & Shi 2000b, Copas & Shi 2001), they are not widely used in the area of applied MA possibly due to the need for a large number of studies in the MA in addition to its mathematical and computational complexities (Egger et al 2001, Rothstein et al 2005, Stanley 2005, Sterne et al 2008).

To alleviate the computation hurdle, the Copas ‘selection model’ has been written in a

R package [www.r-project.org] to fit and provide user-friendly output in form of plots

74 and summaries of the results (Copas 1998, Copas & Shi 2000a, Copas & Shi 2001,

Schwarzer et al 2009). Depending on the purpose they serve, ‘selection models’ can be split into two categories (Terrin et al 2003):

1) Those that attempt to estimate an adjusted pooled effect by assuming a known

selection process (i.e. weight function (Sutton et al 2000b)) but then investigate

the sensitivity to changes in the known values (Bowden et al 2006). In doing so,

sensitivity analysis becomes the focus; and

2) Those that aim at deriving the underlying selection process from the data for

subsequent adjustment.

‘Selection models’ can be further sub-classified by (i) the form of the weight function used, which may be parametric or non-parametric; (ii) whether maximum likelihood or Bayesian estimation are used; and (iii) whether covariates are incorporated into the model (Sutton et al 2000b, Terrin et al 2003).

Although ‘selection modelling’ techniques sound appealing, recent evidence suggests they do not perform significantly better than the TF (Schwarzer et al 2008).

Furthermore, a number of additional concerns exist with their use. Firstly, unless the

MA is large, the underlying selection process needs to be fully, or at least partially, specified by the analyst. Even though evidence suggests the p-value and/or effect size are related to whether a study is finally published (Easterbrook et al 1991, Dickersin

1997, Ioannidis 1998), the actual underlying process is unknown, and probably differs from context to context (Copas & Malley 2008, Peters et al 2009). Thus, the choice of selection process modelled brings uncertainty and subjectivity to the analysis. Many authors have modelled the selection process as a function of a study’s p-value

(Hedges 1984, Iyengar & Greenhouse 1988, Dear & Begg 1992, Hedges 1992,

Hedges & Vevea 1996, Linde et al 1997, Preston et al 2004, Shadish & Baldwin 2005),

while it is also possible to do it as a function of both p-value and effect size (Copas

1998, Copas & Shi 2000b, Copas & Shi 2001, Rothstein et al 2005). There are two

reasons for excluding ‘selection models’ from further investigation here:

1) Unless the MA is large, it will be necessary to specify the selection mechanism as a

modelling assumption (Copas & Malley 2008). Hence, performance of the ‘selection

model’ will directly depend on how well the model specification characterizes reality

and this is difficult to evaluate via simulation (i.e. if the specified ‘selection model’ is

the same as the one used to simulate the data, good performance can be

guaranteed).

2) Previous work has acknowledged that since the selection mechanism is not

identifiable from the data, sensitivity analyses should be carried out using a range of

selection functions (Copas & Li 1997, Copas & Shi 2000b, Copas & Shi 2001, Vevea

& Woods 2005, Henmi et al 2007, Schwarzer et al 2009). While this is potentially

useful in an inference-making context where robustness or lack of it may be explored

over a range of possible ‘selection models’, it is less useful in a decision-making

context where a single decision has to be made.

To sum up, as a method to dealing with PB, ‘selection modelling’ techniques

attempt to either adjust for PB by assuming an underlying ‘selection model’; or perform

a sensitivity analysis to check the robustness of the MA results to varying levels of PB.

Interestingly, a recent empirical evaluation suggests that the Copas ‘selection model’

results in similar performance to the TF method when adjusting treatment estimates

(Schwarzer et al 2008, Schwarzer et al 2009). The TF did not differ from the Copas

‘selection model’ (Copas 1998, Copas & Shi 2000a, Copas & Shi 2001) in relation to

differences in their adjusted effect sizes but in their estimated standard errors with

wider confidence intervals for the TF approach. Moreover, the use of the Copas model

as a sensitivity analysis is only advocated if there is evidence of selection biases (e.g.

PB) because it performs more reliably than TF (Schwarzer et al 2009). Again, this brings in the issue of the reliability of performing a (sensitivity) analysis depending on the results of some prior significance test for funnel plot asymmetry. As mentioned earlier, serious concerns exist in relation to the poor statistical power of all existing tests (Sterne et al 2000, Schwarzer et al 2002, Ioannidis & Trikalinos 2007a, Rücker et al 2008a), which may lead to the inappropriate use of ‘selection modelling’ in scenarios without PB (or other small-study effects). Besides, such conditional approach has never been evaluated before, and so its performance is uncertain (Sterne et al 2008).

4.2.7. Multiple imputation

Carpenter et al (Carpenter et al 2008) recently proposed a novel approach based on the idea of multiple imputation (Sterne et al 2009c) for PB adjustment, which can also be used as sensitivity analysis. Although a comprehensive evaluation of the method is still needed, preliminary results from a case study indicate that the method’s performance is similar to that from the TF. In a few words, the method can be considered a RE MR where the slope coefficient represents the increase in the ln(OR) of observing trial i with each unit increase in the ln(Odds) in favour of treatment found in trial i. The method requires that the number of supposedly missing studies be specified by the user. Then, the MR model is fitted using the observed trial data and posterior imputed trial data (i.e. supposedly missing studies), which are simulated through multiple imputation under the ‘missing at random’ assumption (Carpenter et al

2007). In the simplest case, where only a single study is supposed missing, the simulated imputed data contains the original dataset plus a single simulated study.

Then, a range of values for the slope and number of missing studies are evaluated with the intention to determine which values result in a non-significant funnel asymmetry test (using for example the arcsine test (Rücker et al 2008a)). If the test’s result is non- significant, the values are considered plausible (i.e. amended dataset is no longer significantly affected by PB). Again, this conditional approach, where results of some prior significance test for funnel plot asymmetry is required, raises concerns due to the

77 typical inflation of type I errors in such statistical tests (Sterne et al 2000, Schwarzer et al 2002, Ioannidis & Trikalinos 2007a, Rücker et al 2008a). Ultimately, Carpenter’s method allows investigation of how the treatment effect declines as the odds of publication increases (indicating greater PB) (Carpenter et al 2008). As with methods described above, this one also assumes that funnel asymmetry is exclusively caused by PB, whereas chapter 3 identified many factors besides PB that may induce small- study effects and so funnel asymmetry.

4.3. Summary

Before embarking on the development of novel statistical methods to tackle PB,

Chapter 4 complies with the thesis aim to review and critically appraise approaches that are currently available to address PB. After examining alternative ways of displaying the phenomenon of PB (and other small-study effects), the funnel plot is strongly recommended as the most natural way of investigating the recurrent phenomenon of small-study effects in meta-analytic data. Moreover, the contour enhancement further assists in distinguishing biases induced by the statistical significance of findings (e.g. PB) from other biases; and so contour-enhanced funnel plots are used hereafter.

Because the view taken in this thesis is that testing for small-study effects does not provide a solution to the problem of bias, only adjustment methods are pursued hereafter. Chapter 6 evaluates and compares the performance of several adjustment methods in a simulation study. Before that, in line with the aims of the thesis, chapter 5 proposes the innovative use of the regression-based approach for adjustment purposes so that some of the limitations of currently used methods are overcome. Of course, the shortcomings and strengths of this approach are also extensively discussed.

Underlying theory for publication bias adjustment through regression

5.1. Proposed method for adjusting for publication bias (and other small- study effects)

The review on the sources of small-study effects (section 3.4) suggests that many of the factors commonly known to bias MA manifest themselves as small-study effects.

Because these factors can rarely be identified individually (Sterne et al 2000, Shang et al 2005, Shang et al 2007), the interpretability of the funnel plot becomes problematic

(Egger & Smith 1995, Ioannidis 2008b). Indeed, if such factors were known, a specific adjustment could be applied (Glasziou & Sanders 2002). For instance, there are effect modifiers suspected of inducing small-study effects in specific contexts (e.g. heterogeneity in baseline risk or follow-up), where MR may have a useful role. Other sources of small-study effects such as aspects related to the methodological quality of studies are not so easily incorporated into MR (Greenwood et al 1999, Cipriani et al

2007b, Siersma et al 2007, Higgins et al 2009). Specifically, reported design variables

(e.g. blinding) (Huwiler-Muntener et al 2002) as well as derived quality scores

(Greenland 1994a, b, Jüni et al 1999, Greenland & O'Rourke 2001, Herbison et al

2006, Sterne et al 2009b) are said to be inappropriate surrogates for true study quality due to their intrinsic measurement error and model misspecification. Similarly, reporting biases (e.g. PB) are difficult to adjust for because the magnitude of bias induced by the selective dissemination of research findings is rarely quantifiable (Glasziou & Sanders

2002, Turner et al 2008a).

Furthermore, because the sources of small-study effects are unlikely to operate independently and are not easily disentangled (Egger et al 1997c, Sterne et al 2000,

Glasziou & Sanders 2002, Sterne et al 2002, Ioannidis 2008b), any attempts to consider any of the sources in isolation become futile. On the other hand, a multiple regression approach (section 2.7), which jointly incorporates some of the potential sources of small-study effects, will most likely suffer from the following problems:

1. Lack of sufficient statistical power due to MAs typically combining summary data

from only a handful RCTs (Greenwood et al 1999).

2. Complex hierarchical multiple regression models are required to deal with possible

emerging interactions and non-linear association among covariates, which are

difficult to specify.

3. Incorporating some potential sources of small-study effects while ignoring others

requires justifying the choices made (Clarke & Chalmers 1998, Greenland 2008).

4. Not every variable is always observed or recorded for every study, often resulting in

a missing data problem (Lau et al 1998, Sterne et al 2009c).

5. Multi-colinearity among covariates can be foreseeable because they may similarly

induce small-study effects (as discussed in chapter 3) (Sterne et al 2009b). If a

multiple regression model jointly controls for correlated covariates, the covariate

coefficients become attenuated and therefore vulnerable to the type II error (Shang

et al 2005, Nartey et al 2007). For instance, blinding and allocation concealment

are undoubtedly correlated since blinding is not realistic unless allocation

concealment is adequately done (Wood et al 2008).

In conclusion, several problems anticipate the unfeasibility of multiple regression in practice (Greenwood et al 1999, Shang et al 2005), particularly when using summary data from typically small collections of RCTs. Despite the difficulty in disentangling the nature of small-study effects, the view taken in this thesis is that small-study effects should still be modelled to allow for all potential sources simultaneously. To this end, a simple regression is proposed now to adjust for small-study effects without attempting to attribute them to any particular cause (Sterne et al 2000, Ioannidis 2008b). This adjustment is expected to reduce bias in the estimation of the true effect size, making it a valuable method in policymaking.

As mentioned in section 4.1.4, a number of regression models already exist to test the linear association between the study-specific effect size and its precision (whether standard error, study size or some transformation). They are fundamentally a quantitative version of the funnel plot adopting the simple assumption of funnel plot symmetry in the absence of small-study effects (Berlin et al 1989, Sterne et al 2000).

Since there is no clear consensus on which test should be applied based on its superior performance (Thornton & Lee 2000), some of the most frequently used are contemplated here and in the following chapter.

Evidence of small-study effects can be illustrated by the regression line on a funnel plot (figure 5.1). This figure presents a funnel plot of lnOR against a measure of study precision se(lnOR), for a simulated dataset with an underlying lnOR of 0.4 with severe unexplained heterogeneity and PB induced (i.e. studies are suppressed in bottom left hand side of plot). This regression line reveals how the less precise studies

(those at the bottom of the funnel) tend to have, on average, larger treatment effects, implying that PB affects smaller studies to a greater extent (although in practice the actual cause of asymmetry would usually be unknown (Hrobjartsson & Gotzsche

2004)).

Figure 5.1 Egger's regression line applied to a simulated asymmetrical funnel plot

Funnel plot w ith pseudo 95% confidence limits

0 E lnOR = 0.38 x t t r h a e p

u o l n a b t i i o a s n estimate pooled meta-analysis Bias induced by the

e t d o standard meta-analysis .1 e p f r f e e d c i t c t

.2 Underlying effect Underlying se(logOR) se(ln(OR))

E g g .3 e r ’s

r e g r e s s i o .4 0.4 (OR=1.5) n

-.4 0 .4 0.58 .8 1.2 lnOR

As already mentioned in chapter 3, because larger trials are less likely to experience the impact of PB, a reasonable assumption is that the probability of study suppression decreases as study size increases (Berlin et al 1989, Copas & Shi 2000b,

Lee et al 2008). It then follows that a hypothetical study of infinite size would have no chance of being suppressed, and so would provide an unbiased estimate of the population effect (Moreno et al 2009a). The simplest way to predict the effect size from such a hypothetical infinitely large study is by extrapolating the regression line to the point where the standard error is zero, so that the regression intercept can be interpreted as the effect size adjusted for small-study effects. For example in figure 5.1, a standard FE MA estimates a biased pooled effect size of lnOR=0.58, which is considerably larger than the true underlying effect lnOR=0.4. When MR is applied to the dataset, where the independent variable is se(lnOR), it predicts lnOR=0.38 for se(lnOR)=0, which is closer to the underlying truth.

The idea that the intercept can be interpreted as the pooled effect size adjusted for

PB was thought to be original to this thesis (Moreno et al 2009a). However, it was later unveiled that the same idea had been already mentioned in 1988 in a response by

Begg & Berlin to Rubin’s comments (Begg & Berlin 1988) (page 462). Begg & Berlin do not seem convinced that predicting an ideal study by means of extrapolating to a study of infinite size can provide sensible results. They come to this conclusion when they realise that “a casual glance at the data gives the impression that increased sample size leads inexorably to a mean effect size of zero, a rather pessimistic commentary on the current state of cancer”. Conversely, the view in this thesis is that this pessimistic situation might be actually true in certain cases (Ioannidis 2005, Hemingway et al 2009,

Moreno et al 2009b).

In 1998, the author of the original Stata command ‘metabias’ (Steichen 1998) considered the interpretation of the Egger’s intercept (as an adjusted estimate for PB) to be “weak”, with no further clarification. In 2000, DuMouchel and Normand

(DuMouchel & Normand 2000) (page 157) make use of the standard error to adjust for

PB, while controlling for other covariates simultaneously. In their case study, this covariate is found to be highly significant and it is interpreted as an indication of PB.

Although they state that extrapolating the standard error to the value of zero provides a method of adjusting for PB, they prefer dropping the covariate in the formal analysis because its interpretation is uncertain. Even so, they acknowledge that the statistical significance of this covariate questions the findings from the MA. More recently, Copas

& Malley (Copas & Malley 2008) presented a novel way of obtaining a robust p-value for effect in a MA with PB based on a permutation test (Higgins & Thompson 2004).

Interestingly, this is shown to be closely related to the radial plot (Copas & Lozada-Can

2009), which is a funnel plot related regression based on a variant of the original

Egger’s model; specifically the FE Egger model described shortly (eq. 5.2).

Altogether, the regression intercept predicts the effect size from an infinitely large study, which is assumed unaffected by PB and therefore unbiased. This supposition is complicated by the fact that there could be other genuine sources of small-study effects in MAs apart from PB (as argued in chapter 3). Despite this, in this thesis the view is that small-study effects should be adjusted for without being able to attribute them to any particular cause (Sterne et al 2000, Ioannidis 2008b). It is important to notice that, under this model, study precision is considered a good surrogate for all sources of small-study effects combined. Bear in mind that study precision is commonly used to investigate PB (Egger et al 1997c, Ioannidis & Trikalinos 2007a), and has been seen to dominate other covariates when attempting to explain heterogeneity (Shang et al 2005,

Nartey et al 2007, Shang et al 2007, McMahon et al 2008). Besides, any function of study precision is always an ideal predictor for an infinite size study. Ultimately, the adjusted effect size is considered unbiased because larger studies are assumed less influenced by reporting biases, at the same time as more accurately reflecting routine clinical care received by the relevant population; and therefore, more truly reflecting the effect size of interest.

In summary, regression-based methods that model small-study effects were originally intended for PB testing (Sterne et al 2008). Several of them are put forward here and in the following chapter as adjustment methods. This chapter gives special attention to the earliest regression-based method proposed in the field of PB, which is of key importance to this thesis.

5.2. The original Egger’s model

Egger et al (Egger et al 1997c) proposed the first regression asymmetry test for

PB that is now widely employed. The test is used here as a method to adjust for PB instead. Briefly, the regression model considers a linear association between the standardized effect sizes and their corresponding standard errors, to test whether such

84 association is greater than might be expected to arise by chance (i.e. H0: β =0). Note that this approach assumes the publication selection process is somehow related to the precision of the trials’ results, which justifies the association of treatment effect to the standard error.

y α i = + β + ε i where ε i ~ N(0, ϕ) sei sei

Where yi is the lnOR from study i and sei is its associated standard error. The original Egger model can be easily implemented in Stata (Harbord & Harris 2009,

Sterne et al 2009a) as: metabias yi sei. A more accordant parameterization in standard regression analysis would correspond to an equivalent inverse-variance weighted ordinary least square (WOLS) MR model with standard error as the covariate

(Sterne et al 2000). The equivalent model is implemented in Stata as: regress yi sei

2 [aweight=1/sei ]

1 2 y = α + β × se + ε weighted by where ε ~ N (0, se ×ϕ) i i i se 2 i i i

[Equation 5.1]

The two coefficients α and β, represent the adjusted pooled effect (intercept) and the slope associated with funnel plot asymmetry (small-study effects) respectively. φ symbolizes a multiplicative dispersion parameter estimated from the data which allows for heterogeneity inflation (McCullagh & Nelder 1989, Thompson & Sharp 1999)

(further discussion on the properties of φ is available in section 5.4).

The Egger’s method has been criticized for being unreliable and potentially biased

(Irwig et al 1998, Macaskill et al 2001), and since it is central to this thesis, these allegations are now scrutinised.

5.3. Biases affecting the Egger’s model

Macaskill et al (Irwig et al 1998, Macaskill et al 2001) criticise the Egger’s method for being intrinsically biased according to the following reasons. An explanation of each one follows.

1. Structural correlation between the treatment effect estimate and its precision;

2. Measurement error in the independent variable; and

3. The independent regression variable is a biased estimate of the true study

precision for binary data

5.3.1. Structural correlation

The correlation between individual effect size estimates and their variances will be repeatedly mentioned and deserves to be explored due to its non-trivial implications.

This correlation arises when the study effect and its variance are derived from the 2x2 table, assuming normal approximations of the summary statistics. Of course, the use of the original binary data directly (through the logistic model) overcomes this correlation

(Thompson & Sharp 1999), the Egger’s model uses the approximate normal likelihood of the summary statistics lnORs (Thompson & Sharp 1999, Copas & Lozada-Can

2009).

This structural correlation can lead to asymmetry being observed in a funnel plot, which may then be incorrectly interpreted as evidence of PB, heterogeneity or more generally small-study effects. The impact of this correlation is illustrated in figure 5.2 where a MA of 1000 studies is simulated with an exceptionally large (average)

2 underlying effect size (OR=7; lnOR≈1.95) and large heterogeneity I = 93% (τ 2 =0.61).

Figure 5.2 Funnel plot of simulated MA to illustrate the impact of structural correlation

0.0

0.2

0.4 Standard error Estimate 5% 0.6 10% FE-MA RE-MA True Effect 0.8 -2 0 2 4 6 Effect estimate

Pooled fixed-effect estimate: OR=6.15 (95%CI:6.1-6.2)

Pooled random-effects estimate: OR=7.1 (95%CI:6.7-7.4)

Due to the spurious funnel asymmetry induced by the structural correlation phenomenon, the pooled effect size differs between the FE and RE MAs. Note that the

RE model gives relatively more equal weighting to studies than the FE model, which weights heavier the more precise studies (Cox & Solomon 2003). Consequently, the

95% confidence interval of the FE pooled estimate (6.1–6.2) does not even include the true effect (OR=7) for this particular case.

According to Sterne et al (Sterne et al 2000), the structural correlation phenomenon can be explained as follows. Other things being equal, more extreme lnOR values are associated with larger variances simply because one of the two study

87 arms must have fewer events (i.e. more uncertainty). Implicitly, this means that the lnOR probability distribution becomes increasingly skewed for larger lnOR values. As a matter of fact, the lnOR displays a perfectly normal distribution when lnOR=0, while larger (absolute) values have heavier tails resulting in bigger variances (Chinn 2000).

That is, the spurious correlation decreases for smaller effects until vanishing under the null effect (for balanced clinical trials) (Macaskill et al 2001, Copas & Lozada-Can

2009).

Figure 5.3 illustrates how for (absolute) larger lnOR values, its error term will also increase in a seemingly quadratic concave function. For the particular setting simulated in figure 5.3, the structural correlation is modest for lnOR values between -1 (OR=0.37) and 1 (OR=2.7). While this ‘safety range’ appear fairly robust in an informal sensitivity analysis (not shown), no general conjectures can be made from this plot because different study characteristics such as smaller study sizes will narrow it. Generally speaking, larger (absolute) effect sizes report larger associated variance estimates.

This is particularly true for smaller studies, which are more susceptible.

Figure 5.3 Influence of structural correlation on the standard error (and variance) for a range of lnOR values. Results drawn from a simulated 60-patient balanced trial

2.5

2.0 Var(lnOR) se(lnOR) 1.5

1.0

0.5

0.0

-7 -5 -3 -1 1 3 5 7 ln(OR)

Formerly figure 5.2 showed how the predictor variable (the estimated beforehand standard error) becomes more biased as (absolute) effect size increases because the auto-induced correlation inflates (any function of) the standard error. That is why the

Egger’s regression slope is expected to be positively biased when the underlying

(absolute) effect is far from the null even in the absence of any association. This can be illustrated by the Egger’s regression line reflecting the spurious correlation in figure 5.4.

Figure 5.4 Impact of structural correlation exposed by plotting the original Egger’s regression line on top of the preceding figure 5.2

0.0

0.2

0.4

Estimate Standard error 5% 10% 0.6 FE-MA RE-MA

True Effect

Egger's line 0.8 -2 0 2 4 6 Effect estimate

With regards to the other two reasons to consider the results from the original

Egger’s model biased, they can be understood as a breach of ‘classical’ assumptions in regression analysis. Hence, these are discussed next because of their relevance.

5.3.2. Regression assumptions

For the results from simple linear regression analysis to be reliable, a number of

‘classical’ assumptions are usually required to be fulfilled (Kipnis et al 1997, Harrell

2001, Osborne & Waters 2002):

i) Errors are (conditionally) unbiased, i.e. E(e | x) = 0

ii) Additive error structure

iii) Covariates must be deterministic, not stochastic

iv) Covariate errors do not correlate with the true covariate value

v) Errors are homoscedastic

Assumption (i) refers to whether the distribution of the error term (i.e. the conditional distribution of yi given xi) is normally distributed with mean zero for all values of xi. To this end, the log transformation improves the normality of the response

variable ORi. Also important to satisfy this assumption is that the conceived linear model is the actual one that generated the data in the first place. On the contrary, if for example linearity was inappropriate, the average error term is unlikely to be centred at zero. Then, provided that the correct covariates are included into the right functional

(e.g. linear) relationship, the assumed linear model and the true model should match approximately, giving roughly unbiased errors.

The additive error structure (assumption (ii)) is the usual structure in regression analysis, and while other options are feasible, they are used less frequently (Carroll et al 1995). In fact, the additive error structure is more a recommendation than a requirement in regression analysis. Because of their complexity and relevance to this project, the last three regression assumptions are addressed in the next two sections.

5.3.3. Measurement error and attenuation bias

Assumption (iii) addresses the fact that the recorded covariate values often contain measurement error. Measurement error violates the ‘classical’ least squares regression assumption that all predictor variables should be deterministic (Carroll &

Stefanski 1994, Kipnis et al 1997, Greenwood et al 1999). Predictor variables measured with error are of particular concern in regression analysis because they cannot be reliably controlled (Fewell et al 2007) and can bias the regression coefficients (Thompson & Sharp 1999, Cox et al 2003, Berglund et al 2008).

Measurement error results from the inherent random variability in the measurement of the variable of interest, either because of technical or biological variability on an observed individual (Krueger & Mueller 2002). Equally, measurement error appears because the variable is estimated from a finite sample rather than being the true population value (Macaskill et al 2001, Copas & Lozada-Can 2009). In the context of binary outcome data, the response is y ∈ {0, 1} after the categorisation of a usually continuous variable. Some subjective screening score is generally used to dichotomise the continuous variable into a two-class categorical variable (Thoresen 2007). This categorisation procedure generates measurement error. Overall, the impact of measurement error upon regression analysis is complex and still not well-understood

(Fewell et al 2007). For example, analytical expressions for the bias induced by measurement error in logistic regression do not exist (Gustafson 2003). For simple linear regression, its impact on the estimation of the regression coefficients is better understood, and described below.

Dependent and independent variables can experience measurement error with somewhat different consequences. When measurement error is contained within the dependent variable, the ‘regression to the mean’ phenomenon arises (Thompson &

Sharp 1999, Morton & Torgerson 2003). ‘Regression to the mean’ entails that initial extreme measurements of individuals tend to be followed afterwards by less extreme

91 measurements of the same individuals. That is, initial extreme observations have the tendency to regress towards the mean population value when they are measured again

(Finney 2008). Hence, the difference between the two measurements is, upon the presence of measurement error, correlated to the true underlying value (correlation<1), even in the absence of any true association at all (Bland & Altman 1994c, a, Gunnell et al 2004, Morris & Emberson 2005). The danger of this phenomenon is that a spurious relationship may be wrongly inferred from the apparent improvement in effectiveness following a clinical intervention on a subset of patients who had early measurements observed far from the average (Armitage & Colton 2005). For example, lowering cholesterol drugs given to individuals reporting higher than average cholesterol levels will most certainly guarantee spurious drug efficacy even in cases where the drug is truly ineffective.

This particular scenario is not relevant to this thesis because repeated measures are not the focus here (Peters & Mengersen 2008). However, ‘regression to the mean’ occurs also when, upon measurement error, observers do not actually record the intended variable but a proxy of the true value instead (Carroll & Stefanski 1994,

Greenwood et al 1999). As a result, extreme observations from the proxy variable are too extreme, whereas the true values are in fact, nearer the mean value (i.e. regressing towards the mean) (Armitage & Colton 2005). In the case of a continuous dependent variable such as the lnOR, ‘regression to the mean’ will not affect the mean slope coefficient, but will increase the uncertainty around it, with the subsequent loss of statistical power (Thompson et al 1997, Armitage & Colton 2005, Kirkwood & Sterne

2006).

When the predictor variable is measured with error in a simple linear model, the regression slope is always biased towards zero (known as attenuation or dilution bias) provided the covariate error is unbiased, that is, uncorrelated with the true covariate value and of the response variable (MacMahon & Peto 1990, Arends et al 2000,

Kirkwood & Sterne 2006). This introduces greater uncertainty around the regression line and the relationship is observed to be weaker than in truth. More importantly, the attenuation of the slope will subsequently bias the intercept. Figure 5.5 illustrates regression attenuation under the assumption that measurement error is uncorrelated to its true value.

Figure 5.5 The estimated regression gradient (dotted line) is somewhat attenuated compared to the true gradient (unbroken line) when the measurement error (E) is not correlated with its true value (X) (adapted from Wacholder (Wacholder 1995))

Y denotes the (error-prone) effect size and Z the (error-prone) predictor variable.

Naturally, when translating this figure into a funnel plot by rotating the axes, attenuation bias remains.

Once measurement error upon the covariate is not genuinely random, that is, it depends on the true covariate value, the usual additive error model can be argued to be the incorrect choice for the data and other more complex options may be explored

(Carroll & Stefanski 1994). However, the literature rarely provides examples of non- additive error models being utilised. If the conventional additive error model is used, and the measurement error is positively correlated to the true covariate value, the regression slope will be biased (Freedman et al 1991), and assumption (iv) is also

93 violated. Indeed, slope attenuation has been shown more intensive when the error- prone covariate is not just affected by measurement error but is also positively correlated to the true value (Wacholder 1995). Correlation between the error and the true value is typical of dietary studies with self-reported measures such as weight or fat intake; where people with extreme food habits want to be thought to behave like the average and so report eating less food than in truth (Freedman et al 1991). For cases where the covariate is instead negatively correlated to the true covariate value, assurance of dilution bias is no longer possible (Wacholder 1995). In fact, there are cases where regression coefficients can also be biased upwards in comparison to the true gradient (Cochran 1968, Carroll et al 1995, Wacholder 1995, Kipnis et al 1997).

For scenarios where the measurement error is positively correlated to its true value, regression attenuation becomes more intense than in uncorrelated scenarios

(Wacholder 1995) (figure 5.6).

Figure 5.6 The estimated regression gradient (dotted line) is visibly attenuated compared to the true gradient (unbroken line) when the measurement error (E) is positively correlated with its true value (X) (adapted from (Wacholder 1995) with terms defined as in figure 5.5)

To sum up, ‘regression dilution bias’ and ‘regression to the mean’ are related terms referring to the more general concept of ‘attenuation bias’ (Fuller 1986,

Thompson et al 1997, Frost & Thompson 2000, Delgado-Rodriguez & Llorca 2004,

Armitage & Colton 2005). Attenuation bias and measurement error are rarely ever dealt with in the MR literature (Freedman et al 1991, Copas & Lozada-Can 2009) possibly due to being considered a minor problem compared to other issues such as aggregation bias (Thompson & Higgins 2002). Note, however, that this phenomenon is not an obstacle if the regression analysis is aimed at predicting a response directed to the same data sample (Armitage & Colton 2005). However, the purpose of MR (e.g.

Egger’s model) here is to make inferences about the population (based on the finite sample of published studies). Hence, regression analyses that include a stochastic quantity as the predictor variable should account for the random error in its estimation in order to avoid attenuation bias (Schmid et al 1998). Ignoring this may produce a

‘naïve model’ with potential biased regression results (Kipnis et al 1997, Thoresen

2007). Methods for correcting attenuation bias require repeated measurements of at least some units of the analysis in order to obtain an estimate of the variability around the predictor variable; or alternatively, the use of instrumental variables (Freedman et al 1991, Rosner et al 1992, Frost & Thompson 2000, Cox et al 2003, Armitage &

Colton 2005, Berglund et al 2008).

With regards to the original Egger’s model, where the lnOR standard error is both the covariate and the lnOR error term, the statistical model suffers the consequences of measurement error because the standard error is inevitably a random variable with an underlying probability distribution (Armitage et al 2002). That is, the Egger’s regression model infringes the conventional least squares regression assumption by which predictor variables should be deterministic (Copas & Lozada-Can 2009).

The regression slope is expected to be biased towards zero (known as attenuation/dilution bias) provided the error is uncorrelated with the true value

(MacMahon & Peto 1990, Arends et al 2000, Kirkwood & Sterne 2006). In the Egger’s case, however, the measurement error upon the covariate is positively correlated with the true covariate value and of the response variable lnOR since the standard error becomes inflated due to the structural correlation phenomenon, making attenuation bias even more intense. Then, measurement error is expected to produce attenuation bias, which interestingly biases the regression relationship in an opposite direction to the bias induced by the structural correlation.

Although measurement error can bias both the association intensity and intercept, its assessment is further complicated because studies are not affected by measurement error and structural correlation uniformly since smaller studies are more affected than larger ones. As an added complication, it must be kept in mind that the

Egger’s approach is in fact an inverse-variance weighted regression. Therefore, the impact of bias on the standard error will have a knock-on effect upon the weighting procedure, playing a magnified compounded role over the regression analysis.

Interestingly, this complex scenario mirrors a parallel situation where baseline risk is regressed on the effect size. Whenever baseline risk is believed to be an effect modifier, its relationship with treatment effect is investigated as a potential explanation of between-study heterogeneity (Thompson et al 1997, van Houwelingen et al 2002,

Dohoo et al 2007). The usual naive meta-regression analysis employed to investigate whether the treatment effect (lnOR) varies according to the underlying risk of the patients in the different trials (lnOdds) is flawed and can produce seriously misleading results. The reasons are: i) the response variable lnOR is structurally correlated to the lnOdds, and ii) both variables contain measurement error. Interestingly, this same situation arises in the Egger’s regression model (Arends et al 2000, Dohoo et al 2007,

Ghidey et al 2007).

What is more, study precision and the weighting procedure could be based on sample size, which is exempt from measurement error and structural correlation (Yates

& Cochran 1938, Irwig et al 1991, Frost & Thompson 2000, Macaskill et al 2001, Senn

2007a). Yet, (a function of) the standard error can be argued to be a preferable predictor variable because it is more informative than (a function of) sample size. In particular, the standard error denotes the power of a trial, which depends on the number of participants developing the event in addition to the sample size (Sterne et al

2000). Similarly, the natural and preferred choice employs the error-prone weights despite the subsequent knock-on effect (Carroll & Stefanski 1994).

The correction methods developed by Greenwood et al (Greenwood et al 1999);

Carroll et al (Carroll & Stefanski 1994) and MacMahon et al (MacMahon & Peto 1990) are impractical for the Egger’s regression model because the predictor variable is the lnOR standard error and so would require the repetition of at least some of the published studies. Copas et al (Copas & Lozada-Can 2009) recently proposed a correction for the biases affecting the radial plot. Note that the radial plot is based on a variant of the original Egger’s model; specifically the FE Egger model described shortly

(equation 5.2). The bias adjustment attempts to correct for the measurement error at the same time as the intrinsic structural correlation. Formulae for the bias adjustment can be derived for the Egger model based on the formulation given for the radial plot

(Copas & Lozada-Can 2009). However, no adjustments have been made in the simulation study of chapter 6 because the article was published after the simulation study was carried out.

However, measurement error is unlikely to have a strong impact in practice

(Greenwood et al 1999) because the bias induced by any form of measurement error diminishes quickly for more precise estimates; so that the lnOR standard error becomes less influenced by measurement error for larger study sizes (Macaskill et al

2001, Copas & Lozada-Can 2009). Figure 5.7 shows how studies of reasonable size should not be concerned about the phenomenon of measurement error.

Figure 5.7 Standard error and variance estimates for lnOR=0.58 across a range of study sizes in balanced trials

3.5 3.0 Var(lnOR) 2.5 se(lnOR) 2.0 1.5 1.0 0.5 0.0 0 20 40 60 80 100 120 140 lnOR=0.58 ==> OR=1.78 Study size

Also important to note is that study precision can be gained not only by means of bigger study size, but also through more participants developing clinical events (for binary outcomes).

To assess the influence of attenuation bias and structural correlation upon several

Egger-based MR models, the simulation study in the following chapter compares their performance and examines empirically whether they are influenced in specific meta- analytic scenarios. Finally, the assumption of homoscedasticity (iv) helps introducing the notion of weighted regression.

5.3.4. Heteroscedasticity

Under ordinary least squares (OLS) regression, a major assumption (v) named homoscedasticity is made by which the error term of the dependent variable has approximately a constant variance for all observations. Of course, this assumption of homogeneous variances is fulfilled if the observations are equally influential (i.e. weighted) in the analysis and drawn from the same underlying distribution. Violation of this assumption, known as heteroscedasticity, is the rule in both MR and MA since study sizes are characteristically not homogeneous, with bigger studies being regarded more important and so given more weight in the analysis (Viechtbauer 2007).

Since heteroscedasticity will produce biased regression coefficients (Carroll &

Ruppert 1988, Carroll & Stefanski 1994), standard FE MR applies inverse-variance weighting to allow for possible heteroscedasticity. Specifically, heteroscedasticity is corrected through weighted OLS (WOLS) estimators, which minimises the weighted

2 sum of square errors/residuals by weighting each square error by the precision (1/ sei ) of that observation (Carroll & Ruppert 1988). Although other options such as study size are also available (Yates & Cochran 1938, Senn 2007a), the natural and common choice employs inverse-variance weighting to allow for possible heteroscedasticity

(Carroll & Stefanski 1994), where larger weights are given to observations with smaller error variance (i.e. bigger studies).

5.4. Weighted regression and the Egger’s model variants

The weighting employed in MR is sometimes not described in sufficient detail in the scientific literature (Thompson & Higgins 2002). Acknowledging the different modelling approaches to MR and their underlying assumptions is essential in order to avoid model misspecification. Indeed, the various approaches provide different results on, at least, the uncertainty surrounding the regression coefficients (Thompson &

Sharp 1999). In particular, underestimates of the variances are often obtained which can lead to spuriously significant findings. This has a direct impact on hypothesis testing and confidence intervals, and consequently in PB testing and adjustment

(Rothstein et al 2005), so is of relevance in this thesis.

Weighted regression models can be broadly classified according to the way in which they deal with residual heterogeneity (once explainable heterogeneity is accounted for by the covariates); and hence their different way of dealing with the error term εi. The variance-weighted least squares (VWLS) model can be considered as the

FE MR model since it assumes that sampling error is the only source of error (Harbord

& Higgins 2008). The error term εi are assumed independent and normally distributed

εi=N(0,υi) with error variances υi (Viechtbauer 2007). The error variances υi are

2 replaced by the observed within-study variances sei as if they were the true variances.

As noted for standard MA, the error variances υi are not estimated in the regression

2 analysis but assumed known without error and equal to sei , although this is known to induce bias in the estimation (Copas & Lozada-Can 2009). The FE version of the original Egger’s regression model (equation 5.2) ignores the dispersion parameter φ that accounts for between-study heterogeneity (Thompson & Sharp 1999) outlined in equation 5.1. The FE Egger model is implemented in Stata (Stata 2008) as: vwls yi sei

, sd(sei).

100

1 2 yi = α + β × sei + εi weighted by 2 with ε i ~ N(0, sei ) sei

[Equation 5.2]

Model terms were defined in equation 5.1. The original Egger’s model (based on a

WOLS regression) assumes that the errors have the distribution εi=N(0,υi x φ),

2 where sei replace the variances υi and φ symbolizes a multiplicative dispersion parameter estimated from the data which allows for heterogeneity inflation (McCullagh

& Nelder 1989, Thompson & Sharp 1999). As a matter of fact, the multiplicative parameter φ arises because the ‘aweight’ command in Stata regards the inverse-

2 variance weights (1/sei ) as ‘proportional’ to the inverse of the variance rather than being exactly the inverse of the variance (Thompson & Sharp 1999, Thompson &

Higgins 2002). In practice, the main difference between these two approaches to weighted regression is that the multiplicative dispersion model estimates the magnitude

2 of the error variance (sei x φ) from the data, whereas the FE MR model assumes the sampling error as the only source of variation. Even so, although the dispersion MR model accommodates residual heterogeneity, it could still be considered a FE model since it assumes a unique underlying effect (as opposed to a RE model). Thus, both approaches provide the same mean regression coefficients with different confidence intervals whenever residual heterogeneity remains.

In the context of MR, residual unexplained heterogeneity is normally expected because covariates are not generally able to fully account for the total heterogeneity

(Berkey et al 1998). Ignoring this residual heterogeneity can lead to spurious significance of the regression coefficients, and so exaggerating their importance

(Thompson & Sharp 1999). This makes FE MR unattractive to use in practice

(Thompson & Sharp 1999, Higgins & Thompson 2004, Harbord & Higgins 2008, Sterne

101 et al 2009a). The dispersion MR approach is able to capture unexplained heterogeneity in a multiplicative fashion; and its implementation is common in PB testing (Harbord et al 2006, Harbord & Higgins 2008). Conversely, not many MR analyses (aside from PB tests) are said to use the dispersion model. The view in this thesis is that MR analyses are sometimes not described in sufficient detail in the scientific literature (Thompson &

Higgins 2002); and perhaps many are carried out using the dispersion model even when a FE MR was intended instead. Particularly when using Stata (Stata 2008), it is common to implement a MR model using the ‘regress’ command and specify weights

2 equal to the inverse of the within-study variances (1/sei ) by using the appended command ‘aweight’. The combination of the commands ‘regress’ and ‘aweight’ produces what the Stata manual refers to as “another kind of variance-weighted least squares” compared to the VWLS command. That is, although both approaches provide the same mean regression coefficients, only the VWLS command gives the correct coefficient standard errors for an ideal FE scenario (i.e. homogeneous study effect sizes) (Harbord & Higgins 2008).

The notion of the multiplicative dispersion parameter for heterogeneity inflation has been claimed to have weak theoretical reasoning and to be inconsistent with the typically additive component of variance to represent heterogeneity (Thompson &

Sharp 1999). Indeed, the common approach in the MR context to make allowance for residual heterogeneity is the RE version of the VWLS model, which extends the FE MR model with an additive component of between-study variance τ2 to the within-study

2 variance sei (Harbord & Higgins 2008). The RE Egger model can be implemented with the user-written Stata command metareg (Sharp 1998, Harbord & Higgins 2008, Sterne et al 2009a): metareg yi sei , wsse(sei).

102

1 y = α + β × se + μ + ε weighted by i i i i se 2 +τ 2 i [Equation 5.3]

μ ~ N(0, τ 2 ) ε ~ N (0, se 2 ) where i & i i

With the model terms defined as in equation 5.1, the extra additive parameter τ2 represents the variation between study effects, which can be thought of as a random variable μi drawn from an underlying mixing distribution (Berkey et al 1998, Thompson

& Sharp 1999, Thompson & Higgins 2002, Higgins et al 2009). μi is a normal error term with mean zero and variance τ 2, which is estimated from the data and assumed equal across all studies. Thanks to the additive variance component, the hierarchical model should provide a better fit to heterogeneous data than FE MR.

In summary, the FE model is optimal provided the covariates can account for the whole heterogeneity and the within-study variances are good approximations to the true population variances (i.e. lack of measurement error). However, residual heterogeneity often remains. Two modelling options are available to account for the residual heterogeneity although their performance can be similar in practice

(Thompson & Sharp 1999, Thompson & Higgins 2002); i.e. the RE and dispersion MR models. The dispersion MR model accommodates heterogeneity in a multiplicative fashion by assuming more heterogeneity among the smaller studies (Rubin 1992), whereas RE MR accommodates heterogeneity in an additive fashion by assuming heterogeneity equal across studies. In this sense, although the common approach to accommodate heterogeneity is through the additive component of between-study variance τ2, Rücker et al (Rücker et al 2008b) have now started taking into account that some of that heterogeneity is due to small-study effects (further discussion in section

6.5.7).

103

The three Egger model variants (equations 5.1-3) differ only with regards to the assumptions made about residual heterogeneity and so results from fitting all three variants are considered in the simulation study of chapter 6. Model variants are also considered in section 6.3.2. Of course, they all suffer from the same biases described for the original Egger’s model although not necessarily of identical magnitude.

5.5. Discussion

MR has been shown to improve precision and accuracy by modelling explainable heterogeneity. It helps investigate the associations between study/patient characteristics and study effects. However, the use of multiple MR to incorporate several potential effect modifiers is out of the question due to interpretability problems in typically small MAs of summary data (Greenwood et al 1999).

Further to effect modifiers, one simple way of investigating whether PB is present is by means of funnel plot asymmetry, where funnel plots mirror the tendency for smaller studies to show larger effects (Ioannidis & Trikalinos 2007a). The simple assumption that a funnel plot appears symmetric in the absence of sources of small- study effects is commonly adopted (Berlin et al 1989, Sterne et al 2000). However, small-study effects can arise due to many reasons besides PB and seems unlikely that they can be disentangled, and so funnel plot asymmetry cannot confirm the presence of any specific type of bias (Sterne et al 2000, Lau et al 2006, Ioannidis & Trikalinos

2007a, Ioannidis 2008b). Whenever the source/s of small-study effects is unidentified, the interpretability of the funnel plot will be problematic.

In accordance to Siersma et al (Siersma et al 2007) recommending joint assessment of potential biases, this chapter proposes addressing all sources of small- study effects simultaneously. Although rather than using the multiple MR approach, which is usually impracticable in the MA context (as discussed in section 2.7 & 5.1), the simple MR approach is favoured. A promising solution to jointly incorporate the

104 numerous sources of small-study effects in a simple MR is to include a single covariate acting as their proxy (Greenwood et al 1999), such as study precision. Indeed, study precision is an ideal predictor for an infinite size study. Recall that chapter 3 suggested that the hypothetical study of infinite size would provide an unbiased effect estimate.

For instance, small-study effects associated to trial quality and PB is characterized by a measure of study precision by assuming that smaller studies are, on average, of lower methodological quality and less likely published. Of course, larger number of patients does not ensure higher trial quality or publication, but the involvement and collaboration among more researchers with more time and financial resources to ensure proper study design, conduct and reporting (typical of bigger studies) does (Kjaergard et al

1999, Sterne et al 2001b).

Meta-analysts are concerned about the inconsistencies between studies due to their heterogeneous characteristics; particularly, in relation to those with lower methodological quality that fuel the risk of bias towards positive results. This explains the recent recommendation to minimise the risk of bias by restricting the primary MA to good quality studies (Deeks et al 2008, Higgins et al 2008a, Wood et al 2008, Liberati et al 2009). Contrary to this current trend that clearly implies loss of information, the regression-based approach makes use of all available data while performing an adjustment accounting for not only the methodological quality of studies but also other sources of small-study effects. Yet, the present tendency to advocate routine assessment of the risk of bias in MAs is compatible with the regression-based approach since this automatically down weights by study quality (since quality factors are known to induce small-study effects (Sterne et al 2000)).

105

Study precision in the form of the effect size variance has been shown in both simulation and empirical studies to outperform other covariates when modelling small- study effects (Moreno et al 2009a, Moreno et al 2009b). From a statistical modelling perspective, the single covariate forms part of the regression model that attempts to characterize the underpinning model generating the observed studies. Because all statistical models suffer misspecification to some extent (Bowden et al 2006), its results can only be considered a good approximation to the truth.

The relationship between treatment effect and a measure of its precision was first established by the Egger’s regression model. This chapter has examined this as well as some model variants. Several factors mentioned in section 5.3 can bias their regression coefficients, in addition to increasing uncertainty around them. The extent and direction of the overall bias are difficult to estimate due to the nature of MR and will depend on the error model assumed (Armitage & Colton 2005), the sign and strength of the structural correlation, as well as the magnitude (and inner correlation) of the measurement error (Freedman et al 1991, Wacholder 1995, Thompson & Sharp 1999,

Sterne et al 2000, Harbord et al 2006). Overall, the impact of this bias upon the regression intercept is complex. It is difficult to predict whether the slope coefficient is always attenuated since it is a context-specific problem and both phenomena

(structural correlation and measurement error) bias the regression slope in opposite directions. Although in practice, measurement error is unlikely to have a major impact on the estimated slopes, whilst attenuation/dilution bias is considered a minor problem compared to other issues such as aggregation bias (Thompson & Higgins 2002). What is more, it is unusual to come across a case study where measurement error

(Greenwood et al 1999) or structural correlation had a serious impact in real practice

(Thompson & Higgins 2002). Interestingly, Egger et al (Egger et al 1998) claim that regardless of the existence of attenuation bias, it does not have an impact on the performance of their regression model in terms of type I error rate, as claimed by others (Irwig et al 1998, Macaskill et al 2001, Copas & Lozada-Can 2009). It is also

106 alleged that the impact of the structural correlation is weak as long as the true underlying effect is not far from the null effect. A sufficient number of study participants developing events as well as enough studies within the MA, where these studies must have a mixture of study sizes, are also known to provide less biased regression coefficient estimates (Sterne et al 2000, Harbord et al 2006). The impact of these biases will be investigated empirically in the simulations study in chapter 6.

The adoption of the multiplicative dispersion approach abandons the rigid FE MR structure (incapable of accounting for unexplained residual heterogeneity) in favour of a less conventional but more capable dispersion model. Moreover, it is the only one able to address under-dispersion by taking φ<1. In theory, under-dispersion of observed effects (i.e. less variability that would be expected by chance) should never occur unless it is caused by some extraneous phenomenon such as PB (Jackson 2006a,

Moreno et al 2009a).

Thompson et al (Thompson & Sharp 1999, Thompson & Higgins 2002) consider the justification for inflating the variance in a multiplicative manner weak, and so they favour the more established and intuitively appealing additive RE MR. They criticise the reduced influence of smaller studies on the pooled effect size compared to RE MR.

Interestingly, this same criticism has also been made about FE MR (Thompson &

Pocock 1991, Thompson & Sharp 1999).The view in this thesis is that these criticisms are speculative since the correct weighting is unidentifiable in real life, and might depend from context to context due to particular characteristics of each dataset. In fact, the dispute between the two competing ways of accommodating residual heterogeneity in MR is still unresolved. The confirmation of whether the multiplicative or additive approach captures unexplained heterogeneity more accurately remains difficult in absence of a number of gold standard MAs. Unfortunately, the lack of availability of such empirical data prevents further investigation. In this thesis, it is argued that there are a number of benefits to be gained from implementing the novel dispersion

107 approach (as opposed to FE or RE MR). For instance, RE MR is more vulnerable to the phenomenon of small-study effects and therefore potentially more misleading than both FE MR and dispersion MR due to granting more weight to smaller studies

(Greenland 1994a, Greenwood et al 1999). What is more, empirical evidence indicates that RE MR experiences excessive coverage probabilities in scenarios with little heterogeneity (Moreno et al 2009a). To facilitate a better understanding of the properties of the dispersion MR approach, chapter 8 will investigate its weighting properties relative to the standard MA approach.

5.6. Summary

This chapter satisfies the thesis aim to develop and adapt techniques capable of overcoming some of the limitations of currently used adjustment methods. To this end, chapter 5 proposes the regression-based approach as the most coherent approach for

PB adjustment (and other small-study effects). Intrinsic limitations and pitfalls from MR are scrutinized in relation to the Egger’s type regression models to prevent misleading interpretations of the results. Because these models use a function of the standard error to represents both the predictor variable and the error term of the dependent variable lnOR, it is not possible to definitely ascertain whether the regression slope is always attenuated in comparison to the true gradient.

Because alternative MR approaches to modelling small-study effects are available and they may be affected differently by the pitfalls and limitations highlighted in this chapter, an empirical assessment of their performance is undertaken under controlled conditions. To this end, chapter 6 progresses with the aims of the thesis by carrying out a simulation study to identify those adjustment methods with the most desirable statistical properties.

108

Assessment of existing & novel methods for adjusting for publication bias (and other small- study effects) through simulation

6.1. Introduction

The methods under evaluation in this simulation study include different versions of the TF algorithm (section 4.2.5) as well as several regression methods (section 4.1.4).

The regression methods are more commonly applied for detecting PB (rather than adjust for it) and include those proposed by Egger et al (Egger et al 1997c), and modified variants of this by Harbord et al (Harbord et al 2006) and Peters et al (Peters et al 2006), as well as some less well known approaches (section 6.3.2). For the first time, adjustments that are conditional on some prior significance test for funnel plot asymmetry are evaluated.

As noted in chapters 4-5, existing methods for adjustment (and detection) of PB have many shortcomings, such as having to make strong assumptions that are usually untestable and not always credible (Thornton & Lee 2000, Ioannidis 2008b). In order to overcome the problems of adjustment methods such as the TF method, novel statistical methods are developed in the hope of providing an improved method to adjust for PB. To achieve this, empirical evidence about how human behaviour induces

PB will be utilised. The ultimate aim is to adjust MA results for PB (and other small- study effects) accordingly to allow for more reliable decision-making. To this end, a large-scale assessment of previously described and novel adjustment methods is undertaken here. The performance of competing adjustment methods is assessed through a comprehensive simulation study, to clearly identify those adjustment methods with the most desirable statistical properties.

109

Simulation studies are generally used to evaluate the performance of statistical methods because an adequate assessment of their properties can only be made under

(life-like) simulated data since it is required that the truth is known (Burton et al 2006).

Hence, the simulation study carried out in this chapter offers an excellent opportunity to investigate the performance of different methods under controlled conditions. An important criticism of simulation studies relates to the generalisability of its results

(Maldonado & Greenland 1997). Undoubtedly, the generalisability of results from simulation studies will depend on whether the scenarios simulated truthfully match those found in real life, since generalisation is highly specific to the assumed experimental environment (Minifie & Heard 1985). Ultimately, the findings from a simulation study can be generalised to the real MAs depending on how realistic the simulated scenarios are judged to be (Maldonado & Greenland 1997). In this sense, the generalisability of the findings from this simulation study is supported by the extensive variability in scenarios available, but is questionable for different study designs (or other substantially different characteristics) to the ones simulated here.

Altogether, no simulation study is completely free from the problem of generalisability (Maldonado & Greenland 1997). For instance, with respect to simulation studies designed to assess the performance of methods to deal with PB, the

Cochrane handbook asserts that: “there is ongoing debate over the representativeness of the parameter values used in the simulation studies, and the mechanisms used to simulate PB and small-study effects, which are often chosen with little explicit justification“ (Sterne et al 2008). To overcome this problem, the simulated MAs were based on a set of characteristics intended to reflect MAs of RCTs in the medical literature. These characteristics in the form of assumptions and parameter values have drawn on various researchers’ extensive experience in this area (Moreno et al 2009a) as well as considering the following review of published simulation studies in the field.

110

6.2. Published simulation studies evaluating methods for publication bias

A list of published simulation studies assessing the performance of tests and methods for detecting and adjusting for PB is given in table 6.1. This is an updated version of table 7.2 in Peters’ thesis (Peters 2006) detailing specific parameters and their values from 12 published studies.

111

Table 6.1 Summary of characteristics of published simulation studies assessing performance of tests and adjustments for publication bias

Test(s) and Method(s) How PB was Unexplained Study PB N studies in MA Underlying effect(s) of interest induced heterogeneity

p-value or 0 (null hypothesis) (Begg & Mazumdar 1994) Rank correlation test Yes 25, 75 Yes effect size + 3 standard deviations

Function of Rank correlation test 5, 10, 20, 30 ORs of 1, 0.5, 0.2 No (Sterne et al 2000) Yes standard error Egger’ regression test

Rank correlation test (Macaskill et al 2001) Egger’s regression test Yes p-value 21, 63 ORs of 1, 0.67, 0.5, 0.25 No FIV s.size regression FPV s.size regression

(Duval & Tweedie 2000a) Trim and fill method Yes Effect size 25, 50, 75 Not specified No

(Terrin et al 2003) Trim and fill method No N/A 10, 25 ORs of 0.5, 0.8, 1 Yes

Several regression tests p-value or 6, 16, 30, 90 (Peters et al 2006) Egger’ regression test Yes effect size OR of 1, 1.2, 1.5, 3, 5 Yes

Egger’s regression test p-value or Modifi Egger’s reg. test 21 ORs 1, 0.75, 0.5, 0.25 Yes (Harbord et al 2006) Yes effect size Macaskill weight FPV

112

Test(s) and Method(s) How PB was Unexplained Study PB N studies in MA Underlying effect(s) of interest induced heterogeneity

Rank correlation test (Schwarzer et al 2002) No N/A 10, 20, 50 ORs of 0.5, 0.67, 1, 1.5, 2 Yes Egger’s regression test

FE Radial plot (i.e. FE (Copas & Lozada-Can Yes p-value 50 ORs of 0.2, 0.5, 0.67 No 2009) Egger’ regression test)

Copas selection (Rücker et al 2008a) Arcsine test Yes model (Copas 10, 20, 50 ORs of 0.25, 0.5, 1 Yes & Shi 2000a)

Schwarzer’s Rank Yes p-value 5, 10, 20, 50 ORs of 0.25, 0.5, 0.67, 1 Yes (Schwarzer et al 2007) correlation test

Clayton’s pseudo-data p-value or Yes 5, 25, 50 OR of 3 No (Bowden et al 2006) Augmentation method effect size

113

The lack of a standard simulation framework makes comparisons between methods very challenging, which may explain the lack of consensus on which PB test should be applied based on its superior performance (Thornton & Lee 2000, Sterne et al 2008, Liberati et al 2009). In this sense, it would be desirable to adopt a consolidated simulation framework to alleviate any problems of the methods being evaluated under different (and arguably favourable) simulation conditions. Indeed, it is common in simulation studies to test the performance of a statistical method by generating data that fulfil the underlying assumptions of the method under evaluation. In order to stop this from occurring, the simulation model presented here induces PB following the two most common selection mechanisms, by p-value and effect size, with varying intensity levels. In addition to presenting the simulation results, this chapter tries to offer a consensus simulation framework in which future PB tests and adjustment methods can be evaluated and compared.

6.3. Statistical methods to be evaluated

The adjustment methods assessed here are listed in table 6.2 and details of each one follow. For completeness, the standard FE and RE MA models are included, although they do not perform any adjustment whatsoever. Therefore, any disagreement between their pooled estimates can be solely explained by the effect of their different weighting scheme in the presence of funnel asymmetry (Villar et al 2001). Hence, the discrepancy in pooled estimates can be considered as a naïve quantification of funnel asymmetry (as previously seen in figure 5.2). Additionally, plotting the pooled effects from standard MA models facilitates visualizing performance differences with respect to the more sophisticated adjustment methods.

114 After considering the usual standard MA methods, adjustment methods are split into two major groups, the more traditional frequentist methods and the novel Bayesian approach. Since Bayesian performance measures are somewhat different to frequentist measures, details of the assessment of the novel Bayesian adjustment approach are given in chapter 10. Frequentist methods can be further split into parametric and non-parametric (table 6.2). All parametric approaches considered here model the degree of association between study effect and a measure of its precision through regression. These are based on existing regression-based models (section

4.1.4) originally designed to determine the impact of small-study effects on the conclusions from a MA (Ioannidis & Trikalinos 2007a, Ioannidis 2008b).

Other methods for PB adjustment are available in the literature but were not evaluated here (Richy & Reginster 2006). Selection modelling techniques (Vevea &

Woods 2005) were excluded for the reasons given in section 4.2.6. With regards to the multiple imputation approach (section 4.2.7) proposed by Carpenter et al (Carpenter et al 2008), it was developed after this simulation study was undertaken, and is still undergoing the peer review process. Not all PB tests listed in table 4.1 are included in the assessment. The arcsine-based regression test developed by Rücker et al (Rücker et al 2008a) is not considered here because it performs a correction on the arcsine scale, which is harder to directly compare with the other methods performance. This recently developed method was motivated by the problem of structural correlation. The idea is to stabilize the variance from a binomial distribution by using the arcsine transformation. Since there is no clear consensus on which test should be applied based on its superior performance (Thornton & Lee 2000), a decision was made to evaluate only those that are most frequently used together with some variants, which vary only in the way they accommodate residual heterogeneity.

115 In summary, a comprehensive simulation study is presented here to assess the performance of different adjustment methods including the novel application of several regression-based methods (which are commonly applied to detect PB rather than adjust for it) and the popular TF algorithm. Their performance is assessed by their ability to adjust for small-study effects. Small-study effects are generated by inducing

PB through different selection models for study suppression (section 6.5.8). Explicitly,

PB is induced through two underlying mechanisms, which assume the probability of publication depends on i) the study effect size; or ii) the p-value. Because small-study effects in this simulation study are induced on the basis of PB, for simplicity, the adjustment methods are said to adjust for PB in the remainder of this chapter, although as argued in chapters 3-4, the adjustment methods do actually adjust for small-study effects despite their origin.

The outline of this chapter is as follows. After listing the methods included in the simulation study (table 6.2), they are described unless already done so previously. The following sections describe the design of the simulation study, followed by the results and discussion to conclude the chapter. The results from this simulation study have been published in the journal BMC Medical Research Methodology (Moreno et al

2009a).

116

Table 6.2 List of statistical methods included in the simulation study. An abbreviation is given to each method, which is used in the remainder of the chapter

• The two usual standard meta-analysis methods

Fixed-effect (FE) meta-analysis (FE)

Random-effects (RE) meta-analysis (RE)

• Non-parametric adjustment method: Trim & Fill

R0 estimator, trim using FE & FE on filled dataset (TF FE-FE)

R0 estimator, trim using FE & RE on filled dataset (TF FE-RE)

R0 estimator, trim using RE & RE on filled dataset (TF RE-RE)

• Parametric adjustment methods: weighted regressions

o Egger’s model variants:

Fixed-effect (FE-se);

Random-effects (RE-se)

Dispersion (D-se)

o Egger-Var model variants:

Fixed-effect (FE-var)

Random-effects (RE-var)

Dispersion (D-var)

o Other regressions

Harbord’s model (Harbord)

Peters’ model (Peters)

o Conditional method

Conditional adjustment based on Harbord’s PB test (Harbord-C)

117

6.3.1. Non-parametric adjustment Trim & Fill method

Trim & Fill (Duval & Tweedie 2000b, Duval & Tweedie 2000a) is probably the most popular method for examining the possible effect of PB on the pooled estimate. The method’s algorithm and its variants are fully described in section 4.2.5. Since the TF is considered by many as the standard approach when adjusting for PB, it is sensible to compare the performance of any new adjustment methods with it. Briefly, the TF is implemented under three conceptually different models with respect to how heterogeneity is dealt with FE-FE, FE-RE & RE-RE

6.3.2. Parametric adjustment methods (regression-based)

Regression-based methods that model small-study effects were originally intended for testing PB (Sterne et al 2008). Several of them are put forward here as adjustment methods intending to predict the unbiased effect size from a hypothetical study of infinite size (as discussed in chapter 5). Recall that these methods aim at adjusting for funnel plot asymmetry caused by small-study effects and not necessarily PB alone, although for simplicity, PB is referred to as the cause of such asymmetry in the remainder of the chapter. Details of these various regression methods are described next.

‘Egger methods’ correspond to three Egger model variants (equations 5.1-3) described in chapter 5. Section 5.4 made clear that they differ only with regards to the way they accommodate residual heterogeneity; and so the results from fitting all three variants are considered here.

‘Egger-Var methods’ are alterations of the three Egger model variants created by replacing the predictor variable (standard error of each study’s effect size)

118 with the corresponding variance. This implies that the relationship between effect size and its variance is linear, whereas the Egger approach assumes linearity in relation to the standard error. Similar to the three variants of the Egger’s model, three Egger-Var model variants are considered. The dispersion Egger-Var model (D-var) is

2 2 implemented in Stata as: regress yi sei [aweight=1/sei ]

2 1 2 yi = α + β × sei + ε i weighted by 2 with εi ~ N(0, sei ×ϕ) sei

[Equation 6.1]

2 The FE Egger-Var model (FE-var) is implemented in Stata as: vwls yi sei , sd(sei)

2 1 2 yi = α + β × sei + ε i weighted by 2 with εi ~ N(0, sei ) sei

[Equation 6.2]

2 The RE Egger-Var model (RE-var) is implemented as: metareg yi sei , wsse(sei)

2 1 y = α + β × se + μ + ε weighted by i i i i se 2 +τ 2 i

[Equation 6.3]

μ ~ N(0, τ 2 ) ε ~ N (0, se 2 ) where i & i i

119 ‘Harbord’ is a regression test for funnel plot asymmetry designed by Harbord et al (Harbord et al 2006). A distinct advantage of this method is that it accommodates most of the structural correlation between the lnOR and its standard error (Macaskill et al 2001). Following advice from Roger Harbord, the statistical model can be expressed as:

Zi α ε i = β + + ω i weighted by Vi where ω i = Vi Vi Vi

[Equation 6.4]

2 2 var()εi σ ⎛ σ ⎞ var()ω = = and thus ω ~ N ⎜0, ×ϕ ⎟ i V V i ⎜ V ⎟ i i ⎝ i ⎠

Where Zi is the efficient score and Vi is Fisher's information (the variance of Z under the null hypothesis for the i th study). Again, the two coefficients α and β symbolize the adjusted pooled effect and the degree of small-study effects respectively.

Under some instances the results from the original Egger and Harbord methods agree; i.e. the ratio Zi /Vi approximately equals ln(ORi) while the ln(ORi) variance tends to 1/Vi. This was already predicted by the authors for MAs without large underlying effect sizes (Higgins & Whitehead 1996, Harbord et al 2006).

The Harbord model is implemented in Stata (Sterne et al 2009a) as: metamodbias a c b d. For each study i, a and b represent the observed number who experience the outcome of interest in the treated and control groups, respectively, and c and d are the numbers corresponding to those not developing the outcome in the treated and control group respectively.

120 ‘Peters’ corresponds to a test for PB developed by Peters et al (Peters et al

2006). This WOLS regression model establishes a linear association between study effect and the inverse of its sample size, weighted by a function of sample size. Peters’ approach is a modification of Macaskill’s test (Macaskill et al 2001) that it outperformed under simulation (Peters et al 2006), with the inverse of the total sample size as the independent variable. The weighting given to each study by the Peters’ method

(equation 6.5) is based on the assumption that the null hypothesis of no underlying effect is true. Peters’ model is implemented in Stata as: regress yi 1/sizei

[aweight=1/(1/(ai+bi)+1/(ci +di))]

−1 β ⎛ 1 1 ⎞ y =α + +ε weighted by ⎜ + ⎟ i a + b + c + d i ⎜ a + b c + d ⎟ i i i i ⎝ i i i i ⎠

2 where ε i ~ N(0, sei ×ϕ) [Equation 6.5]

As before, the two coefficients α and β represent the adjusted pooled effect

(intercept) and regression slope respectively. For each study i, a, b, c and d are defined as in the Harbord’s method above, where total sample size of the i th study corresponds to the sum of ai, bi, ci and di.

Using (a function of) the sample size as the predictor variable avoids the problem of structural correlation. It also avoids violating an assumption of regression analysis infringed by all previous methods, which is that covariates shall not contain measurement error (Irwig et al 1991, Frost & Thompson 2000). Recall that sei is a random variable containing sampling variability, but is conventionally assumed fixed, known and equal to their true value (Hardy & Thompson 1996), although this strong assumption is likely to induce bias in the parameter estimates unless the studies are sufficiently large (Copas & Lozada-Can 2009).

121 6.3.3. Conditional methods

The last approach to adjusting for PB supposes that in practice, researchers may carry out a test for PB/small-study effects and consider the use of adjustment methods conditional on the outcome of such test (and appreciation that standard MA models are the most competent when small-study effects is not present). Then, it could be suggested to test for the presence of small-study effects and solely perform any adjustment in case it was detected. Although it is suspected that in practice adjustment methods are applied conditionally, the study of their performance has received little attention so far (Sterne et al 2008).

Due to space constraints, only two options among the many possible combinations of tests and adjustment methods have been considered. Egger’s test is possibly the most popular test to detect PB and so it is used in parallel to Harbord, which is reported to be a more efficient test than Egger’s (Harbord et al 2006). Therefore, two conditional approaches are evaluated in which a standard RE MA model or either of the original

Egger or Harbord adjustment methods are used depending on whether the corresponding test (i.e. Egger or Harbord respectively) is significant at the 10% level.

Since the Egger conditional approach is almost always outperformed by the Harbord one, only the latter is reported.

6.4. Simulation procedures

This section is intended to provide full details on all aspects of the simulation study. The structure of this detailed protocol is based on a modified version designed by Burton et al (Burton et al 2006) addressing key considerations when designing any simulation study. The context considered throughout is that of 2-arm comparative studies reporting binary outcome data, with the results of the MAs given on the lnOR scale.

122

6.4.1. Level of dependence between simulated datasets

The level of dependence between simulated datasets plays an important role in the simulation study design (Burton et al 2006). Following the classification by Burton,

Altman and colleagues (Burton et al 2006), this simulation study can be classified as moderately independent because all statistical methods are compared against the same simulated MA dataset in a particular scenario, while each different scenario contains several independent MA datasets. And so, by using the same MA dataset to compare all competing adjustment methods, the within sample variability is cancelled out helping to better detect differences between the statistical methods assessed.

6.4.2. Software to perform simulations

All statistical procedures described in this chapter, from data simulation to analysis, were carried out in the statistical package Stata, version 9.2 (Stata 2008)

6.4.3. Number of simulations to be performed

The results are based on 5,000 simulated MAs (repetitions) of each scenario considered. Examination of the 95% confidence intervals around the performance mean results proved that 5,000 repetitions produce acceptably small standard errors of the simulation process. Hence, variance reduction techniques were not needed in order to either increase the precision of the model output (pooled effect size) or reduce the number of simulations required to achieve the same level of precision. As opposed to

Bayesian methods implemented in WinBUGS (Lunn et al 2000, Lunn et al 2009)

(chapter 10), non-Bayesian methods (and simulations) run fast thanks to the high computational speed of Stata.

123 6.5. Methods for generating the datasets

Independent studies making up MAs are simulated first. Study outcome is expressed as a 2x2 table where events in both study arms are recorded. Once a pre- specified number of studies are achieved through simulation, they are compiled into a

MA. Next, PB can be induced in order to investigate the methods’ adjustment performance.

The primary statistical model employed for generating the datasets under the FE

MA model is parameterised below, which suitably accounts for binary outcome data

(Smith et al 1995, Schwarzer et al 2002).

C C C ri ~ binomial ( pi ,ni )

T T T ri ~ binomial ( pi ,ni )

[Equation 6.6]. C ⎛ pi ⎞ C δ log⎜ C ⎟ = logit( pi ) = μi − ⎝ 1− pi ⎠ 2

T ⎛ pi ⎞ T δ log⎜ T ⎟ = logit( pi ) = μi + ⎝ 1− pi ⎠ 2

Where ni , pi ri are the number of subjects in a single RCT, the probability of an event and the number of events in the i th study arm respectively. The superscript C or

T indicates the control or treatment group. δ corresponds to the natural logarithm of the

th true pooled OR. μi is the average event rate on the logit scale in the i study.

124 The following list provides a summary of the MA characteristics simulated including the assumed probabilistic distributions used to generate the data, the required parameter values and the assumptions made. Rationale for their choices is given in the following sections.

• The underlying effect sizes simulated were OR=1, 1.5 or 3 (OR>1 is considered

protective)

• The number of primary studies per MA was 5, 10, 20 or 30

• The average probability of an event between the control and the active group was

sampled from a uniform distribution with minimum value 0.3 and maximum 0.7

• The number of subjects within each primary study was sampled from the probability

distribution ~Lognormal (6, 0.6)

• The ratio of exposed to control subjects was one (i.e. balanced trial)

• Between-study variance was 0%, 100%, 150% or 200% of the average within-study

variance

• Two competing selection models were assumed to induce PB

6.5.1. Underlying effect size

The underlying effect sizes considered were OR=1, 1.5 and 3 where OR>1 is considered clinically beneficial. These three values allow evaluating the performance of adjustment methods within a realistic range of effect sizes, from the null to a large effect.

125 6.5.2. Number of primary studies in a meta-analysis

The number of primary studies in a MA was 5, 10, 20 or 30 to illustrate a relatively ordinary range of small and large MA sizes. These sizes are also common to other simulation studies in healthcare settings (table 6.1). For instance, Terrin et al (Terrin et al 2003) simulated MAs of 10 and 25 studies. Schmid (Schmid et al 1998) found the median number of studies of seven major medical journals to be 11.5 and 8 in the

Cochrane database with only a handful MAs with at least 20 studies. Schwarzer et al

(Schwarzer et al 2002) made use of three different sample sizes, 10, 20 & 30; although according to Sterne et al, MAs of no more than 10 RCTs are the most common in the medical literature (Sterne et al 2000). The main reason why MAs of 30 studies have also been evaluated here is that these allow a better assessment of the various adjustment methods than smaller, more typical MAs would allow.

6.5.3. Event rate

Following Schwarzer’s et al (Schwarzer et al 2002) approach, the event rate for the intervention and control arms is modelled by setting an average event probability between the two study arms. More specifically, the average event probability pA is defined as:

logit()p + logit(p ) ln(Odds ) + ln(Odds ) logit()p = T C Î ln(Odds ) = T C A 2 A 2

Event probabilities pE and pC correspond to the event probability in the treatment and control groups, respectively.

The average event probability for each trial was generated according to a uniform distribution (with minimum value 0.3 and maximum 0.7) to introduce sampling variability. Peters et al (Peters et al 2006) also modelled event rates following a uniform

126 distribution. However, they did not model the average event rate but the baseline event rate instead (i.e. probability of an event in the control group pC ~ Uniform (0.3, 0.7).

Peters et al estimated the event rate in the active group by multiplying the underlying

OR by the event rate in the control group. In principle, the only difference between these two approaches is that Peters et al model the control event rate whereas

Schwarzer et al prefer the average event rate instead. However, Schwarzer’s approach is preferred because it has been reported to reduce the artifactual structural correlation compared to the alternative approach (Schwarzer et al 2002). Solving the equations above, the event probabilities from both arms can be obtained (Schwarzer et al 2002):

ln()OR pA × OR logit()p = logit ()p + Î pT = T A 2 p × OR + 1− p A ()A

ln()OR pA logit()pC = logit ()pA − Î pC = 2 pA + ()1− pA × OR

Recall that ln()OR = logit(pT )− logit(pC ). Then, when OR=1 Î pA = pC = pT .

Hence, the probability of an event in both arms match those from Peters et al (Peters et al 2006). Once the underlying OR deviates from the null effect (OR=1), the probability of an event in both arms deviates substantially from 0.5 in opposite directions (table

6.3).

127 Table 6.3 Estimated event probabilities for both study arms with an average event probability of 0.5 according to different underlying ORs

Variable Mean Stand Dev. Min Max

PA 0.5 0.12 0.3 0.7 P 0.5 0.12 0.3 0.7 OR=1 C PT 0.5 0.12 0.3 0.7 P 0.45 0.11 0.26 0.66 OR=1.5 C PT 0.55 0.11 0.34 0.74 P 0.37 0.11 0.2 0.57 OR=3 C PT 0.63 0.11 0.43 0.8 P 0.32 0.10 0.16 0.51 OR=5 C PT 0.68 0.10 0.49 0.84

6.5.4. Numbers of events

The actual number of events in each study arm was generated according to a binomial distribution taking into account the corresponding arm event probability and study arm size ni /2; where ni corresponds to the total number of subjects in a study.

6.5.5. Number of subjects

The number of subjects in a study was modelled with a log normal distribution to represent the greater number of small studies and fewer large studies as typically observed in real MAs (Macaskill et al 2001). Note that the lognormal distribution is frequently used in situations where values are positively skewed. Furthermore, because MAs that combine studies with very similar sample sizes degrades the validity of the regression findings between treatment effect and its precision (Sterne et al 2000,

Harbord et al 2006), a reasonably spread distribution was used to prevent it. Then, the sample size distribution of studies was parameterized as ~Log Normal(6,0.6) (figure

6.1).

128 Figure 6.1 Histogram with summary statistics exhibiting the frequency of study sizes simulated (Stata commands: rndlgn 2000000 6. 0.6 Î summarize xlgn ,detail)

Percentiles Smallest 4 20 1% 100 23 5% 150 26 10% 187 26 Obs 2,000,000 15 25% 269 27 50% 403 Mean 482.6

Percent Largest 4 Std. Dev. 318.0

10 75% 604 6544 90% 869 6695 Variance 101096 95% 1082 7969 Skewness 2.3 99% 1628 10037 Kurtosis 13.5

00 0 2000 4000 6000 8000 Study size

The histogram and percentiles values are shown for better comprehension on the implications of sample size distribution for the studies. For instance, there is a probability of less than 1% of sampling a study with less than 100 individuals. Similarly, there is less than a 1% chance of sampling a study with more than 1628 individuals.

Actually, this range approximately matches the ‘middle’ sample size scenario specified in the simulation study by Terrin et al (Terrin et al 2003). Also, 90% of the studies have sample sizes ranging between 150 and 1083 individuals.

This log normal distribution has a central tendency defined by a mean (median) size of 483 (403) individuals per study and a standard deviation of 318. Note that the median sample size is roughly 300 and 467 subjects in Peters et al (Peters et al 2007) and Macaskill et al (Macaskill et al 2001) (configuration B) respectively. However, the median size reported by seven major medical journals was 265, and 177 for the

129 Cochrane database (Schmid et al 1998). While it is obvious that study sizes will play an important role in the performance of the adjustment methods, simulated study sizes are allowed larger to provide a more favourable scenario to assess the performance of the adjustment methods.

6.5.6. Ratio of subjects

For simplicity, the ratio of subjects between the control and active arms is one.

This implies that the number of individuals allocated to active and control arms were always equal (balanced). This approach is common to most simulation studies (Terrin et al 2003, Peters et al 2006).

6.5.7. Inducing heterogeneity

Initially, homogeneous datasets were simulated according to equation 6.6. On the other hand, heterogeneity is known to be an important factor in determining the performance of methods to adjust for PB (Terrin et al 2003, Stanley 2008); and so, heterogeneous studies were generated according to equation 6.8 given below.

The estimated between-study variance τ 2 (section 2.3) was defined to be 0%,

100%, 150% or 200% of the average within-study variance for studies from the corresponding homogeneous MA. The heterogeneity factor by which the average within-study variance is multiplied allows the incorporation of heterogeneity into homogenous MAs following the approach by Peters et al (Peters et al 2006, Peters et al 2007) and Schwarzer et al (Schwarzer et al 2002). N.B. The average within-study variance is firstly estimated from the homogeneous MA (equation 6.6).

2 2 τ = σ i × heterogeneity factor

th 2 Hence, the effect size of the i trial is modelled asδ i ~ N (θ , τ ) ; derived from

130 2 δi ~ N (θ, ( σi × heterogeneity factor ))

This approach to inducing heterogeneity may seem to emulate somehow the multiplicative dispersion parameters model discussed in section 5.4. However, the heterogeneity factor shall not be mistaken with the multiplicative dispersion parameter.

It is worth noticing that for a multiplicative dispersion model to follow, the heterogeneity

2 factor should multiply, not the average within-study varianceσ i but the study-specific

2 oneσi instead. Hence, the approach adopted for inducing heterogeneity here is additive, rather than multiplicative; and inevitably, this will favour the more traditional

RE approach.

In a sense, it can be said that the levels of heterogeneity are modelled in terms of

I2 since heterogeneity is induced proportionally to the average within-study variation.

This allows making statements relating to the performance of methods with respect to

I2. For instance, whether the performance of methods deteriorates after reaching a certain level of I2. Recall that one major advantage of this statistic is that it allows the inconsistency among the studies’ results to be compared across MAs of different characteristics (Higgins et al 2003).

However, as mentioned already in section 2.3, Rücker et al (Rücker et al 2008c,

2 Rücker et al 2008d) rightly point out that the magnitude of the within-study variances σi has an important impact on the value of I2, which tends to infinity as study precision increases even when the average between-study variance τ2 remains unchanged. As a

2 2 result, MAs with identical level of heterogeneity τ but different degrees of σi will produce different values of I2 (Higgins & Thompson 2002). Thus, in accordance with

131 Rücker et al (Rücker et al 2008c, Rücker et al 2008d), it is obvious that the I2 statistic is clearly not designed for quantifying heterogeneity or interpretation of ‘clinical’ heterogeneity. Note, however, that since the sample size distribution of studies was the same all along the simulation study, the performance of methods can still be established in relation to a common I2 (Coory 2009). Hence, the I2 statistic is favoured here because the interest relies on ”the impact rather than the extent of heterogeneity” across the different meta-analytic scenarios (Higgins & Thompson 2002). Indeed, the way heterogeneity has been modelled can be argued to be appropriate for the purpose of assessing the performance of the adjustment methods, which is known to depend on the ‘impact’ of heterogeneity, regardless of whether it is clinical or statistical (Moreno et al 2009a).

Rücker et al (Rücker et al 2008b) recently proposed a new measure for the impact of heterogeneity after allowing for small-study effects. In a few words, the proposed G2 statistic can be interpreted as the proportion of total variance that cannot be explained by within-study variation alone once allowing for small-study effects. Small-study effects are modelled based on the radial plot (Galbraith 1988, 1994), which corresponds to the FE Egger model (FE-se) (equation 5.2) (Copas & Lozada-Can

2009). Unfortunately, the G2 statistic is not implemented here because it was derived after this simulation study was undertaken, and it is still undergoing the peer review process at this point in time.

An alternative modelling approach would be to define heterogeneity in terms of τ2, which would lead to an assessment of the methods with respect to magnitude of between-study variability. Previous studies assessing methods tackling PB (Schwarzer et al 2002, Harbord et al 2006, Peters et al 2006, Rücker et al 2008a) have used a mixture of these approaches and it is not clear which, if either, is superior.

132 Q − (k −1) The formula used to calculate I2 is I 2 = as opposed to the equivalent Q

2 2 τˆ I = ; where Q stands for the Q statistic calculated as the weighted sum of τˆ2 +σˆ 2 squared differences between individual study effects and the pooled effect across studies. Q is distributed as a chi-square statistic with (k -1) representing the degrees of

2 freedom. This approach avoids the inconvenience of having to calculateσ , which in practice varies from study to study (Higgins & Thompson 2002).

In unbiased scenarios, MAs with 0 and 150% heterogeneity factors correspond to mean (median) I2 values of 7% (0%) & 57% (60%) respectively. Interestingly, the mean values do not equal 0% and 60% as would perhaps be expected according to equation

6.7:

2 2 2 τˆ σ i ×heterogeneity factor 1.5 I = = = = 0.6 τˆ2 +σˆ 2 2 2 1.5+1 σ i ×heterogeneity factor +σ i

[Equation 6.7]

This can be explained by the zero bounded and skewed distribution of the between-study variance and subsequently I2 (Sterne et al 2002). The zero truncation ensures the precision of RE models never exceed the precision of FE ones (Higgins et al 2003). Moreover, the presence of PB is expected to influence τ2 and thus I2.

Generally speaking, PB causes τ2 and I2 to be underestimated although there can be exceptions (Jackson 2006a). More details on the impact of PB on heterogeneity are given in section 6.5.9.

133 The statistical model for generating the datasets under the RE MA model employed for simulation purposes is parameterised below and suitably accounts for binary outcome data (Smith et al 1995).

C C C ri ~ binomial ( pi , ni ) T T T ri ~ binomial ( pi , ni )

C ⎛ pi ⎞ C δi log⎜ C ⎟ = logit( pi ) = μi − ⎝ 1− pi ⎠ 2 [Equation 6.8] T ⎛ pi ⎞ T δi log⎜ T ⎟ = logit( pi ) = μi + ⎝ 1− pi ⎠ 2

2 δi ~ Normal(θ,τ )

Model terms were defined in equation 6.6. The δi are the true study-specific treatment effects (i.e. underlying lnORi) drawn from a common Normal distribution with average treatment effect θ and between-study variance τ 2.

6.5.8. Inducing publication bias

The way in which PB is induced has a non-ignorable impact on the performance of the competing adjustment methods. For instance, the TF method was developed under the assumption that PB is induced by effect size rather than p-value. Consequently, it is expected to perform better when PB is induced in this way than otherwise. It is common in simulation studies to evaluate the performance of a statistical method by generating data that fulfil the underlying assumptions of the method under evaluation.

In order to stop this from occurring here, PB is induced following the two most commonly supposed selection processes in the literature of simulation studies (table

6.1), i.e.

134 1) Publication suppression of those studies that do not report statistical significance of the ‘positive’ finding, based on a one-sided p-value associated with the effect estimate of interest (Hedges 1992, Begg & Mazumdar 1994, Vevea & Hedges

1995, Hedges & Vevea 1996, Copas 1998, Macaskill et al 2001, Nakagawa 2004,

Preston et al 2004, Jackson 2006a, Peters et al 2007, Schwarzer et al 2007, Copas &

Malley 2008).

2) Suppressing the most extreme unfavourable results assumes that only the estimated effect size influences whether a study is published, so that studies with the most extreme unfavourable estimates of effect are excluded from the MA (Duval &

Tweedie 2000b, Duval & Tweedie 2000a, Peters et al 2007).

Prof. John Copas (Copas 2005) asserts that although selection bias is a statistical problem that should not be ignored, statisticians do not know how to solve it simply because the true publication criteria are unidentifiable. Although substantial evidence suggests that the p-value and/or effect size are somehow related to whether a study is published (Easterbrook et al 1991, Dickersin 1997, Nieminen et al 2007, Dwan et al

2008), the actual selection process underlying publication is unknown (Copas 2005,

Sterne et al 2008), and probably differs from context to context (Copas & Malley 2008,

Peters et al 2009). Clearly, if the underlying selection process was known, a straightforward correction could be carried out. The fact that there are potentially many publication/selection criteria depending on the context, allows broad speculation among researchers of this topic (Berlin et al 1989). And presumably at the same time, any selection mechanism assumed by statisticians will suffer some degree of model misspecification (Bowden et al 2006). Indeed, the misspecification is due to the assumed selection process relying on unverifiable conjectures (Copas 2005). Bear in mind that the true selection process cannot be directly investigated unless the unpublished studies become observable somehow. There are few case studies where

135 the selection process has been observed directly by using a gold standard data source, such as the FDA trial registry database. Even though, such studies are only able to conclude that: “The statistical significance of a study’s results was strongly associated with whether and how they were reported (Turner et al 2008a)”. Chapter 7 further explores the selection process for this particular dataset. One problem, however, is the generalisability of these findings, which might be limited to that specific healthcare context, the antidepressants literature. Nevertheless, these conclusions agree with others in broader research areas (Rising et al 2008). For instance, Lee et al (Lee et al

2008) examined a broad group of drug classes and concluded “trials with statistical significant results […] are more likely to be published”. On this basis, a one-sided p- value selection mechanism appears the simplest and most reasonable selection process that can be presumed (Berlin et al 1989).

Preston et al (Preston et al 2004) examine what selection process is believed to occur according to the scientific literature. The authors pay special attention to a paper by Rosenthal & Rubin (Rosenthal & Rubin 1988) that differentiates two diverse selection processes depending on whether there is already consensus on the effect direction (beneficial or harmful). Before a consensus regarding the expected direction of effect is achieved, studies are more likely to be published if they are simply statistically significant (despite the direction). Once the direction is established by cumulative evidence, studies reporting a reverse effect are more likely to be published than any further replications of the established one. That is, trials that contradict established evidence of effect may become attractive for publication (Dwan et al 2008).

Interestingly, the opposite argument has also been made by which studies contradicting the established evidence may in fact be more difficult to publish due to the harder criticisms from reviewers about either the methods or the authors’ interpretation of the results (Ramsey & Scoggins 2008). Rosenthal and Rubin

136 (Rosenthal & Rubin 1988) support the plausibility of the two-step selection process by referring to their own research, where 3% out of 345 studies they examined were published with significant p-values in the opposite direction. While the two-step selection process is plausible, it does not invalidate the simpler one-sided p-value selection mechanism, where the probability of publishing a ‘negative’ significant study can be slim but still possible (more details available in the following section).

For simplicity, the idea that PB might be induced by a combination of p-value and effect size (and perhaps other unmeasured parameters) is intentionally disregarded.

The reason is that by inducing PB by one-sided p-value, the direction of the effect estimate is already implicit in the calculation (detailed explanation provided in appendix

2). In this sense, the one-sided p-value is valuable because it incorporates both measures into a single value.

6.5.8.1. Inducing publication bias by p-value

Hypothesis testing in clinical trials is usually based on a two-sided significance test, with one-sided tests being less prevalent in the literature (Bland & Altman 1994b).

In this thesis, the view is that the following publication process is the one that generally occurs in reality: once the study p-value is obtained by means of a two-sided test and the direction of the effect estimate is realized, a decision is made by the stakeholders on whether to publish the paper (Melander et al 2003). It could be argued that, for the sake of correctness, the same course of action should be modelled here. For simplicity, however, the selection process is modelled as a one-sided test pursuing the statistical significance of the positive findings. Recall that under a one-sided test, studies showing a beneficial treatment have a greater chance of being published; whereas, selection models based on two-sided p-values do not incorporate the potential for direction of effect to influence publication (Bland & Altman 1994b). Apparently, if the two-sided criterion were the true publication criteria, the results from this simulation study would

137 become futile for inference purposes due to lack of generalisability of its results.

However, the general view of the scientific literature as well as of this thesis is that PB is somehow induced by both the direction of the effect and the almighty p-value, which is easily characterized by a one-sided p-value selection process.

Interestingly, for PB to induce funnel asymmetry under the null effect, the selection mechanism must not only favour significant findings but also a particular direction of effect (e.g. beneficial). This is easily achieved through a one-sided p-value selection model (Senn 2007a). Otherwise, if a two-sided p-value were used, the funnel plot would exhibit a hollow in the middle. This would increase the variability in the estimation of the pooled effect without biasing it (in average).

The probability for inclusion into the simulated MA, that is, the probability that a study is published at all, depends on its resulting one-sided p-value. Following a conventional simulation approach (Terrin et al 2003, Peters 2006), the probability of publication is modelled here by a step function with discontinuities at two cut-points

(table 6.4). Two intensity levels were induced to represent ‘moderate’ and ‘severe’ PB

(Hedges & Vevea 1996, Peters 2006). Under severe PB, for instance, only 25% of studies with a one-sided p-value greater than 0.2 are published and subsequently included in the MA.

138 Table 6.4 Step function used to specify the PB severity for one-sided significance

Severity of p-value from Probability of Publication Bias study publication

< 0.05 1 Moderate 0.05 – 0.5 0.75 > 0.5 0.25 < 0.05 1 Severe 0.05 – 0.2 0.75 > 0.2 0.25

Independent studies are simulated until the desired total is reached (5, 10, 20 or

30). Since more studies are excluded for severe bias than for moderate, more studies need to be generated for severe bias than for moderate bias.

This apparent lack of modelling consistency between the alleged life-like scenario and the simulated one is believed to have no major impact on the results. It must be borne in mind that all the regression-based methods evaluated here focus on funnel plot asymmetry. Since the asymmetry caused by inducing PB by a one-sided p-value is roughly the same as that caused by a two-sided p-value (if publication of the ‘positive’ studies is favoured while suppressing the studies with ‘negative’ findings), then any potential impact on the performance of the regression-based methods due to such difference becomes negligible. This can be visually demonstrated by the contour- enhanced funnel plots in figure 6.2. They show how any differences between the contours of a ‘directional’ two-sided test (above) and a one-sided test (below) are quantitatively small and thus negligible. As a matter of fact, the 5% contour line from the one-sided plot corresponds to the 10% contour line from the two-sided plot.

139 Figures 6.2 Comparison of two- vs. one-sided contours in a funnel plot 0.0 Studies 0.2 1% 5% 0.4 10% 0.6

Standard error 0.8 1.0 -4 -2 0 2 4 Effect estimate 0.0 Studies 0.2 1% 5% 0.4 10% 0.6

Standard error 0.8 1.0 -4 -2 0 2 4 Effect estimate

It is important to note that although PB is strictly induced following a one-sided significance test, all contour-enhanced funnel plots are hereafter displayed for a two- sided test with the intention of facilitating visual inspection of asymmetry. Bear in mind that the two-sided plot provides the visual anchors for inspection of funnel symmetry as opposed to the one-sided plot. Also, note that unless stated otherwise, PB induced by p-value refers to a one-sided process.

When judging the performance of the methods under evaluation it must be borne in mind that a MA with a large underlying OR is unlikely to contain many non-significant studies. Then, if PB is induced following a one-sided p-value process, few studies will be excluded and so the MA remains practically unaffected by PB. For this reason, the impact of PB upon MAs with large underlying effect sizes can be considered negligible.

Conversely, the number of studies censored from the MA rises as the underlying OR approaches the null effect. An appealing way to understand and interpret the impact of inducing PB is by plotting contour-enhanced funnel plots. The three following contour- enhanced funnel plots (figure 6.3) exemplify this.

140 Figure 6.3 Contour-enhanced funnel plots illustrating study censoring based on a one-sided significance test according to the true underlying effect size. Above, underlying OR=3 (lnOR=1.1); middle, OR=1.5 (lnOR=0.4); bottom, OR=1 (lnOR=0)

0.1 < p < 0.2 9 0.05 < p < 0.1 0.01 < p < 0.05 p < 0.01 Fixed effects 7 pooled estimate Random effects pooled estimate

3 Inverse standard error standard Inverse

-1 0 1 2 3 4 11 Estimate 0.1 < p < 0.2 0.05 < p < 0.1 0.01 < p < 0.05 9 p < 0.01 Fixed effects pooled estimate 7 Random effects pooled estimate

Inverse standard error standard Inverse 3

-1 0 1 2 3 11 Estimate 0.1 < p < 0.2 0.05 < p < 0.1 0.01 < p < 0.05 9 p < 0.01 Fixed effects pooled estimate 7 Random effects pooled estimate

Inverse standard error standard Inverse 3

-2 -1 0 1 2 Estimate

141

In summary, MAs simulated with a large underlying OR are less likely to censure studies compared to those with values closer to the null effect. Therefore, inducing PB on the basis of one-sided significance leads to a fairly distorted picture that clearly depends on the underlying effect size. This phenomenon leads to the consideration of alternative methods for inducing PB. One alternative approach is described below. It corresponds to how the TF method assumes PB to be induced, that is, exclusively based on the observed study effect size.

6.5.8.2. Inducing publication bias by effect size

Excluding a proportion of studies giving the most extreme unfavourable results is an intuitive way of inducing PB. Contrary to the previous approach of censoring studies, the number of studies excluded now does not depend on the size of the underlying OR. That is to say that only the most extremely unbeneficial studies are omitted (unpublished) from the MA independently of the underlying OR simulated. This approach of inducing PB can be argued somewhat unrealistic since it cannot identify the most unfavourable studies without the whole study population. In practice, the decision to publish a paper based exclusively on its effect size would have to depend on the already published studies.

Again, two levels of bias are induced to represent ‘moderate’ and ‘severe’ PB.

Either 14% or 30% of the most extreme studies showing an unfavourable effect were excluded from the MA such that the final number of studies in a MA was reached. For example, simulated MAs made up of 30 studies with severe PB require generating 43 studies so that 13 (i.e. 30% of the original 43 studies) reporting the most extreme unfavourable estimates are omitted.

142 The two values used to represent different intensity levels (14% & 30%) are arbitrary although consistent with values used by Peters et al (14%) (Peters 2006,

Peters et al 2006, Peters et al 2007) and Duval et al (who use a range between 0 and

29%) (Duval & Tweedie 2000b) (Refer to appendix 3 for more information). Thus, the upper limit is taken as 30% to represent severe PB intensity in accordance with Duval et al and a lower limit of 14% to represent moderate PB in line with Peters et al.

Figure 6.4 helps appreciate the remarkable differences between the three PB scenarios considered. All three contour-enhanced funnel plots contain 30 simulated homogeneous studies under the null effect size (lnOR=0). The top funnel plot corresponds to a scenario without PB and so the funnel appears fairly symmetrical. As expected, the estimated pooled effect is roughly lnOR=0. The middle and bottom plots illustrate severe PB induced by one-sided p-value and effect size respectively. Due to the severe PB induced, both pooled estimates are biased towards the beneficial effect of the experimental treatment. The studies are scattered around the funnel plot differently according to the mechanism used for study censoring. Note that while the studies are quite spread in the middle plot, the studies are less spread in the bottom plot. The impact of PB on between-study variance is discussed in the next section.

143 Figure 6.4 Contour-enhanced funnel plots illustrating selection mechanisms for study censoring (30 simulated homogeneous studies under lnOR=0). Upper plot, no

PB; middle, PB induced by one-sided p-value; bottom, PB induced by effect size

0.1 < p < 0.2 9 0.05 < p < 0.1 0.01 < p < 0.05 p < 0.01 Fixed effects 7 pooled estimate Random effects pooled estimate 5

3 Inverse standardInverse error

-2 -1 0 1 2 11 Estimate 0.1 < p < 0.2 0.05 < p < 0.1 0.01 < p < 0.05 9 p < 0.01 Fixed effects pooled estimate 7 Random effects pooled estimate

Inverse standard error Inverse 3

-2 -1 0 1 2 11 Estimate 0.1 < p < 0.2 0.05 < p < 0.1 0.01 < p < 0.05 9 p < 0.01 Fixed effects pooled estimate 7 Random effects pooled estimate

Inverse standardInverse error 3

-2 -1 0 1 2 Estimate

144 Altogether, one of the most important strengths of this simulation study lies in considering the two most common selection processes with different intensity levels

(severe & moderate). These levels, previously used by Hedges & Vevea (Hedges &

Vevea 1996), should allow a more reliable assessment of the adjustment methods. The moderate intensity is intended to resemble a realistically mild PB situation, whilst the severe intensity represents an extreme PB scenario. Although the extreme PB scenario can be argued to have little resemblance to the everyday reality of MA, the preferred

PB adjustment method should be one that not only performs well under moderate PB but also under extreme conditions.

6.5.9. Impact of publication bias on between-study variance

Jackson (Jackson 2006a) highlights the importance of between-study variance when interpreting funnel plot asymmetry. He draws attention to how Terrin et al (Terrin et al 2003) found that the performance of TF worsened once heterogeneity was incorporated in the simulation. This deterioration even caused TF to inappropriately adjust for PB when none existed. That is the case because unexplained heterogeneity can become a confounder of PB and so a threat to the validity of meta-analytic results.

As a general rule, the presence of PB implies that some studies are censored on one side of the funnel plot. Then, due to PB, the range of sampled values of effect estimates is reduced and so the between-study variance decreases. Interestingly,

Jackson (Jackson 2006a) also points out the possibility that, under particularly extreme circumstances, the effect of PB on the magnitude of heterogeneity might be reversed.

Jackson demonstrates this paradox mathematically with a severe selection process censoring not only studies with less promising results, but also those near the true effect size. This leads to inflated heterogeneity in a very extreme funnel plot where only the most highly significant beneficial studies above the mean population value are published, while the remaining are mostly omitted due to PB.

145 Another interesting scenario identified in this thesis is the one illustrated in figure

6.3 (bottom), where a hollow funnel plot is produced by PB induced by p-value, and so overestimates the between-study variance. Ultimately, Jackson prevents making general statements about the impact of PB over the magnitude of heterogeneity since

PB does not always produce underestimates of between-study variance.

6.6. Simulation scenarios to be investigated

The meta-analytic scenarios used to investigate the performance of the competing adjustment methods are intended to encompass a comprehensive collection of realistic situations in the medical literature. These scenarios denoted by assumptions and parameter values have drawn on various researchers’ extensive experience in this area as well as considering the updated review of published simulation studies in the field (table 6.1) (Moreno et al 2009a).

This wide range of life-like scenarios should reveal any limitations or shortcomings of the competing (adjustment) methods, and so allow robust recommendation on their use to be established. To be more explicit, their performance is thoroughly scrutinized for a variety of MA sample sizes, between-study variances, underlying effects and PB selection mechanisms. The flow diagram in figure 6.5 outlines the scenarios considered.

Figure 6.5 Schematic outlining the meta-analysis scenarios simulated. 5000 meta- analysis datasets were generated for each combination of 3 underlying ORs, 4 meta- analysis sizes and 4 heterogeneity intensities among 5 publication bias situations (240 scenarios in total)

146

Each scenario below modelled for

OR=1; 1.5 & 3

Number of trials per

MA: 5, 10, 20 & 30

Heterogeneity intensity factor: 0; 1; 1,5 & 2

No PB PB by PB by p-value Effect size

Situation 5 Situation 1 Situation 2 Situation 3 Situation 4 Moderate Moderate No PB Severe PB PB Severe PB PB

• Situation 1: PB not induced alongside different intensities of heterogeneity:

3 different effect sizes (ORs of 1, 1.5, 3), four sizes of MA (5, 10, 20 & 30

studies) and four intensity levels of between-study variance (0, 1, 1.5 & 2 times

the within study variance)=48 individual scenarios of situation 1

• Situations 2-3: PB included by one-sided p-value alongside different intensities of

heterogeneity:

3 different effect sizes, four MA sizes, two PB intensity levels (moderate and

severe) alongside four levels of between-study variance = 48 individual

scenarios of situation 2; and another 48 individual scenarios of situation 3

• Situations 4-5: PB included by effect size along different intensities of heterogeneity:

3 different effect sizes, four MA sizes, two PB intensity levels (moderate and

severe) alongside four levels of between-study variance = 48 individual

scenarios of situation 4; and another 48 individual scenarios of situation 5.

147 Altogether there are 5 different PB situations (none, moderate and severe for both p-value and effect size suppression). Each of these situations is applied to all combinations of the following MA characteristics: underlying effect size (OR = 1, 1.5 &

3), number of studies per MA (5, 10, 20 & 30) and heterogeneity factors (0%, 100%,

150% or 200% of the average within-study variance). A total of 240 individual scenarios, with 5,000 simulated datasets (i.e. MAs) are generated for each scenario.

All methods listed in table 6.2 are implemented in every simulated MA. Results from their implementation are reported in section 6.8 although not all scenarios are shown due to the negligible added benefit they provide to the overall conclusions. In particular, scenarios with ‘moderate’ bias are omitted because they follow the same overall trend as the more severe scenarios, but with differences between methods being less pronounced. Hence, the PB situations labelled 1, 2 and 4 from figure 6.5 are subsequently reported. For completeness, appendix 4 contains the remaining scenarios, which are also accessible from www.biomedcentral.com/1471-

2288/9/2/additional.

6.7. Criteria to assess the methods performance

In order to evaluate the adjustment methods, their performance is assessed through typical measures of model accuracy (bias), precision (variance) and coverage

(type I error rate) in the simulation study (Burton et al 2006).

6.7.1. Assessment of bias (model accuracy)

The model accuracy is measured via the ‘absolute bias’ (also known as residual bias), which corresponds to the expected difference between the estimated (adjusted) effect ln(ORA) and the underlying true effect ln(ORU) across all simulated MAs.

148 ⎛ OR ⎞ ln(OR ) − ln(OR ) = ln⎜ A ⎟ A U ⎜ OR ⎟ ⎝ U ⎠

In the context of this simulation study, a negative value indicates an underestimate of the true underlying effect, possibly caused by excessive correction (over- adjustment). This could be interpreted as bias induced by the (adjustment) method due to its inability to correct accurately for funnel plot asymmetry. Similarly, a positive value indicates an over-estimate of the true underlying effect, probably caused by insufficient correction (under-adjustment).

6.7.2. Combined assessment of bias (model accuracy) & variability (model precision)

The magnitude of absolute bias is calculated in order to rank models based on their ability to accurately adjust for PB, which is denoted by their average residual bias.

It could be argued that adjusting accurately is important as long as reasonable levels of precision are maintained. So the focus is now on the evaluation of competing statistical models correcting for bias while maintaining acceptable precision. To this end, the mean sum of squared errors (MSE) is an appropriate and popular quality measure capable of ranking statistical models by trading off accuracy (biasness) and precision

(variation) of their predicted (adjusted) effects (Burton et al 2006).

In statistical modelling, the MSE can be considered a measure of model fit that integrates the overall model accuracy and precision into a single measure. Recall that each adjustment method is based on a statistical model to predict how the data was generated in the first place; i.e. Egger’s regression line is fitted to the data by assuming the study effect size linearly associated to a measure of study precision. Formally, the model selection problem can be defined in classical statistics as that of determining

149 how probable the data are given a statistical model p(data| modeli ) . In theory, the data could have been produced by an infinite number of statistical models. However, some of these are more reasonable and reliable than others. In this sense, the MSE is generally used to determine how well the model fits the data, as well as assisting in the decision to add/remove model terms in order to avoid under/over-fitting. Briefly, the

MSE investigates the goodness of fit of the model, which summarizes the discrepancy

ˆ between the actual response predicted by the model θ j and the true underlying effect value θ.

N ˆ 2 ∑ ((θ j −θ ) ) j=1 MSE (θˆ) = [Equation 6.8] N

ˆ th Whereθ j is the estimated lnOR predicted by the model for the j simulated MA

(j:1,…, N; where N is the total number of simulated MAs). The MSE denotes the average amount by which the model predictor differs from the true effect value due to random error (variance) and bias. The MSE expression can be decomposed further as ˆ outlined in equation 6.9, in terms of the deviation of predictorsθ j from the arithmetic mean of that same model predictorθˆ , plus the difference between this same arithmetic mean and the true simulated underlying effect value θ.

2 2 ⎛ ˆ ˆ ⎞ ˆ Σ⎜(θ j −θ ) ⎟ 2 Σ()(θ j −θ) MSE(θˆ) = = ⎝ ⎠ + (θˆ −θ)2 = var(θˆ) + ⎜⎛bias(θˆ,θ)⎟⎞ n n ⎝ ⎠

[Equation 6.9]

150

Explicitly, the MSE can be mathematically defined as the model response variance plus the squared bias of the estimator. Where corresponds to the arithmetic mean of

over the n simulations and var represents the variance of the distribution of

values across simulations. The expression , represents the average bias predictor, which is calculated as the difference between the constant true underlying effect and the mean predicted effect (corresponding to the absolute bias). Therefore, the MSE helps to investigate whether a method that provides an average unbiased estimate with large variance is any better than a method that induces small bias but has greater certainty around the predicted estimate. Note that according to equation

6.9 the MSE gives an implicit weight to the accuracy and precision of the model response

Outside a simulation study, the underlying true value is unknown and therefore any estimation of model accuracy (bias) is unfeasible. In such cases, cross-validation techniques can help evaluate the goodness of fit but provide no assurance of predictive success (Good & Hardin 2003) because goodness of fit techniques favour over-fitted models since they better adapt their predictions to fit the data. Of course, over-fitted models are not always generalisable to the population because they sometimes become too specific to the dataset from where the model was initially developed. What is required in model selection is to compensate somehow for model complexity, so that a capable model selection technique has a trade off between goodness of fit and model complexity. Note that the MSE does not account for this important component in model comparison, and thus cannot be considered the best approach for selecting the optimal statistical model. Nevertheless, all regression models investigated here have about the same number of parameters and so model complexity is of little relevance.

151 Altogether, the focus in this simulation study is the bias reduction as opposed to variability reduction although there needs to be a trade-off between the amount of bias and variability that can be considered allowable (Greenland & O'Rourke 2001). The rationale is that an unbiased estimate with large variability is preferred to a biased estimate with little variability. This agrees with the literature in that the relative performance of the methods should be assessed primarily on their estimated bias while keeping the inflation in variability tolerable (Demirtas 2007). Variability inflation is measured by the ‘coverage probability’ addressed in the next section.

6.7.3. Assessment of coverage

As mentioned above, this simulation study places more weight in the biasness of the predictions from the competing adjustment methods than in other performance measures for as long as the variability around the mean prediction value is reasonable

(Demirtas 2007). To this end, the coverage of a confidence interval (Burton et al 2006), also known as coverage probability, can be useful for evaluation purposes. A reasonable (but arbitrary) coverage probability range is judged between 70% and 97%.

Other authors have used similar arbitrary ranges for the acceptability criteria of the coverage (Cohen et al 2001, Burton et al 2006).

Coverage probability can be defined as the proportion of simulations wherein the true underlying effect lies within the 95% confidence interval of the predicted effects simulated. Note that these 95% confidence intervals are based on the normal distribution. The coverage should be approximately equal to the nominal coverage rate to properly control the type I error rate (e.g. 5% level) for testing a null hypothesis of no difference in effect size between the true underlying effect and the predicted one. For example, if the nominal level is 5%, corresponding to an expected 95% confidence interval, coverage should coincide with the confidence interval.

152 Over-coverage occurs when coverage rates are above 95%, suggesting that the adjustment method is too conservative. That is, the adjustment method provides a 95% confidence interval that, on average, is unnecessary too wide causing the coverage to be exceeded. If the coverage rate is above the expected 95%, it implies that more simulations than expected (> 95%) have not found a significant difference between the predicted and the underlying effect at the 5% level of significance. Therefore, type I error rate is reduced (<5% level).

In contrast, under-coverage relates to coverage being lower than expected (<

95%). This is due to the adjustment method providing a confidence interval that is too narrow, implying over-confidence. The danger of under-coverage is that it leads to higher than expected type I error rates since more simulations than expected (>5%) incorrectly detect a significant difference of effect with respect to the true simulated underlying effect. Consequently, the type I error rate is inflated (>5% level). In this sense, the coverage probability informs how well the type I error rate is controlled by the statistical model.

It is important to stress that the coverage probability is a direct function of absolute

(residual) bias (Demirtas 2007) as well as heterogeneity. Overall, both bias and heterogeneity, which cannot be usually disentangled, have a detrimental influence upon coverage probabilities.

6.7.4. Assessment of variability

The average length of the 95% confidence interval for the parameter estimate is considered an additional evaluation tool in simulation studies, where narrower confidence intervals imply more precise estimates and hence benefits in statistical power (assuming that the coverage probability is 95%, as expected) (Burton et al

2006). Other measures can equally give a measure of uncertainty. In this sense, the

153 average ‘variance’ of the predicted pooled effects is preferred here because it also helps ascertain the contribution of variance to the MSE measure. That is, this last performance indicator is used to determine the weight of variance (as opposed to bias) upon the MSE. Explicitly, the MSE can be disentangled into two terms, the variance of the predicted effect and the squared absolute bias. Then, plotting the average variance adjacent to the MSE helps to visually explain the proportion of the MSE that can be accounted for by the variance (i.e. variability around the predicted effect). The remaining MSE can then be explained by the absolute bias. Of course, these two values will correctly add up to the MSE only if the coverage probability reaches the desired 95%. Indeed, scenarios with moderate heterogeneity and bias, which allow methods to attain acceptable coverage probability levels, corroborate this validation approach.

Note that since no statistical tests are evaluated in this simulation study (to detect neither PB/small-study effects nor a particular effect size), measures related to statistical tests such as statistical power, type I or II errors are not directly considered

(Deeks et al 2005).

In summary, the extensive simulation study compares some novel and existing methods to adjust for PB/small-study effects. MAs are simulated by considering scenarios with and without PB, where PB is induced by two different mechanisms: the study effect size or its p-value. Heterogeneity is also considered due to its known importance on the methods’ performance. The performance of the competing approaches to PB/small-study effects adjustment is evaluated by examining their absolute bias, coverage probability and MSE.

154 6.8. Results of the simulation study

Figures 6.6.1 and 6.6.2 present the results of simulated MAs of 30 studies with no

PB (PB situation 1) and underlying OR=1 and OR=3 respectively while varying the degree of heterogeneity (defined on the x-axis of all plots). Unsurprisingly, when PB is not induced, the standard RE MA estimator performs best and provides an unbiased estimate, correct 95% coverage probability and the lowest MSE of all methods examined. When the underlying OR=1, the majority of the other methods also perform very well (figure 6.6.1), although the overall performance is significantly reduced once the between-study variance exceeds the within-study variances; i.e. I2>50%.

Generally, the methods which allow for heterogeneous data (i.e. include RE or dispersion parameters) obtain more reasonable coverage probabilities than those which do not.

When the underlying effect increases to OR=3 (figure 6.6.2), greater variability in the results of the different methods is observed, with generally worse performance seen for most methods. This is due, at least in part, to the susceptibility of some methods to the induced artifactual relationship between ln(OR) and its se(lnOR) which increases as the OR increases. Indeed, while the standard RE model has no bias, the two adjustment methods with lowest absolute bias are those using the Peters and the conditional Harbord models, which were developed to circumvent the problems of the artifactual relationship. Note also that the Harbord, Egger RE (RE-se) and dispersion

(D-se) methods report large MSE values. This can be partially explained by their large variance in the model predictor. With respect to the Egger-based methods, the use of the variance as the predictor variable provides less biased and more precise estimates than using the standard error.

155 Figure 6.6.1 Measures of absolute bias, coverage probabilities, MSE and precision of the predicted effect for meta-analyses simulated to have

30 studies, an underlying OR of 1 (lnOR=0) and no PB alongside increasing levels of heterogeneity (x-axis) (PB scenario 1) .2 100 .15 .1

80 FE

RE .05 TF FE-FE 0

60 TF FE-RE Coverage Prob.(%) Coverage Absolute Bias (lnOR) TF RE-RE -.05 FE-se 40 -.1 RE-se

D-se .1 .08 FE-var

RE-var .08 .06 D-var

.06 Harbord MSE .04

Variance Peters .04

Harbord-C .02 .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

156 Figure 6.6.2 Measures of absolute bias, coverage probabilities, MSE and variance of the predicted effect for meta-analyses simulated to have

30 studies, an underlying OR of 3 (lnOR=1.1) and no PB alongside increasing levels of heterogeneity (x-axis) (PB scenario 1) .2 100 .15 .1

80 FE

.05 RE

0 TF FE-FE

60 TF FE-RE Coverage Prob.(%) Coverage -.05 Absolute Bias (lnOR) TF RE-RE -.1 FE-se

40 RE-se

D-se .15 .08 FE-var

RE-var .06 .1 D-var

Harbord MSE .04

Variance Peters .05 Harbord-C .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

157 Figure 6.6.3 presents results of simulated homogeneous MAs with an underlying

OR=1, with no PB, whilst varying the number of studies included in the MA (PB situation 1). Figure 6.6.3 (and figure A2 from appendix 4) reveals how several methods, i.e. both FE & RE standard MAs as well as FE & RE Egger-based methods (FE-se,

RE-se, FE-var & RE-var), report substantial over-coverage - well above the pre- specified 95% - particularly for MAs including less than 30 studies. Meanwhile, the methods that include a dispersion parameter retain more appropriate coverage probabilities. Nevertheless, the over-coverage problem disappears for heterogeneous conditions (figure 6.6.4); implying that the methods listed above are unable to provide reasonable coverage probabilities for fairly homogeneous data (i.e. approximately

I2<10%) for MAs with typically small numbers of studies (e.g. below 30 studies).

158 Figure 6.6.3 Measures of absolute bias, coverage probabilities, MSE and precision of the predicted effect for homogeneous meta-analyses and underlying OR of 1 (lnOR=0) and no PB alongside increasing meta-analysis sizes (x-axis) (PB situation 1) .2 100 .15 98

.1 FE 96 RE .05 TF FE-FE 94

0 TF FE-RE Coverage Prob.(%) Coverage Absolute Bias (lnOR) 92 TF RE-RE -.05 FE-se 90

-.1 RE-se

D-se .25 .3 FE-var .2 RE-var

.2 D-var .15 Harbord MSE .1 Variance Peters .1 Harbord-C .05 0 0

5 10 15 20 25 30 5 10 15 20 25 30

159 Figure 6.6.4 Measures of absolute bias, coverage probabilities, MSE and precision of the predicted effect for heterogeneous meta-analyses

(1.5 factor) and underlying OR of 1 (lnOR=0) and no PB alongside increasing meta-analysis sizes (x-axis) (PB situation 1) .2 100 .15 90

.1 FE

RE 80 .05 TF FE-FE

0 TF FE-RE Coverage Prob.(%) Coverage 70 Absolute Bias (lnOR) TF RE-RE -.05 FE-se 60 -.1 RE-se

.8 .8 D-se

FE-var

.6 .6 RE-var

D-var

.4 .4 Harbord MSE

Variance Peters

.2 .2 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

160 Figures 6.6.5 and 6.6.6 display the results of simulations with severe PB induced by one-sided p-values for OR=1 & 1.5 respectively (PB situation 2). In these plots, the degree of heterogeneity is varied along the x-axis. OR=1.5 is reported rather than

OR=3 because, in the latter case very few studies would be suppressed under this selection mechanism since the majority of studies are highly statistically significant (i.e. similar results are obtained to those presented previously in figure 6.6.2). When severe

PB is induced, no standard MA methods, and to a lesser extent neither any of the TF variants, perform well, all producing biased estimates and poor coverage probabilities.

Figures 6.6.5 and 6.6.6 also suggest that none of the adjustment methods perform particularly well under extreme scenarios of heterogeneity and PB.

The degree of absolute bias is dependent on the underlying odds ratio. This can be explained by the p-value induced PB causing ‘disfigurement’ to the funnel plot which is dependent on the underlying OR; i.e. the funnel plot is almost intact under OR=3, while a very asymmetrical shape is obtained under OR=1. The methods that can accommodate heterogeneous data (through the inclusion of random-effects or dispersion parameters) are the ones with the most appropriate coverage probabilities; these include the Harbord, Peters and the Egger-based methods (RE-se, D-se, RE-var

& D-var). However, the Harbord and the two Egger (RE-se & D-se) methods report substantially inflated MSE and variance values compared to other methods evaluated.

Similarly to former figures 6.6.3 and 6.6.4 but now in PB situation 2 (i.e. homogeneous and heterogeneous MAs with severe PB induced by p-value), both FE and RE standard MAs and all FE/RE Egger and Egger-var methods report substantial over-coverage values (figures A4, A6, A8 & A10 from appendix 4) particularly for homogeneous MAs below 20 studies. Once again, this problem disappears when heterogeneity is present; implying that these methods systematically report inflated coverage probabilities for homogeneous data.

161 Figure 6.6.5 Measures of absolute bias, coverage probabilities, MSE and variance of the predicted effect for meta-analyses simulated to have

30 studies, an underlying OR of 1 (lnOR=0) and severe PB induced by one-sided p-value alongside increasing levels of heterogeneity (x-axis)

(PB situation 2) 100 .2 80 .15 FE .1 60 RE

.05 TF FE-FE 40 TF FE-RE 0 Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 20

-.05 FE-se 0 -.1 RE-se

.15 D-se .08 FE-var

RE-var .06 .1 D-var

Harbord .04 MSE

Variance Peters .05 Harbord-C .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

162 Figure 6.6.6 Measures of absolute bias, coverage probabilities, MSE and variance of the predicted effect for meta-analyses simulated to have

30 studies, an underlying OR of 1.5 (lnOR≈0.4) and severe PB induced by one-sided p-value alongside increasing levels of heterogeneity (x- axis) (PB situation 2) .2 100 .15 80

.1 FE

RE 60 .05 TF FE-FE

0 TF FE-RE Coverage Prob.(%) 40 Absolute Bias (lnOR) TF RE-RE -.05 FE-se 20

-.1 RE-se

D-se .08 .06 FE-var

RE-var .06

.04 D-var

Harbord .04 MSE

Variance Peters .02 Harbord-C .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

163 Figure 6.6.7 considers severe PB induced by effect size on MAs of 30 studies with an underlying OR=1 while varying the level of heterogeneity (PB situation 4). Since inducing PB in this manner has the same effect on the funnel shape regardless of the simulated underlying effect size (unlike suppression based on one-sided p-value), only

OR=1 is exhibited here. Nevertheless, the performance of some of the methods will still be dependent to some degree on the underlying effect size due to the induced structural correlation mentioned earlier (figure A13 from appendix 4).

The conditional regression method, TF and standard MA estimators do not perform particularly well due to low coverage and large amounts of absolute (residual) bias. As before, for larger values of heterogeneity, the methods capable of accommodating heterogeneity report the most appropriate coverage probabilities, although the Harbord and the two Egger (RE-se & D-se) methods report substantially inflated MSE and variance values compared to other methods evaluated. For the same reasons as for the scenarios characterized in figures 6.6.5 and 6.6.6, the present scenario can be regarded as an extreme scenario (i.e. large values of heterogeneity combined with severe PB), where no adjustment method performs particularly well.

Note that under fairly homogeneous effects, FE & RE Egger-based methods (FE- se, FE-var, RE-se & RE-var) provide coverage probabilities well above 95%, implying inappropriately small type I error rates (figure 6.6.7 and in appendix 4, figures A13,

A14, A16, A17 & A18). This can be explained by the inability of these models to accommodate the under-dispersion of observed effects (i.e. less variability than would be expected by chance) caused by PB. Conversely, the methods that include a dispersion parameter do not suffer from excessive coverage probabilities because they can accommodate under-dispersion by allowing the dispersion parameter below the value of one. This means that the Harbord, Peters and the Egger methods (D-se & D- var) perform favourably under scenarios where PB causes under-dispersion.

164 Figure 6.6.7 Measures of absolute bias, coverage probabilities, MSE and variance of the predicted effect for meta-analyses simulated to have

30 studies, an underlying OR of 1 (lnOR=0) and severe PB induced by effect size alongside increasing levels of heterogeneity (x-axis) (PB situation 4) .2 100 .15 80

FE .1 60 RE

.05 TF FE-FE 40

0 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 20 -.05 FE-se 0 -.1 RE-se

.08 D-se

.04 FE-var

RE-var .06 .03 D-var

Harbord .04 MSE .02

Variance Peters

Harbord-C .02 .01 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

165 With respect to the impact of PB upon heterogeneity mentioned in section 6.5.9, tables 6.5 and 6.6 confirm the conclusions from Jackson (Jackson 2006a), where paradoxically extreme PB scenarios sometimes inflate the between-study variance.

2 These tables report the τ & I2 values for various scenarios according to the underlying

OR, heterogeneity levels and selection mechanism for study censoring simulated. Both

2 tables show how some extreme PB scenarios report larger τ & I2 values than those from unbiased MAs. Explicitly, those values highlighted in red in Tables 6.5 and 6.6 indicate that inducing PB by a one-sided p-value can produce larger heterogeneity than in scenarios without PB. This might seem paradoxical only if PB is always expected to reduce heterogeneity in a MA (as discussed in section 6.5.9). Interestingly, inducing PB by effect size always reduces heterogeneity in this simulation study. However, this trend was reversed when, for example, 60% of the most unfavourable studies were omitted (not shown); corresponding to a situation where only the most beneficial studies above the mean population value are published.

Table 6.5 I2 values resulting from the various PB scenarios simulated

2 Heterog Severe PB by Severe PB by I (%) No PB Mean (median) factor Effect size p-value 0 7.7 (0) 0.1 (0) 18.5 (18.3) 1 9.1 (11.4) 0.15 (0) 20.2 (20.4) OR=1 1.5 57.5 (60) 22 (21.8) 60.3 (62.3) 2 78 (79) 56 (58.4) 77.1 (78.5) 0 7.7 (0) 0.1 (0) 2.87 (0) 1 9 (7.4) 0.15 (0) 3.3 (0) OR=1.5 1.5 57.2 (60) 21.4 (21) 44 (46.8) 2 77.8 (79) 55 (58) 69 (71) 0 7.6 (0) 0.1 (0) 7.1 (0) 1 9 (9.2) 0.15 (0) 8.3 (0) OR=3 1.5 56 (58) 18.7 (17) 54.7 (57) 2 76.6 (78) 52 (54.8) 74.8 (76) * Simulated meta-analyses of 30 studies each

166 Table 6.6 Average τ2 values resulting from the various PB scenarios simulated

2 Heterog Severe PB by Severe PB by τ No PB factor Effect size p-value Mean (median) 0 0.004 (0) 0.00005 (0) 0.01 (0.008) 1 0.0045 (0.0004) 0.00006 (0) 0.011 (0.009) OR=1 1.5 0.0557 (0.053) 0.0129 (0.01) 0.061 (0.058) 2 0.142 (0.136) 0.0546 (0.05) 0.135 (0.123) 0 0.0037 (0) 0.00004 (0) 0.0012 (0) 1 0.0045 (0.0003) 0.00006 (0) 0.0014 (0) OR=1.5 1.5 0.0556 (0.053) 0.0125 (0.01) 0.033 (0.031) 2 0.143 (0.137) 0.0537 (0.05) 0.094 (0.089) 0 0.004 (0) 0.00004 (0) 0.0037 (0) 1 0.005 (0.0004) 0.00006 (0) 0.0044 (0) OR=3 1.5 0.0566 (0.054) 0.0115 (0.008) 0.054 (0.051) 2 0.143 (0.137) 0.052 (0.048) 0.129 (0.125)

For scenarios without PB, the average within-study varianceσ 2 could be derived

from the mathematical relationship between the median values of I2 and τ2, where

τ 2 1 − I 2 σ 2 = (). For instance, σ 2 = 0.0344 for an unbiased FE MA scenario with 30 I 2

studies.

167 6.9. Discussion

Results are encouraging, with several of the regression methods displaying good performance profiles. Across all of the scenarios explored, no particular method consistently outperforms all others; which is not surprising given the inherent problem of relatively little information regarding the selection process in any single MA dataset, particularly when there are only moderate numbers of studies (Copas & Jackson 2004).

The overall performance of all methods deteriorates as both the underlying OR

2 increases and I exceeds 50% (in compliance with work from Ioannidis and others

(Ioannidis & Trikalinos 2007a, Stanley 2008)); while at the same time, their

2 performance increasingly diverge. Note that once I >50%, the between-study variance exceeds the average within-study variance, which coincides with the threshold for

‘substantial’ heterogeneity as defined in the Cochrane handbook (Deeks et al 2008).

Note, however, that these results do not imply that the decision to pool studies

2 2 depends on this I threshold. Recall that I is influenced by the precision of the observed studies (sections 2.3 & 6.5.7 (Rücker et al 2008c, Rücker et al 2008d)). That

2 is, MAs with identical levels of heterogeneity τ2 report different I values for MAs with

2 2 different within-study variances σi (Higgins & Thompson 2002). The I >50% threshold is appropriate in this simulation study because the distribution of study sample sizes is the same across all the simulation study (Coory 2009). It would not be sensible, however, to take the 50% threshold outside this simulation study even though, interestingly, others have found the same threshold (Ioannidis & Trikalinos 2007a,

Stanley 2008). According to Rücker et al (Rücker et al 2008c), whenever the distribution of study sizes is different, the threshold should be established instead by means of the clinical relevance of the between-study standard error τ.

168 Regarding the popular TF method, results do not justify the recommendation of any of its variants over the regression-based methods due to its potentially misleading adjustment and poor coverage probabilities, especially when heterogeneity is present

(Terrin et al 2003, Peters et al 2007). Overall, the TF tends to adjust, on average, by a small margin in comparison to the standard MA regardless of the true presence of PB.

It is fair to point out that all the remaining adjustment methods are also deceived to some extent by heterogeneity as a potential confounder of funnel asymmetry.

Nevertheless, as soon as PB is induced in whichever way, the TF performance is undoubtedly worse than the regression-based methods (particularly in relation to its poor coverage probability). Even though, it should be acknowledged that TF was only intended as a form of sensitivity analysis (Duval & Tweedie 2000b) rather than as an adjustment method per se. If any of its variants had to be chosen, the random-random- effects would be the best option thanks to its increased coverage probability in agreement with the recommendation by the creators of the method (Duval & Tweedie

2000b, Duval & Tweedie 2000a).

Although the standard MA is the best approach under lack of PB or other small- study effects, it inevitably performs poorly when present. This motivated the examination of regression-based adjustment methods conditional on their associated test for PB. Such an approach is also of interest because it may reflect what is commonly done in practice when dealing with suspected PB. Unfortunately, these conditional approaches did not perform as well as the (unconditional) alternatives. This may be explained by the fact that all existing tests for PB suffer from low statistical power (Sterne et al 2000, Macaskill et al 2001, Ioannidis & Trikalinos 2007a) leading to inappropriate methods being used in some instances, and this is a warning not to use such an approach (formal or informally). This is an inherent problem of pre-tests since the failure of the pre-test to reject the null-hypothesis does not prove the null hypothesis is true (Moreno et al 2009a).

169 The persistent low level of coverage probabilities by the FE Egger models (FE-se

& FE-var) under heterogeneous settings renders them inappropriate for such situations because their over-confidence tends to induce spurious significant findings. If heterogeneity is believed to be intrinsic to meta-analytic data, then FE approaches should be avoided. Moreover, due to the low power of the associated heterogeneity test (e.g. Q test) and distortions caused by PB, it is often difficult to determine in practice whether heterogeneity is present or not, making appropriate implementation of

FE methods questionable (Higgins et al 2009).

Since FE methods (FE-se & FE-var) can hardly be justified once heterogeneity is alleged inherent, RE methods (RE-se & RE-var) can be seen as the best option to replace them. However, the main concern is their propensity for over-coverage. Just as under-coverage, coverage probabilities above the specified 95% threshold produce inaccurate confidence intervals, which potentially bias any subsequent assessment of uncertainty around the estimate of interest. Over-coverage implies that the model predictor confidence interval is, on average, overly wide making ‘true’ differences less likely to be detectable. In this sense, an excessively conservative method will tend to provide inconclusive results, which can be regarded as unhelpful in order to inform policymaking. Certainly, over-coverage is a serious concern in a decision-making context, where alternative treatments may report similar mean effect sizes. In such cases, accurate quantification of uncertainty to discriminate between them is vital in order to facilitate realistic probabilistic statements about, say, cost-effectiveness relative to alternative treatments. Hence, a conservative method can be argued to be as detrimental as an over-confident method. Nevertheless, not only do the RE (RE-se

& RE-var) Egger models tend to suffer from excessive coverage probabilities, but so do the FE (FE-se & FE-var) Egger models under the following circumstances:

170 • Scenarios of under-dispersion caused by severe PB (figure 6.6.7);

• Mostly homogeneous settings (figures 6.6.3, 6.6.5, 6.6.7), provided the MA is

not exceptionally large (i.e. less than 30 studies); and

• Small size MA (figure 6.6.3), provided the data are fairly homogeneous

Interestingly, dispersion methods do not seem to experience major over- or under- coverage. They do not always produce the least biased estimate, but do consistently retain good coverage probability levels (by equally accommodating homogeneous and heterogeneous data), while keeping competitive in terms of (residual) bias. That is, the

Harbord, Peters and both Egger dispersion (D-se & D-var) methods appear to have best overall performance over the range of simulation scenarios considered.

Nonetheless, the performance of some of these methods can sometimes be deficient in other respects. For example, the performance of the original Egger method

(D-se) is poor in terms of absolute bias under large heterogeneity and large underlying effect sizes (but no PB) (Figure 6.6.2). Some minor overall deterioration is also evident for smaller effect sizes, e.g. OR=1.5 (figure A1, appendix 4). This is due to the amplified influence of structural correlation as both heterogeneity and the underlying effect size increase. Although the performance of the original Egger method gets worse the further the OR is from the null effect (e.g. OR=3), it is crucial to realise that the consequences are in some respects less of a concern because, while very large effect sizes are rare, it is unlikely that the structural correlation will change the direction of the pooled effect.

The regression models that were specifically designed to address the spurious correlation problem (Harbord and Peters) do notably better and generally perform well across the other scenarios. Under some extreme scenarios of PB, nevertheless, the

Harbord method is occasionally outperformed by the original Egger method. To some

171 extent, this peculiarity could be explained by the heavier correction carried out by the original Egger method (D-se) motivated by the influence of the structural correlation.

When faced with small MAs and/or heterogeneity (figures 6.6.1-7), the outstanding coverage from the dispersion methods comes at a high cost in terms of the MSE for the

Harbord and original Egger (D-se) methods compared to the other two methods

(Peters and D-var). The MSE values from the Harbord and Egger (D-se) methods usually report a two- to three-fold increase with respect to any other method (ignoring associated FE-se & RE-se Egger methods). In contrast, Peters and the Egger-var (D- var) methods report slightly lower coverage probabilities with much lower MSE values thanks to their restrained predictor variances. Recall that the MSE denotes the amount by which the prediction differs from the true value due to random error and/or bias.

Consequently, the Harbord and original Egger (D-se) methods are abandoned in favour of the Peters and Egger-var (D-var) methods, which interestingly perform identically throughout the entire simulation study. However, there is one instance (figure 6.6.2) where they visibly differ with regard to absolute bias, which can be explained by the

Peters’ method profiting from avoiding the structural correlation problem by using a function of sample size as the predictor variable.

MAs of thirty RCTs can be regarded as unusually large in the medical literature.

Although MAs of 5, 10 & 20 studies were also simulated, MAs of 30 studies allowed methods to perform under favourable conditions. As expected, the performance of each method tends to improve as MA size increases. On the whole, residual bias remains roughly identical despite MA size but with wider confidence intervals for reduced sizes. The increased uncertainty due to the reduced sample size becomes apparent on the variance plots (figures 6.6.3-4). As a result of the small MA size, coverage probabilities are somewhat inflated in such a way that methods that previously reported correct 95% coverage probabilities (for 30 studies) now exceed it.

172 As mentioned earlier, this is particularly true for moderately homogeneous settings, where FE (FE-se & FE-var) and RE (RE-se & RE-var) Egger models clearly surpass the 95% limit (figure 6.6.3). Besides, not only does the standard RE MA suffer from excessive coverage as expected, but so does the FE MA (figure 6.6.3), although this unforeseen finding can be partially justified by the lack of information that a handful of studies are able to provide (<20 studies).

With respect to absolute bias, it is interesting to observe that all methods follow a comparatively similar trend across the different scenarios (for illustration, figure 6.7).

Figure 6.7 Measures of absolute bias for a range of meta-analysis sizes (x-axis). This illustrative scenario simulated meta-analyses with large heterogeneity (factor of 2), underlying OR of 1.5 (lnOR≈0.4) and publication bias induced by one-sided p-value

(corresponding to PB situation 2)

173 The absolute bias gives the impression of remaining relatively unchanged between the methods, implying that their performance follows a regular correction pattern regardless of the underlying selection mechanism induced. This underlying pattern, although not faultless, is particularly evident when PB is present. Nevertheless, these rankings, based on the residual bias alone, do not qualify any method as a superior adjustment method. The trend can be outlined as follows:

1. The two standard MA approaches are just about equally biased, although the RE

model tends to report a slightly upper estimate in the y-axis than its FE counterpart

due to their different weighting.

2. The three TF versions always carry out, on average, some modest adjustment

resulting in lower estimates compared to the standard MA. There is also an

expected minor inner trend, where the random-random-effects version usually

reports an upper estimate (in the y-axis) compared to the fixed-fixed-effect model.

3. Similarly, the Egger methods (D-se, FE-se & RE-se) result in a lower (in the y-

axis) estimate than the Egger-Var methods (D-var, FE-var & RE-var). This result

can be explained, at least partially, by the Egger variant methods performing a

more severe adjustment that the Egger-Var. The Peters method performs very

similarly to the Egger-var

4. Harbord’s method tends to be located half way between the results from the Egger

(D-se, FE-se & RE-se) and Egger-Var (D-var, FE-var & RE-var) sets (except in

figure 6.6.2).

5. Finally, the conditional and unconditional Harbord methods tend to display the

highest (in the y-axis) estimates among the regression-based methods.

174 These patterns can be understood once the choice of publication selection process induced is recognized not to play a predominant role in how the different regression-based methods carry out their specific adjustments. On the contrary, the magnitude and direction of the funnel asymmetry, in the form of small-study effects, in addition to the presence of random heterogeneity (rather than the publication selection mechanism) becomes the focus of the adjustment. This remark seems a reasonable reflection of the evidence from the simulation results and by noticing that all regression- based methods do only model small-study effects. In this sense, regression-based methods could be criticised for failing to adequately deal with conflicting underlying selection models. Undoubtedly, if the selection mechanism was known, the adjustment could be performed with a much higher degree of accuracy. However, the actual PB selection mechanism is unidentifiable, particularly because there are often too few studies to estimate it from the data while it perhaps differs from context to context.

Then, the fact that regression-based methods are insensitive to different underlying selection mechanisms can be argued to be more beneficial than detrimental in a decision-making context.

The fact that regression-based methods do not make explicit assumptions about the underlying PB selection mechanism is an advantage over the typical ‘selection models’, where a known selection process (i.e. weight function) needs to be assumed a priori or estimated from the data. Definitely, the true publication criteria will never be identified; perhaps because it is not based on one criterion but as many as there are stakeholders involved in the publication process (e.g. journal editors, researchers, sponsors, etc) (Melander et al 2003, Penel & Adenis 2009, Sridharan & Greenland

2009). Hence, the strength of the regression-based methods relies on modelling only the small-study effect trend without holding untestable assumptions.

175 Given the variability and limited scope of some of the previous simulation studies in the evaluation of methods to address PB, it would be desirable for there to be a consensus simulation framework in which future tests and adjustment methods for PB could be evaluated. This should alleviate concerns about methods being evaluated under different (and arguably favourable) simulation conditions as typically turns out in the literature. As mentioned earlier it is common in simulation studies to test the performance of a statistical method by generating data that fulfil the underlying assumptions of the method under evaluation. For instance, the authors of the TF method evaluated their method under the assumption that PB is induced by the effect size and never under a different selection mechanism (e.g. p-value) (Duval & Tweedie

2000b, Duval & Tweedie 2000a). Unsurprisingly, their simulated selection mechanism coincides with the method being based on the same assumption. More curiously, in this simulation study, TF did not perform particularly better under either of the simulated selection mechanisms. In order to stop favourable simulation conditions from occurring,

PB was induced here following the two most common selection processes in combination with two intensity levels. In this sense, the framework developed here is the most comprehensive to date thanks to the broad range of plausible MA contexts evaluated and therefore could form the starting point for future simulation studies.

One favourable factor in this simulation study is that every MA always had considerable variation in study size in each dataset. Yet again, the methods performance will deteriorate if studies sizes are less variable. This is a particular concern for the regression approaches if all studies are small, since a larger extrapolation to the intercept would be required.

Altogether, several regression-based models for PB adjustment performed better than either the TF or conditional (regression-based) approaches. The Egger-var (D-var)

176 and Peters methods are identified as methods with potentially appealing statistical properties for the correction of small-study effects with the Peters method performing better for large ORs under the simulation scenarios evaluated here.

In relation to the recently proposed corrections for the biases (i.e. attenuation bias and structural correlation) affecting the FE Egger (FE-se) model (and to different extent all variants of the Egger’s model too), it is interesting to notice that the Harbord’s model was reported to reduce the biases effectively (Copas & Lozada-Can 2009). Although no corrections are carried out here because they were published posterior to this simulation study, the results indicate that the Egger-var (D-var) model appears to perform better than the Harbord’s model. Therefore, it can be argued that among the

Egger’s model variants, the Egger-var (D-var) model is the least affected by the biases just mentioned. For simplicity, the D-var model will be named ‘the regression-based method’ hereafter.

177 6.10. Summary

This chapter intended to accomplish the thesis aim to facilitate the evaluation and comparison between alternative adjustment methods. Since a simulation study provides an ideal opportunity to compare the performance of competing statistical methods under controlled conditions, a comprehensive simulation study was designed to identify those adjustment methods with the most desirable statistical properties.

Ultimately, chapter 6 provided full details on all aspects of the simulation study in addition to its results.

In keeping with the aims of the thesis specified in section 1.2, chapters 7 & 8 jointly present several case studies to illustrate how to implement the methods proposed, to (visually) compare them with currently used methods, and to show the potential impact of inappropriate analyses on the estimation of the adjusted pooled effect. Furthermore, since the case study in chapter 7 is accompanied by the gold standard data, the investigation of the external validity of the advocated adjustment method becomes possible. Altogether, chapter 7 allows a first assessment on whether the findings from the simulation study can be generalised to real MAs.

178 Adjustment method implemented on a case study where a gold standard exists

7.1. Antidepressants case study

In 2008, Turner et al published a study in the New England Journal of Medicine that strongly suggested the scientific journal literature on antidepressants is biased towards ‘favourable’ results (Turner et al 2008a). The authors exposed reporting biases in the journal-based literature, which included the suppression of whole studies

(i.e. PB) as well as the selective reporting of study outcomes. They demonstrated this by comparing the results from journal-based reports of trials with the results from the corresponding trials submitted to the FDA when applying for licensing (Rising et al

2008).

The FDA dataset is assumed unbiased (but not the complete) body of evidence in the field of antidepressants, and so regarded a gold standard data source, due to the legal requirements of submitting evidence in its entirety to the FDA, and their careful monitoring for protocol deviations (Turner 2004, Chan 2008, Ioannidis 2008a).

However, a gold standard dataset will not be available in most contexts. In the absence of a gold standard, meta-analysts have had to rely on analytical methods to both detect and adjust for publication and related biases (Rothstein et al 2005). Because the performance of the methods has been evaluated using simulation studies, concerns remain as to whether the simulations reflect ‘real life’ situations and therefore whether their perceived performance is generalisable to real practice (Maldonado & Greenland

1997). This hesitation, understandably, has led to caution in the use of the methods, particularly for those that adjust effect-sizes for PB and related biases (Sterne et al

2001b); but ultimately this is what is required for rational decision making if reporting biases exist.

179 Following a summary account of how the antidepressant trials dataset was collated, the implementation of the adjustment methods is carried out. The dataset compiled by Turner et al provides a unique opportunity to evaluate the performance of these analytical methods against a gold standard. Then, results of applying such methods to the journal published data are compared to the findings from the analysis of the FDA submitted gold standard data. Some of the results from this case study have been recently published in the BMJ (Moreno et al 2009b) accompanied by an editorial

(Dubben 2009).

7.2. Data Collection

A full description of the dataset, how it was obtained, and the references to the trials associated with it has been published previously (Turner et al 2008a). Briefly,

Turner et al identified the cohort of all phase 2/3 short-term double-blind placebo- controlled trials used for the licensing of antidepressant drugs between 1987 and 2004 by the FDA. Seventy-four FDA-registered trials involving 12 antidepressant drugs and

12,564 patients were thus identified. In order to compare drug efficacy reported by the published literature with the FDA gold standard, data on the primary outcome from both sources were collected. Once the primary outcome data were extracted from the FDA trial registry, the published scientific literature was then searched for publications matching the same trials. When a match was identified, data were extracted on the article’s apparent primary efficacy outcome.

Because studies reported their outcomes on different scales, effect sizes were all expressed as standardized mean differences by means of Hedges’ g-scores

(accompanied by corresponding variances) (Hedges 1982). Standardized measures are typically used when the outcome metric does not have clear meaning to the reader

(e.g. scores from depression tests based on an arbitrary scale), or when studies using different scales are being combined. This type of normalisation is the most commonly

180 used because it converts all effect sizes to a common scale, where the scaling factor is the standard deviation of the effect size. The standardized measure of effect (g-score) weights the effect size by its precision (i.e. standard deviation); in such a way that the g-score is interpreted as the mean difference in pooled standard deviation units, i.e. number of standard deviations the effect size is away from the null effect of no difference between study arms (Rubin 1992). Then, the g-score indicates how many standard deviations above or below the null effect each study are. For instance, if someone's g-score equals three, their score is said to be three standard deviations higher than the null effect.

Among the 74 FDA-registered studies, 23 studies (31%), accounting for 3449 study participants, were not published. Overall, larger effects were derived from the journal than from the FDA data. Among the 38 studies with results viewed by the FDA as statistically significant, only one was left unpublished. Conversely, inconclusive studies were, with three exceptions, either not published (22 studies) or published in conflict with the FDA findings (11 studies). Moreover, 94% of published studies reported a positive significant result for their primary outcome, compared to 51% according to the FDA. Data for the analysis were extracted from Table C in the appendix (Turner et al 2008a), in which two studies were combined making the total number of studies 73 in this assessment.

7.3. Analysis

This case study implements the best methods for identifying reporting biases and adjusting for small-study effects according to this thesis. Specifically, the contour- enhanced funnel plot (section 4.1.3) and the regression-based adjustment method (D- var as labelled in section 6.3). Other regression-based methods considered in the simulation study were designed for binary outcomes exclusively and thus not considered here. As explained in section 4.1.3, the contours help distinguish reporting

181 biases from other biases because they are a direct function of statistical significance; and so the contour-enhanced funnel plots help disentangle genuine dissemination/reporting biases (rather than PB alone) from other sources of funnel asymmetry (Moreno et al 2009b). For comparison and completeness, other conventional methods to deal with PB are considered. These are the original Egger’s test for funnel asymmetry (Egger et al 1997c), and the TF method (Duval & Tweedie

2000b) (FE linear estimator), which adjusts a MA for PB bias by imputing studies to

‘rectify’ any asymmetry in the funnel plot. The primary analysis uses FE models throughout, however the use of RE models did not change the conclusions.

7.4. Results

Figure 7.1 displays a contour-enhanced funnel plot of the studies submitted to the

FDA with the corresponding FE MA pooled estimate providing a weighted average of effect sizes across trials (g-score=0.31; 95% Confidence Interval (95%CI 0.27-0.35)).

This funnel plot is reasonably symmetrical (p-value from Egger’s test=0.10), which is consistent with the hypothesis that the FDA is an unbiased and appropriate gold standard data source.

182 Figure 7.1 Contour-enhanced funnel plot including studies submitted to the FDA with the corresponding fixed-effect meta-analysis pooled estimate. The pooled estimate displays the 95% confidence interval at the top of the plot

0.0 FDA estimate p < 1% 1% 10% FE MA FDA

0.2

Standard error

0.3

0.4

-1 -.5 0 .5 1 Effect estimate

The contour-enhanced funnel plot for the journal data (figure 7.2) is rather different in appearance and highly asymmetric (p-value from Egger’s test<0.001). A MA of these data result in a considerably higher average effect size (g-score= 0.41; 95%CI 0.37-

0.45). The majority of study estimates now lie above (but many close to) the right-hand contour line indicating a statistically significant benefit at the 5% level, with very few studies located below this 5% contour line (i.e. not reaching significance at the 5% level). Crucially, the area where studies appear to be ‘missing’ is contained within the area where non-significant studies would be located (i.e. inside the triangle defined by the p-value=0.10 contour boundaries). This adds further credence to the hypothesis that the observed asymmetry is caused by reporting biases. Hence, even without the

183 availability of the corresponding funnel plot for the FDA data (figure 7.1) it can be asserted that a contour-enhanced funnel plot (figure 7.2) has convincingly identified reporting biases as a major problem for the journal data.

Figure 7.2 Contour-enhanced funnel plot including studies published in medical journals with the corresponding fixed-effect meta-analysis pooled estimate. The pooled estimates display their 95% confidence intervals at the top of the plot

0.0

p < 1% 1% 10% Journal estimate

FE MA Journal FE MA FDA 0.2

Standard error

0.3

0.4 -1 -.5 0 .5 1 Effect estimate

For the journal dataset, the TF method imputes a total of 18 ‘missing’ studies (all in the region of non statistical significance (the diamonds on figure 7.3)). This agrees reasonably well with the ‘truth’ since 23 studies identified through the FDA registry were not identified in the journal literature. The application of the TF method reduces the average effect size to 0.35; 95%CI 0.31-0.39, which is approximately half-way between the FDA and journal estimates (p-value from Egger’s test =0.57). All three estimates are presented on figure 7.3.

184 Figure 7.3 Contour-enhanced funnel plot including studies published in medical journals in addition to the filled studies from implementing the Trim & Fill method. The fixed-effect meta-analysis pooled estimates display their 95% confidence intervals at the top of the plot

0.0

p < 1% 1% < p < 5%

0.1 5% 10% Journal estimate FE MA Journal FE MA FDA 0.2 Filled study FE Trim & Fill Standard error

0.3

0.4 -1 -.5 0 .5 1 Effect estimate

The fitted line corresponding to the regression-based adjustment method (D-var) is plotted on figure 7.4 (blue unbroken line). The adjusted estimate is obtained by extrapolating the line to where the standard error is 0.0 at the very top of the figure.

This produces an adjusted average effect size of 0.29 (95%CI 0.23-0.35) which is very close to the estimate produced by the MA of the FDA data (0.31; 95%CI 0.27-0.35).

185 Figure 7.4 Contour-enhanced funnel plot including the regression adjustment model implemented on the studies published in medical journals. The adjusted effect given at the top where the standard error is equal to zero (corresponding 95% confidence intervals included)

0.0

p < 1% 1% 10% Journal estimate FE MA Journal FE MA FDA 0.2 Regression line

FE Trim & Fill

Standard error

0.3

0.4 -1 -.5 0 .5 1 Effect estimate

This situation is complicated by the fact that, among the FDA non-significant studies that were published in medical journals, most were published as if they were significant. This is investigated in figure 7.5 by linking the effect sizes from each study, where estimates were available from both data sources (69% (50) of all the trials), using arrows indicating the magnitude and direction of change from FDA to published effect sizes. The effect size differed between FDA and journal analyses in 62% (31) of the 50 trials by at least a g-score of 0.01. Of these, the journal published effects were larger in 77% (24) of the studies (arrow pointing to the right). As expected, a MA of these data produces a higher average effect size for the journal data (g-score=0.41;

95%CI 0.37-0.45) in comparison to the matched FDA data (0.37; 95%CI 0.33-0.41).

186 Figure 7.5 Contour-enhanced funnel plot displaying the discrepancy between FDA and journal data. The arrows are joining the effect results from the same studies when both were available from the FDA and journals

0.0

p < 1% 1% 10% FDA to Journal change in effect

0.2

Standard error

0.3

0.4 -1 -.5 0 .5 1 Effect estimate

From looking at figure 7.5, it is revealing that about eight studies achieve statistical significance at the 5% level when published in medical journals contradicting their non- significant FDA submission; whilst no journal publication revokes statistical significance previously reported to the FDA. This suggests that reporting biases within published studies are directed towards the realization of statistical significance. Similarly, 96%

(21) of the 22 unpublished studies (in journals) were non-significant when submitted to the FDA, which again supports the hypothesis of the presence of reporting biases.

Figure 7.6 displays the FE MA estimate for these 22 unpublished studies (0.15; 95%CI

0.08-0.22), which was far lower than the one for published studies (0.41; 95%CI 0.37-

0.45) adding further support that serious reporting biases are present in the journal data.

187 Figure 7.6 Contour-enhanced funnel plot displaying the estimates of effect only available from FDA (not journal published studies)

0.0 FDA estimate p < 1% 1% 10% FE MA unpublished

0.2

Standard error

0.3

0.4 -1 -.5 0 .5 1 Effect estimate

A re-analysis of the above analyses using RE MA produced very similar results to

2 the FE presented; N.B. I is 16% & 0% for the FDA and journal data, respectively

(Higgins et al 2003).

While the assumption that FDA data is free from reporting biases is credible

(Rising et al 2008), other sources of small-study effects cannot be ruled out with certainty (Ioannidis 2008a). Indeed, the FDA data could still suffer from small-study effects although to a smaller extent than the published literature (Ioannidis 2008a). In spite of how unbiased the FDA data is alleged, it should be noted that such a claim is weak and therefore easily challenged. Small-study effects could have been induced by a form of data irregularity (section 3.4.3), due to discrepancies in the methodological

188 quality of studies between bigger and smaller studies. For that reason, chapter 8 makes the case for routine adjustment of unidentified sources of small-study effects despite the origin of the data. Figure 7.7 displays the adjusted FDA estimate, which exhibits good agreement with both the FE MA FDA and adjusted journal data. The adjusted FDA estimate overlaps with both the FDA FE MA and adjusted journal data, which supports the confidence in the use of the FDA data as an appropriate gold standard data source.

Figure 7.7 Contour-enhanced funnel plot including the regression adjustment model implemented on both the FDA studies and studies published in medical journals.

The adjusted effects given at the top where the standard error=0 with their corresponding 95% confidence intervals

0.0 FDA estimate p < 1% 1% 10% FE MA FDA FE MA Journal Reg line FDA 0.2 Reg line Journal Standard error

0.3

0.4 -1 -.5 0 .5 1 Effect estimate

189 7.5. Discussion

The application of emerging approaches to publication and related biases to the journal dataset where such biases are known to exist, through the availability of a gold standard dataset, has produced very encouraging results. First, detection of reporting/dissemination biases was convincing using a contour-enhanced funnel plot.

Second, the regression-based method produced a corrected average effect size, which was very close to that obtained from the FDA dataset (and closer than that obtained by the TF method).

Some of the limitations of this assessment are, firstly, that the findings relate to a single dataset and thus are not necessarily generalisable to other examples.

Specifically, all the trials considered were sponsored by the pharmaceutical industry

(as opposed to governments or foundation grants) (Mathew & Charney 2009). Further, the methods under evaluation here were designed primarily for the assessment of efficacy outcomes and they might not be appropriate for safety outcomes. For example, there may be incentives to suppress the statistically significant safety outcomes (rather than the non-significant ones). This is an area, which requires more research.

Although the contours on the funnel plot are considered an essential component for distinguishing reporting biases from other causes of funnel plot asymmetry, no claim is made that the contours can distinguish between the different reporting bias mechanisms, for example, whether it is missing whole studies, selectively reported outcomes, or ‘massaged’ data, which has led to the distorted funnel plot. (N.B. Only because the FDA data is available, they can be disentangled but generally this will not be possible). Note that reporting biases have all the same effect in a MA, i.e. they are all assumed to be related to statistical significance and they all result in an exaggeration of the pooled effect. There is solid empirical evidence to support this notion for the effect of reporting biases within published clinical trials in general (Sterne

190 et al 2001b, Melander et al 2003, Chan et al 2004a, Chan et al 2004b, Dwan et al

2008, Rising et al 2008) and for antidepressant trials in particular (Hotopf & Barbui

2005, Furukawa et al 2007, Turner et al 2008a). Potential mechanisms that are known to induce this include (Rising et al 2008, Turner et al 2008a):

i) Selectivity in which outcomes are reported/labelled as primary in journal

publications;

ii) Post-hoc searches for statistical significance using multiple hypotheses tests,

i.e. data dredging/fishing (Ioannidis 2005); and

iii) Selectivity in the analysis methods applied to the data for journal publications

(Chan et al 2008a).

Regarding the last point, the FDA makes its recommendations based on the intention-to-treat principle (Lewis 1999, Heritier et al 2003), whereas only about half the journal publications are analysed and reported following the same principle (Hotopf et al 1997, Hollis & Campbell 1999, Ruiz-Canela et al 2000, Melander et al 2003, Gravel et al 2007). The usual alternative – the per-protocol approach to analysis - excludes drop-outs, non-adherents (or patients with protocol deviations in general) and aims to estimate drug efficacy, which will tend to inflate effect sizes compared to the intention- to-treat approach (which estimates effectiveness) (Bollini et al 1999, Revicki & Frank

1999, Schulz & Grimes 2002, Gartlehner et al 2006). An estimate from a per-protocol analysis will generally have less precision than for the associated intention-to-treat analyses due to the removal of patients with protocol deviations (Fergusson et al 2002,

Tierney & Stewart 2005, Porta et al 2007 ), resulting in a shift downwards along the y- axis of a funnel plot, in accordance with the fact that excluded patients (either because of dropouts or lack of treatment adherence) do not retain the treatment effect, at the same time as they bring in uncertainty due to the loss of information from the missing data (Porta et al 2007). This is consistent with what is observed in figure 7.5 where

191 most arrows are in a downward (as well as right moving) direction. How much such a mechanism commonly contributes to funnel plot asymmetry would be worthy of further investigation. More interestingly, because switching per-protocol by intention-to-treat produces results that are more precise, small-study effects are inevitably induced.

Hence, there is a predictable reduction in the magnitude of the adjusted (extrapolated) effect size due to moving from efficacy to effectiveness results (Gartlehner et al 2006,

McMahon et al 2008).

The following figure 7.8 (courtesy of Erick Turner) describes the different handling of the FDA results when published in medical journals. According to Turner et al

(Turner et al 2008a), 31% of negative FDA studies were published as positive due to a combination of data dredging, failure to report the protocol’s primary outcome (i.e. outcome reporting bias) and the use of per-protocol analyses.

192 Figure 7.8 Different handling of positive versus negative studies by the published literature

Overall, protocol deviations are likely to induce bias in the MA results. Apart from the authors of the antidepressant study (Turner et al 2008a); other authors have also exposed the lack of consistency between protocols and the reporting of studies

(Chan et al 2004b, Chan 2008, Chan et al 2008a, Ioannidis 2008a, Rising et al 2008). It can be argued that the three factors mentioned above overlap with the unique intention of showing the intervention of interest more beneficial than it really is. To this end, the data are analysed with the post-hoc statistical methods that provide at least some significant positive outcomes, which are more likely submitted for publication. This is done regardless of what the primary outcome is in the study protocol. Ultimately, if no significantly ‘positive’ outcomes are obtained, the whole study is not likely to be published (Dwan et al 2008). Very few methods for addressing outcome (Hutton &

Williamson 2000, Williamson & Gamble 2007) and subgroup reporting biases (Hahn et

193 al 2000) exist, and further development of analytical methods to specifically address aspects of reporting biases within studies is encouraged. Nevertheless, it is reassuring that the methods used here to address publication and related biases, generally appear to work well in the presence of multiple types of reporting biases. The regression adjustment method (which is easy to carry out (Thompson & Sharp 1999)) consistently outperformed the TF method in the extensive simulation study in chapter 6 (Moreno et al 2009a) (as well as within this particular dataset).

Because the supposition that the FDA data are free from all sources of small-study effects (table 3.3) is unfounded, the regression-based adjustment was implemented.

The adjusted FDA estimate showed outstanding agreement with both the FDA FE MA and adjusted journal data (figure 7.7) (and so does not contradict the results from

Moreno et al (Moreno et al 2009b)). Instead, this reassures the use of the FDA as an appropriate gold standard data source. On the other hand, the TF method imputes eight studies supposed ‘missing’ from the FDA dataset with the aim of correcting the funnel asymmetry (not shown). It is difficult to conceive, however, that several RCTs were never submitted to the FDA in the first place. To some extent, this suggests that the TF approach is conceptually unsound.

Another limitation of this assessment is that a genuine source of heterogeneity, that is arguably present in the data, is ignored. Recall that the 73 RCTs investigated by

Turner et al correspond to 12 different antidepressant drugs. In line with Turner et al

(Turner et al 2008a), the analysis did not incorporate any potential systematic variation of effect between the 12 drugs. Their efficacy has been assumed equal following a FE

2 model since I was 16% & 0% for the FDA and journal data, respectively (and findings are very consistent if RE are used instead). There is also lack of compelling scientific evidence to support that genuine heterogeneity in effect sizes exists between

194 antidepressant drugs (Kirsch & Sapirstein 1998, Freemantle et al 2000, Kirsch et al

2002a, Kirsch et al 2002b, Hansen et al 2005, Johnson & Kirsch 2008, Kirsch et al

2008, Turner et al 2008a, Turner & Rosenthal 2008). If this assumption were incorrect because the 12 drugs have systematically different effect sizes (Cipriani et al 2006,

Cipriani et al 2009a, Cipriani et al 2009b), the funnel plot in figure 7.1 would display multiple overlapping funnel shapes. That is because more effective drugs require smaller studies to be shown statistically effective, which would consequently induce artificial small-study effects in the ‘complete’ funnel plot. Consequently, inconsistency between study-specific effects would be expected. However, the FDA and journal

2 datasets report I =16% & 0%, respectively, while the ‘complete’ FDA funnel plot remains fairly symmetric (figure 7.7).

A more plausible source of heterogeneity refers to differences in the baseline risk of patients (Sharp et al 1996, Arends et al 2000, Mathew & Charney 2009). Kirsch et al

(Kirsch et al 2008) found antidepressant efficacy to be associated with the baseline severity of patients, which is “attributable to decreased responsiveness to placebo among very severely depressed patients, rather than to increased responsiveness to medication.” On the other hand, Melander et al did not find such association (Melander et al 2008). Nevertheless, if such association exists, overlooking the impact of the baseline severity upon drug efficacy may result in biased conclusions, where baseline severity can be considered a confounder that may have induced small-study effects

(Arends et al 2000, Sharp & Thompson 2000, Sterne et al 2001b). Unfortunately, this claim could not be investigated because of lack of baseline data to facilitate it.

In the MA context, there is an ever-present tension between ‘lumping’ and

‘splitting’ studies and an argument could be made for allowing for specific drug treatment differences by stratifying them and carrying out 12 separate analyses

195 (Cooper et al 2009). Challenges would arise if attempting to detect and adjust for reporting biases (and other small-study effects) in each of the analyses separately due to the difficulty of interpreting funnel plots with small numbers of studies and the limited power of statistical methods (Sterne et al 2000). This thesis is in line with the suggestions of Shang et al (Shang et al 2005) in their assessment of biases in the homoeopathy trial literature (which has some commonalities with the analysis presented here), that it is advantageous to ‘borrow strength’ from a large number of trials and provide empirical information to assist reviewers and readers in the interpretation of findings from small MAs that focus on a specific intervention (Sterne et al 2009b). Further, investigations of extensions of the existing statistical methods, which would formalise such borrowing-of-strength ideas (Higgins & Whitehead 1996) to produce drug-specific estimates of bias are presented in chapter 11.

Because this more global approach capable of ‘borrowing strength’ from larger collections of studies is not always feasible, adjusting for potential small-study effects is endorsed whenever there are enough studies and these are of varying sizes to make results meaningful within a regression analysis setting. To this end, it is advisable to adopt a rule of thumb already in place in the area of PB testing by which at least ten studies with varying sizes must be available in order to proceed with the analysis.

Otherwise, the statistical power is too low to distinguish chance from real small-study effects confidently (Sterne et al 2002, Egger et al 2003, Ioannidis 2008b, Sterne et al

2008, Sterne et al 2009b).

196 Technical matters relating to the robustness of the results to influence the choice of outcome metric and analyses methods used within the assessments are now examined. Firstly, the Hedges’ g-score outcome metric is used throughout the analysis.

This includes a correction for small sample size. An alternative metric, without the correction, is the Cohen’s d-score, which could also have been used. However, this would have negligible influence on the funnel plots presented here since even for the smallest study (N = 25) the multiplicative correction factor (Borenstein et al 2009) for the effect size would be 0.967 and 0.935 for the corresponding variance. Hence this would change the g-score of 0.377 to a d-score of 0.390 and the variance similarly from

0.404 to 0.432 while all other estimates would be even (and most much) less affected than this. An additional consideration is that the contours on the funnels are constructed assuming normality of the effect size since they are based on the Wald test. Some studies may have not used exactly this statistical test in their original analyses. For example, for trials with small sample sizes, a t-test may have been used instead. However, as the Wald and t-test statistics converge as the sample size increases, this is only going to affect the assessment of the most imprecise trials at the bottom of the funnel and all findings are clearly robust to this.

Finally, the external validity of the results from regression-based methods could be questioned because they were assessed in chapter 6 in the context of binary outcome data, whereas the outcome g-score is continuous. However, the Cochrane handbook

(Sterne et al 2008) spells out that, based on general considerations, the original Egger method (Egger et al 1997c) can be used to test for funnel plot asymmetry upon continuous data since no loss of statistical power is expected compared to the binary case. Likewise, this claim could also be made about the other regression-based methods provided that they are transferable between scales in terms of model parameters. What is more, this should equally apply to both testing and adjusting for funnel plot asymmetry. On the other hand, the view in this thesis is that this case study

197 represents a first step in the validation process of the adjustment method advocated.

Indeed, results from the different methods implemented here adhere to the conclusions from the simulation study in chapter 6 where they were jointly evaluated and compared.

7.6. Summary

The regression-based approach recommended in chapter 6 has produced very encouraging results because its corrected effect size was very close to that obtained from the FDA dataset. Thanks to the availability of the gold standard dataset, this case study can be considered a first step in supporting the external validity of the adjustment method.

After the encouraging results from the antidepressants case study, chapter 8 progresses with the aims of the thesis to facilitate a better understanding of the properties of the adjustment method of choice. To this end, two case studies are used to illustrate the weighting properties of the regression-based method relative to standard MA models.

198 The case for the routine implementation of the

adjustment method

8.1. Introduction

The view in this thesis is that by addressing heterogeneity in the form of small- study effects, more reliable effect estimates can be obtained compared to the standard

MA approach. And since small-study effects can never be ruled out with certainty, its routine correction is encouraged with the intention to minimise its impact. With the aim of justifying the routine implementation of the regression-based method, attention is drawn to its weighting properties relative to the standard MA approach. For that, study weights are derived algebraically and then calculated for two illustrative case studies.

To facilitate a better understanding of the properties of the competing methods, threshold lines are drawn to indicate where the weighting allocation differs from standard MA.

8.2. Weighting properties of the regression approach

The way weighting is employed in applied MR is sometimes not described in sufficient detail in published articles (Thompson & Higgins 2002). Indeed, the properties of a method can be better understood if the weight allocated to individual studies were explicit. To this end, the weights assigned by the adjustment regression method are derived algebraically and compared to the weights the studies would have received in a standard FE and RE MA. This should facilitate a better understanding of the weighting properties of the proposed approach relative to standard MA. The two regression coefficients for the adjustment method Egger-var (D-var as labelled in equation 6.1) are first estimated by weighted least squares regression (Gelman & Hill

2007).

199 [Equation 8.1]

Where X is the design matrix containing the covariate values; i.e. the study- specific effect size variance , V is the diagonal matrix whose elements are , and y is the response vector. Because the proportional aspect of the weighting is cancelled out for the purpose of coefficients estimation, the regression coefficients from a FE or dispersion MR models are identical (although not their variances in presence of residual heterogeneity). For obvious reasons, special attention is paid to the intercept and its variance (equations 8.2 and 8.3 respectively, which are derived in appendix 5).

∑ ∑ ∑ [Equation 8.2] ∑ ∑

∑ ∑ ∑ ∑ [Equation 8.3] ∑ ∑

N.B. N-k corresponds to the model degrees of freedom where N is the total number of studies in the MA and k is the number of parameters estimated (i.e. both regression coefficients). The (relative) weighting i assigned to each study i can be found in closed form for the adjustment method of choice D-var. Since equation 8.4 is original to this thesis, appendix 5 contains the mathematical derivation.

∑ [Equation 8.4] ∑ ∑

200 The next two sections present a case study each illustrating the distinctive weights assigned to studies depending on the underpinning statistical model assumed by the competing methods; and so, their characteristic influence upon the overall pooled effect. Recall from section 2.2 that the standard FE MA calculates the pooled effect estimate by weighting the observed study-specific effect sizes by their inverse-variance

1⁄ (Sutton et al 2000a). The standard RE MA model follows the same concept of inverse-variance weighting by adding a component of between-study variance τ2 to the

2 within-study variance 1⁄ (DerSimonian & Laird 1986, DerSimonian & Kacker

2007) (section 2.4). The proposed regression-based adjustment method calculates the study-specific weighted effect size by multiplying the observed effect size by its corresponding weight . What is more, the summation of the weighted effects adds up to the pooled effect, denoted by the regression intercept for the adjustment method

(equation 8.2).

8.3. Pre-eclampsia case study

Pre-eclampsia is a condition unique to human pregnancy characterised by high blood pressure and protein in the urine, which affects up to 10% of pregnancies although it rarely progresses to eclampsia (convulsions). Since the etiology of pre- eclampsia remains unknown (Moodley 2007), a 2006 Cochrane systematic review

(Hofmeyr et al 2006) collected evidence from trials testing, among other hypotheses, whether an increase in calcium intake during pregnancy reduces the chances of suffering from pre-eclampsia. Twelve clinical trials (including 15,528 women) were combined in the original forest plot (Analysis 01.02). These are now presented in a contour-enhanced funnel plot (figure 8.1) alongside the adjustment method and the polled effect estimates from both fixed and RE MAs.

201 Figure 8.1 Funnel plot with the adjustment regression method implemented alongside the pooled effect estimates from both fixed and random-effects meta- analyses in the lnOR scale. The horizontal slashed lines indicate when the regression weights are expected to differ from fixed-effect meta-analysis

0.0 ln(Odds ratio) Relative regression weights > 5% Fixed Effect M-A 10% Fixed Effect M-A Random Effects M-A Regression line 0.5 Relative regression weights <

Fixed Effect M-A

Standard error 1.0 Relative regression weights < Fixed Effect M-A & negative

Relative regression weights > Fixed Effect M-A & negative 1.5

-4 -2 0 2 4 Effect estimate

Figure 8.1 displays a clear small-study effects trend in the data. Interestingly, the adjusted estimate exhibits good agreement with the FE MA pooled estimate (since their

95 confidence/credible intervals clearly overlap). In contrast, the RE MA pooled estimate diverges considerably as a result of being more vulnerable to the influence of the small-study effects. The Cochrane reviewers (Hofmeyr et al 2007) draw attention to the strong evidence of small-study effects in the observed dataset (p-value from

Egger’s test <0.001), where calcium supplementation appears to be more beneficial in smaller studies (Hofmeyr et al 2006). PB and the fact that smaller studies were said to recruit high-risk women were given as justifications for the small-study effects, although

MR was unable to reveal covariates that could explain this phenomenon (Xu et al

202 2008). Despite that, the Cochrane review reported the RE MA pooled estimate

(I2=67.5%) indicating that pre-eclampsia was significantly reduced lnOR= -0.86 (95%CI

-1.27 to -0.44), particularly in women at higher risk of developing pre-eclampsia and those with typically low daily intake of calcium. It concluded that routine use of calcium supplementation during pregnancy for women with low dietary intake was supported by the evident reduction in pre-eclampsia cases and other co-morbidities (Hofmeyr et al

2007). In contrast, the regression adjusted effect estimate includes the null effect lnOR= -0.12 (95%CI -0.32 to 0.08) while visibly overlapping with the still statistically significant FE MA lnOR= -0.23 (95%CI -0.37 to -0.08).

The view in this thesis is that the results from both FE MA and the adjustment method appear more realistic for this case study if larger studies are assumed to better estimate the ‘true’ effect size (as discussed in sections 3.3 & 5.1). Nevertheless, these conclusions are only speculative since there is no gold standard data to corroborate them (Moreno et al 2009b). Also notice that although the results from RE MAs are generally considered conservative because of their wider confidence intervals, it is not so conservative here since it reports a large and highly significant pooled estimate

(Poole & Greenland 1999, Villar et al 2001). Indeed, the RE MA results are incompatible with the other two approaches because it is more vulnerable to the phenomenon of small-study effects due to granting more weight to smaller studies.

Table 8.1 shows both FE MA and adjustment method assigning at least 86% of the overall (relative) weight to the two biggest studies combined, dropping to 39% in the

RE MA (N.B. In addition to the relative weights, further research is underway to derive the absolute weights).

203 Table 8.1 Relative weighting allocation (left) and weighted effect estimates (right) for the adjustment regression method and both fixed and random-effects meta- analyses

Effect Relative weighting in Weighted effect size Effect size size FE RE Egger- stand.error FE MA RE MA Egger-Var (lnOR) MA MA Var -0.09 0.11 45.56% 19.41% 48.84% -0.041 -0.017 -0.044 -0.06 0.11 40.27% 19.28% 43.08% -0.025 -0.012 -0.026 -0.43 0.34 4.70% 13.51% 4.38% -0.020 -0.058 -0.019 -0.89 0.39 3.49% 12.09% 3.07% -0.031 -0.108 -0.027 -1.72 0.56 1.70% 8.45% 1.12% -0.029 -0.145 -0.019 -1.60 0.64 1.31% 7.18% 0.69% -0.021 -0.115 -0.011 -1.85 0.78 0.87% 5.42% 0.21% -0.016 -0.100 -0.004 -2.10 0.79 0.85% 5.33% 0.19% -0.018 -0.112 -0.004 -2.51 1.16 0.40% 2.90% -0.30% -0.010 -0.073 0.007 -1.10 1.19 0.38% 2.77% -0.32% -0.004 -0.030 0.004 -2.67 1.48 0.24% 1.87% -0.47% -0.006 -0.050 0.013 -2.00 1.52 0.23% 1.79% -0.48% -0.005 -0.036 0.010 Pooled effect estimate (Σ)= -0.226 -0.856 -0.120

Table 8.1 also shows the regression-based method down-weighting smaller studies considerably more than the RE MA, while the difference with respect to the FE

MA is not so notable. Altogether, the regression approach diminishes the influence of smaller studies, leading to the dominance of larger studies over the pooled effect size.

To illustrate the expected weighting allocation more intuitively, several threshold lines help contrasting the regression weights to those from FE (figure 8.1) and RE (figure

8.2) MAs (associated equations deriving the threshold lines are available in appendix

5). For example, the top horizontal slashed line indicates when the regression weights are expected larger or smaller than those from standard MA.

204 Figure 8.2 Funnel plot identical to figure 8.1, with addition of horizontal slashed lines to indicate thresholds where the regression weights differ to those from random- effects meta-analysis

0.0 Relative regression weights > ln(Odds ratio) Random Effects M-A 5% 10% Fixed Effect M-A Random Effects M-A 0.5 Regression line

Relative regression weights < Random Effects M-A Standard error Standard 1.0 Relative regression weights < Random Effects M-A & negative

1.5 -4 -2 0 2 4 Effect estimate

Interestingly, the regression method not only produces positive weights but also negative weights for some of the smallest studies (below the second threshold line).

Although the four negative weights have a trivial influence upon the pooled adjusted effect for this particular case study, the truth is that negative weighting can be seen as contra-intuitive in the context of MA because it implies ‘pushing’ the pooled effect away rather than ‘pulling’ it towards the study estimate. Moreover, the two least precise studies (found beneath the third threshold line in figure 8.1) are given weights of larger magnitude than with FE MA. Note, figure 8.2 does not show the later line because it is far-off the plotted range (standard error=2.6). Altogether, the peculiar negative weighting allocation is produced by the ‘seesaw’ effect of the regression approach, allowing the regression line to adopt a meaningful slope and therefore reach a more

205 extreme intercept that would be possible otherwise. Not surprisingly, the regression method was designed for modelling small-study effects and so performs better than MA when present (Moreno et al 2009a). An alternative interpretation is that each of the three approaches delivers a pooled estimate that is a linear combination of the individual study estimates. These study estimates are obtained by multiplying each study effect size by a coefficient (i.e. weight), and the products summed to obtain the corresponding ‘pooled’ effect. The next section illustrates how the proposed method compares to MA under a lack of small-study effects.

8.4. ‘Set shifting’ ability case study

According to the simulation study in chapter 6, the performance of the adjustment method declines compared to standard MA in scenarios without small-study effects.

Even so, the regression method is argued to be the preferred option because the large benefits it provides in scenarios with small-study effects offset the inferior competitiveness in situations without (under simulation conditions) (Moreno et al

2009a). To illustrate the working of the methods in a situation without small-study effects, figure 8.3 displays a dataset where no mention of bias has been made

(Roberts et al 2007, Higgins et al 2009). The dataset contains 14 studies comparing

‘set shifting’ ability using the ‘trail making’ task (the ability to move back and forth between different tasks) in people with eating disorders compared with healthy controls. Effect sizes are calculated as standardized mean differences where positive effect sizes indicate greater deficiency in people with eating disorders (I2=22%)

(Higgins et al 2009). Since there is no evidence of small-study effects, the three mean pooled estimates are almost identical although with different widths of confidence interval. FE MA=0.36 (95%CI 0.22 to 0.51) and RE MA= 0.36 (95%CI 0.20 to 0.53), while the adjusted effect estimate is the only one to include the null effect 0.37 (95%CI

-0.06 to 0.80).

206

Figure 8.3 Funnel plot with the adjustment regression method implemented alongside the pooled effect estimates from both fixed and random-effects meta- analyses. The horizontal slashed lines indicate when the regression weights are expected to differ to those from fixed-effect meta-analysis

0.0 Effect size estimate 5% 10% 0.1 Fixed Effect M-A Random Effects M-A Regression line

0.2

Relative regression weights > Fixed Effect M-A Relative regression weights < Fixed Effect M-A

Standard error Standard 0.3 Relative regression weights < Fixed Effect M-A & negative

Relative regression weights > 0.4 Fixed Effect M-A & negative

0.5 -1 -.5 0 .5 1 Effect estimate

The horizontal threshold lines have not been drawn for the RE MA comparison since they are almost identical to the ones for FE MA above. Note that as shown in table 8.2, both FE & RE MA allocate very similar (relative) weights as would be expected from a ‘symmetrical’ dataset with little heterogeneity. Under lack of small- study effects, the regression method aims for a vertical line (figure 8.3). This vertical line can only be achieved through a linear combination of study estimates derived from negative weights (multiplying the effect sizes) for the least precise studies.

207 Table 8.2 Relative weighting allocation (left) and weighted effect estimates (right) for the adjustment regression method and both fixed and random-effects meta-analyses

Effect Relative weighting in Weighted effect size Effect size size RE Egger- stand.error FE MA RE MA Egger-Var FE MA (lnOR) MA Var 0.07 0.21 12.42% 10.98% 33.97% 0.009 0.009 0.024 0.46 0.22 11.32% 10.31% 28.36% 0.052 0.052 0.130 0.46 0.23 10.36% 9.70% 23.47% 0.048 0.048 0.108 0.28 0.24 9.51% 9.13% 19.18% 0.027 0.027 0.054 0.85 0.25 8.77% 8.61% 15.39% 0.075 0.075 0.131 0.44 0.25 8.77% 8.61% 15.39% 0.039 0.039 0.068 0.20 0.28 6.99% 7.25% 6.36% 0.014 0.014 0.013 0.52 0.29 6.52% 6.86% 3.95% 0.034 0.034 0.021 0.45 0.29 6.52% 6.86% 3.95% 0.029 0.029 0.018 0.01 0.35 4.47% 5.05% -6.42% 0.000 0.000 -0.001 0.59 0.36 4.23% 4.81% -7.66% 0.025 0.025 -0.045 -0.58 0.36 4.23% 4.81% -7.66% -0.025 -0.025 0.044 0.38 0.40 3.42% 4.01% -11.74% 0.013 0.013 -0.045 0.93 0.47 2.48% 3.01% -16.54% 0.023 0.023 -0.154 Pooled effect estimate (Σ)= 0.362 0.362 0.366

Because the meta-analytic contexts are so similar, the performance of the adjustment method in this case study can be compared to the one observed in the simulation study for ‘situation 1’ described in figure 6.5, where no PB was induced.

According to the findings from the simulation study, the regression-based method D-var provided roughly unbiased mean effect estimates (figures 6.6.1 and A1 & A2 from appendix 4) similar to those from standard MA. This supports the hypothesis that, since the regression method provides almost the same mean results than standard MA, this is largely correct.

208 Because there is little heterogeneity (I2=22%), both FE & RE MAs provide similar

95%CI for their pooled effects (figure 8.3). Comparatively, the 95%CI from the regression method can be argued to be excessive. It is known that a precise extrapolation to the infinite size study can only be achieved if large studies with consistent results exist. In agreement with Rubin (Rubin 1992), who also proposes extrapolation to estimate the true effect, uncertainty in its estimation is not necessarily a weakness of extrapolation in regression analysis but the ‘honest’ reflection of the available data. This is the case here (figure 8.3), where study effect sizes are spread over a wide range of effect sizes with no large studies.

It could also be speculated that, although the intensity of heterogeneity appears low I2=22%, this statistic is misleading for the purpose of quantifying heterogeneity

(Rücker et al 2008c). Note that, as already pointed out in sections 2.3 & 6.5.7, if there were larger studies (with unchanged between-study variance τ2=0.0226) I2 would become inflated (Coory 2009). Then, with a larger I2, the wide 95%CI reported by the adjustment method would appear more realistic, while the narrower 95%CIs reported by standard MA would be questioned. Of course, since the truth is unknown, no verification of these conjectures is possible (Moreno et al 2009b). Ultimately, conclusions should be based on the results from the statistical model that is thought most likely to have generated the observed studies. This issue can be considered as a problem of model choice. From a philosophical perspective, the fundamental difference between standard MA and the regression-based method is whether the observed studies are believed to be fairly representative of the research undertaken on a particular topic, and if so, a weighted average of the study results becomes appropriate. Chapter 9 provides further theoretical justification for preferring extrapolating to an ideal (i.e. infinite size) study to predict the ‘true’ effect rather than estimating an average effect from the literature (Rubin 1992).

209 8.5. Discussion

The standard MA makes the strong assumption that the collected data is unbiased, although the presence of small-study effects is often suspected and can never be ruled out with certainty. In fact, the present understanding is that smaller studies are particularly prone to bias, which causes small-study effects (Egger et al

1997c, Sterne et al 2000, Sterne et al 2001b, Villar et al 2001, Egger et al 2003, Jüni et al 2008, McMahon et al 2008). Therefore, the best estimate is not necessarily the weighted average of the literature as advocated by standard MA (Rubin 1992, Poole &

Greenland 1999, Greenland & O'Rourke 2001). What is more, the presence of small- study effects causes the pooled estimates of FE and RE MAs to diverge due to the different way they distribute the weighting among studies (Higgins et al 2009). Because

RE MA assigns more weight to smaller studies than FE MA, RE MA becomes more vulnerable to the phenomenon of small-study effects. Occasionally, findings from the two MA models disagree because a single high-quality study of relatively much larger proportions than the rest contradicts the results from the remaining much smaller studies (LeLorier et al 1997, Higgins & Spiegelhalter 2002, Guyatt et al 2008a). For obvious reasons, the regression line will be ‘pulled’ towards the massive study. Then, the extrapolation will be expected to offer an adjusted estimate nearer the pooled effect from the FE MA than the RE MA.

To facilitate comparison to standard MA, the regression method is interpreted as a weighted average (although it was initially considered as an extrapolation to the infinite size study). The method down-weights imprecise studies (which are more likely biased) while still accommodating residual heterogeneity. As described in section 5.4, the regression method contains a multiplicative dispersion parameter φ that considers smaller studies more heterogeneous than bigger ones (Rubin 1992). As a result, the influence of smaller studies is diminished leading to the dominance of larger studies over the pooled effect size (Thompson & Sharp 1999). Hence, the adoption of the

210 proposed dispersion regression model can be considered more convenient whenever unexplained heterogeneity is larger among smaller studies (following a multiplicative fashion), while still modelling potential small-study effects (Siersma et al 2007, Moreno et al 2009a).

Results from the simulation study and the case studies investigated here and in the previous chapter seem to endorse the proposed method not only as an adjustment method but also as a rational method for combining studies, providing there are enough studies and of varying sizes to make results meaningful within a regression setting. Indeed, the major limitation of the proposed method is the increased uncertainty around the bias-adjusted effect if a large extrapolation is required or the larger studies are inconsistent or contradictory in their results (Rubin 1992).

Interestingly, Carpenter et al (Carpenter et al 2009) argue that when considering selection mechanisms in the analysis, the resulting confidence interval should ideally become wider because there is an implicit increase in uncertainty in the analysis. With the intention to restrain an excessive increase in the confidence interval width of the adjusted estimate, chapter 11 will consider the adoption of external information describing the small-study effects trend.

Altogether, the regression method is advocated for routine implementation.

Indeed, if used appropriately, the regression-based approach is a powerful tool to obtain more reliable effect estimates than the standard MA approach partially thanks to addressing non-random heterogeneity (Greenland & O'Rourke 2001), specifically in the form of small-study effects. The view in this thesis is to encourage to routinely minimise the impact of reporting biases, methodological deficiencies and other plausible sources of small-study effects (Sterne et al 2001b, McMahon et al 2008) without attempting to attribute the small-study effects to any particular cause (Sterne et al 2000, Ioannidis

2008b). After all, robust arguments exist demonstrating the current system of evidence

211 synthesis based on journal manuscripts to be plainly misleading, and exactly as selective in their publications as the tabloid health pages (Goldacre 2008).

Consequently, although policymakers intend to make informed decisions based on

‘evidence based medicine’, it is “evidence b(i)ased medicine” that is ultimately used

(Melander et al 2003).

Some have argued that, in addition to the problem of PB, journal manuscripts do not always mirror the original study closely, and so study protocols are a better source of information (Chan et al 2008a, Chan et al 2008b). However, this option is still a long way from becoming a reality for many analyses (Thornton & Lee 2000). Hence, there is often a need to rely on analytic methods to deal with the problem. With this in mind, it is suggested here to replace the standard MA approach by the regression method D-var since it integrates a joint assessment on the potential impact of PB and other small- study effects simultaneously (Williamson et al 2005, de Jonge et al 2008). The predicted effect size is assumed unbiased and generalisable to the population of interest, making it appropriate for decision-making regarding routine care at the population level. For that, larger studies have to be assumed less influenced by reporting biases, at the same time as reflecting routine clinical care more accurately.

This should be enough to justify a shift in current practice and so preventing grossly misleading results. As with standard MA (Deeks et al 2008), special attention shall be given to combining only studies that (have been judged to) adhere to the pre-defined eligibly criteria to avert criticisms. For instance, if only apples are of interest but oranges are mistakenly also included, the extrapolation of oranges and apples will predict tutti-frutti.

212 8.6. Summary

The proposed regression-based method can be interpreted as an alternative weighting scheme to the standard MA approach. Findings seem to support that, since potential small-study effects can never be rule out with certainty, the adoption of the routine adjustment should be recommended. The major limitation seems to be the increased uncertainty around the adjusted effect if a large extrapolation is required or the larger studies are inconsistent or contradictory in their results. Nevertheless, all previous approaches to dealing with PB also suffer important limitations in similar meta-analytic settings.

Altogether, this chapter complies with the thesis aim to rationalize the adoption of the regression-based method as a way of achieving more reliable effect estimates than the standard MA. The following chapter investigates the links between the proposed adjustment method and Rubin’s approach since both approaches claim to predict the true underlying effect.

213 Simplified Rubin’s surface estimation

9.1. Introduction

According to Rubin (Rosenthal & Rubin 1988, Rubin 1990, Rubin 1992), the average effect size from a standard MA is fundamentally flawed, and so is of little scientific relevance. Rubin suggests that in order to allow more reliable decision- making, the ‘effect size surface’ from a hypothetical study of infinite size should be estimated instead. What is more, the concept of surface estimation advocated by Rubin as early as 1987 (Rubin 1990) and his ”new perspective on MA” has been regarded as the ”future of MA” (Glass 1991). There is, however, a fundamental difference between estimating an average effect from the literature and predicting the true effect from an ideal study (Rosenthal & Rubin 1988).

Rubin argues that the standard MA of a finite population of studies is of little scientific interest because all studies (including all existing and upcoming) are flawed to some extent, making any averaging among them worthless for scientific purposes. His

“new perspective of MA” (Glass 1991, Rubin 1992) attempts to estimate the true effect rather than the average effect size, where the true effect corresponds to the effect size from a hypothetical perfect study of infinite size.

Rubin separates study variables into those of scientific interest and those that are not. Variables are of scientific interest if they can help to explain (genuine) heterogeneity as a function of variation in patient populations or interventions, corresponding to the genuine heterogeneity factors listed in table 3.3. These variables are of scientific interest because they can help to explain the expected effect size in the target population as a function of any of these relevant factors. The remaining effect modifiers are considered ‘design variables’ clearly referring to study variables that

214 would help characterise a hypothetical study of infinite size and perfect design (e.g. study size, randomisation, blinding, etc).

According to Rubin (Rubin 1992), a pooled effect derived from combining study- specific effects is fundamentally flawed because all studies are somewhat influenced by the particular choices made on their study designs (with their corresponding risk of bias on the study’s results). For this reason, no study is of scientific interest in its own right and so the best estimate might not even be any weighted average of studies results after all. Rubin suggests that in order to allow more consistent decision making on health policy, the ‘effect size surface’ should be estimated instead.

9.2. The ‘effect size surface’ approach

This surface is not a plain average effect but a function predicting the true effect size from a hypothetical ideal study (and thus unaffected by study design) conditional only on the variables of scientific interest. Adapting Rubin’s notation (Rubin 1992,

Vanhonacker 1996), suppose that for any study, the following are observed: response

θ, Scientific variables X and Design variables Z.

[Equation 9.1]

pθ, X, Z pθ|X, Z pX, Z

Where p(X,Z) describes the chance of selecting X and Z for a study. This selection probability describes the trialists’ decision to conduct the study for a particular intervention upon a specific population group X with a chosen study design Z. Rubin names p(θ|X,Z) the surface function because it describes how the response θ can be explained in terms of X and Z. On the other hand, the current meta-analytic approach focuses on averaging the literature by calculating the marginal distribution of θ.

215 pθ pθ|X, Z pX, Z dX dZ [Equation 9.2]

Naturally, the marginal distribution of θ is influenced by the trialists choices on the study characteristics expressed by p(X,Z). Rubin’s response surface approach avoids these problem by concentrating on p(θ|X,Z), the conditional distribution of θ given the study characteristics rather than p(θ) alone.

Some of the problems foreseen by Rubin in the estimation of the surface function p(θ|X,Z) include its high dimensionality as a result of the multiple nature of both types of study variables (Greenland & O'Rourke 2001). On top of this, hierarchical multiple regression models are required to deal with possible emerging interactions and non- linear associations (Rubin 1992, Vanhonacker 1996). What is more, not every possible study characteristic is always observed or recorded for every study, which results in a missing data problem when trying to predict Rubin’s ideal study (i.e. hypothetical perfect study of infinite size). Evidently, the typical shortage of statistical power in MR

(Lau et al 1998) in addition to the problem of multi-colinearity and missing data anticipate the unfeasibility of hierarchical multiple regression modelling in practice

(Greenwood et al 1999, Shang et al 2005). Rubin also points out that extrapolation to the surface function will tend to increase uncertainty in the estimation, particularly if the supposedly larger higher quality studies are inconsistent or contradictory in their results. As larger high quality studies become available, extrapolation is expected to converge to a constant surface function; whereas the pooled effect from a standard MA is always dependent, to some extent, on the circumstantial choices made by trialists on

X and Z.

Due to the important limitations in the estimation process of Rubin’s surface function, the approach becomes practically unfeasible, particularly when using

216 summary data from typically small collections of RCTs. As a way out, it is suggested in this thesis that modelling small-study effects through regression is a less demanding but more viable approach to extrapolating to Rubin’s ideal study from where to predict the true effect size. The suggested regression-based approach simplifies the surface function to facilitate the estimation of the mean effect size across the surface, valuable for policymaking (Rubin 1992, Lau et al 1998).

9.3. Rubin’s surface function in terms of small-study effects

The typical suggestion (Berlin et al 1989, Sterne et al 2000) that an unbiased collection of studies is symmetrically distributed around the true effect adopting the funnel shape is over-simplistic because symmetry can be impeded by unidentified effect modifiers with no predictable pattern (Tang & Liu 2000). In fact, there is no obvious reason to suspect from the above formulation that study design choices are made in such a way as to make the distribution of conducted studies symmetrical.

Neither can the scientific variables be expected to induce heterogeneity randomly (i.e. symmetrically). Hence, the distribution of p(θ) does not need to be symmetrical; and so, funnel plot asymmetry does not necessarily entail, for example, that PB is present.

Funnel plot symmetry can only be expected if θ does not depend on X or Z so that p(θ|X,Z)=p(θ); that is, if the FE MA model is truly underpinning the data (Light &

Pillemar 1984) (section 2.2). The RE model (section 2.4) entails that the variation between study effects is random because similar studies are expected to encounter the same reasons for variation, and thus experience heterogeneity to the same extent

(Higgins & Whitehead 1996). These reasons include, for instance, trivial protocol variations or demographic factors (section 2.3). According to Higgins et al (Higgins &

Whitehead 1996), modelling the between-study heterogeneity as random variation is preferable because the amounts of heterogeneity that these (unknown number of) factors induce are difficult to calculate and thus to introduce as model covariates.

217 In the view of this thesis, the assumption that the between-study heterogeneity

(induced by unidentified effect modifiers) varies randomly is untenable for most effect modifiers; and therefore offers a rather meaningless answer in most situations. Indeed, this strong assumption of randomness is known to weaken as background context- specific information becomes available (making much of the heterogeneity explainable)

(Hughes et al 1992, Thompson 1993, Greenland & O'Rourke 2001). Therefore, each dataset can be expected to possess a unique surface function that should be ideally investigated, even though the selection of covariates to explain context-specific heterogeneity is complicated for many reasons (Clarke & Chalmers 1998, Greenland

2008, Higgins et al 2009). For instance, reported design variables (e.g. blinding) have been used to explain heterogeneity caused by different study designs in MAs. Suppose that θ does not depend on X and the only problem is one of study design Z; so that p(θ|X,Z)= p(θ|Z). Then, all effect modifiers related to study design need to be accounted for to avoid the otherwise misleading result from averaging the literature p(θ).

| [Equation 9.3]

Interestingly, study precision (i.e. standard error, study size or some transformation) has been seen to better explain heterogeneity than several Z covariates, and so becoming a promising surrogate to jointly account for different sources of heterogeneity associated to study design (Shang et al 2005, Nartey et al

2007, Shang et al 2007). Of course, a larger number of patients does not ensure proper study design, conduct and reporting of trials. However, including more patients in trials usually leads to the involvement and collaboration of more researchers with more time and financial resources (typical of larger studies), and this may lead to more scrutiny and supervision (Kjaergard et al 1999, Sterne et al 2001b). The relative lack of scrutiny and supervision in smaller studies may explain sponsorship bias (section 3.4) and the so-called small-study effects. Indeed, there is irrefutable empirical evidence

218 confirming that smaller studies are, on average, of lower methodological quality, which fuels the risk of bias towards positive results (Berlin et al 1989, Kjaergard et al 1999,

Sterne et al 2000, Als-Nielsen et al 2004, Shang et al 2005 ).

Certainly, study precision is not an effect modifier per se but a surrogate for any effect modifier than confounds the relationship between study effect and its precision

(Moreno et al 2009b). These confounders are best described as sources of small-study effects (table 3.3). Hence, study precision merely mirrors the small-study effects pattern in a funnel plot regardless of its origin. For example, study precision can easily reveal how smaller studies, which are typically of lower quality, tend to report larger effects, and so induce small-study effects. It then follows that an infinite size study can be assumed of perfect quality and so provides an unbiased estimate of the population effect. For illustration, suppose again that θ does not depend on X and the only problem is one of design Z. If study precision is believed to be a good surrogate for Z, then altogether study quality improves as study size increases. This will allow extrapolating to a study of infinite size from where to predict the true effect size:

|X, | p [Equation 9.4]

This approach is in line with the recommendation by Siersma et al (Siersma et al

2007) of a joint assessment of effect modifiers. Although rather than using their proposed multiple MR approach, which is usually impracticable (Greenwood et al

1999), a simple MR approach is favoured. That is, rather than using (context-specific) reported quality factors as surrogates for true study quality (Greenland & O'Rourke

2001), a single surrogate is preferred for the multidimensional Z as a more efficient approach. Note, however, that despite the covariate/s used, all statistical models suffer misspecification to some extent, and so all models can only be considered an approximation to the truth (Bowden et al 2006).

219 Rubin did not address PB in his surface formulation although he refers to it as a major concern in evidence synthesis (Rubin 1992). To consider PB in the previous equation 9.1, suppose that observing θ from an unselected sample of studies depends on an underlying non-random publication selection process R Є {0,1}. R is binary taking values 0 and 1, where 0 indicates that the study is not published and 1 published. Then, in normal circumstances, the probability of observing a study in the literature is:

pθ, R, X, Z pR|θ, X, Z pθ|X, Z pX, Z [Equation 9.5]

The marginal distribution of θ is obtained by averaging (i.e. integrating) over R, X and Z, which in general gives results that do not refer to any meaningful population. To tackle PB, suppose again that θ does not depend on X, and that the only problem is one of design as well as PB. If the probability of study suppression is assumed to decrease as study size increases (Berlin et al 1989, Copas & Shi 2000b, Lee et al

2008); it then follows that a hypothetical study of infinite size would have no chance of being suppressed, i.e. R1|∞ 1, providing an unbiased estimate of the population effect. Then, if study precision is a good predictor of publication as well as a good surrogate for Z, then, an unbiased effect estimate can be achieved by extrapolating to a study of infinite size:

, R|X, | R| p [Equation 9.6]

By proposing study precision as a sensible surrogate for Z and a predictor of publication, the approach takes advantage of the small-study effects phenomenon.

This allows extrapolating to an ideal study of perfect design unaffected by PB.

220 In order to explain unambiguously how the phenomenon of small-study effects facilitates extrapolation to Rubin’s ideal study Z0, a new group of variables is named S for clarity. Of course, S contains Rubin’s Z variables because they allow define a study of perfect design and always induce small-study effects as discussed in section 3.4.

Recall that Rubin refers to Z as the design variables (e.g. study size) that would help characterise an ideal study Z0 in the sense that is not biased by its design. Similarly,

PB and other reporting biases are equally known to induce small-study effects (Sterne et al 2000, Moreno et al 2009b), and so the publication selection process R is incorporated into S.

Equally important are Rubin’s variables of scientific interest X, which refer to sources of genuine heterogeneity that help to explain heterogeneity as a function of variation in patient populations and interventions. Because some of them induce heterogeneity in the form of small-study effects occasionally (section 3.4), they are also enclosed in S. For example, effect size discrepancies between large and small studies due to different levels of baseline risk, follow-up, compliance or intervention intensity

(e.g. different drug dose) could induce small-study effects in some instances. In this sense, it can be claimed that because larger studies better reflect what would be experienced in routine care at the population level, extrapolation to an infinite size study shall provide more generalisable results. For instance, when smaller studies tend to contain largely higher-risk patients, the method predicts the effect size obtained from routine care of the general population with that particular disease/condition, which by definition is of average-risk. For a comprehensive list of potential sources of small- study effects, please refer to table 3.3.

Ultimately, modelling small-study effects allows extrapolating to Z0 more efficiently than with, for example, quality covariates alone (Greenland & O'Rourke 2001). The extrapolation to such ideal study is supposedly achieved by adjusting for S

221 (i.e. pθ|S), a combination of methodological deficiencies Z, reporting biases (Shang et al 2005), as well as any other sources of small-study effects present.

In response to Rubin’s comments, Begg et al (Begg & Berlin 1988) (page 462) were the first ones to propose this simplification to Rubin’s ideal study (without justification).

They use sample size as the only covariate to extrapolate to a hypothetical perfect study. However, they dismiss this approach when unexpectedly they predict the null effect in their own case study. Since then, several authors have interpreted the regression intercept as the mean effect size of Rubin’s perfect study of infinite size

(Greenland & O'Rourke 2001, Stanley 2005, Stanley 2008). In the view of this thesis, the regression-based method proposed in section 5.1 efficiently simplifies Rubin’s surface function. This is done by estimating the mean effect size across the surface.

Ultimately, the regression extrapolates to perfect study precision with the purpose of predicting the effect size from a hypothetical ideal study of infinite size.

The beginning of this section drew attention to the fact that funnel plot asymmetry does not necessarily entail, for example, that PB is present because symmetry can be impeded by unidentified effect modifiers with no predictable pattern (Tang & Liu 2000).

Indeed, there might be effect modifiers that induce heterogeneity in ‘shapes’ different to the small-study effects trend. Context-dependent variables X may induce genuine heterogeneity that need to be investigated on a case-by-case basis to examine whether it is plausible to have induced small-study effects. If there were any known effect modifiers X inducing heterogeneity in whatever ‘shape’ other than small-study effects, the view in this thesis is that they should be, in principle, incorporated into the regression model (Peters et al 2009). This allows predicting the effect size conditional on them following the spirit of the surface function |, .

222 9.4. Discussion

Although Rubin’s rationale is more coherent than averaging the literature, the surface estimation is practically unfeasible, particularly when using summary data from typically small collections of studies. As an alternative, this chapter advocates modelling small-study effects through regression as a less demanding but more viable approximate approach to extrapolating to Rubin’s ideal study from where to predict the

‘true’ effect size. The suggested approach simplifies the surface function to facilitate the estimation of the mean effect size across the surface, valuable for policymaking

(Rubin 1992, Lau et al 1998).

Ultimately, the proposed adjustment approach provides a feasible and more conceptually coherent way of predicting the true effect size than the standard MA at the expense of one major assumption. That is, the regression intercept corresponds to the

‘true’ effect size from an ideal study of infinite size (Greenland & O'Rourke 2001,

Stanley 2008). This is assumed generalisable to the population of interest, making it appropriate for decision-making regarding routine clinical care for the relevant population. Finally, if there are effect modifiers known to induce heterogeneity in a

‘shape’ other than the small-study effects trend, they should be incorporated in the regression model following the surface function approach.

223 9.5. Summary

This chapter has investigated the links between the proposed adjustment method and Rubin’s approach since both approaches claim to estimate the true underlying effect. The view in this thesis is that the regression-based method proposed in section

5.1 provides a simplification of Rubin’s ideal study from where to predict the true effect size. In line with Rubin, this chapter argues that the best estimate is not necessarily any weighted average of the literature and that Rubin’s ideal study is a preferred approach to predicting the true effect size. In this sense, the proposed adjustment method is conceptualized in terms of the Rubin’s approach and conceived as its simplification. Ultimately, chapter 9 encourages a shift in current practice by adopting the proposed adjustment method routinely as a way of achieving more reliable effect estimates than with standard MA.

Chapter 10 attempts to fulfil the thesis aim to improve the performance from earlier methods. This is achieved by adapting and extending the regression-based adjustment method to the Bayesian context so that some of the known shortcomings of the frequentist-based adjustment methods are overcome. For that, chapter 10 presents a novel semi-parametric regression model that, for example, allows relaxing some modelling assumptions about the shape of the regression line. Eventually, its performance is compared to earlier adjustment methods by means of a simulation study identical to that from chapter 6.

224 Novel Bayesian semi-parametric regression model

10.1. Introduction to Bayesian statistics in meta-analysis

All analytic methods addressed so far are frequentist in nature, in that they assume the underlying effect size to be a unique value (although unknown). In this sense, the frequentist paradigm attempts to answer the question of what value is best supported by the data (highest likelihood) to embody the underlying effect size.

Conversely, the Bayesian paradigm regards the true underlying effect as a stochastic quantity that does not have a unique value but rather a range of possible values represented by a probability distribution. Bayesian methods therefore differ from frequentist ones in that all model parameters are assumed random quantities

(Spiegelhalter et al 1999, Miazhynskaia & Dorffner 2006).

Bayesian methodology is becoming increasingly popular as a result of recent methodological advances and the availability of increasing computing power

(Miazhynskaia & Dorffner 2006). The basic concept in Bayesian statistics is that external evidence (or prior beliefs) supplements the collected data in such a way that the likelihood function (probability distribution describing the data) is combined with the prior distribution describing the plausibility of various values as the true parameter value prior to data collection. That is, while the frequentist approach only uses the likelihood function to describe the strength of support from the data for the various possible values of the parameter of interest, the Bayesian approach uses both the likelihood and the prior information (O’Hagan & Luce 2003). Once both are combined by means of Bayes’ theorem (Bayes 1763), the posterior distribution is obtained for inference purposes (Spiegelhalter et al 2003). Bayes’ theorem is summarized below, where θ indicates the parameter of interest, Y represents the data, p(θ|Y) is the

225 posterior distribution of the parameter after allowing for the data, p(Y|θ) is the conditional likelihood of the data given the parameter (i.e. likelihood function), and p(θ) is the prior distribution of the parameter of interest (Gillies 2007).

p(θ | Y) ∝ p(Y |θ) × p(θ )

[Equation 10.1] posterior(θ ) ∝ Likelihood(θ ) × prior(θ )

The importance given by the posterior distribution to both the prior distribution and likelihood function depends on the relative precision of the information contained within them. That is, the more (comparative) information provided, the more impact it has upon the posterior distribution in comparison to the other (Peters 2006). In other words, the importance given to the prior distribution compared to the likelihood function depends upon the precision of the information forming the prior distribution. The more information available, the more precise the prior distribution, thus the more impact it has in the analysis. An example from O’Hagan et al (O’Hagan & Luce 2003) on how

Bayes’ theorem works is demonstrated in figure 10.1. It illustrates the influence from both the likelihood function and the prior distribution upon the posterior distribution.

Figure 10.1 Illustrative example of a Bayesian analysis, where the prior distribution is represented by the grey line, the likelihood function by the red line and the posterior distribution by the black dotted line.

226 The choice of prior distribution is known to be a controversial issue (Natarajan &

Kass 2000). In order to avoid controversy, the use of non-informative prior distributions has become widespread, which reflects lack of prior knowledge. Prior ignorance is the common approach with respect to the parameter of interest (i.e. effect size) in MA. By doing so, the pooled estimate from a Bayesian MA does not generally differ to that from a classical (frequentist) MA (Other advantages are detailed below that explain the preference for a Bayesian analysis). The difficulty, however, relies on specifying non- informative priors for all required model parameters.

Although there are a number of proposed options, strictly speaking, ‘non- informative’ prior distributions do not exist (Irony & Singpurwalla 1997). In actual fact, there is no way to encode complete ignorance about the value of a parameter; and therefore any prior distribution is expected to exert some influence on the posterior distribution, the more so in the presence of sparse data (Denison et al 2002, Minelli

2005). Therefore, the objective is to identify a prior distribution that has minimal effect on the final inference relative to the data (i.e. reference prior (Irony & Singpurwalla

1997)). For this reason, the term ‘non-informative’ is replaced by ‘vague’ (Lambert et al

2005), which indicates a density that is sufficiently diffuse and gives similar prior probability to a wide but plausible range of parameter values (Minelli 2005). An illustrative example of a Bayesian analysis using a relatively vague prior distribution is also given by O’Hagan et al (O’Hagan & Luce 2003) (figure 10.2).

227 Figure 10.2 Illustrative example of a Bayesian analysis using a relatively vague prior, where the meaning of lines matches those from the previous figure

In the area of MA, a vague prior for the between-study variance parameter has proven particularly difficult, since the ones proposed have been seen to influence the posterior distribution to some extent, mainly in MAs of limited number of studies

(Lambert et al 2005). It is then generally recommended to use a more informative prior in such situations when the between-study variance is poorly estimated (Higgins et al

2008b). Of course, it is essential to perform sensitivity analyses to investigate the robustness of the results with respect to the different choices made.

As already mentioned, the results from a Bayesian analysis are expected to approximate those from the frequentist counterpart when vague prior distributions are used. Note, however, that the classical statistical approach optimises the parameters based exclusively on their likelihood, whereas Bayesian parametric modelling integrates over the parameter/s for the purpose of estimation. As opposed to the optimization process, the integration approach allows modelling uncertainty in the estimation process. Because the Bayesian MA can account for the uncertainty in the parameter estimation of the between and within-study variances (Sutton & Abrams

228 2001), a wider credible interval is expected compared to the classical confidence interval. Particularly, MAs of few and/or small studies are expected to report a greater variability in the pooled estimate compared to the classical one (Rutherford 2008).

The most important advantage of Bayesian statistics is that thanks to Bayes’ theorem (Bayes 1763), direct probability statements can be made about the parameter of interest (Denison et al 2002). For instance, the 95% credible interval represents the probability of finding the parameter value within the specified region. Conversely, classical statistics cannot make direct probability statements about uncertainty in the true parameter value but only provide a long-term frequency range where the true value is expected to be found 95% of the times (i.e. confidence interval). That is, the

Bayesian analysis reports the 95% credible intervals, which are the counterparts of the

95% confidence intervals reported by classical analyses. Although credible intervals and confidence intervals are similar in practical terms, their interpretation is substantially different (Minelli 2005). However, non-statisticians (and sometimes statisticians too) interpret the classical confidence interval as a credible interval, although the classical confidence interval strictly refers to the long-term frequency of finding the parameter value within the specified region (Higgins et al 2008b).

The following (non-exhaustive) list summarizes the potential advantages of the

Bayesian over the classical approach with special emphasis to the area of MA. The list is adapted from several sources (Sutton et al 1998, Spiegelhalter et al 2000a, Higgins et al 2008b, Higgins et al 2009)

229 • Incorporate external evidence, such as on the likely extent of between-study

variation

• Allow unified modelling framework. The conflict between fixed and RE MA is

overcome by explicitly modelling between-study variability (which could be

assumed small)

• Examine the extent to which data would change people’s beliefs (Higgins &

Spiegelhalter 2002)

• Allow for uncertainty in the estimation of all parameters (e.g. within and between-

study variance estimates), which is reflected in the width of credible intervals

(Thompson & Sharp 1999)

• Investigate the relationship between underlying risk and treatment effect

(Thompson et al 1997, Gillies et al 2007)

• Perform complex analyses thanks to the model fitting flexibility of the WinBUGS

software (Lunn et al 2000)

• Allow direct probability statements, such as the probability that the true treatment

effect is greater than the null effect (Ntzoufras 2009)

• Ease of making predictions, which should allow up to date MAs to be used in

designing future studies (Sutton et al 2007)

• Allow extending the evidence synthesis model into decision-making by

incorporating the notion of the utility of health states and expenditure (e.g. cost-

utility analysis) (Spiegelhalter et al 2000a, Ades et al 2006)

• ‘Borrowing strength’ is a benefit from Bayesian hierarchical modelling by assuming

exchangeability between studies (Lindley & Novick 1981)

230 With respect to the last point, exchangeability implies that some model parameters such as the effect size are assumed similar between studies although not identical (e.g.

RE MA); and that there is no prior knowledge on how to distinguish them (Higgins &

Spiegelhalter 2002, Higgins et al 2009). The assumption of exchangeability in Bayesian hierarchical modelling allows the study-specific effects to ‘borrow strength’ from each other, causing them to shrink towards the overall mean, and a subsequent reduction in the credible interval width of the pooled effect estimate. In practical terms, the study- specific estimates shrink towards the overall population mean effect because they are usually sampled from a normal distribution yielding estimates more often sampled nearer the mean than in the distribution tails (Light & Pillemar 1984). The degree of pooling depends on the variability between studies and the precision of the individual study. That is, the degree of pooling depends on the resemblance with the other studies.

In Bayesian statistics, the convoluted idea of p-values is no longer relevant since direct probability statements can be easily made about the parameter of interest such as the probability that certain parameter achieves a clinically significant threshold.

Whether this is a disadvantage or advantage is debatable although it seems clear that direct probability statements (from the posterior distribution) are more appropriate for tackling research questions than classical hypothesis testing (Burton et al 1998,

O’Hagan & Luce 2003, Bayarri & Berger 2004). Nevertheless, Bayes factors can be used for hypothesis testing for either a non-null pooled effect or heterogeneity in the

Bayesian context (Goodman 1999, Higgins et al 2009).

The Bayesian approach has been criticised for allowing ‘subjective’ incorporation of prior believes into the analysis, even though it is often urged not to ignore external information when available (Turner et al 2009b). Conversely, the frequentist approach has been labelled ‘objective’ although, for example, the adoption of the arbitrary 5%

231 level as a threshold for statistical significance is not objective at all. Besides disputes, this thesis makes the case for prior information to be used for decision-making purposes (Denison et al 2002) but not for the collection of scientific evidence. This is best clarified through a simple example. Systematic reviews and MAs are usually pursued to aid policymaking. If necessary, informative priors can be used to adjust for potential biases or portray the likely extent of between-study variability in the MA. Yet, the MA shall collect studies that have not been influenced by prior information in their own analyses. That is, the individual studies must be reported according to the likelihood function only; otherwise, the evidence synthesis exercise becomes unsound.

Since the use of vague priors will tend to provide very similar results to the frequentist approach except for sparse data situations (Smith et al 1995, Hardy & Thompson

1996), the debate is not necessarily between Bayesian and frequentist approaches but rather on the use of informative or vague priors. It is therefore always vital to assess the influence of prior distributions upon results through sensitivity analyses.

A Bayesian FE MA model can be parameterized as equation 6.6 by adding, for instance, vague prior distributions to μi ~N[0,1000] and δ ~N[0,1000]. Where δ

th corresponds to the lnOR and μi is the average event rate on the logit scale in the i study. The RE MA (equation 6.8) additionally includes a prior distribution for the estimate of between-study standard error τ~Uniform[0, 2]. The range 0-2 is usually considered sufficiently broad on the lnOR scale to contain most likely values within reason (Lambert et al 2005). The uniform distribution is becoming the most popular choice as the vague prior distribution for the between-study standard error after being suggested by Gelman (Gelman 2006) and others (Higgins et al 2009).

232 Besides the Bayesian-frequentist paradigms, an important differentiation in statistical modelling relates to whether the statistical model is parametric or non- parametric. The next section describes these two approaches to subsequently present a compromise between the two in the form of a semi-parametric model. This is explored in the hope that it will exceed the performance of earlier methods when adjusting for PB and other small-study effects.

10.2. Parametric versus non-parametric modelling

An important differentiation in statistical modelling relates to whether the statistical model is parametric or non-parametric. The distinction between parametric or non- parametric statistical models relates to whether a functional relationship is established between the model parameters. The likelihood for a parametric model is defined as the joint probability of observing the data conditional on the model parameters. Contrary, a non-parametric model makes inferences about the actual observed responses ignoring model parameters (Denison et al 2002). In the non-parametric framework, the shape of the functional relationship between, for instance, covariates and the dependent variables is determined by the data; whereas in the parametric framework the shape is determined by the assumed model (e.g. linear) (Crainiceanu et al 2005).

Parametric models are intended to simplify estimation and inference by assuming that a particular functional relationship (i.e. statistical model) is approximately correct.

The model is symbolized with a set of constant parameters that can be estimated from the data. In order to reduce the risk of misspecification, the assumptions about the functional form are relaxed by using non-parametric estimation methods. The main advantage of non-parametric over parametric models is their flexibility because they make no assumptions about the functional relationship between the model parameters.

By not establishing the functional relationship, the uncertainty around the non- parametric estimators increases quickly as the number of covariates increase.

233 Consequently, larger sample sizes are usually required to attain the same estimation precision than a parametric model. Moreover, non-parametric estimation does not allow extrapolation (i.e. prediction) outside the supported data range. One last disadvantage of non-parametric modelling is the added complexity in their interpretation (Horowitz &

Lee 2002).

Semi-parametric models combine parametric and nonparametric components.

They are valuable, for example, in settings where fully parametric or non-parametric models do not fit the data well enough or when specifying the functional form of the regression line is problematic (or simply unknown). Therefore, semi-parametric models rely on some parametric assumptions and can consequently suffer from model misspecification. Semi-parametric models can be viewed as a compromise between the two modelling approaches. This is achieved by relaxing the functional assumptions from a parametric model while being more restrictive than a non-parametric one. An important benefit is that the chance of model misspecification is diminished compared to the parametric approach while maintaining a competent level of estimation precision and the prediction capability (Horowitz 2004). In this sense, a novel Bayesian semi- parametric method is explored with the hope that its performance exceeds that of the fully parametric MR methods investigated in chapter 6 for the adjustment of PB/small- study effects.

10.3. WinBUGS – Bayesian model fitting software

The posterior density is often analytically untractable and so is better simulated using MCMC methods (Crainiceanu et al 2005) embedded in a statistical package. The modelling flexibility provided by the software WinBUGS to fitting complex models in addition to the ability to incorporate external evidence makes Bayesian modelling very appealing for evidence synthesis (Gillies et al 2008). As opposed to all statistical procedures in chapter 6, from data simulation to (frequentist) analyses, which were

234 carried out in Stata (Stata 2008), the implementation of the novel Bayesian semi- parametric method is carried out in WinBUGS 1.4.3. (Lunn et al 2000, Lunn et al 2009).

WinBUGS is a software package designed specifically for Bayesian model fitting purposes that runs under Windows and is available free for download from www.mrc- bsu.cam.ac.uk/bugs. It provides a very flexible framework for making complex modelling feasible. The computational ease of WinBUGS allows specification of complex models (e.g. hierarchical, semi-parametric, etc), as well as the inclusion of informative prior information addressed in chapter 11 (Ntzoufras 2009). WinBUGS uses

Gibbs sampling, a particular form of Markov Chain Monte Carlo methodology (MCMC)

(Gilks et al 1996) that allows uncertainty being easily propagated across all model parameters. A brief non-mathematical introduction to the ideas behind MCMC and its application to practical Bayesian inference can be found elsewhere (Everitt & Dunn

1998). In few words, the Gibbs sampler employed for the MCMC simulation simplifies the process of parameter estimation by sampling only from the conditional densities in order to obtain the marginal posterior density of each unknown model parameter

(Ntzoufras 2009).

Finally, Bayesian MCMC analysis needs to be carefully done, particularly in relation to a number of issues including convergence of model parameters and problems of autocorrelation between simulations (Brooks & Gelman 1998). These are addressed later alongside additional model checking procedures.

10.4. Description of the semi-parametric regression model

The relationship between treatment effect and a measure of its precision has already been investigated as a way of modelling small-study effects. However, the regression-based approaches investigated in chapter 6 were said to be flawed to some extent because the observed treatment effects and their corresponding variances are estimates rather than the true values (Copas & Lozada-Can 2009). Thus, the default

235 incorporation of uncertainty in the estimation of model parameters within the Bayesian framework allows a more coherent approach to dealing with the problem of measurement error (Thompson et al 1997). In addition, there is a problem of structural correlation for some of the regression models and the need for continuity corrections whenever no events took place. The Bayesian approach allows resolving all these issues in one go by easily fitting a binomial likelihood rather than an approximate normal likelihood for the summary statistics lnOR (Spiegelhalter et al 2000b). As a result, no continuity correction due to zero events is required and the problem of structural correlation vanishes.

Hence, the use of a binomial Bayesian model should report more reliable results than the previous investigated frequentist models, although no substantial benefit is expected under most scenarios. However, the flexibility gained in shaping the relationship between the effect size and study precision with a semi-parametric model may provide some extra benefit over the fully parametric (frequentist) approach (Wood et al 2002). This anticipated benefit is investigated later in a simulation study.

The Bayesian semi-parametric regression line can be defined as a non-linear monotonic function designed to allow changes in the slope, with the constraint that the regression line be continuous. That is, the regression line consists of uninterrupted linear segments linking two consecutive studies ranked by their precision (Holmes &

Mallick 2001).

One of the difficulties is to define where the meaningful split points (knots) are placed. This issue could be considered as a problem of model choice (Breiman 1993).

This problem is overcome by constraining the regression line to a continuous function, while assigning a knot for every data point (study). This function imitates a piecewise line, where every (truncated) slope segment is non-negative so that successive studies

236 are assumed at least as biased as the previous, more precise study. That is, this singular statistical model considers that each study provides information to update its corresponding slope separately, but not independently from the remaining meta- analytic data. There are two reasons for the segment slopes not to be completely independent from the others. In the first place, the slope distribution is truncated implying a positive correlation with the previous one. Secondly, the implicit exchangeability assumption between slopes, sampled from a normal distribution with mean equal to the previous slope, allows them to ‘borrow strength’ from each other, causing shrinkage towards the overall mean slope.

The Bayesian semi-parametric regression model under study (equation 10.2) is designed to adjust for PB by measuring the extent to which smaller studies are giving results which are more extreme than larger ones (i.e. small-study effects). It can be conceptually split into two separate models. First, a (logistic) model for the outcome measured on a chosen scale such as the lnOR; and then the adjustment model estimating the amount of bias allocated to every study according to its precision, where a hypothetical study of infinite size is assumed of roughly perfect precision (e.g. se1=0.001). The underlying idea is that the treatment effect is split into two terms, the true (unbiased) study effect dm and the appended bias λ. Once the study-specific amount of bias is estimated, it is subtracted from its originally reported effect size to obtain the unbiased effect size from where the unbiased pooled effect can be estimated. Note that this bias is due to the small-study effects, with potential sources listed in section 3.4. Sources of bias include internally biased studies or because of missing studies. The parameterisation below of the FE semi-parametric Bayesian regression model suitably accounts for binary outcome data, and will be employed for simulation purposes later.

237 C C C ri ~ binomial ( pi ,ni )

T T T ri ~ binomial ( pi ,ni )

C ⎛ pi ⎞ C log⎜ C ⎟ = logit( pi ) = μi ⎝ 1− pi ⎠

T ⎛ pi ⎞ T log⎜ T ⎟ = logit( pi ) = μi +δi ⎝ 1− pi ⎠ [Equation 10.2]

δi = dm + λi → Study effect = true effect + bias

λi = λi−1 + βi ×(sei − sei-1)

2 βi ~ N(βi−1,σ ) I( ,0)

The model for the outcome is parameterised so that ni , pi ri are the number of subjects in a single RCT, the probability of an event and the number of events in the i th study arm respectively. The superscript C or T indicates the control or treatment group.

δ is the study-specific lnORi while μi corresponds to the baseline event rate on the logit

th scale in the i study (lnOddsi). The adjustment model breaks up the lnORi into λ, representing the size of bias due to small-study effects, and dm symbolising the true effect size. λ is estimated through a regression on the difference between two consecutive (ranked) lnOR standard errors, where the regression slope represents the rate of bias increase. Each indexed slope i is drawn from a hierarchical normal distribution with the previous value i-1 as its mean. Truncation of the slope value at zero has a similar effect than an informative prior by forcing it to be non-negative. Note that the truncation needs to be manually set (left or right of the funnel plot) so that the regression line is able to model the small-study effects in one particular direction.

238 Interestingly, if the indexed slope is replaced with a fixed slope so that a simple linear regression is modelled instead (N.B. requires dropping truncation), the lambda regression line simplifies to λi = β × sei . Since δi = dm + λi in equation 5.2, then

by substitutionδi = dm + β × sei , where dm is the regression intercept and thus corresponds to the adjusted effect size. In fact, this simplification corresponds to the FE version of the Egger’s regression model (equation 5.2). Discrepancies may occur if the initial value for the covariate[1] is not set to roughly zero or informative priors are used.

Moreover, in chapter 6, weighted (frequentist) regression was usually based on the inverse-variance. In the Bayesian framework, however, study weights are characterized by the probability distribution of their effect size, although in practice both approaches provide very similar results.

10.5. Implementation upon the magnesium case study

In order to illustrate the semi-parametric regression method, a popular case study in the area of PB is used. This dataset produces contradictory FE and RE MA pooled estimates because of the strong small-study effects (figure 4.4). Specifically, FE MA lnOR=0.015 (95%CI -0.045 to 0.075) and RE MA lnOR=-0.721 (95%CI -1.105 to -

0.337). The magnesium dataset is characterized by a clear small-study effects trend

(figure 4.6) and a massive RCT with an inverse-variance weight representing 93% of the whole dataset (Higgins & Spiegelhalter 2002). Figure 10.3 displays the lambda line that must be interpreted as the amount of bias needed to be subtracted from each study-specific effect size to obtain the unbiased measure of effect. For the magnesium data, the lambda line reflects a clear small-study effects trend with, as expected, more variability among smaller studies. Recall that the intercept of the lambda line is NOT the adjusted pooled effect. The adjusted pooled estimate dm=0.1136 (95%CI 0.04 to

0.2). If the frequentist Egger-Var approach (D-var) is used, the adjusted effect is 0.0329

239 (95%CI -0.06 to 0.13). These effect sizes are comparable to those from the FE MA possibly because the largest study carries most of the weight in either approach.

Figure 10.3 Contour-enhanced funnel plot of the magnesium dataset displaying the mean lambda estimates together with their 95% confidence intervals

0.0

0.5

1.0 Standard error

1.5

Magnesium effect p < 1% 1% 10% Lambda estimate

-5 0 5 Effect estimate ln(OR)

10.6. Additive regression and sensitivity analysis of model parameters

Note that although the function resembles a piecewise (segmented) regression due to the discontinuities at the breakpoints, it is not a piecewise regression since all data points are considered across the analysis (McGee & Carleton 1970, Harrell 2001).

Considering the iterative process behind the slope estimation, it seems apparent that the slope follows a cumulative progression. Indeed, the proposed semi-parametric regression tackles a particular dependence structure where the function f is a sum of functions of the individual components of se. This is known as additive regression and allows simplifying the interpretation by disentangling the structure (Denison et al 1998).

240 fi (se) = ()βi ×(sei − sei−1) where f1(se) = 0

Therefore, the additive regression consists of:

δi = dm + f1(se) + f2 (se) +...+ fi (se) + εi

The WinBUGS code for equation 10.2 is:

Model { for ( i in 2 : N ) { for (j in 1:2) { r[j,i] ~ dbin(p[i,j], n[i,j]) logit(p[i,j]) <- mu[i] + delta[i] * equals(i,2) } delta[i] <- dm + lambda[i] mu[i] ~ dnorm(0, 0.001) } dm ~ dnorm(0, 0.001)

for (i in 2: N) { lambda[i] <- lambda[i-1] + b[i] * (se[i] – se[i-1]) b[i] ~ N(b[i-1], precision) I( ,0) } lambda[1] <- 0 b[1] ~ dnorm(0, 0.001)

precision <- 1/(sigma * sigma) sigma ~ dunif(0,2) }

If the observed study-specific effects δi are assumed heterogeneous due to random variability (besides the small-study effects), a RE version can be easily parameterized by adapting:

δi = dmi + λi 2

dmi ~ N(d,σ d )

241 In WinBUGS notation:

delta[i] <- dm[i] + lambda[i] dm[i] ~ dnorm(d, precision_d) } d ~ dnorm(0, 0.001) precision_d <- 1/(sigma_d * sigma_d) sigma_d ~ dunif(0,2)

Note that WinBUGS defines the Normal distribution as N(mean, precision) where precision=1/variance. The specified vague prior distributions are: μi ~N[0, 0.001],

dm~N[0, 0.001], σ ~Uniform[0, 2]; with initial values: λ1=0 and β1 ~N[0, 0.001 ].

Lambda[1]=0 indicates that a hypothetical study of infinite size (se=0.001) is assumed unbiased. b[1]~dnorm(0, 0.001) assumes a vague gradient so that no

PB/small-study effects is presupposed between the hypothetical infinite size study and the largest observed study. This approach is necessary in order to initiate the method estimating the bias term associated to the largest observed study λ2 exclusively from the data.

The correct arrangement of the meta-analytic data is essential for the WinBUGS model to run the MCMC simulations without crashing. Because the regression model uses an iterative recursive process to estimate the bias term lambda, the dataset must always be ranked according to the covariate used. According to the underlying assumption in dealing with small-study effects, bias increases as study precision decreases. Hence, the data must be ranked so that the precision covariate must satisfy: covariate[i] – covariate[i-1] ≥0.

242 Preliminary results indicate that the initial covariate value assigned to the hypothetical study with the purpose of symbolising infinite size (i.e. perfect precision) does not have a noteworthy impact on the shape of the lambda line but on its starting point only. If the largest observed study is assumed unbiased, then the initial covariate value should be given the same value as the one observed for the largest study. Since covariate[2] – covariate[1]=0, the first (segmented) slope must be near zero. Because this approach assumes the largest observed study unbiased, nothing will be subtracted from its observed effect size. If, on the other hand, the hypothetical study is assigned a minimal standard error of 0.001 (N.B. a value of zero is unworkable because it makes

WinBUGS crash), then any discrepancy with respect to the observed largest study will make the slope deviate from zero. This would be reflected on the lambda line, which represents the amount of bias needed to deduct from each observed study. Three initial covariate values have been illustrated in an inverted funnel plot (figure 10.4).

Note that all lambda lines start at zero because the hypothetical study of infinite size is assumed unbiased.

Figure 10.4 Inverted funnel plot displaying the lambda lines for three different initial standard error values assumed for the study of infinite size. Publication bias induced in a simulated dataset with studies displayed as red circles proportional to their size

243 Funnel plot: PB induced by effect size .6 3 initial values: 0.001; 0.1; 0.16 .4 .2 ln(odds-ratio)

0 Adjusted effect with 0.016 Adjusted effect with 0.1)

Adjusted effect with 0.001 -.2 0 .1 .2 .3 .4 se ln(OR)

ln_ES 0.16 FE bi se Actual 0.1 FE bi se Actual 0.001 FE bi se Actual

The initial value representing the precision of a hypothetical infinite size study works as an anchor and results are sensitive to this choice. This is particularly true when considering different measures of study precision as the regression covariate

(e.g. standard error, sample size or their inverse). Indeed, sensitivity analyses indicate that results are more susceptible to large values used to represent a large study size or inverse standard error than to tiny values used for standard error or inverse sample size. Therefore, the precision of an infinite sized study is best described with a scale able to express perfect study precision with a value near zero (Rubin 1992).

It is important to note that the segmented regression slopes follow a normal distribution truncated at zero, implying that slopes are sampled from a non-negative half normal distribution. Sometimes, this causes WinBUGS to crash in extreme circumstances. In a personal communication, Prof. John Thompson recommended allowing for sampling error in the estimation of lambda (bias term). This should provide

244 more robust results when dealing with small, over-dispersed datasets that may contain outliers, as usually occurs in the context of MA. This alternative approach is a generalisation of the original, and should give more flexibility in the estimation of the regression line. The semi-parametric regression model is re-parameterised so that the mean (expected) value of the bias term y[i] is estimated instead.

λi ~ N(yi ,τ )

where tau is given a vague prior distribution

yi = yi−1 + βi × ()sei − sei-1

Then, equation 10.2 becomes:

C C C ri ~ binomial ( pi ,ni )

T T T ri ~ binomial ( pi ,ni )

C ⎛ pi ⎞ C log⎜ C ⎟ = logit( pi ) = μi ⎝ 1− pi ⎠

T ⎛ pi ⎞ T log⎜ T ⎟ = logit( pi ) = μi + δi ⎝ 1− pi ⎠

δi = dm + λi → Study effect = true effect + bias

λi ~ N(yi ,τ )

yi = yi−1 + βi × ()sei − sei-1

2 βi ~ N(βi−1,σ ) I( ,0)

245 As an example of the method enhancement, imagine two observed studies with almost equal precisions but contradictory effect sizes. Under the original method, each study is assigned a posterior probability density function describing the ‘actual’ amount of bias lambda[i] associated to it. The more precise the study is, the narrower its bias probability distribution becomes. Then, if two studies with almost identical precision derive significantly different bias distributions, WinBUGS will struggle estimating the lambda line, making the estimation process likely to crash. To solve this inconvenience, the ’expected bias’ approach relaxes the bias probability distribution by taking a hierarchical structure, which as a result, will exhibit larger variances than the ’actual bias’ approach. The magnesium case study illustrates how the estimated y[i] values under the ’expected bias’ approach provide wider bias estimates than the ’actual bias’

(figure 10.5). This approach, however, cannot be recommended routinely since typically small MAs will produce less reliable estimates of the model parameters.

Indeed, larger collections of studies are needed to achieve the same statistical power.

Hence, modelling the ’expected bias’ can be recommended in the few cases when

WinBUGS is unable to estimate the lambda line otherwise.

Figure 10.5 WinBUGS estimates of bias magnitude under the ’expected bias’ model (y[i]) and the ’actual bias’ model (lambda[i]) for the magnesium case study

caterpillar plot: y caterpillar plot: lambda

[2] [2] [3] [3] [4] [4] [5] [5] [6] [6]

[7] [7]

[8] [8]

[9] [9] [10] [10] [11] [11] [12] [12] [13] [13]

[14] [14] [15] [15] [16] [16]

[17] [17]

-6.0 -4.0 -2.0 -6.0 0.0 -4.0 -2.0 0.0 246 Alternatively, the regression slope can be modelled as a truncated t-distribution instead of a truncated normal distribution since a t-distribution is less affected by outliers thanks to its thicker tails. A re-analysis, however, produced very similar results to the ’expected bias’ model above (not shown).

10.7. Simulation study & customized models to be evaluated

To evaluate the performance of the semi-parametric regression method, model variations are explored and their adjustment performance assessed using the previous simulation framework from chapter 6 to clearly identify those with the most desirable statistical properties. Model variations include different measures of study precision such as the standard error and inverse of sample size. For comparison purposes, the two preferred regression methods from the simulation study in chapter 6 are included in addition to the original Egger’s model (Egger et al 1997c). Because the same MA characteristics and simulation procedures from chapter 6 are used for this simulation study, results related to the fully parametric methods are identical to those from the previous simulation study. These and the Bayesian variant models are listed in table

10.1.

Table 10.1 List of statistical methods included in the simulation study. An abbreviation is given to each method, which is used in the remainder of the chapter.

The inverse-variance meta-analysis

o Fixed-effect (FE) meta-analysis (FE MA)

Fully parametric (frequentist) weighted regression adjustment methods

o Egger Dispersion model (D-se)

o Egger-Var Dispersion model (D-var)

o Peters’ model (Peters)

247 Semi-parametric (Bayesian) weighted regression adjustment method

o FE Truncated indexed slope with the lnOR standard error as covariate

(FE T-bi-se)

o FE Truncated indexed slope with the inverse study sample size as covariate

(FE T-bi-inv-s-size)

Due to time constraints, only one major scenario is simulated. Specifically, the null effect was simulated with 10 studies in each MA without between-study variance or PB induced. Similarly, only the pooled (adjusted) effect estimates and their coverage probabilities are recorded to facilitate assessing the methods performance.

10.8. Results of the simulation study

WinBUGS results are exported into Stata so that frequentist properties of

Bayesian inference for the simulation study can be studied (Crainiceanu et al 2005).

Stata commands were expressly designed (Thompson et al 2006, Thompson et al

2009) to call and control the WinBUGS model from Stata while the same simulation study from chapter 6 generates the data. This allowed the strong Stata data processing facilities to be exploited, as opposed to the limited data handling facilities from

WinBUGS (Thompson et al 2006). Table 10.2 presents the results from the simulation study (I2=11%).

Table 10.2 Pooled effect size and coverage probability for the different methods implemented in the simulation study

Mean effect Cov. Prob.

FE MA -0.0007 97.8% D-se -0.0071 94.7% D-var -0.0035 94.9% Peters 0.0173 94.9% FE T-bi-se -0.3268 68.6% FE T-bi-inv-s-size -0.1281 75.8%

248 Results are not encouraging for the semi-parametric approach since both model variants report pooled effect sizes further away from the null effect than the competing parametric approaches. Similarly, the performance of the semi-parametric approaches is undoubtedly worse than the fully-parametric regression methods in relation to poor coverage probability. Results do not justify the recommendation of any of the semi- parametric model variants over the fully parametric ones due to their potentially misleading adjustment and poor coverage probabilities. It could be argued that since their performance is so disappointing in scenarios of lack of PB/small-study effects, no further investigation of its performance is justified for other PB scenarios.

10.9. Simulation procedures & model checking

In all the simulation results that are presented, one chain was run. Converge was achieved usually within the first 25,000 iterations although 100,000 were run for precaution. These ‘burn-in’ simulations were then discarded and a further 100,000 iterations run on which all inference is based. Model checking requires an exhaustive assessment of the convergence of the MCMC samplers. Diagnostic tools were used to examine model convergence (Brooks & Gelman 1998, Thompson et al 2009, Welton et al 2009). It includes visual inspection of the trace plot for the sample values of the

MCMC chain, monitoring the potential autocorrelation on the different model parameters and a sensitivity analysis of the initial values and choice of vague prior distributions. These diagnostic tools are embedded into the ‘WinBUGS from Stata’ commands developed for this purpose (Thompson et al 2009). Since the results from the semi-parametric adjustment method are not encouraging, limited consideration is given to the sensitivity analyses and convergence matter. Nevertheless, there was no obvious trend in the trace plots or autocorrelation, indicating good mixing of the MCMC simulations. Hence, reasonable convergence can be considered accomplished.

Results from sensitivity analyses carried out to check the influence of the choice of initial values and vague prior distributions, using both the uniform and normal or

249 gamma distributions when possible (Welton et al 2009) indicate that the choices are not highly influential.

10.10. Discussion & summary

This chapter has investigated the worthiness of a novel semi-parametric method to adjust for PB (and other small-study effects). Since Bayesian methods are to some extent different from frequentist ones, the assessment of the novel Bayesian adjustment approach was discussed in a separate chapter for clarity. Chapter 10 complies with the aims of the thesis by proposing a novel Bayesian approach designed to overcome some of the shortcomings from earlier methods. Its performance results suggest that the simpler adjustment method proposed earlier performs better and so it is disregarded hereafter. In this way, chapter 10 further meets the aims of the thesis by providing recommendations on how best to address PB (and other small-study effects).

Chapter 11 progresses with the aims of the thesis by attempting to overcome the typical limitations in extrapolation from small datasets. To this end, research exploring the potential benefits and methodological challenges in incorporating external information describing the small-study effects trend is undertaken next with the intention of improving the methods’ performance. Incorporating external information within a Bayesian framework is expected to be particularly valuable for MAs of few and small studies since previous work in PB suggests these datasets includes little information on the small-study effects trend (Moreno et al 2009b). Chapter 11 also satisfies the aims of the thesis by further exploring the benefits and methodological challenges in adapting the adjustment method to the more complex evidence synthesis model known as network MA.

250 A fully Bayesian approach to regression

adjustment by using prior information

11.1. Use of external data to inform the small-study effects trend

In line with recent work (Sterne et al 2009b, Turner et al 2009b, Welton et al

2009), more accurate predictions for the true effect size are potentially possible if evidence on bias is explicitly modelled. This can be achieved through the incorporation into the model of external information on bias using all the relevant information available (Higgins & Spiegelhalter 2002, Spiegelhalter et al 2003, Welton et al 2009).

So this chapter explores the incorporation of information obtained from external data to form a prior distribution for the slope regression coefficient. In the author’s view, this research is considered worthwhile given the encouraging results and clear scope for further improvement in the performance of the regression adjustment method. In this sense, borrowing strength from external meta-analytic data (Higgins & Whitehead

1996) describing the small-study effects trend should be particularly helpful for overcoming the typical limitations in extrapolation from small datasets, which has been repeatedly highlighted as a major limitation of the regression-based method (as well as for previous approaches to dealing with PB). The derivation of the slope prior distribution depicting external information about the small-study effects trend is formulated within the Bayesian framework. The methodology is demonstrated in the next section using data from the antidepressants case study presented in chapter 7.

251 11.2. Application to the antidepressants case study

It is important to recall that the 73 FDA-registered RCTs displayed in figure 7.1 correspond to 12 different antidepressant drugs. These were assumed to have the same underlying effect size. If the assumption were incorrect as some have argued

(Cipriani et al 2006, Cipriani et al 2009a, Cipriani et al 2009b), a stratified analysis would have been more appropriate. To investigate this assumption visually, figure 11.1 displays a FE MA stratified by antidepressant drug (FDA data). Only the drug-specific pooled results are displayed due to lack of space to display the 73 studies specific estimates.

Figure 11.1 Fixed-effect meta-analysis stratified by antidepressant drug based on the FDA data (ordered according to the mean effect size)

Study Fixed effects % ID g-score (95% CI) Weight

a_paroxetine Subtotal (I-squared = 2.1%, p = 0.429) 0.42 (0.30, 0.54) 9.82

b_venlafaxine Subtotal (I-squared = 30.1%, p = 0.209) 0.39 (0.27, 0.52) 8.58

c_venlafaxine XR Subtotal (I-squared = 45.2%, p = 0.161) 0.39 (0.23, 0.56) 5.06

d_mirtazapine Subtotal (I-squared = 49.5%, p = 0.037) 0.35 (0.22, 0.48) 8.22

e_escitalopram Subtotal (I-squared = 23.5%, p = 0.270) 0.31 (0.20, 0.42) 11.03

f_duloxetine Subtotal (I-squared = 0.0%, p = 0.443) 0.30 (0.21, 0.39) 15.50

g_paroxetine CR Subtotal (I-squared = 0.0%, p = 0.621) 0.32 (0.15, 0.49) 4.64

h_fluoxetine Subtotal (I-squared = 37.9%, p = 0.169) 0.26 (0.12, 0.40) 6.93

i_nefazodone Subtotal (I-squared = 0.0%, p = 0.480) 0.26 (0.12, 0.40) 7.20

j_sertraline Subtotal (I-squared = 0.0%, p = 0.892) 0.26 (0.12, 0.39) 7.56

k_citalopram Subtotal (I-squared = 12.3%, p = 0.336) 0.25 (0.12, 0.37) 8.16

l_bupropion SR Subtotal (I-squared = 0.0%, p = 0.652) 0.17 (0.04, 0.31) 7.31

Heterogeneity between groups: p = 0.293 Overall (I-squared = 15.7%, p = 0.133) 0.31 (0.27, 0.35) 100.00

0 .18 .25 .31 .4

252 2 By examining the subgroups I , it could be argued that some require a RE MA rather than a FE MA (although only one is significant at the 10% level of significance).

Mean efficacy ranges between 0.18 and 0.42 with a mean overall effect of 0.31 (g- scores). It is difficult to identify systematic differences between the 12 antidepressant drugs because all their 95%CIs overlap. To facilitate visualizing potential underlying differences, four vertical lines have been drawn on the figure to represent different levels of efficacy.

Figure 11.2 (courtesy of Erick Turner) displays the mean effect size for all 12 drugs derived from a stratified MA of journal and FDA data separately, which helps to illustrate the influence of biases upon the journal data relative to the FDA data.

Figure 11.2 Column graph illustrates the added efficacy derived from journal data beyond the FDA-derived efficacy

Sertraline Bupropion Paroxetine Fluoxetine Citalopram Duloxetine Venlafaxine Nefazodone Mirtazapine Escitalopram

Paroxetine Venlafaxine

253 Besides confirming that the disagreement in efficacy results between the FDA and journal data is not random but systematic, figure 11.2 also puts on view that the efficacy rankings derived from FDA data differ from the rankings derived from journal data. What is more, challenges arise when attempting to assess PB (and other small- study effects) in a stratified MA due to limited statistical power (i.e. scarcity of degrees of freedom at the drug-level) (Sterne et al 2000). To illustrate this hurdle, figures 11.3-4 display funnel plots for each of the 12 drugs implementing the regression adjustment method for the journal and FDA data, respectively. Note that some funnel plots do not contain the regression line because 3 studies are at least needed to fit it.

Figure 11.3 Funnel plots for all 12 antidepressant drugs combining studies published in medical journals. Fixed-effect meta-analysis and the ‘independent slope’ regression method (ignoring prior information) are applied to each drug

0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Bupropion SR effect estimate Citalopram effect estimate Duloxetine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -4 -2 0 2 4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 1.5 Escitalopram effect estimate Fluoxetine effect estimate Mirtazapine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Nefazodone effect estimate Paroxetine effect estimate Paroxetine CR effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Sertraline effect estimate Venlafaxine effect estimate Venlafaxine XR effect estimate

254 Figure 11.4 Funnel plots for all 12 antidepressant drugs combining studies submitted to the FDA. Fixed-effect meta-analysis and the ‘independent slope’ regression method (ignoring prior information) are applied to each drug

0.0 0.0 0.0

0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1.5 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Bupropion SR effect estimate Citalopram effect estimate Duloxetine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1 0.2 0.2 0.2 0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -2 -1 0 1 2 3 -1 -.5 0 .5 1 -2 -1 0 1 2 Escitalopram effect estimate Fluoxetine effect estimate Mirtazapine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Nefazodone effect estimate Paroxetine effect estimate Paroxetine CR effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -10 -5 0 5 10 Sertraline effect estimate Venlafaxine effect estimate Venlafaxine XR effect estimate

Both figures clearly illustrate the large uncertainty induced by the regression approach when extrapolating from small datasets (chapter 8). To overcome this limitation, section 7.5 suggested enhancing the method’s reliability by including larger collections of studies. One way of achieving this is by conducting comprehensive assessments of the small-study effects that allow borrowing strength (Higgins &

Whitehead 1996) from larger collections of studies than are typically available in a single MA (Shang et al 2005, Moreno et al 2009b, Sterne et al 2009b). To demonstrate this approach, this case study shows how any of the 12 drug-level treatment effects can be adjusted for PB (and other small-study effects) even when apparently there are too few studies for a judicious analysis. For that, the slope of each of the 12 antidepressant drugs is informed and revised in the light of the external information available from the remaining 11.

255 The degree of pooling between the slopes is expected to depend on the resemblance of the small-study effects trend among the 12 drugs. Indeed, section 10.1 emphasized the potential of prior information to produce more precise posterior estimates provided that there is no conflict between the new meta-analytic dataset and the informative prior. For completeness and comparability, this process is carried out for all 12 drugs (figure 11.5-6). Before reporting the results, the methodological aspects required for deriving and using prior distributions are addressed next.

11.3. Deriving the prior distribution

Following the approach proposed by Higgins & Whitehead in 1996 (Higgins &

Whitehead 1996, Rutherford 2008), a Bayesian framework is adopted with the purpose of formulating an empirical prior distribution for the regression slope by means of combining external data. Because the regression slopes describing the small-study effects trend in each dataset can be expected to vary between meta-analytic datasets

(Jüni et al 2008), the slope parameters are treated as random effects sampled from a common hierarchical distribution. This implies that all slopes are considered heterogeneous but exchangeable across other datasets. That is, all 12 slopes are assumed similar to the average slope in other datasets (Sterne et al 2009b) and are not distinguishable a priori. This exchangeability assumption, which is essential to justify the use of empirically based prior distributions, is discussed in more detail later.

Ultimately, the prior distribution describing the exchangeable regression slopes is incorporated in the D-var model (equation 6.1) to estimate the posterior distribution of the drug-specific bias-adjusted effect. For that, the D-var model is extended by

adopting a RE slope ~, , where a different (exchangeable) slope K is derived for each antidepressant drug K with a common mean value B. That is, the intensity of the small-study effects trend is assumed to vary randomly between the

256 multiple meta-analytic datasets and to be sampled from a common hypothetical

(normal) population. Conceptually, the adapted Bayesian D-var model, parameterized in equation 11.1, can be interpreted as performing a concurrent but separate bias adjustment for all 12 antidepressants while sharing information on the small-study effects trend.

2 1 yi,k = α k + βk × sei,k + ε i,k weighted by 2 sei,k

[Equation 11.1]

2 with εi,k ~ N(0, sei,k ×ϕ) & ~,

Model terms were defined in equation 5.1. The WinBUGS code for this hierarchical

Bayesian model is:

Model { for(i in 1:N) (where N=total number of studies) {

precision[i] <- 1/((se[i] * se[i]) * dispersion) g-score[i] ~ dnorm(d[i], precision[i]) d[i]<- alpha[antidepressant_drug[i]] + beta[antidepressant_drug[i]] *(se[i] * se[i]) } dispersion ~ dunif(0.001, 4) mean.beta ~ dnorm(0, 0.0001) sd.beta ~ dunif(0, 2) tau.beta <- 1/(sd.beta * sd.beta)

alpha[k] <- 0 beta[k] <- 0 (where antidepressant drug K=1,...,12)

for (k in 1:12) { alpha[k] ~ dnorm(0, 0.001) beta[k] ~ dnorm(mean.beta, tau.beta) }

for(k in 1:12) { rk[k] <- 13 - rank(d[],k) (Ranking interventions where 1=best) best[k] <- equals(rk[k],1) (Probability of being best)

}}

257 The RE slopes are formulated using a vague prior distribution for both hyper- parameters, the mean slope (across the various k) B~N[0,1000] and its between-drug standard deviation ~Uniform[0,2]. Vague prior distributions are also formulated for the drug-specific effects αK~N[0,1000] and the overall dispersion parameter φ~

Uniform[0.001,4]. The dispersion parameter, which was repeatedly discussed in chapters 5 & 6, is formulated within the study precision calculation. Other model parameters are derived from the basic ones, while others are self-explicative.

Similar to the between-study variance parameter estimated in a standard RE MA

(Higgins & Whitehead 1996), the multiplicative dispersion parameter is difficult to estimate whenever there are few studies. Modelling either drug-specific dispersion or exchangeability of the dispersion parameter among antidepressant drugs is possible but a (more restrictive) common dispersion parameter across all studies is preferred.

Note that because some antidepressant drugs contain as few as two studies, the dispersion parameter is particularly problematic to estimate even if the exchangeability assumption is made. To overcome this problem, the statistical model assumes a common dispersion parameter across all studies.

The MCMC simulation procedures follow those from section 10.9. Similarly, Model checking examined the convergence of model parameters and problems of autocorrelation between simulations. Visual inspection of the trace plots showed no suspicious trend or autocorrelation, which indicates good mixing of the MCMC simulations. The posterior regression lines (once prior information is accounted for) are displayed in figures 11.5-6 regarding the journal and FDA data, respectively. Note that all funnel plots contain the shrunken regression lines due to borrowing strength, which allows fitting the lines in a Bayesian framework despite the few numbers of studies available.

258 Figure 11.5 Funnel plots for all 12 antidepressant drugs combining studies published in medical journals. Fixed-effect meta-analysis and the ‘exchangeable slopes’ regression method (using prior information) are applied to each drug

0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Bupropion SR effect estimate Citalopram effect estimate Duloxetine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Escitalopram effect estimate Fluoxetine effect estimate Mirtazapine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Nefazodone effect estimate Paroxetine effect estimate Paroxetine CR effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Sertraline effect estimate Venlafaxine effect estimate Venlafaxine XR effect estimate

259 Figure 11.6 Funnel plots for all 12 antidepressant drugs combining studies submitted to the FDA. Fixed-effect meta-analysis and the ‘exchangeable slopes’ regression method (using prior information) are applied to each drug

0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Bupropion SR effect estimate Citalopram effect estimate Duloxetine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Escitalopram effect estimate Fluoxetine effect estimate Mirtazapine effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Nefazodone effect estimate Paroxetine effect estimate Paroxetine CR effect estimate 0.0 0.0 0.0

0.1 0.1 0.1

0.2 0.2 0.2

0.3 0.3 0.3 Standard error Standard error Standard error Standard

0.4 0.4 0.4 -1 -.5 0 .5 1 -1 -.5 0 .5 1 -1 -.5 0 .5 1 Sertraline effect estimate Venlafaxine effect estimate Venlafaxine XR effect estimate

As expected, once information about the small-study effects is shared amongst the small datasets, the corresponding regression lines shrink towards the overall mean slope and provide narrower 95%CIs. Indeed, all slopes become very alike because the hierarchical prior information overruns the drug-specific slope derived from its small dataset (Spiegelhalter et al 2003). Only in the case of Citalopram (based on journal data), the 95%CI is wider when using prior information than without. This exemption can be explained by its independent regression line in figure 11.2 fitting the four data points almost perfectly (R2=0.997) producing a tiny sum of squares errors and so a very narrow 95%CI. Because the prior slope distribution ‘forces’ the posterior slope to deviate slightly, the fit is no longer so ideal causing its 95%CI to amplify somewhat in figure 11.5.

260

WinBUGS allows ranking the performance of the 12 drugs on each of the MCMC cycles by considering the ranking as a stochastic variable. The probability a drug is best corresponds to the proportion of times a drug comes up first in the MCMC cycles

(Ades et al 2006). Table 11.1 displays the probability that a particular drug is best together with the median and 95%CI rankings. For comparison purposes, rankings are equally estimated for journal and FDA data for to the three approaches to evidence synthesis examined here. That is, standard MA, ‘independent slopes’ and

‘exchangeable slopes’ regression methods.

261 Table 11.1 Probability that a drug is best together with the median and 95% confidence/credible intervals of their performance ranking regarding journal and FDA data separately

FDA data Journal data FDA data Journal data FDA data Journal data

Prob.Best Ranking % (2.5 ‐ 50 ‐ 97.5) Prob.Best Ranking % (2.5 ‐ 50 ‐ 97.5) Prob.Best Ranking % (2.5 ‐ 50 ‐ 97.5)

Exchangeable slopes regression model Independent slopes regression model Independent fixed effect meta‐analyses bupropion 0.00 5 ‐ 11 ‐ 12 0.01 3 ‐ 11 ‐ 12 0.04 1 ‐ 10 ‐ 12 NA NA 0.00 5 ‐ 11 ‐ 12 0.00 4 ‐ 11 ‐ 12 citalopram 0.01 3 ‐ 9 ‐ 12 0.00 6 ‐ 10 ‐ 12 0.15 1 ‐ 4 ‐ 11 0.02 2 ‐ 5 ‐ 12 0.00 5 ‐ 10 ‐ 12 0.00 5 ‐ 10 ‐ 12 duloxetine 0.01 2 ‐ 6 ‐ 11 0.01 2 ‐ 6 ‐ 10 0.12 1 ‐ 4 ‐ 10 0.05 1 ‐ 4 ‐ 12 0.00 4 ‐ 7 ‐ 11 0.00 4 ‐ 7 ‐ 11 escitalopram 0.03 1 ‐ 6 ‐ 11 0.01 2 ‐ 7 ‐ 10 0.23 1 ‐ 4 ‐ 11 0.11 1 ‐ 4 ‐ 12 0.03 1 ‐ 6 ‐ 12 0.02 2 ‐ 9 ‐ 12 fluoxetine 0.01 2 ‐ 9 ‐ 12 0.00 8 ‐ 11 ‐ 12 0.01 2 ‐ 7 ‐ 10 0.00 2 ‐ 6 ‐ 12 0.00 4 ‐ 8 ‐ 12 0.01 3 ‐ 8 ‐ 12 mirtazapine 0.09 1 ‐ 5 ‐ 12 0.13 1 ‐ 5 ‐ 11 0.14 1 ‐ 9 ‐ 12 0.30 1 ‐ 3 ‐ 12 0.04 1 ‐ 4 ‐ 9 0.27 1 ‐ 2 ‐ 6 nefazodone 0.01 2 ‐ 9 ‐ 12 0.02 2 ‐ 7 ‐ 12 0.07 1 ‐ 7 ‐ 12 0.09 1 ‐ 5 ‐ 12 0.00 3 ‐ 8 ‐ 12 0.02 2 ‐ 6 ‐ 10 paroxetine 0.27 1 ‐ 3 ‐ 11 0.14 1 ‐ 5 ‐ 11 0.04 1 ‐ 5 ‐ 10 0.36 1 ‐ 2 ‐ 12 0.42 1 ‐ 2 ‐ 4 0.41 1 ‐ 2 ‐ 5 paroxetine CR 0.09 1 ‐ 6 ‐ 12 0.02 2 ‐ 7 ‐ 11 NA NA NA NA 0.08 1 ‐ 5 ‐ 12 0.00 4 ‐ 9 ‐ 12 sertraline 0.01 2 ‐ 9 ‐ 12 0.10 1 ‐ 4 ‐ 10 0.04 1 ‐ 5 ‐ 10 NA NA 0.00 5 ‐ 10 ‐ 12 0.07 1 ‐ 7 ‐ 12 venlafaxine 0.20 1 ‐ 3 ‐ 9 0.17 1 ‐ 3 ‐ 8 0.05 1 ‐ 6 ‐ 12 0.06 1 ‐ 5 ‐ 12 0.19 1 ‐ 3 ‐ 6 0.11 1 ‐ 3 ‐ 7 venlafaxine XR 0.26 1 ‐ 3 ‐ 10 0.38 1 ‐ 2 ‐ 9 0.11 1 ‐ 10 ‐ 12 NA NA 0.24 1 ‐ 3 ‐ 8 0.08 1 ‐ 4 ‐ 9

262 It is important to realise that most rankings report 95%CIs that overlapped with competing drugs. As a result, it is challenging to distinguish them and identify a clear winner. Even so, there are a number of drugs that clearly perform better than others do under each model (highlighted in yellow). These can be recognized by their high probability of being the best drug. Because this analysis does not pretend to be a definitive assessment on antidepressants efficacy, only attention to the best three performing drugs is given hereafter. Other drugs are disregarded to simplify the analysis since they can be argued to have negligible probabilities of ever being the best, while being usually ranked in late positions.

As discussed in chapter 7, the FDA provides gold standard data from where to compare results. The fact that the unadjusted FDA rankings (based on standard MAs) match those derived from the (exchangeable slopes) adjusted FDA data, reinforces the use of the ‘exchangeable slopes’ regression method as a reliable approach for the adjustment of small-study effects. In both sets of rankings, the three best performing drugs accumulate at least 70% of the overall probability of being the best antidepressants (in terms of efficacy). On the other hand, the ‘independent slopes’ approach, which considers total independence between all 12 datasets, seems to provide very unhelpful adjusted results (figures 11.3-4) and potentially misleading rankings . Hence, extrapolation to the infinite size study based on few data points without the capacity to borrow strength from external data becomes dangerously misleading.

Once the ‘exchangeable slopes’ regression model is established as the correct adjustment approach for this meta-analytic framework, the adjusted (ranking) results derived from the journal data can be compared to those from the equivalent FDA data.

In that sense, it is interesting to observe that only when the ‘exchangeable slopes’ regression method is adopted, the best three drugs coincide for both sources of data,

263 that is, the FDA and journal. Although their ranking order differs, the same three drugs report the highest probabilities of being the best performing drugs (highlighted in yellow).

11.4. Discussion on the use of prior information

This chapter investigates the advantages of using external information describing the small-study effects trend to overcome the typical limitations in extrapolation from small datasets. For that, methods for deriving the prior distribution and introducing it into a new meta-analytic dataset have been explored. Informative prior distributions are derived from empirical external evidence rather than elicited opinions (Higgins &

Whitehead 1996, White et al 2007). Altogether, this chapter demonstrates how the use of external information improves the accuracy of the (posterior) bias-adjusted effect in a new meta-analytic dataset.

To borrow strength effectively, the assumption of exchangeability is essential. This assumption is considered reasonable if the small-study effects are similar among all meta-analytic datasets. In this case study, it could be sensibly argued that the same criteria for journal publication (and other sources of small-study effects) are shared by the trials’ sponsors. Note that all trials were sponsored by the pharmaceutical industry, which may suggest a similar conduct towards reporting of trials, reinforcing the assumption of exchangeability. On the other hand, the assumption is questionable whenever there are reasons to suspect that the small-study effects trend is systematically different between the meta-analytic datasets. In such case, the statistical model should account for those reasons accordingly (Spiegelhalter et al 2003).

This chapter has followed the Bayesian approach proposed by Higgins et al

(Higgins & Whitehead 1996), that was originally intended to borrow strength from external meta-analytic data to produce a more precise between-study variance

264 estimate in RE MA. Altogether, borrowing strength from external meta-analytic data regarding the small-study effects trend has been shown valuable for overcoming the typical limitations in extrapolation from small datasets. As explained in section 10.1, the assumption of exchangeability in Bayesian hierarchical modelling allows the drug- specific small-study effects trend to borrow strength from data relating to other antidepressant drugs. As a result, the small-study effect trend shrinks towards the overall mean value, with an expected reduction in the credible interval width of the bias-adjusted estimate. However, caution shall be taken when interpreting the results because it is likely that the prior information will overrun the data articulated by the few aggregated studies in the new dataset (Spiegelhalter et al 2003).

Future research could consider the possibility that the intensity of small-study effects can vary in different settings. Findings from a meta-epidemiological study by

Wood et al (Wood et al 2008) suggest that stratification by intervention type (drug versus non-drug) and outcomes (objective versus subjective) could be advantageous.

Other research still ongoing (Jüni et al 2008) further suggests that higher intensities of small-study effects affect complementary medicine compared to conventional. Then, it would be interesting to explore the extent to which the small-study effects vary according to clinical areas or other characteristics such as the type of outcome variable used. It would be also interesting to explore whether there are pharmacological factors that could give a plausible explanation to the findings; for instance, whether the best performing antidepressants share a unique pharmacological feature. Further research should also consider the necessary balance between risks and benefits in the evaluation of antidepressants (Hansen et al 2005, Cipriani et al 2007a) as well as for all other medical interventions.

265

It is generally required that the sensitivity of results to different prior distributions is carried out. This is especially important when prior distributions are intended vaguely informative, since they may end up being quite informative after all, such as in the typical case of variance parameters (Lambert et al 2005). Sensitivity analyses on the vague prior distributions and choice of initial values indicated that the choices are not highly influential. Sensitivity analyses were not carried out on the fully informative slope prior because they are not indispensable.

Ultimately, the best three performing drugs identified from the adjusted journal data match those derived from the FDA data, suggesting that the adjustment method not only works convincingly in large datasets. Indeed, it has consistently identified the best three performing drugs from small datasets. Therefore, the rule of thumb suggested in section 7.5 becomes less of a concern whenever external information on the small-study effects trend is available. Recall that this rule of thumb recommends the adjustment if there are enough studies and these are of varying sizes to make results meaningful within a regression analysis setting. Otherwise, the extrapolation from such constrained datasets produces large uncertainty in the bias-adjusted effect estimate.

Although the same best three performing drugs were identified in the adjusted journal and FDA data, their rankings did not match, which may lead to the erroneous choice for the best drug. It is difficult, nevertheless, to be certain that the correct order is that observed in the FDA data. Challenges arise when attempting to provide inferences on the correct order of performance when their 95%CIs clearly overlap

(Spiegelhalter 2005). With the aim of strengthening the conclusions, it is advisable to examine additional data beyond the placebo-controlled trials included here. Network

MA could allow integrating head-to-head evidence (Leucht et al 2009) rather than focussing exclusively on placebo-controlled pairwise-comparisons (Caldwell et al

266

2005). Further to combining all direct evidence, network MA also allows integrating indirect evidence by considering efficacy from two active treatments relative to a common comparator.

Indeed, a recent publication in the Lancet has implemented this innovative methodology to synthesise the antidepressants literature and so provide a ranking of the efficacy of antidepressant drugs (Cipriani et al 2009a). Although the analysis carried out here does not pretend to be a definitive assessment on antidepressants efficacy, it is important to highlight that their estimated rank-order is quite different to the one reported here. In fact, the author of this thesis has sent a letter (in collaboration with others) to the Lancet journal (Turner et al 2009a) to caution against this recent attempt to ranking antidepressant drugs without allowing for potential biases in the data.

Several differences may explain the disparity in the findings. For example, Cipriani et al (Cipriani et al 2009a) disregarded data on placebo-controlled comparisons and only included data on head-to-head comparisons, which entails excluding relevant studies simply because they lacked a second active comparator. Moreover, head-to- head evidence comparing active treatments can become a source of bias in the form of sponsorship bias (Sismondo 2008). This type of bias arises when one of the two active comparators is given unfair advantage by the sponsor of the clinical trial. This bias could partially explain inconsistency in network MAs, perhaps as a source of small- study effects as discussed in section 3.4. Indeed, sponsorship bias has become one of the most important concerns for the validity of findings from network MAs (Cipriani et al

2009a, Cipriani et al 2009b). For trials testing multiple dosages, Cipriani et al only included one dose group. Such selectivity would appear to go against the precepts of systematic review to avoid selection bias and include all relevant data, particularly since their chosen methodology facilitates inclusion of such data (Turner et al 2009a).

267

The view in this thesis is that although multiple-treatment MA is still an area of active research (Song et al 2003, Glenny et al 2005, Sutton et al 2008, Song et al

2009), it conveys the future of evidence synthesis. The reason being is that network

MA facilitates the simultaneous combination of all direct and indirect evidence from multiple treatments with the intention of increasing the precision of effect estimates while preserving randomisation. In line with this thesis, routine consideration of potential biases (e.g. PB) is recommended, and so, next section investigates the adaptation of the proposed adjustment method to the network MA framework.

11.5. Introduction to network meta-analysis

At the same time as highlighting the advantages of the simultaneous evaluation of multiple interventions, this section focuses on the methodological challenges that arise when attempting to adapt the proposed adjustment method to the network MA framework. Network MA has three main roles to play in the area of evidence synthesis beyond standard MA (Lu & Ades 2004):

• To facilitate estimation of relative efficacy between interventions not yet been

compared face-to-face. Hence, indirect evidence is very valuable whenever direct

evidence does not exist (Eckert & Lançon 2006, Edwards et al 2009).

• If direct evidence exists about a particular comparison but it is either contradictory

or vague (Gartlehner & Moore 2008), the inference can be strengthened by

combining direct and indirect comparisons through the phenomenon of borrowing

strength.

• Network MA allows simultaneous comparisons of all interventions for the purpose

of ranking and picking the best one.

268

The assumptions underlying network MA are identical to those from standard pairwise MA (Hasselblad 1998, Sutton et al 2008) since network MA is in fact a generalization of standard pairwise MA (Lu & Ades 2004). There is, however, a crucial assumption in network MA that goes beyond standard MA. Namely, the comparative efficacy of the competing interventions must be additive. That is, if there is no direct evidence on the relative efficacy of intervention A versus B, this must correspond

(according to the illustrative diagram from Jansen et al (Jansen et al 2008)) to the difference in relative efficacy between P versus A, and P versus B (figure 11.7).

Figure 11.7 Diagram illustrating the assumption of additive efficacy in network MA

dPA

dPB

dAB

AB = PB PA d d - d

For the indirect evidence to be consistent, direct and indirect evidence shall not disagree beyond chance (Cipriani et al 2009a). To check this out, consistency analyses in the network of evidence become vital because inferences can only be based on consistent models (Lu & Ades 2006). The aim of a consistency analysis is to reveal discrepancies in estimated treatment effects between the pairwise contrasts obtained from direct evidence exclusively (e.g. standard MA) and those drawn from the combination of direct and indirect evidence altogether (derived from the network MA).

Major reasons why inconsistencies may emerge include heterogeneity in study protocols and patients’ populations. Even when inconsistency in the network of evidence is not detected, it should never be ruled out since it might be due to a type II

269 error. On the other hand, evidence of inconsistency does not tell which model is responsible (MA versus network MA) because both could well be inconsistent with the data (Lumley 2002).

Three candidate network MA models are possible depending on the variability between study effects. The FE model assumes no heterogeneity between study results in either treatment comparisons (relative to a common comparator). The RE model considers heterogeneity to be constant among all study effects. Finally, the contrast- specific RE model assumes a separate heterogeneity parameter for each of the treatment comparisons (Lu & Ades 2004, Cooper et al 2009).

Because network MA is a more complex evidence synthesis model than standard pairwise MA, detecting and dealing with biases becomes more challenging. As a result, network MA becomes more vulnerable to selection biases such as PB (Lu & Ades

2004), which has already been acknowledged as a problem that requires research

(Cooper et al 2006). The reason why no effort has been made to date can be partially explained by the fact that network MA is a relatively novel methodology and because the assessment of PB is more complex than in standard pairwise MA. So, this chapter contributes to knowledge by considering the adjustment for PB (and other small-study effects) in the network MA framework for the first time. For that, the methodological challenges that arise in the implementation of the adjustment method are investigated next.

270

11.6. Adapting the adjustment method to network meta-analysis

Study-level covariates have already been incorporated in MR within the network

MA model to explain heterogeneity (Salanti et al 2009). Likewise, the proposed adjustment method can be implemented as any other MR model. Because the use of

MR within the network MA framework is a pioneering area of research, only three examples have been found in the literature (Nixon et al 2007, Cooper et al 2009,

Salanti et al 2009). Naturally, the typical limitations and pitfalls affecting MR highlighted in sections 2.5-2.7 also apply in the network MA framework. The MR model allows three alternative ways of modelling the small-study effects or any other sources of variability caused by a potential effect modifier (Cooper et al 2009):

i. Modelling an independent regression line for each treatment comparison;

ii. All treatment comparisons have the same underlying (i.e. common) regression

slope; and

iii. All treatment comparisons have different but exchangeable slopes

Once the adjustment method is implemented within the network MA framework, there is no longer the need to include a heterogeneity parameter because the regression-based method already incorporates a dispersion parameter. Again, modelling a different dispersion parameter for every treatment comparison is possible although difficult to estimate whenever there are few studies in that comparison. To be consistent with equation 11.1 and because no benefits can be foreseen, a common dispersion parameter across all studies generally appears more sensible and viable. It is important to point out that the dispersion parameter has never been implemented in the area of network MA. Thus, further research would be required to investigate the correct implementation of the multiplicative parameter to accommodate heterogeneity.

271

It is not clear, however, whether the routine implementation of the adjusted network MA model shall be advocated at this stage. The view in this thesis is that further research is necessary to investigate its performance in this more complex evidence synthesis framework. Note that network MAs frequently have very few studies for some of the treatment contrasts. This becomes a concern for extrapolation in regression analysis as repeatedly mentioned along the thesis.

11.7. Summary

Continuing with the Bayesian framework from chapter 10, chapter 11 has shown how the use of prior information regarding the small-study effects trend can further assist in the goal of adjusting for PB (and other small-study effects) more accurately.

Altogether, this chapter fulfils the thesis aim to expose the benefits and methodological challenges in the adoption of informative priors to inform the small-study effects trend with the intention to overcome the typical limitations in extrapolation from small datasets. Interestingly, this could help overcome the rule of thumb of a minimum number of data points to carry out the adjustment.

This chapter also contributes to knowledge by considering the adjustment for PB

(and other small-study effects) in the network MA framework for the first time. For that, the benefits and methodological challenges that arise in the implementation of the adjustment method are explored. No suggestion is made regarding routine adjustment because further research is needed about the method’s performance in this more complex framework. The closing chapter reviews and discusses the concerns emphasized across the thesis as well as the achievements accomplished.

272

Discussion & conclusions

12.1. Thesis summary

In this section, the main issues that have risen during the thesis are identified and related to the thesis aims (section 1.3).

The introductory chapter 1 emphasizes the issues that justify the need for the adjustment of PB in evidence synthesis, with MA as the most widespread framework.

The following chapters contribute to the development of valid adjustment approaches to achieve this core aim. To facilitate this, chapters 2 & 3 review the literature on the available techniques in the area of MA and the different types of biases that can affect meta-analytic data, with special attention to PB. Chapter 3 highlights how traditionally

PB has been seen as the main reason for the observed small-study effects in many

MAs. Because no overview of the many potential causes currently exists, chapter 3 contributes to knowledge by performing the first review on the sources of small-study effects.

Before developing alternative techniques for PB adjustment (and other small-study effects) in chapters 5 & 10, chapter 4 provides a comprehensive overview and critical appraisal of the existing methods for dealing with PB in MA. Chapter 4 makes recommendations as to how best to address PB (and other small-study effects) in MA.

The contour-enhanced funnel plot is strongly recommended as the most natural way of displaying the phenomenon of small-study effects as well as further assisting in distinguishing biases induced by the statistical significance of findings (e.g. PB) from other biases. What is more, because the view in this thesis is that testing for small- study effects does not offer a solution to the problem of PB (or small-study effects more generally), it is discouraged.

273

Chapters 5 & 10 describe methods that overcome some of the limitations of currently used adjustment methods either by extending and adapting them or by developing new approaches. To this end, chapter 5 proposes the regression-based approach as the most coherent approach for PB adjustment (and other small-study effects), while chapter 10 proposes a modified Bayesian version with semi-parametric properties. To facilitate the evaluation and comparison between existing adjustment methods and those proposed in chapters 5 & 10, a simulation study is described in chapter 6. The simulation study identifies those adjustment methods with the most desirable statistical properties.

Of particular interest to the contribution of knowledge was the assessment of the conditional approach to PB adjustment carried out for the first time in the area of PB. It has been suggested that, in practice, researchers carry out an adjustment for PB depending on whether the small-study effects is detected by means of a related statistical test (section 6.3.3). Findings strongly discourage the conditional approach because of its poor performance (Moreno et al 2009a). Moreover, a considerable effort has been made to design the most comprehensive simulation study to date in the area of PB. And so, chapter 6 further contributes to knowledge by offering a simulation framework in which future methods can be evaluated and compared.

Chapters 7, 8, 10 & 11 provide several case studies to demonstrate the implementation of the adjustment methods, to compare them with currently used methods, and to illustrate the potential impact of inappropriate analyses on the estimation of the bias-adjusted pooled effect. Since the antidepressants case study in chapter 7 is accompanied by the gold standard data, the external validity of the advocated adjustment method is successfully investigated for the first time with very encouraging results. Chapter 8 facilitates a better understanding of the properties of

274 the adjustment method of choice. To this end, two case studies are used to demonstrate the properties of the weights allocated to individual studies of the proposed adjustment method relative to the standard MA. Chapter 8 contributes to knowledge by deriving algebraically the weighting scheme associated with the regression adjustment method. Ultimately, a shift in current practice is encouraged by adopting the proposed adjustment method routinely as a way of achieving more reliable effect estimates than with standard MA.

Chapter 9 investigates the links between the proposed adjustment method and

Rubin’s approach since both approaches claim to predict the true underlying effect size. In agreement with Rubin (Rubin 1992), this chapter maintains that the best estimate is not necessarily any weighted average of the literature and that predicting

Rubin’s ideal study is the preferred approach to estimate the ‘true’ effect size. Chapter

9 further argues that Rubin’s approach is unfeasible in typical meta-analytic settings while the proposed adjustment method is conceived as its simplification. The idea is that modelling the small-study effects allows extrapolating to an infinite size study, which predicts the effect size from Rubin’s ideal study. Chapter 9 contributes to knowledge by conceptualising the proposed adjustment method in terms of the Rubin’s approach while consolidating arguments endorsing the routine adjustment initially suggested in the previous chapter.

Chapter 10 proposes a novel Bayesian approach for the adjustment of small-study effects designed to overcome some of the shortcomings from earlier methods. Its performance results suggest that the simpler adjustment method proposed earlier performs better. Chapter 10 provides recommendations on how best to address PB

(and other small-study effects). In this sense, this chapter contributes to knowledge by providing some evidence in favour of simpler statistical models, particularly in the typical MA setting where summary data from only a handful RCTs are available.

275

Chapter 11 attempts to overcome the typical limitations in extrapolation from small datasets required by the regression adjustment method. The chapter contributes to knowledge by illustrating for the first time the way in which prior information can be derived and employed to inform the estimation of the small-study effects trend. Overall, the incorporation of external information within a Bayesian framework intended to inform the small-study effects trend is shown highly valuable in the goal of adjusting more accurately, particularly for MAs of few and small studies. Chapter 11 discusses the benefits and methodological challenges in adapting the adjustment method to the network MA framework.

Specific aspects have been identified along the thesis, which require consideration before choosing to apply the proposed adjustment method. The next section follows the argumentative line of the thesis to sum up and discuss the most important features and limitations highlighted across the thesis.

12.2. Discussion

MAs of RCTs following systematic reviews are regarded as the highest level of evidence in medicine for evaluating interventions (Harbour & Miller 2001). Contrary to growing evidence, standard MA typically considers the individual studies free from any form of internal bias in addition to assuming that the collection of studies is free from selection bias. Since these are very strong assumptions that are rarely questioned, the view in this thesis is that standard MA should be considered to offer a naive synthesis of the evidence (LeLorier et al 1997). Meanwhile, substantial empirical evidence in mounting up suggesting that smaller studies are more influenced by internal biases as well as selection bias, particularly in the form of PB (Berlin et al 1989, Sterne et al

2000, Als-Nielsen et al 2004, Shang et al 2005, Moreno et al 2009b). In this sense, the results from the simulation study and the case studies reinforce the line of reasoning that the best pooled effect estimate is not necessarily any weighted average of the

276 literature as advocated by standard MA (Glass 1991, Rubin 1992, Poole & Greenland

1999, Greenland & O'Rourke 2001, Moreno et al 2009a, Moreno et al 2009b).

The adjustment method developed here sought to improve on current methods by providing a pooled effect estimate free from PB. Note however that no statistical approach will ever entirely correct for PB because the underlying criteria for publication can never be identified with certainty, perhaps because it differs from context to context as well as between the stakeholders involved in the publication process (journal editors, researchers, sponsors, etc) (Copas 2005, Baker & Jackson 2006, Sridharan &

Greenland 2009). Clearly, if the true publication selection process was ascertained, a straightforward correction could be performed. In this sense, considerable efforts have been made to correctly assume or model the true selection process underpinning PB

(chapter 4), including TF (Duval & Tweedie 2000b) and selection modelling techniques

(Hedges 1992). Regrettably, all these methods suffer important limitations, and so are generally recommended as sensitivity analyses (Sterne et al 2008). Sensitivity analyses are commonly used to aid decision-making although ideally one would prefer a bias-adjusted effect estimate for subsequent use in a decision-making model such as in a cost-effectiveness analysis. Because the view in this thesis is that results from the evidence synthesis exercise are going to be used for decision-making purposes, correcting MA results for potential bias is preferred. With this in mind, the core aim of this thesis is to attempt to obtain a bias-adjusted pooled estimate that is as unbiased as possible from PB. This should allow achieving more reliable results to enhance decision-making in health policymaking.

Due to the anticipated difficulties in attempting to deduce the true selection mechanism underpinning PB, this thesis intended from the beginning to address the problem of PB from a different perspective. So, this thesis focuses and expands on a relatively recent idea that the presence of PB infers a very peculiar trend in the meta-

277 analytic data. This peculiar trend, known as small-study effects, by which smaller studies show a greater effect than larger studies (Sterne et al 2000) is being increasingly noticed since the use of funnel plots became widespread. In fact, the presence of PB is usually assessed by means of investigating the existence of small- study effects (Ioannidis 2008b). Although traditionally PB has been seen as the main reason for the observed small-study effects in most MAs (Williamson & Gamble 2005,

Williamson et al 2005, Dwan et al 2008), this testing approach has been fiercely criticised mainly because the small-study effects can be induced by a variety of factors besides PB (Shang et al 2005, Ioannidis 2008b, McMahon et al 2008). On the other hand, contour-enhanced funnel plots are recommended here because they help identify those studies that are selectively reported according to the statistical significance of their findings; which ultimately is a major assumption of PB. Since no overview of the many potential causes currently exists, the first review on the sources of small-study effects is carried out in chapter 3. The findings from the review on the sources of small-study effects suggest that many of the factors commonly known to bias MA manifest themselves as small-study effects.

As explained in detail in chapter 3, adjusting for small-study effects implies modelling the trend and extrapolating to a hypothetical study of infinite size. Until now, no clear description was available on the meaning of this extrapolation for most sources of small-study effects. To address probable controversies, chapter 3 explicitly interpreted what it means to extrapolate the small-study effects trend to a study of infinite size for each one of its potential sources. Ultimately, this allows making a major assumption by which larger studies better estimate the effect size of interest because they are less influenced by reporting biases and more accurately reflect routine clinical care received by the general population (with the condition of interest). Hence, a hypothetical study of infinite size can be seen as an ideal study unaffected by PB from where to estimate the unbiased effect estimate.

278

The recent tendency to advocate routine assessment of the risk of bias in MAs

(Deeks et al 2008, Higgins et al 2008a, Wood et al 2008, Liberati et al 2009) is compatible with the adjustment method proposed here since study quality is known to induce small-study effects (Sterne et al 2000). Recall, however, that the proposed adjustment approach does not intend to attribute the small-study effects to any particular cause.

Modelling for small-study effects and extrapolating to the hypothetical infinite size study is achieved by a simple linear model regressing the observed study effects on some function of their precision (chapter 5). This agrees to some extent with recent findings from Copas et al (Copas & Malley 2008), who advocates a permutation test

(Higgins & Thompson 2004) as a novel way of obtaining a robust p-value for effect in a

MA affected by PB. Interestingly, the permutation tests is shown to be closely related to the radial plot (Copas & Malley 2008, Copas & Lozada-Can 2009), which in turn is closely related to a funnel plot related regression (Moreno et al 2009a).

Because the true publication selection mechanism is usually unidentifiable, the fact that the regression-based approach makes no explicit assumptions about it is an advantage over the typical ‘selection models’, where a known selection process (i.e. weight function) needs to be assumed a priori or estimated from the data (Copas &

Malley 2008). However, it could be argued that the regression adjustment method implies a selection mechanism, where the distribution of effect estimates between observed and missing studies is different to some extent. As with any other analysis of partially observed data (here due to PB), untestable assumptions need to be made from the observed data. Further investigation on this implicit selection mechanism (and associated assumptions) and later comparison to the selection mechanisms specified by other methods (e.g. TF) would help improving the understanding of the regression- based approach.

279

The Egger’s type regression is favoured in this thesis by arguing that because the sources of small-study effects are unlikely to operate independently and are not easily disentangled (Ioannidis 2008b), any attempts to considering any of the sources in isolation become futile. Similarly, although the joint assessment of potential biases is generally recommended (Siersma et al 2007), several pitfalls and limitations (section

5.1) anticipate the unfeasibility of multiple regression, particularly when using summary data from typically small collections of RCTs. On the other hand, study precision is ideal for modelling the small-study effects and so predicting a study of infinite size.

Bear in mind that study precision is commonly used to investigate PB (Egger et al

1997c, Ioannidis & Trikalinos 2007a), it has been seen to dominate other covariates when attempting to explain heterogeneity (Shang et al 2005, Nartey et al 2007, Shang et al 2007, McMahon et al 2008) and it has been shown empirically to perform better than other options do (chapters 6, 7, 8 & 10).

As well as assessing the competing regression adjustment methods in a simulation study (chapter 6), they could have been evaluated by removing the most precise study from a genuine MA and investigate how well the method predicted the result of the missing study (across a large number of MAs). In the view of this thesis, this type of assessment is less efficient that the simulation study and inconsistent with the method’s major assumption. Recall that the regression approach does not intend to predict the effect size for a particular study size but for the infinite size study because only this one is assumed unbiased. Hence, attempting to predict the most precise

(observed) study wrongly assumes that it is unbiased; and so findings may become misleading. Interestingly, this coincides with the approach adopted in the controversial

2005 Lancet publication by Shang et al (Shang et al 2005) where they propose to predict the treatment effect from the largest study observed.

280

Model Averaging is a technique aimed at comprising model uncertainty by averaging results about parameters and predictions across the different competing models (Claeskens & Hjort 2008). It has been suggested that rather than adopting a single adjustment method (with the inherent problem of model section), averaging over the results from all the competing methods should be preferred because it avoids having to justify the choice of model (Candolo et al 2003, Novielli 2007). Selecting a statistical model based on the well-established performance measures used in the simulation study was thought to be more desirable than model averaging because it makes the comparison of methods’ performance explicit. Moreover, in the view of this thesis, the sample of statistical models investigated in section 6.3 is a finite sample from an infinite number of plausible models. This brings to mind potential selection bias induced by the choice of models evaluated and thus included in the average

(Greenland 2009).

The regression adjustment method involves extrapolation beyond the observed data. At the same time, this extrapolation depends, to some extent, on the pattern observed in the smallest studies, which are assumed less reliable. Hence, it should be acknowledged that whenever the extrapolation result disagrees considerably from the effect size of the largest observed study, additional awareness is needed when interpreting the findings.

One of the reasons for the concerns in the performance of the regression adjustment method is the structural correlation that arises between the lnOR and its variance estimate. According to the investigation carried out in section 5.3.1, the structural correlation becomes a concern if sample sizes are small (<60).

Concerns regarding the performance of the adjustment methods in scenarios with only small studies have been repeatedly highlighted because it represents a major

281 drawback of the regression-based method compared to the standard MA approach.

Nevertheless, all previous approaches to dealing with PB also suffer important limitations in similar constrained meta-analytic settings. To allow the inspection of this meta-analytic setting, section 8.4 presented an illustrative MA containing 14 relatively small studies (with no evidence of small-study effects). Findings suggest that the major limitation of the proposed regression adjustment method is the increased uncertainty around the adjusted effect if a large extrapolation is required (or the bigger studies are inconsistent or contradictory in their results).

In agreement with Rubin (Rubin 1992), uncertainty in the estimation of the adjusted effect is not necessarily a weakness of extrapolation in regression analysis but the ‘honest’ reflection of the available data. With this in mind, the implementation of the adjustment method is endorsed by the evidence in this thesis whenever there are enough studies and these are of varying sizes to make results meaningful within a regression analysis setting. For that, it has been recommended to adopt a rule of thumb by which at least ten studies with varying sizes must be available in order to proceed. Otherwise, the statistical power is too low to distinguish chance from real small-study effects confidently (Sterne et al 2002, Egger et al 2003, Ioannidis 2008b,

Sterne et al 2008, Sterne et al 2009b).

Because MAs of few studies are so frequent, all approaches to dealing with PB generally perform poorly. With this in mind, the extrapolation from small datasets is the most serious concern for the regression adjustment method regarding its performance.

Chapter 11 demonstrates how to use external information to improve the accuracy of the bias-adjusted effect in a new meta-analytic dataset. Borrowing strength from external meta-analytic data regarding the small-study effects trend has been shown valuable for overcoming the typical limitations in extrapolation from small datasets, making the rule of thumb mentioned above less of a concern. Chapter 11 also

282 examines the benefits and methodological challenges in adapting the adjustment method to network MA. Although its implementation is possible, no suggestion is made regarding routine use because further research is needed about its performance in this more complex evidence synthesis framework. The next section identifies other issues for which future research would also be valuable.

12.3. Further work

For all issues investigated in this thesis, specific aspects have been identified which would benefit from further work. One extension to the primary aims of the thesis has already been carried out in chapter 11. This corresponds to the incorporation of external information to obtain a more precise bias-adjusted estimate of the true effect size.

Further work could involve applying the adjustment method developed here to additional case studies for which gold standard data also exist. This would allow a continued verification of the external validity of the adjustment method. It could be argued that a MA assessing an ineffective intervention against placebo should provide a kind of gold standard data since in theory it should produce a null pooled effect. With this in mind, attempts were made (without much success) to obtain the data published in 2005 on placebo-controlled trials of homoeopathy (Shang et al 2005). The authors argued that the clinical effects of homoeopathy must be placebo effects gained from the patient-provider interaction. Therefore, if the placebo effect fades away a few weeks after this interaction (Kaptchuk et al 2008), the pooled beneficial effect (inferred from the published studies) should be largely explained by the palpable small-study effects in their original figure 2 (figure 12.1) (Shang et al 2005, Rutten & Stolper 2009,

Wilson 2009). For comparison, an equal number of conventional-medicine trials were also meta-analysed. This was done to overcome one of the major limitations of their study. That is, because homeopathy may work only for some conditions, combining

283 them together can have a questionable value. Hence, randomly matched trials assessing conventional medical treatments were equally meta-analysed arguing that by using the same meta-analytic approach, conventional medicine would still show significant efficacy altogether (Wilson 2009).

Figure 12.1 Original figure 2 from Shang et al (Shang et al 2005), which displays separate funnel plots for 110 homoeopathy trials (top) and 110 matched conventional- medicine trials (bottom)

In accordance with Shang et al, the small-study effects could be partially explained by a combination of reporting biases and methodological deficiencies, which are identified in section 3.4 to induce such pattern. It would be interesting to implement the

284 regression-based adjustment method in both funnel plots in figure 12.1. In the author’s view, the funnel plots seem to suggest that extrapolation to the infinite size study (by using the advocated method) would predict no efficacy of homeopathy whatsoever (as opposed to conventional medicine), representing the end of the homeopathic spell

(Chaplin 2007, Ernst 2008, Fisher 2008, Brooks 2009a, Brooks 2009b, Garattini &

Bertelé 2009). This analysis was never performed because access to the data was denied by the authors contacted. Perhaps because the dataset was never made available to anyone, others decided to recreate it (Rutten & Stolper 2008). Although

Rutten et al (Rutten & Stolper 2008) did not display any funnel plots, they dismiss the observed funnel asymmetry as “irrelevant”. More shockingly, they argue that the funnel plot asymmetry observed in the small good quality trials “proves” that the funnel plot asymmetry observed in the homeopathy dataset is not caused by bias (Lüdtke &

Rutten 2008, Rutten & Stolper 2008, 2009). Obviously, the findings from this thesis contradict the conclusions from Rutten et al

Interestingly, the placebo effect has also been observed in a recent systematic review comparing placebo with no treatment (section 3.3) (Hrobjartsson & Gotzsche

2004). In line with the authors, the placebo effect could also be partly explained by the small-study effects phenomenon (figure 3.1) (Hróbjartsson 2002, Gøtzsche et al 2007).

Ultimately, further research could investigate to what extent the placebo effect

(Hróbjartsson 2002) observed in several MAs (Hrobjartsson & Gotzsche 2004, Shang et al 2005, Shang et al 2007, Madsen et al 2009) can be explained by the small-study effects phenomenon.

The availability of additional gold standard datasets would also allow the investigation of whether the multiplicative or additive approach for variance inflation captures unexplained heterogeneity more accurately in MA. As described in section

285

5.4, the proposed regression-based method contains a multiplicative dispersion parameter that considers smaller studies more heterogeneous than bigger ones.

Consequently, the influence of smaller studies is further diminished leading to the dominance of larger studies over the pooled effect size, regardless of the presence of small-study effects (Thompson & Sharp 1999). Although some object to this approach in favour of the more established RE MR model (Thompson & Pocock 1991, Thompson

& Sharp 1999), the adoption of the proposed regression-based approach can be considered more convenient whenever unexplained heterogeneity is greater among smaller studies. Regrettably, future research remains difficult in absence of a number of gold standard MAs.

As discussed in chapter 11, research considering the possibility that the intensity of small-study effects can vary depending on the clinical area or study characteristics could help derive more precise prior distributions to inform the small-study effects trend. For example, it was argued that evidence from a recent review of antidepressant drugs (figure 11.6) suggested that drugs sponsored by the pharmaceutical industry might inflate their drug’s efficacy by comparable amounts (through the phenomenon of small-study effects). Besides, chapter 11 recommends further research examining the performance of the adjustment method within the network MA framework before advocating its routine use in this more complex evidence synthesis framework. For that, extensive methodological work is required to properly adapt the preferred adjustment method to the network MA framework.

The recommendation of a single regression-based adjustment method raises a number of generalisability issues since results from simulation studies depend on whether the scenarios simulated truthfully match those found in real life (Minifie &

Heard 1985, Maldonado & Greenland 1997). For example, the simulation study only considered summary efficacy data from RCTs because it is the most common meta-

286 analytical setting. Then, the generalisability of the simulation results are questionable for different study designs (or other substantially different characteristics) to the ones simulated in chapter 6. Indeed, another adjustment method could have performed better if the simulated meta-analytic setting and the mechanisms used to induce PB had been different. On the other hand, the generalisability is supported by the extensive variability in scenarios considered. Moreover, the external validity of the advocated adjustment method was successfully investigated in chapter 7 with very encouraging results. It would be interesting nonetheless, that further research re- evaluated the performance of alternative adjustment (regression-based) methods for different meta-analytic settings.

With this in mind, and given that MA is increasingly used to combine observational studies, the performance of competing adjustment methods in MAs of observational studies could be the objective of future research. It would be interesting to investigate the performance of the proposed adjustment method in such settings, where biases specific to observational studies may be at work (e.g. confounding bias). It is important to appreciate, however, that the simulation study (chapter 6) took one year to carry out thoroughly. Therefore, another simulation study designed in the same detail, for example, to represent observational studies and other meta-analytic settings would obviously be a very time consuming endeavour. Yet, it has already been suggested that the regression-based adjustment method could be applied to a MA of observational studies. Specifically, the thesis author has been involved in a scientific manuscript recently submitted to PLOS Medicine journal evaluating the deceptive efficacy of C reactive protein, which has been recommended as a prognostic marker among patients with stable coronary artery disease (Hemingway et al 2009). Because the systematic review of the evidence revealed strong evidence of small-study effects, the regression adjustment method was implemented. Since no formal evaluation of this

287 adjustment method has been yet performed for this meta-analytic setting, the results were interpreted as a sensitivity analysis.

Future research on different meta-analytic settings could lead to the adaptation of the adjustment method to the more complex generalised evidence synthesis models used in decision modelling (Sutton et al 2002), where evidence from studies beyond

RCTs is also combined. This is particularly interesting because levels of PB (and other small-study effects) may vary between study designs, and this can be addressed by incorporating relevant prior information as done in chapter 11.

In health economic evaluation, decision models are being increasingly used to evaluate the cost-effectiveness of new interventions. Such models usually rely on the

(published) literature in order to estimate model parameters. For example, for treatment effects, decision models often rely on MAs of the published RCTs, which may provide a biased estimate due to PB. This might lead to incorrect estimates regarding cost- effectiveness and, hence, incorrect decisions regarding what treatment should be implemented. That is, decision models, which rely on MAs for best evidence on treatment effects, might then lead to incorrect recommendations as to which treatment is the most cost-effective (Petitti 1994).

Additionally, in a decision-making context, decisions regarding research priority

(e.g. expected value of perfect information approach (Welton et al 2008)) and design of future research (e.g. expected value of sample information approach (Ades et al 2004,

Willan 2007)) need to be made based on the data available. What is required, in addition to tackling potential biases, is an analysis of available data in which sources of uncertainty are properly reflected. Then, important benefits can be anticipated from the routine use of bias-adjustments whenever evidence is synthesised to inform decision- making. Note that probabilistic decision models allow uncertainty in evidence to be

288 expressed and propagated in the context of a probabilistic cost-effectiveness analysis.

This is indeed the preferred methodology favoured for submissions of Health Technical

Assessments (HTA) to the National Institute for health and Clinical Excellence (NICE).

Research could be extended to investigate the adaptation of the proposed method to adjust for PB in the area of cost-effectiveness analysis, where suspiciously most published analyses report favourable ratios of cost-effectiveness below the threshold set for ‘good value for money’ (Bell et al 2006).

Since decision models often require information on multiple outcomes, models could be extended to simultaneously consider selection effects across multiple outcomes. Future research could be valuable because the information available on certain outcomes can be used to inform estimation of other outcomes, which might be missing due to outcome reporting bias (Jackson et al 2005, Sterne et al 2009c). More specifically, the small-study effects trends left by two outcomes can ‘borrow strength’ reciprocally so as to obtain more precise effect estimates. With this in mind, the concurrent consideration of risks and benefits seems the correct way forward in the evaluation of interventions. It could be argued that a necessary balance between risks and benefits needs to be achieved in order to concede a health technology as the most efficient among the alternatives (Hansen et al 2005).

Although adverse events are an important group of outcomes, they were beyond the scope of this thesis. These have distinctive characteristics to efficacy outcomes: i) data suppression may be ‘reversed’ so it is the non-significant or uninteresting outcomes which are reported; and ii) data are often sparse making it difficult to distinguish truly ‘no events’ from suppression of event data (Chan et al 2004a,

Bongartz et al 2009). Undoubtedly, future research on this topic should be very welcome; in the mean time, the adjustment method developed here is censured for adverse event data.

289

12.4. Conclusions

In order to put the work of this thesis into context, the findings have been integrated with the evidence already available in the literature to provide an overview of all issues that require consideration in the area of MA with regard to PB. This has led to the development of a set of recommendations for the conduct of evidence synthesis in the form of MA.

In conclusion, the innovative regression-based adjustment method developed here improved on current adjustment methods. Note that while standard MA makes the strong assumption that the collected data is unbiased; the presence of small-study effects is often suspected and can never be ruled out with certainty (McMahon et al

2008). Therefore, the view in this thesis is to encourage to routinely minimise the impact of reporting biases, methodological deficiencies and other plausible sources of small-study effects (Sterne et al 2001b, McMahon et al 2008) without attempting to attribute the small-study effects to any particular cause (Sterne et al 2000, Ioannidis

2008b).

Although far from being definitive or unchallengeable, the recommendation to estimate the bias-adjusted pooled effect routinely as a more conceptually coherent way of predicting the ‘true’ effect is intended altogether to improve the reliability of results compared to standard MA. Addressing the phenomenon of small-study effects has many advantages including the straightforward prediction of the ‘true’ effect size. This is assumed generalisable to the population of interest, making it valuable for policymaking regarding routine care, at the cost of reducing precision on the pooled estimate compared to standard MA.

290

In recognition of the longstanding tradition of MA, a considerable effort has been made to comprehensively identify the reasons why a shift in paradigm is needed; with a number of justifications already given as early as 1987 (Rosenthal & Rubin 1988,

Rubin 1990, Glass 1991, Greenland & O'Rourke 2001). The question of how to encourage the routine implementation of the adjustment method represents an important challenge. Although the proposed regression method is straightforward to apply, the ideal way forward might be for the funnel plot command in existing software to optionally display the regression line with the corresponding results output.

Ultimately, the widespread adoption of the recommended approach can only be possible if experienced meta-analysts view this cautious proposal with an open mind.

291

Appendix 1

Logistic regression

Logistic regression is a form of generalised linear model (GLM) (McCullagh &

Nelder 1989). Logistic regression assumes a binomial distribution for the outcome variable with the logit link function modelling the relationship between the probability of the outcome and the linear predictor η. The logit link is the canonical link function for the binomial distribution. For the i th subject the probability of observing the outcome is given by the inverse of the logit function such that,

Odds i exp(ηi ) p(y i = 1|x i ) = = 1+ Odds i 1+ exp(ηi )

The logistic link function maps the linear predictor η onto the p unit interval [0, 1], where ηi =α +βxi in the case of a single covariate. As with any GLM additional covariates, including interaction terms, can also be included in the linear predictor.

Written in terms of the logit link:

Where, the response variable y equals 1 if the patient develops the event, and 0 otherwise. In the analysis of a cohort study the intercept α represents the baseline log odds of observing the outcome. In the analysis of a case-control study the intercept represents the baseline log odds of the outcome plus the log of the sampling fraction of cases to controls (Farewell 1979). The regression slope β can be interpreted as the change in the log odds per unit increase in x, and because beta is a change on the log scale it becomes a ratio on the odds scale.

292

Appendix 2

Relationship between p-value and effect size

When PB is assumed to be induced by effect size, only the effect size from each study is used to select whether the study is reported. However, when PB is assumed to be induced by p-value the study effect size and standard error are used to determine whether the study is reported. It is important to bear in mind that the p-value is determined by which test statistic is applied. For example, the contour lines on the contour-enhanced funnel plots are based on the Z-test statistic. The Z-test statistic for a two-sided test of a continuous variable is given below,

Χ − μ | Z |= 0 σ n

Reject H0 if |Z| > 1.96 (1.96 corresponds to a z-score under a two-sided hypothesis test where the significance level is 5%). σ stands for standard deviation; μ0 population mean; n sample size and finally Χ denotes the sample mean effect size.

In this example, the z-score is the outcome from the test statistic; and its distribution under the null hypothesis is assumed to be normal with mean zero and standard deviation equal to one. Then, given a z-score, a normal table is used to find the corresponding p-value, that is, the probability of obtaining a sample statistic as or more extreme by chance alone if the null hypothesis is true. This p-value is also known as the observed level of significance. The z-score is then compared to the pre-specified level of significance, which is commonly 5%. If the observed level of significance (p- value) is lower that the pre-specified 5%, then the null hypothesis of ‘no effect’ is rejected.

293

When PB is assumed to be induced by the p-value, the decision to publish is made on the basis of the p-value observed and the significance level considered

(usually 5%). For example, if the p-value turns out to be 0.01, it implies that the probability of observing data as extreme as this is 1 in 100 by chance alone, if the null hypothesis is true. So it can be concluded that it is unlikely to happen strictly by chance because the benchmark significance level was said to be 5%. As a result, a study with such highly significant result will be most certainly published.

A one-sided test assumes that the effect size is either larger or smaller than the hypothesised value (in this case it is in the direction of a beneficial treatment effect).

For a two-sided test the Z-value at the 97.5th percentile 1.96 is used as the critical value and for a one-sided test the Z-value at the 95th percentile 1.65 is used as the critical value.

294

Appendix 3

Publication bias intensity levels established by Duval and Tweedie

With respect to the simulation study undertaken by Duval and Tweedie (Duval &

Tweedie 2000b), their simulations are carried out for N = n+ Ko. Where n is the number of observed studies in a meta-analysis, Ko is the number of missing studies and N is the total number of studies carried out. Values of 25, 50, 75 were used for n, and three values for Ko = 0,5,10. Then, Ko/N(%) can be interpreted as the percentage of studies suppressed due to PB, which ranges between 0% and 29% (see table below).

N n Ko Ko/N 25 25 0 0% 30 25 5 17% 35 25 10 29% 50 50 0 0% 55 50 5 9% 60 50 10 17% 75 75 0 0% 80 75 5 6% 85 75 10 12%

Appendix 4

Additional plots summarising the results from the remaining scenarios Additional results from the simulation study, where homogeneous and heterogeneous MA refers to a heterogeneity factor of 0 and 1.5 respectively.

295

Publication bias situation 1 Figure A1 Measures of absolute bias, coverage probabilities, MSE and precision of the predicted effect for meta-analyses simulated to have 30 studies, an underlying OR of 1.5 and no PB alongside increasing levels of heterogeneity (x-axis) .2 100 .15 .1

80 FE

RE .05 TF FE-FE 0

60 TF FE-RE Coverage Prob.(%) Coverage Absolute Bias (lnOR) TF RE-RE -.05 FE-se 40 -.1 RE-se

D-se .08 .1 FE-var

RE-var .06 D-var

Harbord MSE .04 .05 Variance Peters

Harbord-C .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

296

Figure A2 Measures of performance of the predicted effect for homogeneous meta-analyses and underlying OR of 3 and no PB alongside increasing MA sizes (x-axis) .2 100 .15 .1

95 FE

RE .05 TF FE-FE 0

90 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE -.05 FE-se 85 -.1 RE-se

.4 D-se .25

FE-var .2

.3 RE-var

D-var .15

.2 Harbord MSE .1 Variance Peters

.1 Harbord-C .05 0 0

5 10 15 20 25 30 5 10 15 20 25 30

297

Figure A3 Measures of performance of the predicted effect for heterogeneous meta-analyses and underlying OR of 3 and no PB alongside increasing MA sizes (x-axis) .2 100 .15 90

.1 FE 80 RE .05

TF FE-FE 0 70 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) -.05 TF RE-RE 60

-.1 FE-se

50 RE-se

.8 .8 D-se

FE-var

.6 .6 RE-var

D-var

.4 .4 Harbord MSE

Variance Peters

.2 .2 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

298

Publication bias situation 2 Figure A4 Measures of performance of the predicted effect for homogeneous MAs and underlying OR=1 and severe PB induced by p-value alongside MA sizes (x-axis) .2 100 .15 80 .1 FE

RE .05 TF FE-FE 60 0 TF FE-RE Coverage Prob.(%) Coverage Absolute Bias (lnOR) TF RE-RE -.05 FE-se 40

-.1 RE-se .4 .3 D-se

FE-var

.3 RE-var

.2 D-var

.2 Harbord MSE

Variance Peters .1

.1 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

299

Figure A5 Measures of performance of the predicted effect for heterogeneous MAs and underlying OR=1 and severe PB induced by p-value alongside MA sizes (x-axis) .2 100 .15 80

.1 FE 60 RE .05 TF FE-FE 40

0 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 20 -.05 FE-se 0 -.1 RE-se

D-se .8 .8

FE-var

.6 RE-var .6

D-var .4 .4 Harbord MSE

Variance Peters .2 .2 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

300

Figure A6 Measures of performance of the predicted effect for homogeneous MAs and underlying OR=1.5 and severe PB induced by p-value alongside MA sizes (x-axis) .2 100 .15 95

.1 FE

RE 90 .05 TF FE-FE

0 TF FE-RE Coverage Prob.(%) 85 Absolute Bias (lnOR) TF RE-RE -.05 FE-se 80 -.1 RE-se

.2 .3 D-se

FE-var

RE-var .15 .2 D-var

.1 Harbord MSE

Variance Peters .1 Harbord-C .05 0 0

5 10 15 20 25 30 5 10 15 20 25 30

301

Figure A7 Measures of performance of the predicted effect for heterogeneous MAs and underlying OR=1.5 and severe PB induced by p-value alongside MA sizes (x-axis) .2 100 .15 .1

80 FE

RE .05 TF FE-FE 0

60 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE -.05 FE-se 40 -.1 RE-se

.6 .6 D-se

FE-var

RE-var .4 .4 D-var

Harbord MSE

Variance Peters .2 .2 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

302

Publication bias situation 3 Figure A9 Measures of performance of the predicted effect for homogeneous MAs and underlying OR=1 and moderate PB induced by p-value alongside MA sizes (x-axis) .2 100 .15 80 .1 FE

RE .05 TF FE-FE 60 0 TF FE-RE Coverage Prob.(%) Coverage Absolute Bias (lnOR) TF RE-RE -.05 FE-se 40

-.1 RE-se

.3 D-se .25

FE-var .2 RE-var

.2 D-var .15 Harbord MSE .1 Variance Peters .1 Harbord-C .05 0 0

5 10 15 20 25 30 5 10 15 20 25 30

303

Figure A10 Measures of performance of the predicted effect for heterogeneous MAs and underlying OR=1 and moderate PB induced by p- value alongside MA sizes .2 100 .15 80

.1 FE 60 RE .05 TF FE-FE 40

0 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 20 -.05 FE-se 0 -.1 RE-se

.8 .8 D-se

FE-var

.6 .6 RE-var

D-var

.4 .4 Harbord MSE

Variance Peters

.2 .2 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

304

Figure A11 Measures of performance of the predicted effect for homogeneous MAs & underlying OR=1.5 and moderate PB induced by p-value alongside MA sizes (x-axis) .2 100 .15 .1

95 FE

RE .05 TF FE-FE 0

90 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE -.05 FE-se 85 -.1 RE-se

.3 D-se .2 FE-var

RE-var .15 .2 D-var

Harbord .1 MSE

Variance Peters .1 Harbord-C .05 0 0

5 10 15 20 25 30 5 10 15 20 25 30

305

Figure A12 Measures of performance of the predicted effect for heterogeneous MAs & underlying OR=1.5 and moderate PB induced by p- value alongside MA sizes (x-axis) .2 100 .15 90

.1 FE 80 RE .05 TF FE-FE 70

0 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 60 -.05 FE-se 50 -.1 RE-se

.6 .6 D-se

FE-var

RE-var .4 .4 D-var

Harbord MSE

Variance Peters .2 .2 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

306

Figure A8 Measures of performance of the predicted effect for MAs simulated with 30 studies, OR=1 & moderate PB induced by p-value along levels of heterogeneity 100 .2 80 .15

FE .1 60 RE

.05 TF FE-FE 40

0 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 20

-.05 FE-se 0 -.1 RE-se

D-se .15 .08

FE-var

RE-var .06 .1 D-var

Harbord .04 MSE

Variance Peters .05 Harbord-C .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

307

Publication bias situation 4 Figure A13 Measures of performance of the predicted effect for MAs simulated with 30 studies, underlying OR of 3 & severe PB induced by effect size along heterogeneity 100 .2 80 .15

FE .1 60 RE

.05 TF FE-FE 40

0 TF FE-RE Coverage Prob.(%) Coverage Absolute Bias (lnOR) TF RE-RE 20 -.05 FE-se 0 -.1 RE-se

D-se .04 .06 FE-var

RE-var .03 D-var .04

Harbord MSE .02

Variance Peters .02 Harbord-C .01 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

308

Figure A14 Measures of performance of the predicted effect for homogeneous MAs & underlying OR=1 and severe PB induced by effect size alongside MA sizes (x-axis) .2 100 .15 80

.1 FE

RE 60 .05 TF FE-FE

0 TF FE-RE Coverage Prob.(%) 40 Absolute Bias (lnOR) TF RE-RE -.05 FE-se 20 -.1 RE-se

D-se .15 .25 FE-var

.2 RE-var .1 D-var .15 Harbord MSE Variance .1 Peters .05 Harbord-C .05 0 0

5 10 15 20 25 30 5 10 15 20 25 30

309

Figure A15 Measures of performance of the predicted effect for heterogeneous MAs and underlying OR=1 and severe PB induced by effect size along MA sizes (x-axis) .2 100 .15 80

.1 FE 60 RE .05 TF FE-FE 40

0 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 20 -.05 FE-se 0 -.1 RE-se

.4 D-se .4

FE-var .3

.3 RE-var

D-var .2

.2 Harbord MSE

Variance Peters .1 .1 Harbord-C 0 0

5 10 15 20 25 30 5 10 15 20 25 30

310

Publication bias situation 5 Figure A16 Measures of performance of the predicted effect for MAs simulated with 30 studies, underlying OR=1 & moderate PB induced by effect size along heterogeneity .2 100 .15 80

.1 FE

RE 60 .05 TF FE-FE

0 TF FE-RE Coverage Prob.(%) Coverage 40 Absolute Bias (lnOR) TF RE-RE -.05 FE-se 20 -.1 RE-se

D-se .08 .06

FE-var

RE-var .06

.04 D-var

Harbord .04 MSE

Variance Peters .02 Harbord-C .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

311

Figure A17 Measures of performance of the predicted effect for MAs simulated to have 30 studies, OR of 3 and moderate PB induced by effect size alongside heterogeneity .2 100 .15 .1

80 FE

RE .05 TF FE-FE 0 60 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE -.05 FE-se 40 -.1 RE-se

D-se .08 .06

FE-var

RE-var .06

.04 D-var

Harbord .04 MSE

Variance Peters .02 Harbord-C .02 0 0

0 .5 1 1.5 2 0 .5 1 1.5 2

312

Figure A18 Measures of performance of the predicted effect for homogeneous MAs, OR=3 and moderate PB induced by effect size alongside meta-analysis sizes (x-axis) .2 100 .15 .1

90 FE

RE .05 TF FE-FE 0

80 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE -.05 FE-se 70 -.1 RE-se

.2 .3 D-se

FE-var

RE-var .15 .2 D-var

.1 Harbord MSE

Variance Peters .1 Harbord-C .05 0 0

5 10 15 20 25 30 5 10 15 20 25 30

313

Figure A19 Measures of performance of the predicted effect for heterogeneous MAs, OR=3 and moderate PB induced by effect size alongside meta-analysis sizes (x-axis) .2 100 .15 90

.1 FE 80 RE .05 TF FE-FE 70

0 TF FE-RE Coverage Prob.(%) Absolute Bias (lnOR) TF RE-RE 60 -.05 FE-se 50 -.1 RE-se

.5 .6 D-se

FE-var .4 RE-var .4 D-var .3

Harbord MSE .2 Variance Peters .2 Harbord-C .1 0 0

5 10 15 20 25 30 5 10 15 20 25 30

314

Appendix 5

Derivation of equations presented in chapter 8 Regression coefficients for the regression-based method Egger-var (D-var) estimated by weighted least squares regression. X is the design matrix containing the covariate values , (study-specific effect size variance), V is the diagonal matrix whose elements are , and y is the response vector. N-k corresponds to the model degrees of freedom where N is the total number of studies in the MA and k is the number of parameters estimated (i.e. both regression coefficients).

Equation 1:

1 000 0 0 00 0 1 0 0 0 0 0 00 0 1 0 000

1 1 1 ∑ ∑

Regression intercept. Equation 2:

∑ ∑ ∑ 1 ∑ ∑

315

Regression slope given by equation 3:

1 ∑ ∑ ∑ 1 ∑ ∑

In order to determine the influence of each study towards the pooled effect (designated by the regression intercept α), the (relative) weight distribution needs be calculated. This is done by solving equation 2 for yi (i.e. the observed study-specific effect size). The relative weighting assigned to each study i can be found in closed form in equation 4:

1 ∑ 1 ∑ ∑

The weighting approach above can be compared to the one from a FE MA, which is based on the inverse-variance (N.B. relative weighting is used).

1 1 ∑ 1 1 ∑ ∑ ∑

To determine what value of provides the same (relative) weighting in both the FE MA and regression-based method, the equation above is solved for , resulting in equation 5:

1 ∑

316

To find the value of for which the regression method starts assigning negative weights, equation 4 is equated to zero and then solve for . Equation 6:

∑

Equation 6 suggests that the arithmetic mean of is the breakeven point splitting positive from negative weighting. Among the negative weights, some study weights are bigger than for MA (in absolute terms). Equation 7:

2 ∑ 1 ∑

When the contrast is made against the RE MA, results become more convoluted. The counterparts of equations 5 and 7 for RE MA are 8 and 9, respectively (N.B. Equation 6 is the same for FE MA & RE MA). Equation 8:

1 2 1 1 1 2 ∑ ∑ 2 2∑ ∑ ∑ ∑ 2 2 ∑ 2 2 1 2 ∑ 2

1 ∑ 2∑ 2 ∑ ∑ 1 2 ∑

1 1 ∑ ∑ ∑ ∑ 1 2 ∑

317

Equation 9:

1 2 1 1 1 2 ∑ ∑ 2 2∑ ∑ ∑ ∑ 2 2 ∑ 2 2 1 2 ∑ 2

1 ∑ 2∑ 2 ∑ ∑ 1 2 ∑

1 1 ∑ ∑ ∑ ∑ 1 2 ∑

An important statistic is the variance of the intercept coefficient (equation 2). This can be calculated as follows. The first element of the covariance matrix (given in equation 10) can be calculated for the FE version of the regression-based method (equation 11):

Equation 10:

∑ Equation 11: ∑ ∑

318

Of course, the resulting value is incorrect for the dispersion model, and so it is corrected by multiplying it by the MSE (equation 12) from the dispersion model.

Equation 11:

ESS stands for the error sum of squares and N-k corresponds to the model degrees of freedom. According to the Stata manual, the ESS from the dispersion model is estimated by equation 12:

Then, the corrected intercept variance is given by equation 13:

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ 1 1 ∑ ∑ ∑ ∑

If the intercept and slope are substituted by equations 2 & 3 respectively, then the variance of the intercept can be expressed only in terms of N, k, and y. Equation 14:

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ 1 ∑ ∑

319

Bibliography

Abbasi K. (2004) Compulsory registration of clinical trials. Lancet 364:911-2

Abbasi K. & Godlee F. (2005) Next steps in trial registration. BMJ 330:1222-3

Ades A.E., Lu G. & Claxton K. (2004) Expected value of sample information calculations in medical decision modeling. Med Decis Making 24:207-27

Ades A.E., Sculpher M., Sutton A., Abrams K., Cooper N., Welton N. & Lu G. (2006) Bayesian methods for evidence synthesis in cost-effectiveness analysis. PharmacoEconomics 24:1

Als-Nielsen B., Chen W., Gluud C. & Kjaergard L.L. (2003) Association of funding and conclusions in randomized drug trials: A reflection of treatment effect or adverse events? JAMA 290:921-8

Als-Nielsen B., Chen W., Gluud L.L., Siersma V., Hilden J. & Gluud C. (2004) Are trial size and reported methodological quality associated with treatment effects? Observational study of 523 randomised trials. 12th International Cochrane Colloquium. Ottawa, Canada,

Altman D.G. & Deeks J.J. (2002) Meta-analysis, Simpson's paradox, and the number needed to treat. BMC Med Res Methodol 2

Anne M.M., Mark R.E., Robert D.H., Christopher G.M. & Catherine S. (2009) Cochrane reviews used more rigorous methods than non-Cochrane reviews: survey of systematic reviews in physiotherapy. J Clin Epidemiol [In press]

Arends L.R., Hoes A.W., Lubsen J., Grobbee D.E. & Stijnen T. (2000) Baseline risk as predictor of treatment benefit: three clinical meta-re-analyses. Stat Med 19:3497-518

Armitage P., Berry G. & Matthews J.N.S. (2002) Statistical Methods in Medical Research, Blackwell Publishing

Armitage P. & Colton T. (2005) Encyclopedia of Biostatistics, Chichester,John Wiley

Bagnardi V., Zambon A., Quatto P. & Corrao G. (2004) Flexible Meta-Regression Functions for Modeling Aggregate Dose-Response Data, with an Application to Alcohol and Mortality. Am J Epidemiol 159:1077-86

Baker R. & Jackson D. (2006) Using Journal Impact Factors to Correct for the Publication Bias of Medical Studies. Biometrics 62:785-92

Balk E.M., Bonis P.A.L., Moskowitz H., Schmid C.H., Ioannidis J.P.A., Wang C. & Lau J. (2002) Correlation of quality measures with estimates of treatment effect in meta-analyses of randomized controlled trials. JAMA 287:2973-82

Barza M., Trikalinos T.A. & Lau J. (2009) Statistical considerations in meta-analysis. Infectious disease clinics of north america 23:195-210

Bassler D., Ferreira-Gonzalez I., Briel M., Cook D.J., Devereaux P.J., Heels-Ansdell D., Kirpalani H., Meade M.O., Montori V.M. & Rozenberg A. (2007) Systematic reviewers neglect bias that results from trials stopped early for benefit. J Clin Epidemiol 60:869-73

Bax L., Ikeda N., Fukui N., Yaju Y., Tsuruta H. & Moons K.G.M. (2009) More Than Numbers: The Power of Graphs in Meta-Analysis. Am J Epidemiol 169:249-55

320

Baxter G.F., Sumeray M.S. & Walker J.M. (1996) Infarct size and magnesium: insights into LIMIT-2 and ISIS-4 from experimental studies. Lancet 348:1424-6

Bayarri M.J. & Berger J.O. (2004) The interplay of Bayesian and frequentist analysis. Statist Sci 19:58-80

Bayes T.R. (1763) An essay towards solving the doctrine of chances. Phil Trans Roy Soc London 53:370

Becker B.J. (2005) Failsafe N or File-Drawer Number. IN Rothstein, H.R., Sutton, A.J. &Borenstein, M. (Eds.) Publication bias in meta-analysis: Prevention, assessment and adjustments. Chichester, Wiley

Begg C.B. & Berlin J.A. (1988) Publication bias: a problem in interpreting medical data (with discussion). JR Stat Soc B 151:419-63

Begg C.B. & Mazumdar M. (1994) Operating characteristics of a rank correlation test for publication bias. Biometrics 50:1088-101

Bell C.M., Urbach D.R.,Ray J.G.,Bayoumi A.,Rosen A.B., Greenberg D. & Neumann P.J. (2006) Bias in published cost-effectiveness studies: systematic review. BMJ 332:699-703

Bender R., Bunce C., Clarke M., Gates S., Lange S., Pace N.L. & Thorlund K. (2008) Dealing with multiplicity in systematic reviews. J Clin Epidemiol 61:857-65

Berglund L., Garmo H., Lindbäck J., Svärdsudd K. & Zethelius B. (2008) Maximum likelihood estimation of correction for dilution bias in simple linear regression using replicates from subjects with extreme first measurements. Stat Med 27:4397-407

Berkey C.S., Antczak-Bouckoms A., Hoaglin D.C., Mosteller F. & Pihlstrom B.L. (1995a) Multiple-outcomes meta-analysis of treatments for periodontal disease. J Dent Res 74:1030-9

Berkey C.S., Hoaglin D.C., Antczak-Bouckoms A., Mosteller F. & Colditz G.A. (1998) Meta- analysis of multiple outcomes by regression with random effects. Stat Med 17:2537-50

Berkey C.S., Hoaglin D.C., Mosteller F. & Colditz G.A. (1995b) A random-effects regression model for meta-analysis. Stat Med 14:395-411

Berlin J.A. & Antman E.M. (1994) Advantages and limitations of meta-analytic regressions of clinical trials data. Online J Curr Clin Trials. Doc No 134

Berlin J.A., Begg C.B. & Louis T.A. (1989) An Assessment of Publication Bias Using a Sample of Published Clinical Trials. J Am Statist Assoc 84:381-92

Berlin J.A. & Ghersi D. (2006) Preventing publication bias: registries and prospective meta- analysis. IN Dr. Hannah R. Rothstein, D.A.J.S.D.M.B. (Ed.) Publication bias in meta- analysis

Berlin J.A., Santanna J., Schmid C.H., Szczech L.A. & Feldman H.I. (2002) Individual patient- versus group-level data meta-regressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head. Stat Med 21:371-87

Bland J.M. & Altman D.G. (1994a) Statistic Notes: Regression towards the mean. BMJ 308:1499-

Bland J.M. & Altman D.G. (1994b) Statistics Notes: One and two sided tests of significance. BMJ 309:248-

321

Bland J.M. & Altman D.G. (1994c) Statistics Notes: Some examples of regression towards the mean. BMJ 309:780-

Blettner M.S., W. Schlehofer,B. Scheuchenpflug,T. Friedenreich,C. (1999) Traditional reviews, meta-analyses and pooled analyses in epidemiology. Int J Epidemiol 28:1-9

Bollini P., Pampallona S., Tibaldi G., Kupelnick B. & Munizza C. (1999) Effectiveness of antidepressants. Meta-analysis of dose-effect relationships in randomised clinical trials. Br J Psychiatry 174:297-303

Bongartz T., Warren F.C., Mines D., Matteson E.L., Abrams K.R. & Sutton A.J. (2009) Etanercept therapy in rheumatoid arthritis and the risk of malignancies: a systematic review and individual patient data meta-analysis of randomised controlled trials. Ann Rheum Dis 68:1177-83

Borenstein M., Hedges L.V., Higgins J.P.T. & Rothstein H.R. (2009) Introduction to meta- analysis, Chichester,Wiley

Bowden J., Thompson J.R. & Burton P. (2006) Using pseudo-data to correct for publication bias in meta-analysis. Stat Med 25:3798-813

Bradburn M.J., Deeks J.J., Berlin J.A. & Localio A.R. (2007) Much ado about nothing: a comparison of the performance of meta-analytical methods with rare events. Stat Med 26:53-77

Brand R. & Kragt H. (1992) Importance of trends in the interpretation of an overall odds ratio in the meta-analysis of clinical trials. Stat Med 11:2077-82

Brealey S., Scally A., Hahn S., Thomas N., Godfrey C. & Crane S. (2006) Accuracy of radiographers red dot or triage of accident and emergency radiographs in clinical practice: a systematic review. Clinical Radiology 61:604-15

Breiman L. (1993) Hinging hyperplanes for regression, classification, and functionapproximation. IEEE Transactions on Information Theory 39:999-1013

Brookes S.T., Whitley E., Peters T.J., Mulheran P.A., Egger M. & Davey Smith G. (2001) Subgroup analyses in randomised controlled trials: quantifying the risks of false- positives and false-negatives. Health technology assessment (Winchester, England) 5:1

Brooks M. (2009a) 13 things that don't make sense: the most intriguing scientific mysteries of our time, Profile Books

Brooks M. (2009b) Homeopathy: Sometimes a dose of nothing can do you a power of good. The Guardian. 6th February 2009. London, Guardian News and Media Limited

Brooks S.P. & Gelman A. (1998) General Methods for Monitoring Convergence of Iterative Simulations. Journal of Computational and Graphical Statistics 7:434-55

Burton A., Altman D.G., Royston P. & Holder R.L. (2006) The design of simulation studies in medical statistics. Stat Med 25:4279-92

Burton P.R., Gurrin L.C. & Campbell M.J. (1998) Clinical significance not statistical significance: a simple Bayesian alternative to p values. J Epidemiol Community Health 52:318-23

Caldwell D.M., Ades A.E. & Higgins J.P.T. (2005) Simultaneous comparison of multiple treatments: combining direct and indirect evidence. BMJ 331:897

Calnan M., Davey Smith G. & Sterne J.A.C. (2006) The publication process itself was the major cause of publication bias in genetic epidemiology. J Clin Epidemiol 59:1312-8

322

Candolo C., Davison A.C. & Demetrio C.G.B. (2003) A Note on model uncertainty in linear regression. JR Stat Soc. Series D (The Statistician) 52:165-77

Carpenter J., Rücker R. & Schwarzer G. (2008) Adjusting for publication bias: a multiple imputation approach. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Carpenter J.R., Kenward M.G. & White I.R. (2007) Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Stat Methods Med Res 16:259-75

Carpenter J.R., Schwarzer G., Rücker G. & Künstler R. (2009) Empirical evaluation showed that the Copas selection model provided a useful summary in 80% of meta-analyses. J Clin Epidemiol 62:624-31.e4

Carroll R.J. & Ruppert D. (1988) Transformation and weighting in regression, London, Chapman & Hall

Carroll R.J., Ruppert D. & Stefanski L.A. (1995) Measurement error in nonlinear models, London, Chapman & Hall

Carroll R.J. & Stefanski L.A. (1994) Measurement error, instrumental variables and corrections for attenuation with applications to meta-analyses. Stat Med 13:1265-82

Chalmers I. (1990) Underreporting research is scientific misconduct. JAMA 263:1405-8

Chalmers T.C., Frank C.S. & Reitman D. (1990) Minimizing the three stages of publication bias. JAMA 263:1392-5

Chan A.-W. (2008) Bias, spin, and misreporting: Time for full access to trial protocols and results. PLoS Medicine 5:e230

Chan A.-W., Hróbjartsson A., Haahr M.T., Gøtzsche P.C. & Altman D.G. (2004a) Empirical evidence for selective reporting of outcomes in randomized trials. Comparison of protocols to published articles. JAMA 291:2457-65

Chan A.-W., Hrobjartsson A., Jorgensen K.J., Gotzsche P.C. & Altman D.G. (2008a) Discrepancies in sample size calculations and data analyses reported in randomised trials: comparison of publications with protocols. BMJ 337:a2299

Chan A.-W., Krleza-Jeric K., Schmid I. & Altman D.G. (2004b) Outcome reporting bias in randomized trials funded by the Canadian Institutes of Health Research. CMAJ 17:735- 40

Chan A., Tetzlaff J., Altman D.G., Gøtzsche P.C., Hróbjartsson A., Krleza-Jeric K., Laupacis6 A. & Moher D. (2008b) The SPIRIT initiative: defining standard protocol items for randomised trials. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Chaplin M.F. (2007) The Memory of Water: an overview. Homeopathy 96:143-50

Chinn S. (2000) A simple method for converting an odds ratio to effect size for use in meta- analysis. Stat Med 19:3127-31

Cipriani A., Barbui C., Brambilla P., Furukawa T.A., Hotopf M. & Geddes J.R. (2006) Are all antidepressants really the same? The case of fluoxetine: a systematic review. The Journal of clinical psychiatry 67:850

Cipriani A., Furukawa T.A., Salanti G., Geddes J.R., Higgins J.P.T., Churchill R., Watanabe N., Nakagawa A., Omori I.M., Mcguire H., Tansella M. & Barbui C. (2009a) Comparative

323

efficacy and acceptability of 12 new-generation antidepressants: a multiple-treatments meta-analysis. Lancet 373:746-58

Cipriani A., Geddes J.R., Furukawa T.A. & Barbui C. (2007a) Metareview on short-term effectiveness and safety of antidepressants for depression: an evidence-based approach to inform clinical practice. Can J Psychiatry 52:553-62

Cipriani A., Geddes J.R., Furukawa T.A., Salanti G. & Barbui C. (2009b) Ranking antidepressants - Authors' reply. Lancet 373:1761-2

Cipriani A., Malvini L., Furukawa T.A. & Barbui C. (2007b) Relationship Between Quality of Reports of Antidepressant Randomized Controlled Trials and Treatment Estimates: Systematic Review, Meta-Analysis, and Meta-Regression Analysis. Journal of Clinical Psychopharmacology 27:352

Claeskens G. & Hjort N.L. (2008) Model Selection and Model Averaging, Cambridge University Press

Clarke M. & Chalmers I. (1998) Meta-analyses, multivariate analyses, and coping with the play of chance. Lancet 351:1062-

Cochran W.G. (1954) The combination of estimates from different experiments. Biometrics 10:101-29

Cochran W.G. (1968) Errors of Measurement in Statistics. Technometrics 10:637-66

Cohen A.S., Kane M.T. & Kim S.H. (2001) The precision of simulation study results. Applied Psychological Measurement 25:136

Comas M., Castells X., Hoffmeister L., Román R., Cots F., Mar J., Gutiérrez-Moreno S. & Espallargues M. (2008) Discrete-event simulation applied to analysis of waiting lists. Evaluation of a prioritization system for cataract surgery. Value in Health 11:1203-13

Cooper N.J., Sutton A.J., Lu G. & Khunti K. (2006) Mixed comparison of stroke prevention treatments in individuals with nonrheumatic atrial fibrillation. Arch Intern Med 166:1269- 75

Cooper N.J., Sutton A.J., Morris D., Ades A.E. & Welton N.J. (2009) Addressing between-study heterogeneity and inconsistency in mixed treatment comparisons: Application to stroke prevention treatments in individuals with non-rheumatic atrial fibrillation. Stat Med 28:1861-81

Coory M.D. (2009) Comment on: heterogeneity in meta-analysis should be expected and appropriately quantified. Int J Epidemiol dyp157

Copas J. (1998) What works?: selectivity models and meta-analysis. JR Stat Soc A 161:95-105

Copas J. (2005) The downside of publication. Significance 2:154-7

Copas J. & Jackson D. (2004) A bound for publication bias based on the fraction of unpublished studies. Biometrics Volume 60: 146-53

Copas J. & Lozada-Can C. (2009) The radial plot in meta-analysis: approximations and applications. JR Stat Soc C 58:329-44

Copas J. & Shi J.Q. (2000a) Meta-analysis, funnel plots and sensitivity analysis. Biostat 1:247- 62

Copas J.B. & Li H.G. (1997) Inference for non-random samples. JR Stat Soc B 59:55-95

324

Copas J.B. & Malley P.F. (2008) A robust P-value for treatment effect in meta-analysis with publication bias. Stat Med 27:4267-78

Copas J.B. & Shi J.Q. (2000b) Reanalysis of epidemiological evidence on lung cancer and passive smoking. BMJ 320:417-8

Copas J.B. & Shi J.Q. (2001) A sensitivity analysis for publication bias in systematic reviews. Stat Methods Med Res 10:251-65

Cox D.R. & Solomon P.J. (2003) Components of variance,CRC Press

Cox N.J., Hardin J.W., Carroll R.J., Schmiediche H., Rabe-Hesketh S., Skrondal A., Pickles A., Gutierrez R.G., Linhart J.M., Pitblado J.S., Jenkins S.P., Gould W. & Newson R. (2003) A special issue of the Stata Journal on Measurement Error. Stata Journal 3

Crainiceanu C., Ruppert D. & Wand M.P. (2005) Bayesian analysis for penalized spline regression using WinBUGS. Journal of Statistical Software 14:1–24

Curfman G.D., Morrissey S. & Drazen J.M. (2005) Expression of Concern: Bombardier et al., "Comparison of Upper Gastrointestinal Toxicity of Rofecoxib and Naproxen in Patients with Rheumatoid Arthritis," N Engl J Med 2000;343:1520-8. N Engl J Med 353:2813-4

Curt G.A. & Chabner B.A. (2008) One in five cancer clinical trials is published: A terrible symptom--What's the diagnosis? Oncologist 13:923-4

De Angelis C., Drazen J.M., Frizelle F.A., Haug C., Hoey J., Horton R., Kotzin S., Laine C., Marusic A. & Overbeke A. (2004) Clinical trial registration: a statement from the International Committee of Medical Journal Editors. Can Med Assoc

De Jonge P., Bockting C.L., Schoones J.W., Ninan P.T., Poole R.M., Stiles G.L., Turner E.H. & Tell R.A. (2008) Selective publication of antidepressant trials. N Engl J Med 358:2180-2

Dear K.B.G. & Begg C.B. (1992) An approach for assessing publication bias prior to performing a meta-analysis. Statistical Science 7:237-45

Decullier E., Lheritier V. & Chapuis F. (2005) Fate of biomedical research protocols and publication bias in France: retrospective cohort study. BMJ 331:19-

Deeks J.J., Higgins J.P.T., Altman D.G. & (Editors) (2008) Chapter 9: Analysing data and undertaking meta-analyses IN Higgins, J.P.T. &Green, S. (Eds.) Cochrane Handbook for Systematic Reviews of Intervention. Version 5.0.0 (updated February 2008) (available from www.cochrane-handbook.org). Oxford, Cochrane Collaboration

Deeks J.J., Macaskill P. & Irwig L. (2005) The performance of tests of publication bias and other sample size effects in systematic reviews of diagnostic test accuracy was assessed. J Clin Epidemiol 58:882-93

Delgado-Rodriguez M. & Llorca J. (2004) Bias. J Epidemiol Community Health 58:635-41

Demirtas H. (2007) Letter to the editor about: The design of simulation studies in medical statistics by Andrea Burton et al. Stat Med 26:3818-21

Denison D.G.T., Holmes C.C., Mallick B.K. & Smith A.F.M. (2002) Bayesian methods for nonlinear classification and regression, Chichester, England, Wiley

Denison D.G.T., Mallick B.K. & Smith A.F.M. (1998) Automatic Bayesian curve fitting. JR Stat Soc B 60:333-50

Dersimonian R. & Kacker R. (2007) Random-effects model for meta-analysis of clinical trials: An update. Contemporary Clinical Trials 28:105-14

325

Dersimonian R. & Laird N. (1986) Meta-analysis in clinical trials. Controlled Clin Trials 7:177-88

Dickersin K. (1990) The existence of publication bias and risk factors for its occurrence. JAMA 263:1385-9

Dickersin K. (1992) Why register clinical trials?: Revisited. Controlled clinical trials 13:170-7

Dickersin K. (1997) How important is publication bias? A synthesis of available data. AIDS Educ Prev 9(Suppl):15-21

Dickersin K. (2005) Publication bias: Recognizing the problem, understanding its origins and scope, and preventing harm. IN Rothstein, H.R., Sutton, A.J. &Borenstein, M. (Eds.) Publication Bias in Meta-analysis: Prevention, Assessment, and Adjustments. London, John Wiley & Sons, Ltd

Dickersin K. (2008) Reporting and other biases in studies of Neurontin for migraine, psychiatric/bipolar disorders, nociceptive pain, and neuropathic pain. http://www.pharmalot.com/wp-content/uploads/2008/10/neurontin-dickersin-2.pdf

Dickersin K., Min Y.I., Orza M., Lucey J., Johnson K., Clarke J., Doll R. & Peto R. (1993) Publication bias: The problem that won't go away. Annals of the New York Academy of Sciences 703:135-48

Dohoo I., Stryhn H. & Sanchez J. (2007) Evaluation of underlying risk as a source of heterogeneity in meta-analyses: A simulation study of Bayesian and frequentist implementations of three models. Preventive Veterinary Medicine 81:38-55

Doroshow J.H. (2008) Commentary: publishing cancer clinical trial results: A scientific and ethical imperative. Oncologist 13:930-2

Doucet M. & Sismondo S. (2008) Evaluating solutions to sponsorship bias. J Med Ethics 34:627-30

Dubben H.-H. (2009) New methods to deal with publication bias [editorial]. BMJ 339:b3272-

Dumouchel W. & Normand S.L. (2000) Computer-modeling and graphical strategies for meta- analysis. IN Stangl, D.K. &Berry, D.A. (Eds.) Meta-analysis in medicine and health policy. New York, Marcel Dekker

Duval S. & Tweedie R.L. (2000a) A nonparametric "Trim and Fill" method of accounting for publication bias in meta-analysis. J Am Stat Assoc 95:89-98

Duval S. & Tweedie R.L. (2000b) Trim and fill: A simple funnel plot based method of testing and adjusting for publication bias in meta-analysis. Biometrics 56:455-63

Dwan K., Altman D.G., Arnaiz J.A., Bloom J., Chan A., Cronin E., Decullier E., Easterbrook P.J., Von Elm E. & Gamble C. (2008) Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS ONE 3:e3081

Easterbrook P. (1987) Reducing publication bias. Br Med J (Clin Res Ed) 295:1347

Easterbrook P.J. (1992) Directory of registries of clinical trials. Stat Med 11:345-423

Easterbrook P.J., Berlin J.A., Gopalan R. & Matthews D.R. (1991) Publication bias in clinical research. Lancet 337:867-72

Eckert L. & Lançon C. (2006) Duloxetine compared with fluoxetine and venlafaxine: use of meta-regression analysis for indirect comparisons. BMC Psychiatry 6:30

326

Edwards S.J., Clarke M.J., Wordsworth S. & Borrill J. (2009) Indirect comparisons of treatments based on systematic reviews of randomised controlled trials. International Journal of Clinical Practice 63:841-54

Egger E., Zellwegerzahner T., Schneider M., Junker C. & Lengeler C. (1997a) Language bias in randomised controlled trials published in English and German. Lancet 350:326-9

Egger M. & Davey Smith G. (1997) Meta-analysis: bias in location and selection of studies. BMJ 316:61-6

Egger M., Davey Smith G. & Altman.D.G. (2001) Systematic revews in health care: Meta- analysis in context, London,BMJ Books

Egger M., Davey Smith G. & Phillips A.N. (1997b) Meta-analysis: principles and procedures. BMJ 315:1533-7

Egger M., Davey Smith G. & Sterne J.A.C. (1998) Meta-analysis: is moving the goal post the answer? (letter). Lancet 351:1517

Egger M., Ebrahim S. & Smith G.D.(2002) Where now for meta-analysis? Int J Epidemiol 31:1-5

Egger M., Juni P., Bartlett C., Holenstein F. & Sterne J. (2003) How important are comprehensive literature searches and the assessment of trial quality in systematic reviews? Empirical study. Health Technol Assess 7:1-76

Egger M. & Smith G.D. (1995) Misleading meta-analysis [editorial]. BMJ 310:752-4

Egger M., Smith G.D., Schneider M. & Minder C. (1997c) Bias in meta-analysis detected by a simple, graphical test. BMJ 315:629-34

Emerson J.D. (1994) Combining estimates of the odds ratio: the state of the art. [Review]. Stat Methods Med Res 3:157-78

Ernst E. (2008) The truth about homeopathy. Br J Clin Pharmacol 65:163-4

Evans A. (2007) Rail safety and rail privatization. Significance 4:15-8

Evans T., Gülmezoglu M. & Pang T. (2004) Registering clinical trials: an essential role for WHO. Lancet 363:1413-4

Everitt B.S. & Dunn G. (1998) Statistical analysis of medical data: new developments, London,Arnold: Oxford University Press

Farewell V.T. (1979) Some results on the estimation of logistic models based on retrospective data. Biometrika 66:27-32

FDA (2009) Law Strengthens FDA (US Food and Drug Administration). http://www.fda.gov/oc/initiatives/advance/fdaaa.html

Fergusson D., Aaron S.D., Guyatt G. & Hebert P. (2002) Post-randomisation exclusions: the intention to treat principle and excluding patients from analysis. BMJ 325:652-4

Fewell Z., Davey Smith G. & Sterne J.A.C. (2007) The impact of residual and unmeasured confounding in epidemiologic studies: a simulation study. Am J Epidemiol 166:646-55

Finney J.W. (2008) Regression to the mean in substance use disorder treatment research. Addiction 103:42-52

Fisher P. (2008) On the plausibility of homeopathy. Homeopathy 97:1-2

327

Fleiss J.L. & Gross A.J. (1991) Meta-analysis in epidemiology, with special reference to studies of the association between exposure to environmental tobacco smoke and lung cancer: a critique. J Clin Epidemiol 44:127-39

Formann A.K. (2008) Estimating the proportion of studies missing for meta-analysis due to publication bias. Contemporary Clinical Trials 29:732-9

Freedman L.S., Carroll R.J. & Wax Y. (1991) Estimating the relation between dietary Intake obtained from a food frequency questionnaire and true average intake. Am J Epidemiol 134:310-20

Freemantle N., Anderson I.M. & Young P. (2000) Predictive value of pharmacological activity for the relative efficacy of antidepressant drugs: Meta-regression analysis. Br J Psychiatry 177:292

Freemantle N., Cleland J., Young P., Mason J. & Harrison J. (1999) β Blockade after myocardial infarction: systematic review and meta-regression analysis. BMJ 318:1730-7

Frost C. & Thompson S.G. (2000) Correcting for regression dilution bias: comparison of methods for a single predictor variable. J R Stat Soc Ser A Stat Soc 163:173-89

Fuller W.A. (1986) Measurement error models, John Wiley & Sons, Inc. New York, NY, USA

Furukawa T.A., Watanabe N., Omori I.M., Montori V.M. & Guyatt G.H. (2007) Association between unreported outcomes and effect size estimates in cochrane meta-analyses. JAMA 297:468-70

Galandi D., Schwarzer G. & Antes G. (2006) The demise of the randomised controlled trial: bibliometric study of the German-language health care literature, 1948 to 2004. BMC Med Res Methodol 6:30

Galbraith R.F. (1988) A note on graphical presentation of estimated odds ratios from several clinical trials. Stat Med 7:889-94

Galbraith R.F. (1994) Some applications of radial plots. J Am Statist Assoc 89:1232-42

Garattini S. & Bertelé V. (2009) Homoeopathy: not a matter for drug-regulatory authorities. Lancet [In Press]:

Gartlehner G., Hansen R.A., Nissman D., Lohr K.N. & Carey T.S. (2006) A simple and valid tool distinguished efficacy from effectiveness studies. J Clin Epidemiol 59:1040-8

Gartlehner G. & Moore C.G. (2008) Direct versus indirect comparisons: A summary of the evidence. International Journal of Technology Assessment in Health Care 24:170-7

Gavaghan D.J., Moore R.A. & Mcquay H.J. (2000) An evaluation of homogeneity tests in meta- analyses in pain using simulations of individual patient data. Pain 85:415-24

Gelman A. (2006) Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis 1:515–34

Gelman A. & Hill J. (2007) Data analysis using regression and multilevel/hierarchical models, Cambridge university press,New York

Gerber A.S. & Malhotra N. (2008) Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results? Sociol Methods Res 37:3-30

Ghidey W., Lesaffre E. & Stijnen T. (2007) Semi-parametric modelling of the distribution of the baseline risk in meta-analysis. Stat Med 26:5434-44

328

Gibson L. (2004) GlaxoSmithKline to publish clinical trials after US lawsuit. BMJ 328:1513-a

Gilks W.R., Richardson S. & Spiegelhalter D.J. (1996) Markov Chain Monte Carlo in practice, Chapman and Hall, London

Gillies C.L. (2007) Development of evidence synthesis methods for health policy decision making - A chain evidence apporach [PhD thesis]. Department of Health Sciences. University of Leicester

Gillies C.L., Abrams K.R., Lambert P.C., Cooper N.J., Sutton A.J., Hsu R.T. & Khunti K. (2007) Pharmacological and lifestyle interventions to prevent or delay type 2 diabetes in people with impaired glucose tolerance: systematic review and meta-analysis. BMJ 334:299

Gillies C.L., Lambert P.C., Abrams K.R., Sutton A.J., Cooper N.J., Hsu R.T., Davies M.J. & Khunti K. (2008) Different strategies for screening and prevention of type 2 diabetes in adults: cost effectiveness analysis. BMJ 336:1180-5

Glass G.V. (1991) Review of Wachter & Straff book: The Future of Meta-analysis. J Am Stat Assoc 86:1141-2

Glasziou P.P. (1992) Meta-analysis adjusting for compliance: the example of screening for breast cancer. J Clin Epidemiol 45:1251-6

Glasziou P.P. & Sanders S.L. (2002) Investigating causes of heterogeneity in systematic reviews. Stat Med 21:1503-11

Glenny A.M., Altman D.G., Song F., Sakarovitch C., Deeks J.J., D’amico R., Bradburn M. & Eastwood A.J. (2005) Indirect comparisons of competing interventions. Health Technol Assess 9:1-148

Gluud L.L. (2006) Bias in clinical intervention research. Am J Epidemiol 163:493-501

Goldacre B. (2006) Academics are as guilty as the media when it comes to publication bias. The Guardian. June 10 2006. London, Guardian News and Media Limited

Goldacre B. (2008) Missing in action: the trials that did not make the news. The Guardian. Septemeber 20 2008. London, Guardian News and Media Limited

Good P.I. & Hardin J.W. (2003) Common errors in statistics (and how to avoid them), John Wiley & Sons, Interscience

Goodman S.N. (1999) Toward evidence-based medical statistics 2: the Bayes factor. Annals of Internal Medicine 130:1005-13

Goodman S.N. (2007) Stopping at nothing? Some dilemmas of data monitoring in clinical trials. Annals of Internal Medicine 146:882

Goodman S.N. (2008) Systematic reviews are not biased by results from trials stopped early for benefit. J Clin Epidemiol 61:95-6

Gøtzsche P.C. (1987) Reference bias in reports of drug trials. BMJ 295:654-6

Gøtzsche P.C. (1989) Multiple publication of reports of drug trials. European Journal of Clinical Pharmacology 36:429-32

Gøtzsche P.C., Hrobjartsson A., Maric K. & Tendal B. (2007) Data extraction errors in meta- analyses that use standardized mean differences. JAMA 298:430-7

329

Gravel J., Opatrny L. & Shapiro S. (2007) The intention-to-treat approach in randomized controlled trials: Are authors saying what they do and doing what they say? Clinical Trials 4:350-6

Greenland S. (1987) Quantitative methods in the review of epidemiological literature. Epidemiol Rev 9:1-30

Greenland S. (1994a) Invited commentary: a critical look at some popular meta-analytic methods. Am J Epidemiol 140:290-6

Greenland S. (1994b) Quality scores are useless and potentially misleading. Am J Epidemiol 140:300-1

Greenland S. (2008) Invited Commentary: Variable Selection versus Shrinkage in the Control of Multiple Confounders. Am J Epidemiol 167:523-9

Greenland S. (2009) Weaknesses of Bayesian model averaging for meta-analysis in the study of vitamin E and mortality. Clinical Trials 6:42-6

Greenland S. & O'rourke K. (2001) On the bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions. Biostat 2:463-71

Greenwood C.M.T., Midgley J.P., Matthew A.G. & Logan A.G. (1999) Statistical issues in a metaregression analysis of randomized trials: Impact on the dietary sodium Intake and blood pressure relationship. Biometrics 55:630-6

Gülmezoglu A.M., Pang T., Horton R. & Dickersin K. (2005) WHO facilitates international collaboration in setting standards for clinical trial registration. Lancet 365:1829-31

Gunnell D., Berney L., Holland P., Maynard M., Blane D., Davey Smith G. & Frankel S. (2004) Does the misreporting of adult body size depend upon an individual's height and weight? Methodological debate. Int J Epidemiol 33:1398-9

Gustafson P. (2003) Measurement error and misclassification in statistics and epidemiology: Impacts and Bayesian adjustments, London, Chapman & Hall/CRC

Guyatt G., Devereaux P.J., Pogue J., Yusuf S., Yang H., Montori V., Alonso-Coello P., Urrutia G., Berwanger O., Ciapponi A., Salazar A. & Trsitan M. (2008a) Fixed and random effect models when one trial dominates meta-analysis results. German Journal for Quality in Health Care 102:98-9

Guyatt G., Devereaux P.J., Yusuf S., Yang H., Alonso-Coello P., Ciapponi A., Pogue J., Salazar A., Trsitan M. & Urrutia G. (2008b) Small sample size, composite endpoints, trials stopped early for benefit: threats to valid meta-analysis. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Hahn S., Williamson P.R. & Hutton J.L. (2002) Investigation of within-study selective reporting in clinical research: follow-up of applications submitted to a local research ethics committee. J Eval Clin Pract 8:353-9

Hahn S., Williamson P.R., Hutton J.L., Garner P. & Flynn E.V. (2000) Assessing the potential for bias in meta-analysis due to selective reporting of subgroup analyses within studies. Stat Med 19:3325-36

Hamza T.H., Van Houwelingen H.C. & Stijnen T. (2008) The binomial distribution of meta- analysis was preferred to model within-study variability. J Clin Epidemiol 61:41-51

Hansen R.A., Gartlehner G., Lohr K.N., Gaynes B.N. & Carey T.S. (2005) Efficacy and safety of second-generation antidepressants in the treatment of major depressive disorder. Ann Intern Med 143:415-26

330

Harbord R.M., Egger M. & Sterne J.A.C. (2006) A modified test for small-study effects in meta- analyses of controlled trials with binary endpoints. Stat Med 25:3443-57

Harbord R.M. & Harris R.J. (2009) Updated tests for small-study effects in meta-analyses. Stata Journal 9:197-210

Harbord R.M. & Higgins J.P.T. (2008) Meta-regression in Stata. Stata Journal 8:493-519

Harbour R. & Miller J. (2001) A new system for grading recommendations in evidence based guidelines. BMJ 323:334-6

Hardy R.J. & Thompson S.G. (1996) A likelihood approach to meta-analysis with random effects. Stat Med 15:619-29

Hardy R.J. & Thompson S.G. (1998) Detecting and describing heterogeneity in meta-analysis. Stat Med 17:841-56

Harrell F.E. (2001) Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis, New York,Springer

Hasselblad V. (1998) Meta-analysis of multitreatment studies. Med Decis Making 18:37-43

Hedges L.V. (1982) Estimating effect size from a series of independent experiments. Psychol Bull 92:490-9

Hedges L.V. (1984) Estimation of effect size under nonrandom sampling: the effects of censoring studies yielding statistically insignificant mean differences. Journal of Educational Statistics 9:61-85

Hedges L.V. (1992) Modeling publication selection effects in meta-analysis. Statistical Science 7:246-55

Hedges L.V. & Olkin I. (1984) Nonparametric estimators of effect size in meta-analysis. Psychol.Bull. 96:573-80

Hedges L.V. & Olkin I. (1985) Statistical Methods for Meta-Analysis, Academic Press, London

Hedges L.V. & Vevea J.L. (1996) Estimating effects size under publication bias: small sample properties and robustness of a random effects selection model. Journal of Educational and Behavioural Statistics 21:299-332

Hemingway H., Chen R., Fitzpatrick N., Damant J., Shipley M., Philipson P., Mcallister K., Abrams K., Moreno S.G., Sculpher M. & Palmer S. (2009) C reactive protein and prognosis in stable coronary artery disease: systematic review and meta-analysis of 77 studies (Submitted). PLoS Medicine

Henmi M., Copas J.B. & Eguchi S. (2007) Confidence intervals and p-values for meta-analysis with publication bias. Biometrics 63:475-82

Herbison P., Hay-Smith J. & Gillespie W.J. (2006) Adjustment of meta-analyses on the basis of quality scores should be abandoned. J Clin Epidemiol 59:1249.e1-.e11

Heres S., Davis J., Maino K., Jetzinger E., Kissling W. & Leucht S. (2006) Why Olanzapine beats Risperidone, Risperidone beats Quetiapine, and Quetiapine beats Olanzapine: An exploratory analysis of head-to-head comparison studies of second-generation antipsychotics. Am J Psychiatry 163:185-94

Heritier S.R., Gebski V.J. & Keech A.C. (2003) Inclusion of patients in clinical trial analysis: the intention-to-treat principle. Med J Aust 179:438-40

331

Higgins J., Thompson S., Deeks J. & Altman D. (2002) Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and practice. Journal of Health Services Research and Policy 7:51-61

Higgins J.P.T. (2008) Commentary: Heterogeneity in meta-analysis should be expected and appropriately quantified. Int J Epidemiol 37:1158-60

Higgins J.P.T., Altman D.G. & (Editors) (2008a) Chapter 8: Assessing risk of bias in included studies. IN Higgins, J.P.T. &Green, S. (Eds.) Cochrane Handbook for Systematic Reviews of Intervention. Version 5.0.0 (updated February 2008) (available from www.cochrane-handbook.org). Oxford, Cochrane Collaboration

Higgins J.P.T., Deeks J., Altman D.G. & (Editors) (2008b) Chapter 16: Special topics in statistics. IN Higgins, J.P.T. &Green, S. (Eds.) Cochrane Handbook for Systematic Reviews of Intervention. Version 5.0.0 (updated February 2008) (available from www.cochrane-handbook.org). Oxford, Cochrane Collaboration

Higgins J.P.T. & Green S. (2008) Cochrane handbook for systematic reviews of interventions version 5.0.0, Oxford,Cochrane Collaboration

Higgins J.P.T. & Spiegelhalter D.J. (2002) Being sceptical about meta-analyses: a Bayesian perspective on magnesium trials in myocardial infarction. Int J Epidemiol 31:96-104

Higgins J.P.T. & Thompson S.G. (2002) Quantifying heterogeneity in meta-analysis. Stat Med 21:1539-58

Higgins J.P.T. & Thompson S.G. (2004) Controlling the risk of spurious findings from meta- regression. Stat Med 23:1663-82

Higgins J.P.T., Thompson S.G., Deeks J.J. & Altman D.G. (2003) Measuring inconsistency in meta-analyses. BMJ 327:557-60

Higgins J.P.T., Thompson S.G. & Spiegelhalter D.J. (2009) A re-evaluation of random-effects meta-analysis. JR Stat Soc A 172:

Higgins J.P.T. & Whitehead A. (1996) Borrowing strength from external trials in a meta-analysis. Stat Med 15:2733-49

Higgins J.P.T., Whitehead A., Turner R.M., Omar R.Z. & Thompson S.G. (2001) Meta-analysis of continuous outcome data from individual patients. Stat Med 20:2219-41

Hofmeyr G.J., Atallah A. & Duley L. (2007) Dietary calcium supplementation and pre-eclampsia. Commentary. Int J Epidemiol 36:290-3

Hofmeyr G.J., Atallah A.N. & Duley L. (2006) Calcium supplementation during pregnancy for preventing hypertensive disorders and related problems. Cochrane Database of Systematic Reviews

Hollis S. & Campbell F. (1999) What is meant by intention to treat analysis? Survey of published randomised controlled trials. BMJ 319:670-4

Holmes C.C. & Mallick B.K. (2001) Bayesian regression with multivariate linear splines. JR Stat Soc B 63:3-17

Hooper L., Bartlett C., Davey Smith G. & Ebrahim S. (2002) Systematic review of long term effects of advice to reduce dietary salt in adults. BMJ 325:628

Hopewell S., Clarke M. & Mallett S. (2005) Grey literature and systematic reviews. IN Rothstein, H.R., Sutton, A.J. &Borenstein, M. (Eds.) Publication bias in meta-analysis: Prevention, assessment and adjustments. Chichester, Wiley

332

Hopewell S., Clarke M., Stewart L. & Tierney J. (2007a) Time to publication for results of clinical trials (Cochrane Methodology Review), John Wiley & Sons, Ltd

Hopewell S., Mcdonald S., Clarke M. & Egger M. (2007b) Grey literature in meta-analyses of randomized trials of health care interventions. The Cochrane Database of Methodology Reviews

Horowitz J.L. (2004) Semiparametric models. IN Gentle, J.E., Härdle, W. &Mori, Y. (Eds.) Handbook of Computational Statistics: Concepts and Methods. Springer

Horowitz J.L. & Lee S. (2002) Semiparametric methods in applied econometrics: do the models fit the data? Statistical Modeling 2:3-22

Horton R. (1997) Medical editors trial amnesty. Lancet 350:756

Horton R. & Smith R. (1999) Time to register randomised trials - the case is now unanswerable. BMJ 319:865-6

Hotopf M. & Barbui C. (2005) Bias in the evaluation of antidepressants. Epidemiol Psichiatr Soc 14:55-7

Hotopf M., Lewis G. & Normand C. (1997) Putting trials on trial--the costs and consequences of small trials in depression: a systematic review of methodology. J Epidemiol Community Health 51:354-8

Hróbjartsson A. (2002) What are the main methodological problems in the estimation of placebo effects? J Clin Epidemiol 55:430-5

Hrobjartsson A. & Gotzsche P.C. (2004) Is the placebo powerless? Update of a systematic review with 52 new randomized trials comparing placebo with no treatment. Journal of Internal Medicine 256:91-100

Hughes M.D., Freedman L.S. & Pocock S.J. (1992) The impact of stopping rules on heterogeneity of results in overviews of clinical trials. Biometrics 48:41-53

Huston P. & Moher D. (1996) Redundancy, disaggregation, and the integrity of medical research. Lancet(British edition) 347:1024-6

Hutton J.L. & Williamson P.R. (2000) Bias in meta-analysis due to outcome variable selection within studies. Appl Stat 49:359-70

Huwiler-Muntener K., Juni P., Junker C. & Egger M. (2002) Quality of reporting of randomized trials as a measure of methodologic quality. JAMA 287:2801-4

IFPMA (2009) News releases: Global industry position on disclosure of information about clinical trials. http://www.ifpma.org/news/newsreleasedetail.aspx?nID=2205,

Ioannidis J.P.A. (1998) Effect of the statistical significance of results on the time to completion and publication of randomized efficacy trials. JAMA 279:281-6

Ioannidis J.P.A. (2005) Why most published research findings are false. PLoS Medicine 2:696

Ioannidis J.P.A. (2008a) Effectiveness of antidepressants: an evidence myth constructed from a thousand randomized trials? Philosophy, Ethics, and Humanities in Medicine 3:14

Ioannidis J.P.A. (2008b) Interpretation of tests of heterogeneity and bias in meta-analysis. J Eval Clin Pract 14:951-7

Ioannidis J.P.A., Patsopoulos N.A. & Evangelou E. (2007) Uncertainty in heterogeneity estimates in meta-analyses. BMJ 335:914-6

333

Ioannidis J.P.A. & Trikalinos T.A. (2007a) The appropriateness of asymmetry tests for publication bias in meta-analyses: a large survey. CMAJ 176:1091-6

Ioannidis J.P.A. & Trikalinos T.A. (2007b) An exploratory test for an excess of significant findings. Clinical Trials 4:245-53

Irony T.Z. & Singpurwalla N.D. (1997) Non-informative priors do not exist A dialogue with José M. Bernardo. Journal of the statistical planning and inference 65:159-77

Irwig L., Glasziou P., Wilson A. & Macaskill P. (1991) Estimating an individual's true cholesterol level and response to intervention. JAMA 266:1678-85

Irwig L., Macaskill P., Berry G. & Glasziou P. (1998) Bias in meta-analysis detected by a simple, graphical test. Graphical test is itself biased. BMJ 315:629-34

Iyengar S. & Greenhouse J.B. (1988) Selection models and the file drawer problem. Statistical Science 3:109-35

Jackson D. (2006a) The implications of publication bias for meta-analysis' other parameter. Stat Med 25:2911-21

Jackson D. (2006b) The power of the standard test for the presence of heterogeneity in meta- analysis. Stat Med 25:2688-99

Jackson D., Copas J. & Sutton A.J. (2005) Modelling reporting bias: the operative mortality rate for ruptured abdominal aortic aneurysm repair. JR Stat Soc A 168:737-52

Jacobs A. (2002) Association between competing interests and conclusions. Reasons for relation are also interesting. BMJ 325:1420

Jansen J.P., Crawford B., Bergman G. & Stam W. (2008) Bayesian meta-analysis of multiple treatment comparisons: An introduction to mixed treatment comparisons. Value in Health 11:956-64

Jefferson T., Di Pietrantonj C., Debalini M.G., Rivetti A. & Demicheli V. (2009) Relation of study quality, concordance, take home message, funding, and impact in studies of influenza vaccines: systematic review. BMJ 338:b354-

Johnson B.T. & Kirsch I. (2008) Do antidepressants work? Statistical significance versus clinical benefits. Significance 5:54-8

Jørgensen A., Maric K., Tendal B., Faurschou A. & Gøtzsche P. (2008) Industry-supported meta-analyses compared with meta-analyses with non-profit or no support: Differences in methodological quality and conclusions. BMC Med Res Methodol 8:60

Jørgensen A.W., Hilden J. & Gøtzsche P.C. (2006) Cochrane reviews compared with industry supported meta-analyses and other meta-analyses of the same drugs: systematic review. BMJ 333:782-

Jüni P., Altman D.G. & Egger M. (2001) Systematic reviews in health care: Assessing the quality of controlled clinical trials. BMJ 323:42-6

Juni P. & Egger M. (2005) Commentary: Empirical evidence of attrition bias in clinical trials. Int J Epidemiol 34:87-8

Jüni P., Holenstein F., Sterne J., Bartlett C. & Egger M. (2002) Direction and impact of language bias in meta-analyses of controlled trials: empirical study. Int J Epidemiol 31:115-23

Jüni P., Nüesch E., Reichenbach S., Rutjes A., Scherrer M., Bürgi E. & Trelle S. (2008) Overestimation of treatment effects associated with small sample size in osteoarthritis

334

research. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Jüni P., Witschi A., Bloch R. & Egger M. (1999) The hazards of scoring the quality of clinical trials for meta-analysis. JAMA 282:1054-60

Kaptchuk T.J., Kelley J.M., Conboy L.A., Davis R.B., Kerr C.E., Jacobson E.E., Kirsch I., Schyner R.N., Nam B.H., Nguyen L.T., Park M., Rivers A.L., Mcmanus C., Kokkotou E., Drossman D.A., Goldman P. & Lembo A.J. (2008) Components of placebo effect: randomised controlled trial in patients with irritable bowel syndrome. BMJ 336:999-1003

Kipnis V., Freedman L.S., Brown C.C., Hartman A.M., Schatzkin A. & Wacholder S. (1997) Effect of measurement error on energy-adjustment models in nutritional epidemiology. Am J Epidemiol 146:842-55

Kirkwood B.R. & Sterne J.A.C. (2006) Essential medical statistics, Oxford, Blacwell Publishing

Kirsch I., Deacon B.J., Huedo-Medina T.B., Scoboria A., Moore T.J. & Johnson B.T. (2008) Initial severity and antidepressant benefits: A meta-analysis of data submitted to the Food and Drug Administration. PLoS Med 5:e45

Kirsch I., Moore T.J., Scoboria A. & Nicholls S.S. (2002a) The emperor’s new drugs: An analysis of antidepressant medication data submitted to the US Food and Drug Administration. Prevention & Treatment 5

Kirsch I. & Sapirstein G. (1998) Listening to Prozac but hearing placebo: A meta-analysis of antidepressant medication. Prevention & Treatment 1

Kirsch I., Scoboria A. & Moore T.J. (2002b) Antidepressants and placebos: Secrets, revelations, and unanswered questions. Prev Treat 5

Kjaergard L.L. & Als-Nielsen B. (2002) Association between competing interests and authors' conclusions: epidemiological study of randomised clinical trials published in the BMJ. BMJ 325:249-

Kjaergard L.L. & Gluud C. (2002) Citation bias of hepato-biliary randomized clinical trials. J Clin Epidemiol 55:407-10

Kjaergard L.L., Nikolova D. & Gluud C. (1999) Randomized clinical trials in HEPATOLOGY: Predictors of quality. Hepatology 30:1134-8

Kjaergard L.L., Villumsen J. & Gluud C. (2001) Reported methodologic quality and discrepancies between large and small randomized trials in meta-analyses. Ann Intern Med 135:982-9

Kjaergard L.L., Villumsen J. & Gluud C. (2008) Correction: reported methodologic quality and discrepancies between large and small randomized trials in meta-analyses. Ann Intern Med 149:219-a-

Krueger J. & Mueller R.A. (2002) Unskilled, unaware, or both? The better-than-average heuristic and statistical regression predict errors in estimates of own performance. [see comment]. Journal of Personality & Social Psychology 82:180-8

Krzyzanowska M.K., Pintilie M. & Tannock I.F. (2003) Factors associated with failure to publish large randomized trials presented at an oncology meeting. JAMA 290:495-501

Kunz R., Vist G. & Oxman A.D. (2008) Randomisation to protect against selection bias in healthcare trials (Cochrane Methodology Review). The Cochrane Database of Methodology Reviews 1465-858

335

Laine C., Horton R., Deangelis C.D., Drazen J.M., Frizelle F.A., Godlee F., Haug C., Hebert P.C., Kotzin S., Marusic A., Sahni P., Schroeder T.V., Sox H.C., Van Der Weyden M.B. & Verheugt F.W.A. (2007) Clinical trial registration -- Looking back and moving ahead. N Engl J Med 356:2734-6

Lambert P., Sutton A.J., Abrams K.R. & Jones D.R. (2002) A comparison of summary patient level covariates in meta-regression with individual patient data meta-analyses. J Clin Epidemiol 55:86-94

Lambert P.C., Sutton A.J., Burton P.R., Abrams K.R. & R. J. (2005) How vague is vague? A simulation study of the impact of the use of vague prior distributions in MCMC using WinBUGS. Stat Med 24:2401-28

Landefeld C.S. & Steinman M.A. (2009) The neurontin legacy -- Marketing through misinformation and manipulation. N Engl J Med 360:103-6

Larsson S.C., Orsini N. & Wolk A. (2007) Body mass index and pancreatic cancer risk: A meta- analysis of prospective studies. International Journal of Cancer 120:1993-8

Lau J., Ioannidis J.P.A. & Schmid C.H. (1998) Summing up evidence: one answer is not always enough. Lancet 351:123-7

Lau J., Ioannidis J.P.A., Terrin N., Schmid C. & Olkin I. (2006) The case of the misleading funnel plot. BMJ 333:597-600

Lawlor D.A. & Hopker S.W. (2001) The effectiveness of exercise as an intervention in the management of depression: systematic review and meta-regression analysis of randomised controlled trials. BMJ 322:763

Lee K., Bacchetti P. & Sim I. (2008) Publication of clinical trials supporting successful new drug applications: A literature analysis. PLoS Medicine 5:e191

Lee K.J. & Thompson S.G. (2008) Flexible parametric models for random-effects distributions. Stat Med 27:418-34

Leizorovicz A., Haugh M.C. & Boissel J.P. (1992) Meta-analysis and multiple publication of clinical-trial reports. Lancet 340:1102-3

Lelorier J., Gregoire G., Benhaddad A., Lapierre J. & Derderian F. (1997) Discrepancies between Meta-Analyses and Subsequent Large Randomized, Controlled Trials. N Engl J Med 337:536-42

Leucht S., Komossa K., Rummel-Kluge C., Corves C., Hunger H., Schmid F., Asenjo Lobos C., Schwarz S. & Davis J.M. (2009) A meta-analysis of head-to-head comparisons of second-generation antipsychotics in the treatment of schizophrenia. Am J Psychiatry 166:152-63

Lewis J.A. (1999) Statistical principles for clinical trials (ICH E9): an introductory note on an international guideline. Stat Med 18:1903-42

Lexchin J., Bero L.A., Djulbegovic B. & Clark O. (2003) Pharmaceutical industry sponsorship and research outcome and quality: systematic review. BMJ 326:1167

Liberati A., Altman D.G., Tetzlaff J., Mulrow C., Gotzsche P.C., Ioannidis J.P.A., Clarke M., Devereaux P.J., Kleijnen J. & Moher D. (2009) The PRISMA statement for reporting systematic reviews and meta-Analyses of studies that evaluate health care interventions: Explanation and elaboration. PLoS Medicine 6

Liesegang T.J., Albert D.M. & Schachat A.P. (2008) Not for Your Eyes: Information Concealed through Publication Bias. American Journal of Ophthalmology 146:638-40

336

Light R.J. & Pillemar D.B. (1984) Summing up: the science of reviewing research, Harvard University Press, Cambridge, Mass

Linde K., Clausius N., Ramirez G., Melchart D., Eitel F., Hedges L.V. & Jonas W.B. (1997) Are the clinical effects of homoeopathy placebo effects? A meta-analysis of placebo- controlled trials. Lancet 350:834-43

Lindley D.V. & Novick M.R. (1981) The Role of Exchangeability in Inference. The Annals of Statistics 9:45-58

Lu G. & Ades A.E. (2004) Combination of direct and indirect evidence in mixed treatment comparisons. Stat Med 23:3105-24

Lu G. & Ades A.E. (2006) Assessing evidence inconsistency in mixed treatment comparisons. J Am Stat Assoc 101:447-59

Lüdtke R. & Rutten A.L.B. (2008) The conclusions on the effectiveness of homeopathy highly depend on the set of analyzed trials. J Clin Epidemiol 61:1197-204

Lumley T. (2002) Network meta-analysis for indirect treatment comparisons. Stat Med 21:2313- 24

Lunn D., Spiegelhalter D., Thomas A. & Best N. (2009) The BUGS project: Evolution, critique and future directions. Stat Med 28:3049-67

Lunn D.J., Thomas A., Best N. & Spiegelhalter D. (2000) WinBUGS - A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing 10:325-37

Lynch J.R., Cunningham M.R.A., Warme W.J., Schaad D.C., Wolf F.M. & Leopold S.S. (2007) Commercially funded and United States-based research is more likely to be published; good-quality studies with negative outcomes are not. J Bone Joint Surg Am 89:1010-8

Macaskill P., Walter S.D. & Irwig L. (2001) A comparison of methods to detect publication bias in meta-analysis. Stat Med 20:641-54

Macmahon S. & Peto R. (1990) Blood pressure, stroke and coronary heart disease, part 1, prolonged differences in blood pressure: prospective observational studies corrected for the regression dilution bias. Lancet 335:765-74

Madsen M.V., Gotzsche P.C. & Hrobjartsson A. (2009) Acupuncture treatment for pain: systematic review of randomised clinical trials with acupuncture, placebo acupuncture, and no acupuncture groups. BMJ 338:a3115-

Magazin A., Kumar A., Soares H., Hozo I., Schell M. & Djulbegovic B. (2008) Empirical investigation of optimism bias. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Maldonado G. & Greenland S. (1997) The importance of critically interpreting simulation studies. Epidemiology 8:429-34

Mantel N. & Haenszel W. (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 22:719-48

Mar J., Moreno S.G. & Chilcott J. (2006) Probabilistic cost-effectiveness analysis of the treatment of sleep apnea. Gaceta Sanitaria 20:47-53

Margitic S.E., Morgan T.M., Inouye S.K., Landefeld C.S., Durham N.C., Sager M.A. & Thomas J.L. (1992) Planning, designing and implementing a prospective meta-analysis study. Stat Med

337

Margitic S.E., Morgan T.M., Sager M.A. & Furberg C.D. (1995) Lessons learned from a prospective meta-analysis [see comments]. J.Am.Geriatr.Soc. 43:435-9

Martinson B.C., Anderson M.S. & De Vries R. (2005) Scientists behaving badly. Nature 435:737-8

Mathew S.J. & Charney D.S. (2009) Publication bias and the efficacy of antidepressants. Am J Psychiatry 166:140-5

Mcauley L., Pham B., Tugwell P. & Moher D. (2000) Does the inclusion of grey literature influence estimates of intervention effectiveness reported in meta-analyses? Lancet 356:1228-31

Mccullagh P. & Nelder J. (1989) Generalized linear models, London, Chapman and Hall

Mcgee V.E. & Carleton W.T. (1970) Piecewise regression. J Am Stat Assoc 1109-24

Mcmahon B., Holly L., Harrington R., Roberts C. & Green J. (2008) Do larger studies find smaller effects? The example of studies for the prevention of conduct disorder. European child & adolescent psychiatry 17:432-7

Mehta C.R., Patel N.R. & Tsiatis A.A. (1984) Exact significance testing to establish treatment equivalence with ordered categorical data. Biometrics 40:819-25

Melander H., Ahlqvist-Rastad J., Meijer G. & Beermann B. (2003) Evidence b(i)ased medicine-- selective reporting from studies sponsored by pharmaceutical industry: review of studies in new drug applications. BMJ 326:1171-3

Melander H., Salmonson T., Abadie E. & Van Zwieten-Boot B. (2008) A regulatory apologia—A review of placebo-controlled studies in regulatory submissions of new-generation antidepressants. European Neuropsychopharmacology

Miazhynskaia T. & Dorffner G. (2006) A comparison of Bayesian model selection based on MCMC with an application to GARCH-type models. Statistical Papers 47:525-49

Minelli C. (2005) Meta-analysis of genetic association studies: overview of the methodological issues and proposal of guideline [PhD thesis]. Department of Health Sciences. University of Leicester

Minifie J.R. & Heard E. (1985) On the generalizability of MRP simulation results. Engineering Costs and Production Economics 9:211-7

Mittlböck M. & Heinzl H. (2006) A simulation study comparing properties of heterogeneity measures in meta-analyses. Stat Med 25:4321-33

Moher D., Cook D.J., Eastwood S., Olkin I., Rennie D. & Stroup D. (1999) for the QUORUM group. Improving the quality of reporting of meta-analysis of randomised controlled trials: the QUORUM statement. Lancet 354:1896-900

Moher D., Liberati A., Tetzlaff J., Altman D.G. & Grp P. (2009) Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Medicine 6

Moher D., Pham B., Jones A., Cook D.J., Jadad A.R., Moher M., Tugwell P. & Klassen T.P. (1998) Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analysis? Lancet 352:609-13

Montori V.M., Devereaux P.J., Adhikari N.K.J., Burns K.E.A., Eggert C.H., Briel M., Lacchetti C., Leung T.W., Darling E. & Bryant D.M. (2005) Randomized trials stopped early for benefit: A systematic review. JAMA 294:2203-9

338

Moodley J. (2007) Commentary: Calcium supplementation during pregnancy for preventing hypertensive disorders and related problems. Int J Epidemiol 36:293

Moreno S.G., Sutton A.J., Ades A.E., Stanley T.D., Abrams K.R., Peters J.L. & Cooper N.J. (2008) Assessment of methods to adjust for publication bias through a comprehensive simulation study. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Moreno S.G., Sutton A.J., Ades A.E., Stanley T.D., Abrams K.R., Peters J.L. & Cooper N.J. (2009a) Assessment of regression-based methods to adjust for publication bias through a comprehensive simulation study. BMC Med Res Methodol 9:2

Moreno S.G., Sutton A.J., Turner E.H., Abrams K.R., Cooper N.J., Palmer T.M. & Ades A.E. (2009b) Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publications. BMJ 339:b2981-

Morgenstern H. (1982) Uses of ecologic analysis in epidemiologic research. American Journal of Public Health 72:1336-44

Morris R.W. & Emberson J.R. (2005) Commentary: Over-correction for regression dilution bias? Not for blood pressure vs coronary heart disease. Int J Epidemiol 34:1368-9

Morton S.C., Adams J.L., Suttorp M.J., Shanman R., Di Valentine J.D., Rhodes S. & Shekelle P.G. (2004) Meta-regression Approaches: What, Why, When, and How? Clinical Focus. Rockville, MD, Agency for Healthcare Research and Quality

Morton V. & Torgerson D.J. (2003) Effect of regression to the mean on decision making in health care. BMJ 326:1083-4

Nakagawa S. (2004) A farewell to Bonferroni: the problems of low statistical power and publication bias. ISBE

Nartey L., Huwiler-Müntener K., Shang A., Liewald K., Jüni P. & Egger M. (2007) Matched-pair study showed higher quality of placebo-controlled trials in Western phytotherapy than conventional medicine. J Clin Epidemiol 60:787-

Natarajan R. & Kass R.E. (2000) Reference Bayesian methods for generalized linear mixed models. J Am Stat Assoc 95:227-37

Naylor C.D(1997) Meta-analysis and the meta-epidemiology of clinical research. BMJ 315:617-9

Nieminen P., Rücker G., Miettunen J., Carpenter J. & Schumacher M. (2007) Statistically significant papers in psychiatry were cited more often than others. J Clin Epidemiol 60:939-46

Nixon R.M., Bansback N. & Brennan A. (2007) Using mixed treatment comparisons and meta- regression to perform indirect comparisons to estimate the efficacy of biologic treatments in rheumatoid arthritis. Stat Med 26:1237-54

Norton E.C., Wang H. & Ai C. (2004) Computing interaction effects and standard errors in logit and probit models. The Stata Journal 4:103-16

Novielli N. (2007) Model uncertainty in economic evaluation [MSc dissertation]. Health Sciences. Leicester, University of Leicester

Ntzoufras I. (2009) Bayesian modeling using WinBUGS, New Jersey, John Wiley & Sons

339

Nüesch E., Trelle S., Reichenbach S., Rutjes A., Scherrer M., Bürgi E. & Jüni P. (2008) Empirical evidence of attrition bias in meta-analyses of randomized controlled German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Nurmi H. (1998) Voting paradoxes and referenda. Social Choice and Welfare 15:333-50

O’hagan A. & Luce B.R. (2003) A primer on Bayesian statistics in health economics and outcomes research, Bayesian initiative in health economics and outcomes research, Centre for Bayesian Statistics in Health Economics

Oliva J., López-Bastida J., Moreno S.G., Mata P. & Alonso R. (2009) Cost-effectiveness analysis of a genetic screening program in the close relatives of Spanish patients with familial hypercholesterolemia. Health management 62

Olkin I. (1999) Diagnostic statistical procedures in medical meta-analyses. Stat Med 18:2331-41

Osborne J.W. & Waters E. (2002) Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research & Evaluation 8

Palmer T.M., Peters J.L., Sutton A.J. & Moreno S.G. (2008) Contour-enhanced funnel plots for meta-analysis. Stata Journal 8:242-54

Parra-Blanco A., Gimeno-García A.Z., Quintero E., Nicolás D., Moreno S.G., Jiménez A., Hernández-Guerra M., Eishi Y. & López-Bastida J. (2010) Diagnostic accuracy of immunochemical versus guaiac fecal occult blood tests for colorectal cancer screening (Accepted for publication). Journal of Gastroenterology

Pearson H. (2006) Tragic drug trial spotlights potent molecule. Nature.com

Penel N. & Adenis A. (2009) Publication biases and phase II trials investigating anticancer targeted therapies. Invest. New Drugs 27:287-8

Perlis R.H., Perlis C.S., Wu Y., Hwang C., Joseph M. & Nierenberg A.A. (2005) Industry sponsorship and financial conflict of interest in the reporting of clinical trials in psychiatry. American Journal of Psychiatry 162:1957

Peters J., Sutton A.J., Jones D.R., Abrams K.R. & Rushton L. (2008) Contour-enhanced meta- analysis funnel plots help distinguish publication bias from other causes of asymmetry. J Clin Epidemiol 991-6

Peters J.L. (2006) Generalised synthesis methods in human health risk assessment [PhD thesis]. Department of Health Sciences. University of Leicester

Peters J.L. & Mengersen K.L. (2008) Meta-analysis of repeated measures study designs. J Eval Clin Pract 14:941-50

Peters J.L., Sutton A.J., Jones D.R. & Abrams K.R. (2005) Performance of tests and adjustments for publication bias in the presence of heterogeneity. Technical report 05- 01. Department of Health Sciences, University of Leicester

Peters J.L., Sutton A.J., Jones D.R., Abrams K.R. & Rushton L. (2006) Comparison of two methods to detect publication bias in meta-analysis. JAMA 295:676-80

Peters J.L., Sutton A.J., Jones D.R., Abrams K.R., Rushton L. & Moreno S.G. (2009) Assessing publication bias in meta-analyses in the presence of between-study heterogeneity [In press]. The journal of the Royal Statistical Society - Serie A

Peters J.L., Sutton J.A., Jones D.R., Abrams K.R. & Rushton L. (2007) Performance of the trim and fill method in the presence of publication bias and between-study heterogeneity. Stat Med 26:4544-62

340

Petitti D.B. (1994) Meta-analysis, decision analysis and cost-effectiveness analysis, Oxford University Press, New York

Petitti D.B. (2001) Approaches to heterogeneity in meta-analysis. Stat Med 20:3625-33

Peto R., Yusuf S. & Collins R. (1985) Cholesterol-lowering trial results in their epidemiologic context. Circulation 72:451

Petticrew M. (1998) Diagoras of Melos (500 BC): an early analyst of publication bias. Lancet 352:1558-

Petticrew M., Egan M., Thomson H., Hamilton V., Kunkler R. & Roberts H. (2008) Publication bias in qualitative research: what becomes of qualitative research presented at conferences? J Epidemiol Community Health 62:552-4

Pham B., Klassen T.P., Lawson M.L. & Moher D. (2005) Language of publication restrictions in systematic reviews gave different results depending on whether the intervention was conventional or complementary. J Clin Epidemiol 58:769-

Pharm B., Platt R., Mcauley L., Klassen T.P. & Moher D. (2001) Is there a "best" way to detect and minimize publication bias? An empirical evaluation. Evaluation and the Health Professionals 24:109-25

Pildal J., Hrobjartsson A., Jorgensen K.J., Hilden J., Altman D.G. & Gøtzsche P.C. (2007) Impact of allocation concealment on conclusions drawn from meta-analyses of randomized trials. Int J Epidemiol 36:847-57

Poole C. & Greenland S. (1999) Random-effects meta-analysis are not always conservative. Am J Epidemiol 150:469-75

Porta N., Bonet C. & Cobo E. (2007) Discordance between reported intention-to-treat and per protocol analyses. J Clin Epidemiol 60:663-9

Preston C., Ashby D. & Smyth R. (2004) Adjusting for publication bias: modelling the selection process. J Eval Clin Pract 10:313-22

Ramsey S. & Scoggins J. (2008) Commentary: Practicing on the tip of an information iceberg? Evidence of underpublication of registered clinical trials in oncology.Oncologist 13:925-9

Ranstam J., Buyse M., George S.L., Evans S., Geller N.L., Scherrer B., Lesaffre E., Murray G., Edler L. & Hutton J.L. (2000) Fraud in medical research an international survey of biostatisticians. Controlled Clinical Trials 21:415-27

Ravnskov U. (1992) Cholesterol lowering trials in coronary heart-disease - frequency of citation and outcome. BMJ 305:15-9

Rennie D. (1999) Fair conduct and fair reporting of clinical trials. JAMA

Rennie D. & Flanagin A. (1992) Publication bias: The triumph of hope over experience. JAMA 267:411-2

Revicki D.A. & Frank L. (1999) Pharmacoeconomic evaluation in the real world. Effectiveness versus efficacy studies. Pharmacoeconomics 15:423-34

Richy F., Ethgen O., Bruyere O., Deceulaer F. & Reginster J.Y. (2004) From sample size to effect-size: Small study effect investigation (SSEi). The Internet J. of Epidemiology 1

Richy F. & Reginster J.Y. (2006) A simple method for detecting and adjusting meta-analyses for publication bias. The Internet Journal of Epidemiology 3:1985-2005

341

Riley R.D., Lambert P.C., Staessen J.A., Wang J., Gueyffier F., Thijs L. & Boutitie F. (2008) Meta-analysis of continuous outcomes combining individual patient data and aggregate data. Stat Med 27:1870-93

Rising K., Bacchetti P. & Bero L. (2008) Reporting bias in drug trials submitted to the Food and Drug Administration: Review of publication and presentation. PLoS Medicine 5:e217

Roberts M.E., Tchanturia K., Stahl D., Southgate L. & Treasure J. (2007) A systematic review and meta-analysis of set-shifting ability in eating disorders. Psychological Medicine 37:1075-84

Robinson W.S. (1950) Ecological correlations and the behavior of individuals. American Sociological Review 15:351-7

Roman R., Comas M., Mar J., Bernal E., Jimenez-Puente A., Gutierrez-Moreno S. & Castells X. (2008) Geographical variations in the benefit of applying a prioritization system for cataract surgery in different regions of Spain. BMC Health Services Research 8:32

Rosenthal R. (1979) The file drawer problem & tolerance for null results.Psychol Bull 86:638-41

Rosenthal R. & Rubin D.B. (1988) Comment: Assumptions and procedures in the file drawer problem. Statistical Science 3:120-5

Rosner B., Spiegelman D. & Willett W.C. (1992) Correction of logistic regression relative risk estimates and confidence intervals for random within-person measurement error. Am J Epidemiol 136:1400-13

Rothstein H.R., Sutton A.J. & Borenstein M. (2005) Publication bias in meta-analysis. Prevention, assessment and adjustments, Chichester, Wiley

Rothwell P.M. (2005) External validity of randomised controlled trials: “To whom do the results of this trial apply?”. Lancet 365:82-93

Rubin D. (1990) A new perspective on meta-analysis. IN Wachter, K.W. &Straff, M.L. (Eds.) The future of meta-analysis. New York, Russell Sage Foundation

Rubin D.B. (1992) Meta-analysis: literature synthesis or effect-size surface estimation? Journal of Educational Statistics 17:363-74

Rücker G. & Schumacher M. (2008) Simpson's paradox visualized: The example of the Rosiglitazone meta-analysis. BMC Med Res Methodol 8:34

Rücker G., Schwarzer G. & Carpenter J. (2008a) Arcsine test for publication bias in meta- analyses with binary outcomes. Stat Med 27:746-63

Rücker G., Schwarzer G., Carpenter J., Binder H. & Schumacher M. (2008b) Measuring residual variation with respect to a fixed effect model that allows for small-study effects. Stat Med [Submitted]

Rücker G., Schwarzer G., Carpenter J. & Olkin I. (2009) Why add anything to nothing? The arcsine difference as a measure of treatment effect in meta-analysis with zero cells. Stat Med 28:721-38

Rücker G., Schwarzer G., Carpenter J. & Schumacher M. (2008c) Undue reliance on I2 in assessing heterogeneity may mislead. BMC Med Res Methodol 8:79

Rücker G., Schwarzer G. & Schumacher M. (2008d) Heterogeneity in meta-analysis: misconceiving I^2. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

342

Ruiz-Canela M., Martinez-Gonzalez M.A. & De Irala-Estevez J. (2000) Intention to treat analysis is related to methodological quality. BMJ 320:1007-

Rutherford M.J. (2008) Meta-analysis with small numbers of studies - are current methods producing misleading results? Health Sciences. Leicester, University of Leicester

Rutten A.L.B. & Stolper C.F. (2008) The 2005 meta-analysis of homeopathy: the importance of post-publication data. Homeopathy 97:169-77

Rutten A.L.B. & Stolper C.F. (2009) Reply to Wilson. Homeopathy 98:129-

Salanti G., Marinho V. & Higgins J., P. T. (2009) A case study of multiple-treatments meta- analysis demonstrates that covariates should be considered. J Clin Epidemiol 62:857- 64

Scally A.J. (2006) An inverstigation of small sample bias in meta-analyses, using a meta-meta analytical approach. Department of Health Sciences. University of Leicester

Schmid C.H., Lau J., Mcintosh M.W. & Cappelleri J.C. (1998) An empirical study of the effect of the control rate as a predictor of treatment efficacy in meta-analysis of clinical trials. Stat Med 17:1923-42

Schmid C.H., Stark P.C., Berlin J.A., Landais P. & Lau J. (2004) Meta-regression detected associations between heterogeneous treatment effects and study-level, but not patient- level, factors. J Clin Epidemiol 57:683-97

Schulz K.F., Chalmers I., Hayes R.J. & Altman D.G. (1995) Empirical evidence of bias: Dimensions of methodological quality associated with estimates of treatment effects in controlled trials. JAMA 273:408-12

Schulz K.F. & Grimes D.A. (2002) Sample size slippages in randomised trials: exclusions and the lost and wayward. Lancet 359:781-5

Schulz K.F. & Grimes D.A. (2005) Multiplicity in randomised trials II: subgroup and interim analyses. Lancet 365:1657-61

Schwarzer G., Antes G. & Schumacher M. (2002) Inflation of type I error rate in two statistical tests for the detection of publication bias in meta-analyses with binary outcomes. Stat Med 21:2465-77

Schwarzer G., Antes G. & Schumacher M. (2007) A test for publication bias in meta-analysis with sparse binary data. Stat Med 26:721-33

Schwarzer G., Carpenter J.R. & Rücker G. (2009) Empirical evaluation suggests Copas selection model preferable to Trim-and-Fill for selection bias in meta-analysis. J Clin Epidemiol

Schwarzer G., Rücker G. & Carpenter J. (2008) Comparison of Copas selection model for publication bias and trim-and-fill. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Selvin H.C. (1958) Durkheim's Suicide and Problems of Empirical Research. The American Journal of Sociology 63:607-19

Senn S. (2007a) Statistical issues in drug development, New York, John Wiley

Senn S. (2007b) Trying to be precise about vagueness. Stat Med 26:1417-30

Shadish W.R. & Baldwin S.A. (2005) Effects of behavioral marital therapy: A meta-analysis of randomized controlled trials. Journal of Consulting and Clinical Psychology 73:6-148

343

Shang A., Huwiler-Müntener K., Nartey L., Jüni P., Dörig S., Sterne J.A.C., Pewsner D. & Egger M. (2005) Are the clinical effects of homoeopathy placebo effects? Comparative study of placebo-controlled trials of homoeopathy and allopathy. Lancet 366:726-32

Shang A., Huwiler K., Nartey L., Jüni P. & Egger M. (2007) Placebo-controlled trials of Chinese herbal medicine & conventional medicine comparative study.Int J Epidemiol 36:1086-92

Sharp S. (1998) Meta-analysis regression. Stata Tech Bull 42

Sharp S.J. & Thompson S.G. (2000) Analysing the relationship between treatment effect and underlying risk in meta-analysis: comparison and development of approaches. Stat Med 19:3251-74

Sharp S.J., Thompson S.G. & Altman D.G. (1996) The relation between treatment benefit and underlying risk in meta-analysis. BMJ 313:735-8

Shechter M. & Shechter A. (2005) Magnesium & myocardial infarction. Clin Calcium 15:1873-7

Siersma V., Als-Nielsen B., Chen W., Hilden J., Gluud L.L. & Gluud C. (2007) Multivariable modelling for meta-epidemiological assessment of the association between trial quality and treatment effects estimated in randomized clinical trials. Stat Med 26:2745

Sim I., Chan A.W., Gülmezoglu A.M., Evans T. & Pang T. (2006) Clinical trial registration: transparency is the watchword. Lancet 367:1631-3

Simes R.J. (1986) Publication bias: the case for an international registry of clinical trials. J Clin Oncol 4:1529-41

Simpson E.H. (1951) The interpretation of interaction in contingency tables. JR Stat Soc B 13:238-41

Sismondo S. (2007) Ghost management: How much of the medical literature is shaped behind the scenes by the pharmaceutical industry? PLoS Medicine 4:e286

Sismondo S. (2008) Pharmaceutical company funding and its consequences: A qualitative systematic review. Contemp Clin Trials 29:109-13

Slavin R.E. (1986) Best-evidence synthesis: An alternative to meta-analytic and traditional reviews. Educ Res 15:5-11

Slavin R.E. (1995) Best evidence synthesis: an intelligent alternative to meta- analysis. [Review]. J Clin Epidemiol 48:9-18

Smith M.L. (1980) Publication bias and meta-analysis. Evaluation in Education 4:22-4

Smith T.C., Spiegelhalter D.J. & Thomas A. (1995) Bayesian approaches to random-effects meta-analysis: A comparative study. Stat Med 14:2685-99

Song F., Altman D.G., Glenny M.-A. & Deeks J.J. (2003) Validity of indirect comparison for estimating efficacy of competing interventions: empirical evidence from published meta- analyses. BMJ 326:472-6

Song F., Eastwood A.J., Gilbody S., Duley L. & Sutton A.J. (2000) Publication and related biases. Health Technol Assess 4:1-115

Song F., Loke Y.K., Walsh T., Glenny A.-M., Eastwood A.J. & Altman D.G. (2009) Methodological problems in the use of indirect comparisons for evaluating healthcare interventions: survey of published systematic reviews. BMJ 338:b1147-

344

Song F., Sheldon T.A., Sutton A.J., Abrams K.R. & Jones D.R. (2001) Methods for exploring heterogeneity in meta-analysis. Eval Health Prof 24:126-51

Spiegelhalter D.J. (2005) Funnel plots for comparing institutional performance. Stat Med 24:1185-202

Spiegelhalter D.J., Abrams K.R. & Myles J.P. (2003) Bayesian approaches to clinical trials and health-care evaluation, Chichester, England,Wiley

Spiegelhalter D.J., Miles J.P., Jones D.R. & Abrams K.R. (2000a) Bayesian methods in Health Technology Assessment. Health Technology Assessment 2

Spiegelhalter D.J., Myles J.P., Jones D.R. & Abrams K.R. (1999) Methods in health service research: An introduction to Bayesian methods in health technology assessment. BMJ 319:508-12

Spiegelhalter D.J., Thomas A. & Best N.G. (2000b) WinBUGS Version 1.4. User manual, MRC Biostatistics Unit, Cambridge

Sridharan L. & Greenland P. (2009) Editorial policies and publication bias: The importance of negative studies. Archives of Internal Medicine 169:1022

Stangl D.K. & Berry D.A. (2000) Meta-analysis in medicine and health policy, Routledge, USA

Stanley T.D. (2005) Beyond publication bias. Journal of Economic Surveys Volume 19:309-45

Stanley T.D. (2008) Meta-regression methods for detecting and estimating empirical effects in the presence of publication selection. Oxford Bulletin of Economics and Statistics 70:103-27

Stata (2008) Stata Statistical Software: Release 9.2. College Station, Texas.

Steichen T.J. (1998) METABIAS: Tests for publication bias in meta-analysis. Stata Tech Bull 7:125-33

Steichen T.J. (2000) METATRIM: Stata module to perform nonparametric analysis of publication bias. Stata Tech Bull 61:8-14

Sterling T.D. (1959) Publication decisions and their possible effects on inferences drawn tests of significance - or vice versa. American Statistical Association Journal 54:30-4

Sterling T.D., Rosenbaum W.L. & Weinkam J.J. (1995) Publication decisions revisited: the effect of the outcome of statistical tests on the decision to publish and vice versa. American Statistician 49:108-12

Stern J.M. & Simes R.J. (1997) Publication bias: evidence of delayed publication in a cohort study of clinical research projects. BMJ 315:640-5

Sterne J.A.C., Davey Smith G. & Cox D.R. (2001a) Sifting the evidence - what's wrong with significance tests? Another comment on the role of statistical methods. BMJ 322:226-31

Sterne J.A.C. & Egger M. (2001) Funnel plots for detecting bias in meta-analysis: Guidelines on choice of axis. J Clin Epidemiol 54:1046-55

Sterne J.A.C., Egger M. & Davey Smith G. (2001b) Systematic reviews in health care: Investigating and dealing with publication and other biases in meta-analysis. BMJ 323:101-5

Sterne J.A.C., Egger M., Moher D. & (Editors) (2008) Chapter 10: Addressing reporting biases. IN Higgins, J.P.T. &Green, S. (Eds.) Cochrane Handbook for Systematic Reviews of

345

Intervention. Version 5.0.0 (updated February 2008) (available from www.cochrane- handbook.org). Oxford, Cochrane Collaboration

Sterne J.A.C., Gavaghan D. & Egger M. (2000) Publication and related bias in meta-analysis: Power of statistical tests and prevalence in the literature. J Clin Epidemiol 53:1119-29

Sterne J.A.C. & Harbord R.M. (2004) Funnel plots in meta-analysis. Stata Journal 4:127-41

Sterne J.A.C., Harbord R.M., Palmer T.M., Peters J.L., Sutton A.J., Moreno S.G., Harris R.J., Steichen T.J. & Egger M. (2009a) Investigating bias in meta-analysis: metafunnel, confunnel, metabias, and metatrim. IN Sterne, J.A.C. (Ed.) Meta-Analysis in Stata: An Updated Collection from the Stata Journal. Stata Press

Sterne J.A.C., Jüni P., Schulz K.F., Altman D.G., Bartlett C. & Egger M. (2002) Statistical methods for assessing the influence of study characteristics on treatment effects in 'meta-epidemiological' research. Stat Med 21:1513-24

Sterne J.A.C., Welton N.J., Ades A.E., Altman D.G. & Carlin J.B. (2009b) Incorporation of potential biased evidence in systematic reviews and meta-analyses (submitted). Int J Epidemiol

Sterne J.A.C., White I.R., Carlin J.B., Spratt M., Royston P., Kenward M.G., Wood A.M. & Carpenter J.R. (2009c) Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ 338:b2393-

Stinson E.R. & Mueller D.A. (1980) Survey of health professionals' information habits and needs. Conducted through personal interviews. JAMA 243:140-3

Sutton A., Ades A.E., Cooper N. & Abrams K. (2008) Use of indirect and mixed treatment comparisons for technology assessment. PharmacoEconomics 26:753

Sutton A.J. & Abrams K.R. (2001) Bayesian methods in meta-analysis and evidence synthesis. Stat Methods Med Res 10:277-303

Sutton A.J., Abrams K.R. & Jones D.R. (2002) Generalized synthesis of evidence and the threat of dissemination bias: the example of electronic fetal heart rate monitoring (EFM). J Clin Epidemiol 1013-24

Sutton A.J., Abrams K.R., Jones D.R., Sheldon T.A. & Song F. (1998) Systematic reviews of trials and other studies.

Sutton A.J., Abrams K.R., Jones D.R., Sheldon T.A. & Song F. (2000a) Methods for meta- analysis in medical research, John Wiley,London

Sutton A.J., Cooper N. & Jones D. (2009a) Evidence synthesis as the key to more coherent and efficient research. BMC Med Res Methodol 9:29

Sutton A.J., Cooper N.J., Jones D.R., Lambert P.C., Thompson J.R. & Abrams K.R. (2007) Evidence-based sample size calculations based upon updated meta-analysis. Stat Med 26:2479-500

Sutton A.J., Donegan S., Takwoingi Y., Garner P., Gamble C. & Donald A. (2009b) An encouraging assessment of methods to inform priorities for updating systematic reviews. J Clin Epidemiol 62:241-51

Sutton A.J. & Higgins J.P.T. (2008) Recent developments in meta-analysis. Stat Med 27:625-50

Sutton A.J., Song F., Gilbody S.M. & Abrams K.R. (2000b) Modelling publication bias in meta- analysis: a review. Stat Methods Med Res 9:421-45

346

Takkouche B., Cadarso-Suarez C. & Spiegelman D. (1999) Evaluation of old and new tests of heterogeneity in epidemiologic meta-analysis. Am J Epidemiol 150:206-15

Tang J.L. & Liu J.L. (2000) Misleading funnel plot for detection of bias in meta-analysis. J Clin Epidemiol 53:477-84

Terrin N., Schmid C.H. & Lau J. (2005) In an empirical evaluation of the funnel plot, researchers could not visually identify publication bias. J Clin Epidemiol 58:894-901

Terrin N., Schmid C.H., Lau J. & Olkin I. (2003) Adjusting for publication bias in the presence of heterogeneity. Stat Med 22:2113-26

Thompson J., Palmer T. & Moreno S. (2006) Bayesian analysis in Stata using WinBUGS. Stata Journal 6:530-49

Thompson J., Palmer T., Moreno S.G. & Warren F. (2009) Further stata commands for Bayesian analysis in WinBUGS. Stata Journal [In press]

Thompson S.G. (1993) Controversies in meta-analysis: the case of the trials of serum cholesterol reduction. [Review]. Stat Methods Med Res 2:173-92

Thompson S.G. (1994) Systematic Review: Why sources of heterogeneity in meta-analysis should be investigated. BMJ 309:1351-5

Thompson S.G. & Higgins J.P.T. (2002) How should meta-regression analyses be undertaken and interpreted? Stat Med 21:1559-73

Thompson S.G. & Pocock S.J. (1991) Can meta-analyses be trusted? Lancet 338:1127-30

Thompson S.G. & Sharp S.J. (1999) Explaining heterogeneity in meta-analysis: a comparison of methods. Stat Med 18:2693-708

Thompson S.G., Smith T.C. & Sharp S.J. (1997) Investigation underlying risk as a source of heterogeneity in meta-analysis. Stat Med 16:2741-58

Thoresen M. (2007) A note on correlated errors in exposure and outcome in logistic regression. Am J Epidemiol 166:465-71

Thorlund K., Gluud C. & Wetterslev J. (2008) How meta-analytical inferences in practice depend on the choice of between-trial variance estimator. German Journal for Quality in Health Care (Abstracts of the 16th Cochrane Colloquium) 102:7-99

Thornton A. & Lee P. (2000) Publication bias in meta-analysis: its causes and consequences. J Clin Epidemiol 53:207-16

Tierney J.F. & Stewart L.A. (2005) Investigating patient exclusion bias in meta-analysis. Int J Epidemiol 34:79-87

Timmer A., Hilsden R.J., Cole J., Hailey D. & Sutherland L.R. (2002) Publication bias in gastroenterological research–a retrospective cohort study based on abstracts submitted to a scientific meeting. BMC Med Res Methodol 2:7

Tramer M.R., Reynolds D.J.M., Moore R.A. & Mcquay H.J. (1997) Impact of covert duplicate publication on meta-analysis: a case study. BMJ 315:635-40

Tu Y.-K., Gunnell D. & Gilthorpe M. (2008) Simpson's Paradox, Lord's Paradox, and Suppression Effects are the same phenomenon - the reversal paradox. Emerging Themes in Epidemiology 5:2

347

Turner E.H. (2004) A taxpayer-funded clinical trials registry and results database. PLoS Medicine 1:e60

Turner E.H., Matthews A.M., Linardatos E., Tell R.A. & Rosenthal R. (2008a) Selective publication of antidepressant trials and its influence on apparent efficacy. N Engl J Med 358:252-60

Turner E.H., Matthews A.M., Linardatos E., Tell R.A. & Rosenthal R. (2008b) Selective publication of antidepressant trials and its influence on apparent efficacy. Suplement Material. N Engl J Med 358:252

Turner E.H., Moreno S.G. & Sutton A.J. (2009a) Ranking antidepressants. Lancet 373:1760

Turner E.H. & Rosenthal R. (2008) Efficacy of antidepressants. Is not an absolute measure, and it depends on how clinical significance is defined. BMJ 336:516

Turner R.M., Spiegelhalter D.J., Smith G.C.S. & Thompson S.G. (2009b) Bias modelling in evidence synthesis. J R Stat Soc Ser A Stat Soc 172:21-47

Van Houwelingen H.C., Arends L.R. & Stijnen T. (2002) Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat Med 21:589-624

Vandenbroucke J.P. (1998) Bias in meta-analysis detected by a simple, graphical test. Experts' views are still needed. BMJ 316:469-70

Vanhonacker W.R. (1996) Meta-analysis and response surface extrapolation: a least squares approach. The American Statistician 50

Vevea J.L. & Hedges L.V. (1995) A general linear model for estimating effect size in the presence of publication bias. Psychometrika 60:419-35

Vevea J.L. & Woods C.M. (2005) Publication bias in research synthesis: sensitivity analysis using a priori weight functions. Psychol Methods 10:428-43

Vickers A., Goyal N., Harland R. & Rees R. (1998) Do certain countries produce only positive results? A systematic review of controlled trials. Controlled Clinical Trials 19:159-66

Viechtbauer W. (2005) Bias and efficiency of meta-analytic variance estimators in the random- effects model. 30:261-93

Viechtbauer W. (2007) Hypothesis tests for population heterogeneity in meta-analysis. British Journal of Mathematical and Statistical Psychology 60:29-60

Villar J., Mackey M.E., Carroli G. & Donner A. (2001) Meta-analyses in systematic reviews of randomized controlled trials in perinatal medicine: comparison of fixed and random effects models. Stat Med 20:3635-47

Von Elm E., Poglia G., Walder B. & Tramer M.R. (2004) Different Patterns of Duplicate Publication: An Analysis of Articles Used in Systematic Reviews. JAMA 291:974-80

Wacholder S. (1995) When measurement errors correlate with truth: Surprising effects of nondifferential misclassification. Epidemiology 6:157-61

Wakefield J. (2004) Ecological inference for 2 x 2 tables (with discussion). JR Stat Soc A 167:385–445

Welton N.J., Ades A.E., Caldwell D.M. & Peters T.J. (2008) Research prioritization based on expected value of partial perfect information: a case-study on interventions to increase uptake of breast cancer screening. J R Stat Soc Ser A Stat Soc 171:807-41

348

Welton N.J., Ades A.E., Carlin J.B., Altman D.G. & Sterne J.A.C. (2009) Models for potentially biased evidence in meta-analysis using empirically based priors. JR Stat Soc A 172:

White I.R., Carpenter J., Evans S. & Schroter S. (2007) Eliciting and using expert opinions about dropout bias in randomized controlled trials. Clinical Trials 4:125-39

Whitehead A. (1997) A prospectively planned cumulative meta-analysis applied to a series of concurrent clinical trials. Stat Med 16:2901-13

WHO (2009a) Importance of trial registration and reporting. www.who.int/ictrp/background/en/

WHO (2009b) International Clinical Trials Registry Platform (ICTRP). www.who.int/ictrp

Willan A.R. (2007) Clinical decision making and the expected value of information. Clinical Trials 4:279-85

Williamson P.R. & Gamble C. (2005) Identification and impact of outcome selection bias in meta-analysis. Stat Med 24:1547-61

Williamson P.R. & Gamble C. (2007) Application and investigation of a bound for outcome reporting bias. Trials 8:9

Williamson P.R., Gamble C., Altman D.G. & Hutton J.L. (2005) Outcome selection bias in meta- analysis. Stat Methods Med Res 14:515-24

Wilson P. (2009) Analysis of a re-analysis of a meta-analysis: in defence of Shang et al. Homeopathy 98:127-8

Wisløff T., Myhre K.I. & Norderhaug I.N. (2006) Fraud and publication bias - the only reason for funnel-plot asymmetry? IN Services, N.K.C.F.T.H. (Ed.) Handb Health Technol Assess. Handb Health Technol Assess

WMA (2009) Ethical principles for medical research involving human subjects. www.wma.net/e/policy/b3.htm

Wood L., Egger M., Gluud L.L., Schulz K.F., Juni P., Altman D.G., Gluud C., Martin R.M., Wood A.J.G. & Sterne J.A.C. (2008) Empirical evidence of bias in treatment effect estimates in controlled trials with different interventions and outcomes: meta-epidemiological study. BMJ 336:601-5

Wood S.A., Jiang W. & Tanner M. (2002) Bayesian mixture of splines for spatially adaptive nonparametric regression. Biometrika 89:513-28

Woods K.L. (1995) Mega-trials and management of acute myocardial infarction. Lancet 346:611-4

Xu H., Platt R.W., Luo Z.-C., Wei S. & Fraser W.D. (2008) Exploring heterogeneity in meta- analyses: needs, resources and challenges. Paediatric and Perinatal Epidemiology 22:18-28

Yates F. & Cochran W.G. (1938) The analysis of groups of experiments. Journal of Agricultural Sciences 28:556-80

Young N.S., Ioannidis J.P.A. & Al-Ubaydli O. (2008) Why current publication practices may distort science. PLoS Med 5:e201

Yule G.U. (1903) Notes on the theory of association of attributes in statistics. Biometrika 2:121- 34

349