<<

Research in Nursing & Health, 2003, 26, 322–333 Focus on Research Methods Research Methods: Managing Primary Study Quality in Meta-Analyses

Vicki S. Conn,* Marilyn J. Rantz*

School of Nursing, University of Missouri–Columbia, Columbia, MO 65211 Received 3 March 2003; accepted 8 May 2003

Abstract: Meta-analyses synthesize multiple primary studies and identify patterns of relationships. Differences in primary study methodological quality must be addressed for meta-analysis to produce meaningful results. No single standard exists for addressing these quality variations. Quality measurement scales are fraught with development and application problems. Several strategies have been proposed to address quality. Researchers can set minimum levels for inclusion or require that certain quality attributes be present. An inclusive method is to weight effect sizes by quality scores. This allows the inclusion of diverse studies but relies on questionable quality measures. By considering quality an empirical question, meta-analysts can examine associations between quality and effect sizes and thus preserve the purpose of meta-analysis to systematically examine . Researchers in- creasingly are combining strategies to overcome the limitations of using a single approach. Future work to develop valid measures of primary study quality dimensions will improve the ability of meta-analysis to inform research and nursing practice. ß 2003 Wiley Periodicals, Inc. Res Nurs Health 26:322–333, 2003

Keywords: meta-analysis; research methods An information explosion has occurred over the increasingly common (Cooper, 1998). Well-con- last 50 years as the volume of scientific literature ducted meta-analyses can guide future research has grown exponentially. Researchers strive to and inform practice (Conn & Armer, 1996). The design new studies based on existing knowledge results of meta-analyses are determined both by but face the daunting task of summarizing what the studies included and their management in the is known from extant research. Problems with meta-analysis process. The scientific rigor of po- narrative reviews have stimulated interest in quan- tential primary studies varies dramatically, and titative integration of existing research, both as a several strategies have been proposed to address foundation for future research and as a basis for quality (Moher et al., 1999; Saunders, Soomro, nursing practice. As the need to systematically Buckingham, Jamtvedt, & Raina, 2003). One solu- synthesize research grows more critical, use of the tion is to exclude all but the most rigorous studies. powerful tool known as meta-analysis is becoming Another approach is to include studies of varied

Contract grant sponsor: NIH NINR (to Vicki Conn, principal investigator); Contract grant number: RO1NR07870. Correspondence to Vicki S. Conn, S317 School of Nursing–MU, Columbia, MO 65211. *Professor. Published online in Wiley InterScience (www.interscience.wiley.com) DOI: 10.1002/nur.10092

322 ß 2003 Wiley Periodicals, Inc. MANAGING PRIMARY STUDY QUALITY IN META-ANALYSIS / CONN AND RANTZ 323 quality and then address quality through weight- article we discuss (a) ways that researchers con- ing or moderator analysis procedures (Cooper, ceive of quality and assess it, (b) associations 1998). No single standard exists for managing this between study quality and outcomes, and finally, complex issue. In this article we examine stra- (c) strategies to manage study quality in quanti- tegies for managing the varied quality of primary tative syntheses. Although meta-analysis is use- studies in meta-analysis. ful in both intervention and descriptive research, we address the intervention category because studies in this category are often used to direct BRIEF OVERVIEW OF nursing practice or to suggest further intervention META-ANALYSIS research. Readers interested in further information about meta-analysis may refer to Cooper (1998) or In meta-analysis research, the pooled results of to frequently updated Web sites (e.g., that of the several primary studies are analyzed to provide a University of Maryland, http://ericae.net/meta/). quantitative review of existing empirical evidence. Meta-analysis follows a systematic process: (a) formulate the research problem, (b) search for METHODOLOGICAL QUALITY eligible studies, (c) evaluate available data, (d) pool results, (e) quantitatively analyze, and (f) Both consumers of research and the researchers interpret findings taking into account the strengths themselves consistently express concerns about and limitations of the existing studies (Cooper, methodological quality. The emphasis on quality 1998). Meta-analysts calculate an overall estimate is consistent with the goals of science to produce of the magnitude of association between the valid knowledge (Petersen & White, 1989). This variables they study. discussion focuses on internal aspects of Although the overall effect estimate is very quality because cannot be present important, it is sometimes of equal interest to without and because external investigate differences in effect size associated validity is not an inherent attribute of individual with variations between studies by conducting a studies (Juni, Altman, & Egger, 2001). moderator analysis. Moderator analysis estimates Explicit definitions of quality generally focus effect sizes separately for different values of the on the extent to which studies generate reprodu- moderator variable under study. Intervention cible information (Moher et al., 1995). ‘‘Quality attributes and characteristics of samples are gives us an estimate of the likelihood that the typical variables for moderator analyses. For results are a valid estimate of the truth,’’ according example, in a recent meta-analysis Conn, Valen- to Moher et al. (1995, p. 62). Quality is determined tine, & Cooper (2002) reported the overall effect by the extent to which study design, conduct, and size of interventions to increase physical activity analysis systematically avoid or minimize poten- among aging adults. However, the researchers tial sources of bias (Moher et al., 1995). Sys- calculated significantly larger effect sizes for tematic bias can contribute to error, which could particular intervention components (e.g., self- favor either the experimental or the control/ monitoring) and for studies with particular subject comparison treatment. A loss of precision may characteristics (e.g., patients with specific chronic contribute to error, in which potentially effica- illnesses). This moderator analysis is especially cious treatments are abandoned as ineffective. useful for nursing intervention research, in which Studies with methodological problems can con- a basic intervention often varies somewhat tribute added variability that reduces precision between studies. and hampers scientific progress (Detsky, Naylor, The discipline of nursing will realize the im- O’Rourke, McGeer, & L’Abbe, 1992; Lohr & mense potential benefits of quantitative synthesis Carey, 1999; West et al., 2002). only when meta-analytic methods are applied Unfortunately, no exists for appropriately to primary studies. Meta-analysts determining the ‘‘true’’ scientific rigor of primary must address the critically important issue of studies (Detsky et al., 1992). Most quality dimen- primary study quality during study selection or sions have to do with preventing bias in selection, data management or both. Generally, some of the performance, detection, or attrition (Juni et al., research reports that are retrieved and assessed for 2001). Table 1 summarizes commonly noted com- inclusion in meta-analyses will be strong, and ponents of intervention research quality (Balk others will possess weaknesses. The challenge is et al., 2002; Chalmers, Celano, Sacks, & Smith, to generate the most useful information possible 1983; Juni et al., 2001; Kunz & Oxman, 1998; from the existing empirical evidence. In this Moher et al., 1998; Schulz, Chalmers, & Altman, 324 RESEARCH IN NURSING & HEALTH

Table 1. Components of Primary Intervention Research Methodological Quality for Meta-Analysis

Concept Issues

Sample selection Sample attributes appropriate for study purpose Intervention tested with important subgroups Recruitment Recruitment strategy prevents bias Description of potential subjects who declined participation Sample size adequacy Size adequate to provide a sufficiently precise estimate of effect size Central system generates an unpredictable assignment sequence Allocation concealment/ blinding Comparison group Nature of the comparison group appropriate for the area of science Management of preintervention differences between comparison groups Blinding/masking Participants Care providers Assessors measuring outcomes Data analysts Interventions Intervention reproducible by others Intervention consistent with theory Treatment integrity Prevention of treatment contamination Attrition management Attrition prevented and reported Intention to treat analysis Outcome measures Objective measures when possible Construct validity of instruments ascertainable Adequate reliability to provide sufficiently precise estimate of effect size Appropriate follow-up period to measure outcomes Avoid mono-operation bias, if appropriate Statistical analysis Assumptions of analysis consistent with data Significance level appropriate given number of tests conducted on data Potential confounders not controlled in design addressed in analysis values and p levels presented

2002; Schulz, Chalmers, Hayes, & Altman, 1995; studies of varying quality in their work, they have Sindhu, Carpenter, & Seers, 1997; West et al., created a number of scales for assessing study 2002). quality (Balk et al., 2002). Meta-analysts began to The notion of research quality is complex. develop primary study quality scales in the 1980s Although research methods experts list many (Chalmers et al., 1981). More than 100 of these similar components of quality, their lists are rarely scales exist for measuring the quality of primary identical, and definitions vary substantially. For studies (West et al., 2002), and they vary drama- example, ‘‘single blinding’’ may refer either to tically in size, composition, complexity, and participants being unaware of their assignment or extent of development (Moher, Jadad, & Tugwell, to the masking of the people conducting the out- 1996). For instance, the relative weight between come assessments. Studies may report single, categories differs greatly among scales. Juni et al. double, triple, or quadruple blinding (Schulz et al., (1999) found that the relative weight assigned to 2002). Some experts have suggested assessing all three common measures (blinding, randomiza- components of quality (Table 1), but others have tion, and management of dropouts) varied from argued that only selected aspects of quality are 0% to 100%. Scoring of items also varied, with critical(Juni,Witschi,Bloch,&Egger,1999).Most some scales using gradation of scores within each investigators agree that they must assess both the item and others scoring only presence or absence. design and execution of the study for quality. Few quality measures have been developed Concerns about quality have contributed to interest using established scale development techniques in strategies to measure methodological rigor. (Jadad et al., 1996; Moher et al., 1996). The West et al. (2002) review of more than 100 scales found only two instruments developed using standard Instruments to Measure Quality procedures (Downs & Black, 1998; Sindhu et al., 1997). When developing scales, researchers Because meta-analysts have sometimes been typically include estimations of reliability and unsure of how or whether to include primary assessments of content and criterion validity. MANAGING PRIMARY STUDY QUALITY IN META-ANALYSIS / CONN AND RANTZ 325

Establishing the validity of these instruments is of r ¼.47 was found, reflecting the difficulty in challenging work. Unfortunately, no gold standard reaching a consensus on how to generate valid exists, so criterion validity cannot be established. knowledge, even within a well-defined area of Psychometric properties of instruments, including science. interrater reliability, rarely have been documented Because varied ideas of quality give rise to (Moher et al., 1996). varied scales, it is not surprising that these scales Meta-analysts should base their selection of a generate discrepant results when applied to studies quality scale on instrument attributes and com- (Juni et al., 1999). Moher et al. (1996) compar- plexity, consistency between items, and an under- ed six primary-study quality scales when they standing of which characteristics of primary were applied to 12 individual studies. Scoring studies are key for the current projects in that area was completed by several raters, who then re- of science. For example, breast cancer treatment solved score differences through consensus and researchers often encounter poor compliance and arbitration. They found that scores differed dram- high dropout because of both the acute and the atically across scales, with scores ranging from chronic effects of treatment (Liberati, Himel, & 23% to 74% of the maximum possible score Chalmers, 1986). In , typical limitations of for individual studies. It remains unclear how pain research include statistically underpowered these very different quality indices should be trials and insufficient follow-up given the common interpreted. use of wait-list control groups (Morley, Eccleston, Some of the scales’ scoring procedures are pro- & Williams, 1999). In other areas of science, blematic. Assessments of report quality have often the use of outcome measures with questionable been confounded with design quality (Detsky psychometric properties is common (Conn et al., et al., 1992; West et al., 2002). For instance, some 2002). Meta-analysts working in different areas of scales require that a research report explicitly science may choose different measures of quality address a particular quality dimension before based on their particular concerns. Readers in- allocating points, but other scales automatically terested in more details on quality measures can provide points unless a report explicitly states that review published copies of quality scales (de Vet the quality feature was absent from the study et al., 1997; Downs & Black, 1998). West et al. (Sindhu et al., 1997). is common, (2002) provide an excellent review of existing as research reports often include insufficient scales. Saunders et al. (2003) provide an overview design details. Problems with missing data may of instruments applicable to nonrandomized inter- be managed by using rather than summed vention studies. The proliferation of scales has scores. Sometimes, instruments have been con- been accompanied by considerable debate over structed such that weaker research designs that are their usefulness. fully revealed and acknowledged as a limitation may score better in that category than a study with an apparently better design but one about which little detail is offered and whose limitations might Problems with Scale Measures be unknown (Sindhu et al., 1997). of Overall Quality Expert opinion determines the weighting of individual items on scales because empirical As scale developers have failed to use appropriate evidence is lacking (de Vet et al., 1997). It remains processes, problems have developed. Some varia- unclear how to interpret the highest quality-scale tions among instruments reflect differing concep- score for a given set of studies. If the highest tions of quality that are seldom explicitly defined score is 60 points out of a possible 100, should 60 (Juni et al., 1999). Only moderate agreement be considered the highest score for this area of exists about which domains of quality should be science? The scoring systems of some scales ad- included in scales. Most scale items are based on just for criteria that are not applicable to a given expert opinion because there has been only scant study, resulting in relative rather than absolute empirical work documenting the link between scores (Chalmers et al., 1981). It is unclear quality dimensions and study outcomes (Lohr & whether study quality is a continuous character- Carey, 1999). Even within fairly narrow areas of istic or whether a threshold effect exists for quality research, investigators may agree only partially (Detsky et al., 1992). The interpretation of scores on which attributes are the most important. For is ambiguous. example, Cooper (1986) asked experts in one Most scales result in a single, overall score for substantive field to evaluate the importance of quality. Only a few scales contain subscales that design attributes. An overall average correlation profile strengths and weaknesses (Downs & Black, 326 RESEARCH IN NURSING & HEALTH

1998). This is a major limitation. For example, a all trials, 52% for low-quality trials, and 29% small study with inadequate power might be an for high-quality trials. The overall effects were important source of information even though find- reduced to 35% when quality scores weighted ings lacked statistical significance, whereas a large estimates. Other researchers found no or limit- study with selection bias might be a less valid ed association between overall quality scores source. In such cases, relying on a single global and effect sizes (Balk et al., 2002; Emerson, quality score may obscure the underlying structure Burdick, Hoaglin, Mosteller, & Chalmers, 1990; of the multidimensional concept of methodologi- Fergusson, Laupacis, Salmi, McAlister, & Huet, cal quality (Lohr & Carey, 1999). 2000; Sharpe, 1997; Sterne et al., 2002). Beyond problems with the scales themselves, The quality of primary studies included in meta- the instruments have proven difficult to apply analyses can influence results in unpredictable consistently, even in randomized controlled trials. directions, including masking or even reversing For example, Balk et al. (2002) reported low the effect direction (Sterne et al., 2002). For interrater reliability for the presence of intention- example, Khan, Daya, and Jadad (1996) found to-treat analysis, randomization location, outcome that a fertility effect disappeared when they ex- assessor blinding, inclusion of a , and cluded studies of low quality. In other cases, accounting for variables. Also, some treatment effects were only manifested in high- scales are reliable and accurate in some areas of quality studies. For example, Brown (1992) science but not in others (Moher & Olkin, 1995). found a positive effect for educating people with The potential for bias in the rating process is diabetes, but only when low-quality studies were another concern, prompting some to suggest excluded. masking research reports (concealing authorship Primary study quality may be confounded with and institutional affiliation and methods or results other attributes of studies. For instance, Sterne sequentially). Little empirical evidence has accu- et al. (2002) found that differences in effect-size mulated to support these procedures (Jadad et al., estimates between unpublished and published 1996; McNutt, Evans, Fletcher, & Fletcher, 1990; trials increased after controlling for trial quality. Moher et al., 1998). In contrast, they found a decrease in effect-size- estimate differences between trials in English versus other languages after study quality was RELATIONSHIPS BETWEEN controlled for (Sterne et al., 2002). Bias related QUALITY AND STUDY OUTCOMES to study quality is a bigger problem when overall effect sizes are small in randomized controlled Associations between quality dimensions and trials (Kunz, Neumayer, & Khan, 2002). effects are observational findings. Confounding Different scales generate diverse assessments of between quality dimensions and other important study quality, which may cause inconsistency in aspects of the studies (such as treatments tested) efforts to relate study quality to outcome. Juni et al. could exist (Juni et al., 1999). Given those limi- (1999) compared the results from 25 quality scales tations, several researchers have examined the link that were applied to studies comparing low- between study quality and outcomes. Findings molecular-weight heparin with standard heparin. often have been contradictory. For six quality scales the relative risks were nearly identical for both treatments in high-quality trials, whereas better effects for low-molecular-weight heparin were documented in low-quality trials. Scale-Measured Quality and Seven scales documented an opposite trend: Study Outcomes No intervention differences for low-molecular- weight heparin were found in low-quality trials, When comparing low-quality studies with high- but high-quality trials showed evidence of im- quality studies, some researchers found that proved outcomes. For the remaining 12 studies, no low-quality studies underestimated effect sizes differences by study quality were documented. compared to high-quality studies (Ortiz et al., The authors noted that these discrepant results 1998). In contrast, other researchers have docu- were not surprising given the heterogeneous mented effect sizes 30–50% larger among the nature of the quality scales. low-quality studies (Juni et al., 1999; Kjaergard, Inconsistencies may also be related to differ- Villumsen, & Gluud, 2001; Moher et al., 1995, ences between areas of science. Balk et al. (2002) 1998, 1999). For example, Moher et al. (1998) found that overall measures of study quality were reported an overall treatment benefit of 39% for not associated with effect-size differences across MANAGING PRIMARY STUDY QUALITY IN META-ANALYSIS / CONN AND RANTZ 327 four medical areas. When meta-analyses within sizes (Balk et al., 2002; Juni et al., 1999; Linde each area were considered separately, quality et al., 1999). Allocation concealment may be most components were related to effect estimates, but important when investigators possess strong be- the direction of the association was not consistent liefs about the superiority of one treatment or across the four specialties. These findings sug- when treatments are compared to control condi- gest associations between quality and effect-size tions instead of alternative treatments. It is also estimates may vary by area of science. possible that inadequate concealment is a surro- These inconsistent findings regarding the gate measure for other quality aspects of the study importance of study quality may point to the (Schulz et al., 1995). questionable appropriateness of using overall Although experts often suggest masking group measures or to problems of grouping studies assignments from health care providers, few across variables. Overall quality scores may ob- researchers have examined this issue. Van der scure associations between quality and outcomes. Heijden, van der Windt, Kleijnen, Koes, and Existing associations between selected dimen- Bouter (1996) found that poor blinding of pro- sions of quality and outcomes may not be apparent viders was the most prevalent weakness in studies because the most important items may have little of shoulder steroid injections. This type of mask- overall effect on the total quality scores (Juni et al., ing may be most important in studies where 2001). The lack of consistent association between providers offered compensatory treatments or overall quality scores and effect-size values may otherwise confounded the assignment. reflect the possibility that different aspects of A common strategy is to mask those conducting lower quality may operate to either increase or outcome assessments. Researchers have docu- decrease effect-size estimates. It is possible that mented increased effect sizes (up to 35%) among associations between individual items cancel each studies without masked data collectors (Chalmers other out when overall scores are used. These et al., 1983; Juni et al., 1999; Kunz & Oxman, concerns have contributed to attempts to link 1998; Schulz et al., 1995; West et al., 2002). individual quality dimensions with outcomes. Blinding may be especially important when the outcome assessment requires at least some sub- jective judgment. Randomization and Effect Sizes

Most researchers have examined differences Management of Dropouts/Withdrawals among randomized controlled trials, with only a and Effect Sizes few studies examining the impact of randomiza- tion itself. In a meta-analysis of single-interven- Although intention-to-treat analysis prevents tion trials including both randomized and selective attrition from biasing results, it is partic- nonrandomized primary studies, Kunz and Oxman ularly difficult to assess (Balk et al., 2002). Neither (1998) reported both under- and overestimation of Schulz et al. (1995) nor Kjaergard et al. (2001) effect sizes in nonrandomized trials. The magni- detected a consistent pattern in effect sizes related tude of the differences was sizable, ranging to exclusions after randomization. The authors from 76% underestimation of effects to 160% noted that this issue is especially poorly reported overestimation. in primary studies, thus rendering their results unclear.

Blinding/Masking/Concealment and Study Outcomes Conclusions About the Relationship Between Quality and Outcomes Adequate randomization requires both adequate generation of the allocation sequence and alloca- Findings linking quality with outcomes are in- tion concealment. Failure to mask randomization conclusive. Problems with scales may contribute has been associated with notably larger (e.g., 41% to the inconsistency. Further, differences by area greater) effect sizes (Chalmers et al., 1983; of science make generalizations risky. These Colditz, Miller, & Mosteller, 1989; Kunz & findings do not suggest that quality is unimportant, Oxman, 1998; Schulz et al., 1995; West et al., rather, that different aspects of quality may be 2002). In contrast, others have found no associa- important in different areas of science and that tion between allocation concealment and effect further development of valid scales is essential. 328 RESEARCH IN NURSING & HEALTH

STRATEGIES TO It is inadequate to include only published pri- MANAGE QUALITY mary studies in meta-analyses. That an article is published in a peer-reviewed journal is an unsa- To ensure the quality of meta-analysis results, it is tisfactory proxy measure of its quality because the important that explicit systems are in place to most consistent difference between published and handle the variable quality of primary studies unpublished reports is the statistical significanceof (Assendelft, Koes, Knipschild, & Bouter, 1995; the findings, not the methodological quality of the Moher et al., 1999). Three basic strategies address study(Conn,Valentine,Cooper,&Rantz,inpress). quality in primary studies. Setting quality thresh- In some areas of science, dissertation research is olds for inclusion in meta-analysis is based on poor quality (Vickers & Smith, 2000), whereas in the idea that only studies with certain quality other areas dissertation studies are similar in features can contribute valid answers to the re- quality to published research (Conn et al., 2002). search question. Weighting estimates by quality Setting explicit criteria will ensure that studies are scores allows studies with stronger research fairly evaluated for inclusion. methods to make larger contributions to effect- The threshold approach sometimes, to use a size estimates. Considering quality an empirical colloquial term, ‘‘throws the baby out with the question allows examination of differences in bathwater.’’ That is, among the poorer-quality effect sizes in relationship to either overall quality studies excluded are likely be some studies that or specific quality attributes (e.g., intention-to- vary in ways beyond quality (Moher et al., 1996). treat analysis). For example, excluding studies with very small samples may omit projects with highly innovative interventions or studies with difficult-to-recruit subjects. This exclusion could limit the usefulness Setting a Quality Threshold of moderator analysis to determine whether varia- tions in interventions or samples are associated A priori decisions to use explicit selection criteria with effect-size differences. for whether to include or exclude a study based The threshold approach has a major limitation: on its methods are common in meta-analyses. Excluding research that may be of lower rigor goes Decisions to include or exclude studies are critical against the scientific habit of examining data— determinants of the validity and generalizability of letting the data speak. A strength of meta-analysis findings. When quality thresholds are not expli- is its ability to examine the association between citly stated, they are often implied in exclusion design attributes and effect-size estimates as an criteria. For example, in Sharpe’s (1997) review of empirical question. If strong studies produce sys- 32 published clinical meta-analyses tematically different results from weak studies, the quality exclusions most commonly occurred be- results of the strong ones can be believed (Cooper, cause of the absence of control groups, confound- 1998). But if no differences are detected, studies of ing of treatment conditions, a lack of random varied strength can be included in the analysis assignment, and invalid measures. Determining because they likely provide other variations that which studies to exclude is a challenge. Decisions may contribute to the value of the findings; these about exclusion on the basis of quality are often could include different approaches to measuring at least somewhat arbitrary. This allows for a ambiguous outcomes or varied samples (Cooper, potential inclusion bias because study procedures 1998). The practices of including all studies and are complex, and research reports frequently lack empirically examining methodology-related dif- pertinent details (Fergusson et al., 2000; Sharpe, ferences are consistent with the scientific dis- 1997). covery process (Cooper, 1998). The threshold approach can involve inclusion When arguments are made to exclude studies of criteria that are particularly important for an area inferior quality, it begs the questions of what of science. Another approach is to select a priori constitutes research of high quality and how to particular quality-scale cutoff scores. These cri- validly assess quality (Sharpe, 1997). Decisions to terion-referenced approaches are common. Alter- exclude studies may be too subjective to be re- natively, a norm-referenced strategy could be liable. Studies chosen with the threshold approach used, in which the quality threshold score is deter- often contain quality variations often not addres- mined as derived from a particular set of studies sed in the meta-analysis (Tritchler, 1999). In being considered, such as the score. This practice, meta-analysts often combine the thresh- strategy would yield the highest-quality studies old approach with other strategies that will be from a set of primary reports. discussed next. MANAGING PRIMARY STUDY QUALITY IN META-ANALYSIS / CONN AND RANTZ 329

Weighting by Study Quality Scores quality was not related to effect size, but the exclusion of studies based on control group The second major strategy for managing the management was associated with differences in quality of primary studies is the common strategy effect-size estimates. Weighting by overall quality of weighting effect-size estimates by study at- may obscure important sources of heterogeneity tributes. Most syntheses weight effect sizes by among study results (Greenland, 1994). some indicator of sample size, such as the inverse of the , which gives larger studies more weight in effect-size estimates. Similarly, indivi- dual effect estimates can be weighted by quality Considering Quality an scores. This would yield a larger impact from Empirical Question higher-quality studies on overall pooled results (Detsky et al., 1992). The third major approach to addressing the quality Weighting by quality-scale scores is the most of primary studies is to examine the association common approach in medical meta-analyses between quality and effect sizes. In this approach (Moher et al., 1999). This practice follows the the relationship between outcomes and overall assumption that studies with deficiencies are less quality scores and/or between effect-size out- informative and should have less influence on comes and specific components of quality can be overall outcomes (Tritchler, 1999), and it offers examined. Considering study quality as a potential several potential advantages. Weighting by quality moderator of sources of heterogeneity is consis- scores allows all studies to be included in the tent with meta-analysis as a study of studies, rather synthesis, even very diverse studies. This prevents than as just a statistical system for combining potential bias in the selection of primary studies. primary study outcomes into a single effect-size Weighting places more emphasis on studies of estimate (Greenland, 1994). greater rigor so that these studies affect the find- Examining the impact of research methods on ings more heavily. Use of quality-scale scores as effect sizes makes quality an empirical question. a weight produce less statistical heterogeneity This is consistent with the rigorous approach to because better-quality trials likely result in a research synthesis that moves beyond narrative higher signal-to-noise ratio as random variation reviews to examine effect-size moderators quan- decreases (Moher et al., 1998). titatively (Tritchler, 1999). If the results of Decisions about weighting may be difficult. methodologically sound studies are different from Statistical and empirical justification is lacking studies with flaws, the results of the high-quality when it comes to incorporating quality scores as studies can be believed (Cooper, 1998). If metho- weights (Detsky et al., 1992; Juni et al., 2001). dological differences are not associated with Using quality scores as weights assumes there is a effect-size variations, then the studies of lower linear relation between estimates of quality and quality may be included because they will likely the weights assigned to response options on the vary from the higher-quality studies in other ways scale (Moher et al., 1998). The scaling relation that may be important for the meta-analysis. For may not be linear but rather may require more example, studies with small samples could contain complex scoring and weighting systems (Moher pilot tests of novel interventions or might include et al., 1998). Further, weighting strategies appro- hard-to-recruit participants. These studies would priate in one area of science may not generalize to represent a valuable source of information for the other fields (Balk et al., 2002). overall synthesis. Weighting by quality scores is impeded by the Retaining all studies that meet the inclusion problems of the scales themselves, as previously criteria allows readers to understand the full described. The lack of interrater reliability among of evidence in the area of science and to decide quality scales suggests that considerable subjec- how much importance to give the evidence. Anal- tive assessments are required to complete the yzing the effects of methodological components scales. Using quality scores that are produced with increases confidence in findings when studies of considerable observer inference muddles objec- diverse designs are included. Excluding studies tive measures with arbitrary judgments (Green- other than randomized controlled trials can be land, 1994). Another problem with the weighting problematic in some areas of science, especially in strategy is that an overall score for quality may behavioral research that includes patient popula- mask the effect of potentially important individual tions because withholding treatment may pose components of quality (Fergusson et al., 2000). ethical problems. Meta-analysts have more con- For example, Fergusson found that overall study fidence in robust findings across diverse designs 330 RESEARCH IN NURSING & HEALTH that are subject to varied threats to validity. Increasingly, meta-analysts are using random- Including all possible studies that address the effects models for analysis. Researchers using the research question allows maximal use of existing random-effects model assume that a study-level data. variance component is present as an additional One approach to looking at quality as an em- source of random influence. The random-effects pirical question is to focus on the quality scores model allows broader generalization of findings that scales generate. Researchers can graphically than does the fixed-effects model (Cooper, 1998), plot the association between scores and effect-size which might make the former appropriate when estimates. An alternative is to initiate the meta- studies of diverse quality are included. analysis with high-quality studies and sequentially A major strength of this approach is its potential add studies of progressively lower quality to to allow researchers to examine the association graphically portray the relationship between between outcomes and specific quality compo- quality and effect-size estimates (Moher et al., nents. As researchers learn more about which 1996). As a form of sensitivity analysis, this components of quality are important, this knowl- strategy tests how robust the results are relative to edge will enhance the interpretation of existing quality attributes (Oxman, 1994). This analysis is research and inform future research methods. The often conducted with overall quality scores but decision to examine the components of quality could be completed with quality components. acknowledges that concerns may vary across areas The component approach identifies specific of science. This strategy hews to the goals of research methodology dimensions that may be science in that it allows the scientific community coded reliably from research reports and then to evaluate findings within the context of maxi- subjected to moderator analysis (Juni et al., 2001; mum information about the association between Moher et al., 1996). Generally, this approach methodological decisions and findings. Cochrane requires less inferential judgment than formulat- Collaboration reviews predominantly use the ing overall study quality ratings, and so greater component approach to examine quality and effect reliability is possible (Cooper, 1998). The com- sizes (Moher et al., 1999). ponent approach allows the synthesist to address Limitations exist for this strategy. Problems topics of relevance to that area of science without with scale measures of overall quality (lack of having to cope with missing data on scales de- interrater reliability, validity questions) may limit signed for other areas of science. The component confidence in findings related to overall quality approach may avoid problems associated with measures. Overall measures may mask interesting overall quality scores related to different weak- effects of individual components of quality on nesses, which can affect results in diverse direc- effect-size estimates. The component approach tions. It is very important to select components does not suffer from these limitations. relevant to the area of science for coding and analysis. Wortman (1994) provides an excellent overview of the component moderator analysis Combination Strategies threats to validity approach based on the classic Cook and Campbell text (1979). Sometimes meta-analysts use a combination of Generally, the meta-analyst begins moderator strategies, or mixed approach. For example, re- analysis by examining associations between searchers may judge some studies as totally in- quality measures and outcomes because metho- adequate for one area of science and exclude them dological differences may be correlated with from the analysis while including studies with substantive differences. Researchers must inter- varied strengths and weaknesses. This allows pret these findings with care so that they are sure investigators to exclude studies so poorly designed not to interpret a confounded situation as being or executed that it is difficult to come to any con- attributable to the substantive difference. When clusions based on their findings. Studies that are methodological attributes are found to be im- retained may be weighted by quality scores. portant, they can be controlled in the subsequent Alternatively, researchers may analyze quality moderator analysis. For example, Conn et al. components for their association with effect size (2002) controlled for the interval between inter- (Petersen & White, 1989). For example, in a recent vention completion and in a meta-analysis of interventions to increase physical synthesis of exercise behavior change interven- activity among aging adults, studies in which the tions. Subsequent moderator analyses may ad- measure of the dependent variable was con- dress substantive differences in interventions or founded with the independent variable were differences related to the samples studied. excluded (Conn et al., 2002). This occurred when MANAGING PRIMARY STUDY QUALITY IN META-ANALYSIS / CONN AND RANTZ 331 researchers provided frequent center-based super- Future Methodological Research vised exercise sessions as an intervention to increase activity behavior and then measured Findings linking quality attributes to effect-size of participation in the sessions as the estimates are contradictory and inconclusive. indicator of physical activity. Studies like this Future research within areas of science may pro- were excluded. However, the meta-analysts did vide more reliable evidence regarding these include studies with single-group pre–post associations. The previously described limitations designs because several small pilot studies or of existing instruments to measure overall quality demonstration projects added interesting inter- provide an excellent background for future work vention variations to the synthesis. Conn et al. to develop better measures of quality. Measures (2002) controlled for this design feature in the with subscales to address distinct dimensions of analysis. Moher et al. (1998) provided an example study quality are urgently needed. For example, of conducting several strategies to deal with study the What Works Clearinghouse funded by the U.S. quality. These combination approaches allow Department of Education is currently developing a meta-analysts to retain studies that have potential set of standards for evaluating the validity of to address the research question while examining causal claims. Their instrument to assess quality consistencies and accounting for the variability will possess subscales such as intervention con- related to methodological features in a collection struct validity, comparability of treatment groups, of primary studies. contamination, outcome measure construct valid- ity, and statistical validity (Valentine & Cooper, 2003). Their instrument will yield informa- tion consistent with Cook and Campbell’s (1979) General Considerations in Assessing well-known sets of threats to validity: construct Methodological Quality validity, internal validity, external validity, and statistical validity. The presence of explicit de- Regardless of which system is used to assess the finitions of terms is another strength of this quality of primary studies, extensive training and approach. This approach will allow meta-analysts pretesting are essential. Reviewers must put their to consider design quality components individu- predispositions aside and evaluate primary studies ally as an empirical question while providing objectively (Cooper, 1998). Multiple raters should tested coding items for assessing quality. Ideal assess quality. Researchers should decide a priori instruments to assess quality will require low how they will adjudicate differences in reviewers’ inference judgments by focusing on concrete opinions. As the complexity of rating systems assessments rather than abstract judgments. Psy- increases, so do discrepancies (Lohr & Carey, chometric evaluation of such instruments is 1999). Researchers should manage rating scales as essential. It is crucial that future work produce they do other research instruments: select an in- valid measures of primary-study quality dimen- strument with sound psychometric properties and sions if meta-analysis is to better inform research evidence of applicability to the topic, extensively and nursing practice. train for and supervise instrument application, and fully disclose instrument-related problems in research reports. CONCLUSIONS The inadequacy of research reporting is an ongoing challenge (Beck, 1999; Moher, Schulz, & Questions about the quality of studies included in Altman, 2001). The quality of research reports meta-analyses have existed since Glass coined invariably affects assessments of methodological the term meta-analysis in 1976. Emphasis on the features. Recent efforts to develop standards for primary study quality is vital to ensure that future research reporting in scientific journals may research and nursing practice are based on valid improve this situation (Beck; Moher et al.). In syntheses of existing studies. It is essential that this article we have discussed strategies that re- researchers lay out explicit processes for addres- searchers can use to address the quality of primary sing the quality of primary studies so that others studies considered for inclusion in meta-analyses. may assess how well the process protected against Strategies that assess quality allow conclusions bias or errors (Oxman, 1994). Using the watch- about the cumulative strength of information in a words caution and inclusion as a rule of thumb, particular area of science. This information can be meta-analysts should have a goal of including useful for planning future research, as well as for as much data as possible in their work. At times practice and policy implications. this may mean including studies that have 332 RESEARCH IN NURSING & HEALTH methodological weaknesses but that nonetheless black achievement. In R. Feldman (Ed.), The social may offer valuable information. By including psychology of education (pp. 341–363). Cambridge, studies of varied quality, meta-analysts can UK: Cambridge University Press. delineate the impact of these variations on out- Cooper, H. (1998). Synthesizing research (3rd ed.) comes (Detsky et al., 1992). Assessing the quality Thousand Oaks, CA: Sage. Detsky, A., Naylor, C., O’Rourke, K., McGeer, A., & of primary studies adds a crucial layer of L’Abbe, K. (1992). Incorporating variations in the complexity to the process of conducting meta- quality of individual randomized trials into meta- analysis (Moher et al., 1998). As researchers analysis. Journal of Clinical , 45, 255– carefully attend to issues of quality, their work 265. furthers not only substantive science but also our de Vet, H., de Bie, R., van der Heijden, G., Verhagen, A., understanding of how methodological decisions Sijpkes, P., & Kipschild, P. (1997). Systematic review influence outcomes. Thus, the science goals of on the basis of methodological criteria. Physiother- examining all available evidence and building apy, 1997, 284–289. cumulative knowledge may be realized. Downs, S., & Black, N. (1998). The feasibility of creat- ing a checklist for the assessment of the metho- dological quality both of randomised and non- randomised studies of health care interventions. REFERENCES Journal of Epidemiology & Community Health, 52, 377–384. Assendelft, W., Koes, B., Knipschild, P., & Bouter, L. Emerson, J., Burdick, E., Hoaglin, D., Mosteller, F., & (1995). The relationship between methodological Chalmers, T. (1990). An empirical study of the pos- quality and conclusions in reviews of spinal mani- sible relation of treatment differences to quality pulation. JAMA, 274, 1942–1948. scores in controlled randomized clinical trials. Con- Balk, E., Bonis, P., Moskowitz, H., Schmid, C., trolled Clinical Trials, 11, 339–352. Ioannidis, J., Wang, C., et al. (2002). Correlation of Fergusson, D., Laupacis, A., Salmi, L., McAlister, F., & quality measures with estimates of treatment effect Huet, C. (2000). What should be included in meta- in meta-analyses of randomized controlled trials. analyses? An exploration of methodological issues JAMA, 287, 2973–2982. using the ISPOT meta-analyses. International Journal Beck, C. (1999). Facilitating the work of a meta-analyst. of Technology Assessment in Health Care, 16, 1109– Research in Nursing & Health, 22, 523–530. 1119. Brown, S. (1992). Meta-analysis of diabetes patient Greenland, S. (1994). Invited commentary: A critical education research: Variations in intervention effects look at some population meta-analytic methods. across studies. Research in Nursing & Health, 15, American Journal of Epidemiology, 140, 290–296. 409–419. Jadad, A., Moore, R., Carroll, D., Jenkinson, C., Chalmers, T., Celano, P., Sacks, H., & Smith, H. (1983). Reynolds, D., Gavaghan, D., et al. (1996). Assessing Bias in treatment assignment in controlled clinical the quality of reports of randomized clinical trials: Is trials. New England Journal of Medicine, 309, 1358– blinding necessary? Controlled Clinical Trials, 17, 1361. 1–12. Chalmers, T., Smith, H., Blackburn, B., Silverman, B., Juni, P., Altman, D., & Egger, M. (2001). Systematic Schroeder, B., Reitman, D., et al. (1981). A method reviews in health care: Assessing the quality of for assessing the quality of a randomized control trial. controlled clinical trials. British Medical Journal, Controlled Clinical Trials, 2, 31–49. 323, 42–46. Colditz, G., Miller, J., & Mosteller, F. (1989). How study Juni, P., Witschi, A., Bloch, R., & Egger, M. (1999). The design affects outcomes in comparison of therapy. hazards of scoring the quality of clinical trials for in Medicine, 8, 441–454. meta-analysis. Journal of the American Medical Conn, V., & Armer, J. (1996). Meta-analysis and public Association, 282, 1054–1060. policy: Opportunity for nursing impact. Nursing Khan, K., Daya, S., & Jadad, A. (1996). The importance Outlook, 44, 267–271. of quality of primary studies in producing unbiased Conn, V., Valentine, J., & Cooper, H. (2002). Inter- systematic reviews. Archives of Internal Medicine, ventions to increase physical activity among aging 156, 661–666. adults: A meta-analysis. Annals of Behavioral Kjaergard, L., Villumsen, J., & Gluud, C. (2001). Medicine, 24, 190–200. Reported methodologic quality and discrepancies Conn, V., Valentine, J., Cooper, H., & Rantz, M. between large and small randomized trials in meta- (In press). Should grey literature be included in meta- analyses. Annals of Internal Medicine, 135, 982– analysis? Nursing Research. 989. Cook, T., & Campbell, D. (1979). Quasi-experimenta- Kunz, R., Neumayer, H., & Khan, K. (2002). When tion: Design and analysis issues for field settings. small degrees of bias in randomized trials can mislead Boston: Houghton Mifflin. clinical decisions: An example of individualizing Cooper, H. (1986). On the social psychology of using preventive treatment of upper gastrointestinal bleed- research reviews: The case of desegregation and ing. Critical Care Medicine, 30, 1503–1507. MANAGING PRIMARY STUDY QUALITY IN META-ANALYSIS / CONN AND RANTZ 333

Kunz, R., & Oxman, A. (1998). The unpredictability Petersen, M., & White, D. (1989). An information paradox: Review of empirical comparisons of rando- synthesis approach to reviewing literature. In M. mised and non-randomised clinical trials. British Petersen & D. White (Eds.), Health care of the Medical Journal, 317, 1185–1190. elderly: An information sourcebook (pp. 26–36). Liberati, A., Himel, H., & Chalmers, T. (1986). A Newbury Park, CA: Sage. quality assessment of randomized control trials of Saunders, L., Soomro, G., Buckingham, J., Jamtvedt, primary treatment of breast cancer. Journal of Clini- G., & Raina, P. (2003). Assessing the methodological cal Oncology, 4, 942–951. quality of nonrandomized intervention studies. Wes- Linde, K., Scholz, M., Ramirez, G., Clausius, N., tern Journal of Nursing Research, 25, 223–237. Melchart, D., & Jonas, W.B. (1999). Impact of study Schulz, K., Chalmers, I., & Altman, D. (2002). The quality on outcome in -controlled trials of landscape and lexicon of blinding in randomized homeopathy. Journal of Clinical Epidemiology, 52, trials. Annals of Internal Medicine, 136, 254–259. 631–636. Schulz, K., Chalmers, I., Hayes, R., & Altman, D. Lohr, K., & Carey, T. (1999). Assessing ‘‘best evid- (1995). Empirical evidence of bias. Dimensions of ence’’: Issues in grading the quality of studies for methodological quality associated with estimates of systematic reviews. Joint Commission Journal on treatment. JAMA, 273, 408–412. Quality Improvement, 25, 470–479. Sharpe, D. (1997). Of apples and oranges, file drawers McNutt, R., Evans, A., Fletcher, R., & Fletcher, S. and garbage: Why validity issues in meta-analysis (1990). The effects of blinding on the quality of peer will not go away. Clinical Psychology Review, 17, review. A randomized trial. JAMA, 263, 1371–1376. 881–901. Moher, D., Cook, D., Jadad, A., Tugwell, P., Moher, M., Sindhu, F., Carpenter, L., & Seers, K. (1997). Develop- Jones, A., et al. (1999). Assessing the quality of ment of a tool to rate the quality assessment of reports of randomised trials: Implications for the randomized controlled trials using a Delphi techni- conduct of meta-analyses. Health Technology As- que. Journal of Advanced Nursing, 25, 1262–1268. sessment, 3(12), 1–98. Sterne, J., Juni, P., Schulz, K., Altman, D., Bartlett, C., & Moher, D., Jadad, A., Nichol, G., Penman, M., Tugwell, Egger, M. (2002). Statistical methods for assessing P., & Walsh, S. (1995). Assessing the quality of the influence of study characteristics on treatment randomized controlled trials: An annotated biblio- effects in ‘‘meta-epidemiological’’ research. Statis- graphy of scales and checklists. Controlled Clinical tics in Medicine, 21, 1513–1524. Trials, 16, 62–73. Tritchler, D. (1999). Modelling study quality in meta- Moher, D., Jadad, A., & Tugwell, P. (1996). Assessing analysis. Statistics in Medicine, 18, 2135–2145. the quality of randomized controlled trials. Current University of Maryland (2003). Meta-analysis of issues and future directions. International Journal of research studies. Retrieved May 13, 2003, from Technology Assessment in Health Care, 12, 195– http://ericae.net/meta. 208. van der Heijden, G., van der Windt, D., Kleijnen, J., Moher, D., & Olkin, I. (1995). Meta-analysis of Koes, B., & Bouter, L. (1996). Steroid injections randomized controlled trials. A concern for stan- for shoulder disorders: A systematic review of RCTs. dards. JAMA, 274, 1962–1964. British Journal of General Practice, 46, 309–316. Moher, D., Pham, B., Jones, A., Cook, D., Jadad, A., Valentine, J., & Cooper, H. (2003). What Works Moher, M., et al. (1998). Does quality of reports of Clearinghouse Study Design and Implementation. randomised trials affect estimates of intervention Assessment Device (version 0.6). Washington, DC: efficacy reported in meta-analyses? Lancet, 352, U.S. Department of Education. Retrieved from http:// 609–613. www.w-w-c.org. Moher, D., Schulz, K., & Altman, D. (2001). The Vickers, A., & Smith, C. (2000). Incorporating data CONSORT statement: Revised recommendations for from dissertations in systematic reviews. Inter- improving the quality of reports of parallel-group national Journal of Technology Assessment in Health randomized trials. JAMA, 285, 1987–1991. Care, 16, 711–713. Morley, S., Eccleston, C., & Williams, A. (1999). West, S., King, V., Carey, T., Lohr, K., McKoy, N., Systematic review and meta-analysis of randomized Sutton, S., et al. (2002). Systems to rate the strength of controlled trials of cognitive behaviour therapy and scientific evidence. Evidence Report/Technology behaviour therapy for chronic pain in adults, exclud- Assessment No. 47 AHRQ (Publication No. 02- ing headache. Pain, 80, 1–13. E016). [Prepared by Research Triangle Institute– Ortiz, Z., Shea, B., Suarez-Almazor, M., Moher, D., University of North Carolina Evidence-Based Wells, G., & Tugwell, P. (1998). The efficacy of folic Practice Center under Contract No. 290-97-0011.] acid and folinic acid in reducing methotrexate Rockville, MD: Agency of Healthcare Research and gastrointestinal toxicity in rheumatoid arthritis. A Quality. meta-analysis of randomized controlled trials. Jour- Wortman, P.M. (1994). Judging research quality. In nal of Rheumatology, 25, 36–43. H. Cooper & L. Hedges (Eds.), The handbook of Oxman, A. (1994). Checklists for review articles. research synthesis (pp. 97–109). New York: Russell British Medical Journal, 309, 648–651. Sage Foundation.