Psychological Methods Copyright 1998 by the American Psychological Association, Inc. 1998, Vol. 3, No. 3, 339-353 1082-989X/98/S3.00

Using Ratios as Effect Sizes for Meta-Analysis of Dichotomous Data: A Primer on Methods and Issues

C. Keith Haddock David Rindskopf University of Missouri—Kansas City Graduate Center of the City University of New York

William R. Shadish University of Memphis

Many meta-analysts incorrectly use correlations or standardized difference to compute effect sizes on dichotomous data. Odds ratios and their loga- rithms should almost always be preferred for such data. This article reviews the issues and shows how to use odds ratios in meta-analytic data, both alone and in combination with other estimators. Examples illustrate procedures for estimating the weighted average of such effect sizes and methods for computing estimates, confidence intervals, and homogeneity tests. Descriptions of fixed- and random-effects models help determine whether effect sizes are functions of study characteristics, and a random-effects regression model, previously unused for odds ratio data, is described. Although all but the latter of these procedures are already widely known in areas such as medicine and , the absence of their use in psychology suggests a need for this description.

Psychological researchers often use dichotomous programs; then the proportion of each group that be- outcome variables in which participants are classified came pregnant after treatment would be determined. into two distinct groups, such as pregnant and not Here, both the antecedent factor (treatment condition) pregnant, recidivist and not recidivist, improved and and the outcome variable (pregnant vs. not pregnant) not improved, or passed and failed. When a dichoto- are dichotomous. Jacobson, Follette, and Revenstorf mous outcome variable is a function of a comparison (1984) and Blanchard and Schwarz (1988) suggested between two treatment conditions, a common way of methods of statistically defining a clinically signifi- presenting such data is by of a fourfold table. cant change in psychotherapy treatment that poten- In a fourfold table, two dichotomous variables are tially yields a dichotomous outcome (clinically sig- crossed with one another to form four possible cat- nificant change or not). Of course, the two variables egories. For example, a psychologist could study the need not be treatment and outcome; for example, one relative effectiveness of two programs for lowering might be interested in the relationship between sex the teenage pregnancy rate in a community. Partici- (male vs. female) and hiring decisions (hired vs. not pants might be randomly assigned to one of the two hired). The issues we outline in the present article apply equally well to all fourfold designs, even though we focus on . C. Keith Haddock, Department of Psychology, Univer- In these cases, effect sizes are often inappropriately sity of Missouri—Kansas City; David Rindskopf, Depart- computed with methods developed for continuous ment of Psychology, Graduate Center of the City University data, such as standardized mean difference statistics of New York; William R. Shadish, Department of Psychol- or phi correlation coefficients. The present article re- ogy, University of Memphis. The authors are listed in alphabetical order. views better effect size estimates for these situations Correspondence concerning this article should be ad- and appropriate statistical models for meta-analysis of dressed to C. Keith Haddock, Department of Psychology, such effect size data. The effect size measures pre- University of Missouri, 5319 Holmes, Kansas City, Missouri sented in this article are widely used in epidemiology, 64110-2499. chemistry, genetics, and medicine (Feinberg, 1980;

339 340 HADDOCK, RINDSKOPF, AND SHADISH

Sandercock, 1989). Indeed, in meta-analysis, these is- used. However, x2 is not a measure of the magnitude sues and the proper analyses are well known by sta- of the effect; rather, x2 is a function of both the mag- tistical experts (e.g., Fleiss, 1994), and some of the nitude of effect and the number of participants in the material we present has been described previously study. Theoretically, the number of participants stud- (e.g., Shadish & Haddock, 1994). However, these ied should not affect a measure of the magnitude of methods almost never have been used in psychology the association between two variables (Fisher, 1954; and education. Rather, most researchers in psychol- Fleiss, 1981). ogy and education continue to use incorrect effect size Table 2 presents data from 24 studies examining estimates such as the standardized mean difference the effectiveness of psychosocial treatments for indi- or the phi correlation. Hence, a primer is viduals who were addicted to illegal drugs, to alcohol, needed to sensitize the psychological and educational or to smoking cigarettes. In all these studies, the out- communities to the issues and appropriate methods come was categorized as successful (e.g., drug free) for such meta-analytic work.' In addition, we present or not successful. Therefore, the principle outcome a hierarchical linear model for the fixed- and random- measure in these studies is a dichotomous variable. effects of such data in meta- We use these data to illustrate various computations analysis. Although this extension of standard hierar- with fourfold tables throughout this article. chical models to such data is straightforward, such an extension has not been presented in the literature pre- Odds Ratio Measures of Effect Size viously for odds ratio data. There is widespread consensus among statisticians that, with a few exceptions that we discuss shortly, the Analyzing Fourfold Tables most appropriate measure of effect size from a four- fold table is the odds ratio (Agresti, 1990; Fleiss, The structure and notation of a fourfold table are 1981; Kleinbaum, Kupper, & Morgenstern, 1982; presented in Table 1. ThepyS indicate the proportion Mantel & Hankey, 1975; Sandercock, 1989; Somes & of participants or values falling in the cell in row i and O'Brien, 1985) or a transformation such as the loga- column ;', and the n^s represent the raw frequency of rithm of the odds ratio. These previous works have participants falling into the respective cells. The + presented the formulas and various examples of how symbols in the row and column totals represent sum- to apply them. We summarize such work here, but the mation across rows or columns. Thus, for example, interested reader is referred to these works for more P1+ represents the proportion of individuals in row 1, detail. We can express ftg, the odds of improving summed across columns. Many readers will be more given that one is treated with a psychosocial treatment familiar with the sample sizes (i.e., the n^s) of the (E), as the following ratio of conditional : cells of fourfold tables labeled as A through D; there- fore, this notation is also included in Table 1. nE = P(l\E)/P(l'\E), In order to determine whether a statistically signifi- cant relationship exists between the variables repre- where P(IIE) is the of improving given 2 sented in a fourfold table, the familiar x test can be psychosocial treatment and P(I'IE) = 1 - P(IIE) is the probability of not improving given the same treat-

Table 1 ment. Similarly, fl^ can be defined for those patients Fourfold Table not receiving the psychosocial treatment (E' denotes receiving the alternative or control treatment) with the Characteristic B probability of improving given the comparison treat- Characteristic A Present Absent Row total ment, P(IIE'), and the probability of not improving

Present nu = A «,, = B «,+ PM Pl2 PI +

Absent «2I = C n22 = D «2+ 1 P21 P22 P2+ Although this article focuses on only fourfold tables as

Column total "+i "+2 " + * the exemplar case in meta-analysis, the arguments extend in

P+i P+2 1.0 principle to polytomous categorical variables as well. How- ever, the issues become more complicated given the diffi- Note. PIJ — proportion of participants or values in row /. column j; n^ = raw frequency of participants in row i, column j; + = culty identifying a single correct summary estimate for summation across row or column. more than two categories of outcome. META-ANALYSIS OF DICHOTOMOUS DATA 341

Table 2 Data From 24 Studies of Treatment for Addiction

Treatment Comparison

Study Success Failure Success Failure Alcohol studies Bowers & Al-Redha (1990) 4 3 1 4 Drummond et al. (1990) 5 14 6 12 Ferrell&Galassi(1981) 5 3 3 6 Madill et al. (1966) 3 9 3 12 Olson etal. (1981) 16 10 19 3 Orford & Edwards (1977) 18 28 17 29 Penick et al. (1969) 9 12 15 8 Primo et al. (1972) 5 2 1 3 Sannibale (1988) 15 15 4 4 Shaffer etal. (1964) 21 7 20 8 Stein et al. (1975) 11 17 8 20 Walker etal. (1983) 42 52 49 63 Substance abuse studies Henggeler et al. (1991) 89 3 71 13 Joanning et al. (1992) 19 16 4 21 Lewis et al. (1990) 24 20 15 25 McClellan et al. (1993) 21 10 3 7 Ziegler-Driscoll (1977) 18 19 3 9 Smoking cessation studies Areechon & Punnotok (1988) 56 22 37 27 Hall et al. (1985) 71 49 36 81 Harackiewicz et al. (1988) 16 74 8 30 Hjalraarson (1984) 38 65 19 77 Jamrozik et al. (1984) 10 90 8 89 Puska et al. (1979) 29 55 21 55 RCBTS (1983) 68 212 76 189

Nnte. RCBTS = Research Committee of the British Thoracic Society. given the comparison treatment, P(I'\E'). The popu- (2) lation parameter flE is estimated by the sample value:

Pii/P+i Pii Notice that Equation 2 is equivalent to multiplying the P12/P+1 Pl2 main diagonal entries (raw frequencies) of the cells of the fourfold table and then dividing by the product of An overall measure of effect size for a study is the the off-diagonal entries. For this reason, the odds ratio ratio of the two odds: is sometimes referred to as the cross-product ratio. Odds ratios can equal any nonnegative number. In- dependence of the two dichotomous variables in a fourfold table is equivalent to 1, individuals estimated by the sample odds ratio: in row 1 of the fourfold table (usually the group with the antecedent factor or the treatment group) are more (fu/Pn) Pn likely to be in column 1 (usually the outcome of in- (1) terest, such as having a or being cured of depression) than participants in row 2 (typically the An equivalent formula to Equation 1 is control or alternative treatment group)—that is, the 342 HADDOCK, RINDSKOPF, AND SHAD1SH

treatment is more effective than the control. Similarly, perimental studies where ambiguity sometimes exists when 0 < «; < 1.0, participants in row 1 are less likely as to which variable should be considered the ante- to be in column 1 than participants in row 2—that is, cedent and which should be the outcome. When the the treatment is less effective than the control. As order of the rows or columns is reversed, the new general rules of thumb, odds ratios close to 1.0 rep- value of CD, is simply the inverse of the original value; resent a weak relationship between variables, whereas that is, (0; becomes I/to,. odds ratios over 3.0 for positive associations (less Statisticians frequently transform odds and odds ra- than one third for negative associations) indicate tios by taking the natural logarithm (log base e, typi- strong relationships. Odds ratios for all the studies in cally symbolized as In* and meaning to take the natu- Table 2 are presented in Table 3. ral logarithm of x) of these quantities Inw,, which is The odds ratio is invariant to row or column mul- estimated by the sample log odds ratio Ino,. The loga- tiplication by a nonzero constant, such as with dou- rithm of an odds is called the , and the logarithm bling the number of individuals included in one row of the odds ratio is sometimes called the logistic dif- of the fourfold table by doubling the number of indi- ference (because the logarithm of a ratio is the differ- viduals given the treatment. The odds ratio is also ence between the logarithms of the numerator and invariant when the fourfold table is changed so that denominator). This transformation remedies an inter- the rows of the fourfold table become columns and the pretation problem with the odds ratio, unlike a corre- columns become rows, which is important in nonex- lation coefficient or a standardized mean difference statistic, an odds ratio of zero does not indicate a lack of association between variables. However, because a Table 3 Derived Estimates From Addiction Studies value of unity suggests no association in an odds ratio, taking the natural logarithm of the odds ratio (Ino^) Odds Log odds will make zero represent no effect, because ln(l) = 0. Study ratio ratio Variance The left and right limits of Ino, are infinity (the func- Alcohol studies tion can change infinitely in either direction away Bowers & Al-Redha (1990) 5.33 1.67 1.83 from zero), whereas the left limit of the odds ratio is Drummond et al. (1990) 0.71 -0.34 0.52 zero. The log odds ratio is symmetric about zero, and Ferrell & Galassi (1981) 3.33 1.20 1.03 a reversal of rows or columns merely changes the sign Madill et al. (1966) 1.33 0.29 0.86 (Agresti, 1990). Two values of Ino, that are equal Olson et al. (1981) 0.25 -1.37 0.55 except for sign represent the identical degree of asso- Orford & Edwards (1977) 1.10 0.09 0.19 Penick et al. (1969) 0.40 -0.92 0.39 ciation, because Ino, = -ln(l/o,). When using the log Primo et al. (1972) 7.50 2.02 2.03 odds ratio, one may also present the underlying odds 1 Sannibale (1988) 1.00 0.00 0.63 ratio, or e ™'. Shaffer et al. (1964) 1.20 0.18 0.37 An approximate 95% of the log Stein et al. (1975) 1.62 0.48 0.33 odds ratio can easily be calculated. First, calculate the Walker et al. (1983) 1.04 0.04 0.08 log odds ratio. Next, find the variance of the log odds Substance abuse studies ratio, which is the sum of the reciprocals of the raw Henggeler et al. (1991) 5.43 1.69 0.44 frequencies of the fourfold table: Joanning et al. (1992) 6.23 1.83 0.41 1111 Lewis et al. (1990) 2.00 0.69 0.20 V, = + + + (3) McClellan et al. (1993) 4.90 1.59 0.62 "ll "12 «21 "22 Ziegler-Driscoll (1977) 2.84 1.05 0.55 The approximate of the log odds ratio Smoking cessation studies is the square root of Equation 3. Next, calculate lower Areechon & Punnotok (1988) 1.86 0.62 0.13 and upper bounds for a 95% confidence interval: Hall et al. (1985) 3.26 1.18 0.08

Harackiewicz et al. (1988) 0.81 -0.21 0.23 \L = Ino; - [1.96 x SE (Ino;)] Hjalmarson (1984) 2.37 0.86 0.17 Xu = lno; + [1.96 x SE (Ino,)], (4) Jamrozik et al. (1984) 1.24 0.21 0.25 Puskaet al. (1979) 1.38 0.32 0.12 where XL is the lower limit of the 95% confidence

RCBTS (1983) 0.80 -0.23 0.04 interval of the log odds ratio and \v is the upper limit

Nate. RCBTS = Research Committee of Ihe British Thoracic of the same interval. A more complicated yet gener- Society. ally more accurate method of computing the 95% con- META-ANALYSIS OF DICHOTOMOUS DATA 343 fidence interval for odds ratios is provided by Fleiss differences in the marginal distributions of the four- (1981, pp. 71-75), and exact upper and lower limits fold table. No other effect size measure (e.g., raw or have been calculated for a limited number of tables by standardized difference between proportions success- D. G. Thomas and Gart (1977). However, the confi- ful in each group) has all these benefits. dence interval defined by Equation 4 is reasonably Researchers in psychology and education usually accurate and can be generally used (Sandercock, incorrectly analyze dichotomous data with effect size 1989). Log odds ratios and for all studies in measures for continuous data such as the standardized Table 2 are presented in Table 3. mean difference statistic (d) or the product- One potential difficulty in computing odds ratios is correlation coefficient (<|>). These measures will tend that if any of the proportions in the fourfold table to underestimate the size of the effect, unless the mar- equals zero, the odds ratio is undefined. A suggested ginal distributions are similar to each other (Fleiss, remedy for this situation (Fleiss, 1981) is to add 0.5 to 1981, p. 60). To illustrate, Table 4 presents standard- each observed frequency: ized mean difference measures (d) from the substance abuse studies in Table 2. The first d (af,) was esti- ! 1+0.5)(n22 + 0.5) (5) mated by using the method recommended by Hassel- 12 + 0.5)(n21+0.5)' blad and Hedges (1995), where the product of the log The standard error of o' is odds ratio and v3/iT yields a standardized mean dif-

ference statistic. The second d (d2) presented was SE(o' ) = i computed by computing means and standard devia- tions directly from the dichotomous data. Means and standard deviations are computed by assigning a value of 0.0 to all participants at one level of the outcome (6) variable (usually those who did not experience the However, if other cell values are small, this remedy outcome, such as with nonimprovement in treatment) may not work well. and a value of 1.0 to all participants at the remaining level of the outcome variable (usually those who ex- Other Measures of Effect Size perienced the outcome, such as with improvement in In general, the odds ratio (or log odds ratio) will be treatment). Then, means and standard deviations are the effect size of choice for fourfold tables (Agresti, computed for both groups on this two-point scale of 1990; Bishop, Fienberg, & Holland, 1975; Fleiss, measurement. Treating the data as continuous (i.e.,

1981; Hamilton, 1979). The key benefits of the odds d2) yields a consistently smaller effect size than if one ratio are its invariance across methods, the appropriately estimates the standardized mean differ- fact that it is not affected by unequal sample sizes, its ence from the odds ratio (i.e., d,). Notice that the invariance when either rows or columns of the four- magnitude of these underestimations is variable but fold table are multiplied by a constant, and its com- quite large in some cases. For d to be optimal, the patibility with or loglinear models ratio of the group sizes should be about the same as that already use the odds ratio as the measure of as- the ratio of successes to failures. Consistent with this sociation. Finally, the odds ratio is not affected by rule, the Henggeler et al. (1991) study has the worst

Table 4 Comparison of Two Methods for Computing the Standardized Mean Difference

Study d i ft-) d) p\ +/P-2+ P+ 1 fp i Henggeler et al. (1991) 0.933 0.432 .21 52/48% 91/9% Joanning et al. (1992) 1.009 0.840 .39 58/42% 38/62% Lewis et al. (1990) 0.382 0.343 .17 52/48% 46/54% McClellan et al. (1993) 0.876 0.791 .33 76/24% 59/41% Ziegler-Driscoll (1977) 0.576 0.478 .21 76/24% 43/57%

Note, d-t = standardized mean difference estimated with Hasselblad and Hedges (1995)

method; d2 = standardized mean difference computed from means and standard deviations; = <|> = phi coefficient; Pi^pz+ — proportion of treatment versus controls; p+tfp+2 proportion of successes versus failures. 344 HADDOCK, RINDSKOPF, AND SHADISH marginal match and the worst estimate of d with the the measure of effect size. This occurs because is a continuous measure estimate. Conversely, the Lewis, function of \2 (i.e., 4> = \X2W2), which varies in Piercy, Sprenkle, and Trepper (1990) study has the statistical power across various study designs. For in- best marginal match and the least discrepancy be- stance, a prospective study with equal sample sizes 2 tween rf, and d2. The others fall in between. among groups results in a more powerful \ test than Likewise, when the marginal distributions of the that from a cross-sectional analysis with an equivalent fourfold table are not equivalent, cjj will not provide a overall sample size (Fleiss, 1981). There are addi- satisfactory measure of effect size. Notice in Table 4 tional problems with adopting 4> as a measure of ef- that data from the Henggeler et al. (1991) and Ziegler- fect size, and the reader is encouraged to consult stan- Driscoll (1977) studies produce equivalent correla- dard references for a detailed examination of these tions; however, the corresponding effect sizes based deficiencies (cf. Carroll, 1961; Fleiss, 1981, Goodman on the Hasselblad and Hedges (1995) formula are & Kruskal, 1954; Kleinbaum et al., 1982). quite different. Using 4> as a measure of effect size The only correlational measure that might prove would provide misleading evidence for the conclusion useful is the tetrachoric correlation. It has the advan- that these two studies produce similar treatment ef- tage that it is completely insensitive to differences in fects. Similarly, Table 5 illustrates that a series of marginal distributions if the assumption of underlying studies may demonstrate an identical relationship be- bivariate normality is met. In practice, even though tween factors, whereas, if <$> was used as the measure the assumption is probably not met for the type of data of effect size, the reviewer would incorrectly con- encountered in meta-analysis, tetrachoric correlations clude that the relationship varies widely. In this series probably behave as nicely as odds ratios. However, of five hypothetical studies examining the relationship because tetrachorics are more difficult to calculate between exposure to a toxin and disease occurrence, and odds ratios are not based on underlying distribu- the proportion of exposed individuals is allowed to tional assumptions, odds ratios and log odds ratios are vary while the relationship between exposure and dis- preferable measures of effect size. ease occurrence is held constant. Notice that whereas Primary and meta-analysis might consider mea- the odds ratio is constant across the studies, c|) varies sures other than the odds ratio under limited circum- considerably. It is smallest when the proportion ex- stances. When reviewing prospective treatment out- posed is low (or high) and largest when the proportion come studies with equal sample sizes, and if exposed is .50. proportions observed in all studies are neither too Another limitation of 4> is its inability to achieve large nor too small, the rate difference (raw difference unity when the marginal distributions are unequal. For between proportions) will generally give results simi- instance, Nunnally (1979, Figure 4-3) demonstrated lar to an analysis of the log odds ratios. Sometimes, that if one is comparing two treatment groups with one variable involved in a fourfold table may origi- equal sample sizes and 20% of all participants im- nally have been continuous but may arbitrarily be prove, the largest possible (even if all improvers are dichotomized. In this case, the research could use in one of the groups) is .45. Finally, Fleiss (1981) has techniques designed to recover the continuous metric. demonstrated that 4> lacks the property of invariance If only one variable was dichotomized, a biserial cor- across study designs. That is, two researchers inves- relation is appropriate if bivariate normality of the tigating the same phenomenon with different research underlying distribution is assumed. If both variables designs will obtain different estimates if <|> is used as were dichotomized, the tetrachoric correlation is ap-

Table 5 Comparison of Phi and Odds Ratio in Estimating the Relationship Between Exposure and Disease Exposed group Unexposed group Proportion exposed Diseased Healthy Diseased Healthy 4> Odds ratio 5% 40 10 190 760 .31 16.0 10% 80 20 180 720 .41 16.0 15% 120 30 170 680 .47 16.0 25% 200 50 150 600 .54 16.0 50% 400 100 100 400 .60 16.0 META-ANALYSIS OF DICHOTOMOUS DATA 345 propriate if underlying bivariate normality is as- to the fixed-effects model when fixed-effects assump- sumed. However, the latter assumption frequently is tions are met and partly because certain features of questionable in treatment outcome studies, because random-effects models may better reflect uncertainty one dichotomy (treatment-comparison) might not be about the data. The analysis of fixed-effects models is considered an artificial dichotomization of a con- somewhat simpler computationally, however, and tinuum in most cases, and in any case, the computa- may be appropriate under some circumstances, espe- tional demands of the tetrachoric correlation are con- cially when homogeneity tests are nonsignificant. siderable. In all these cases, moreover, the odds ratio Table 6 reports results from both fixed- and random- always does at least as well as any other effect size effects models on the log odds ratio data in Table 3 measure. with methods outlined by Shadish and Haddock (1994); because the latter source is readily accessible, Univariate Meta-Analyses of Odds Ratios we do not present the pertinent formulas here. A test of homogeneity provides an initial look into A number of authors have described fixed- and whether a fixed- or random-effects model might be random-effects univariate methods for combining appropriate. The significant homogeneity test over all odds ratios gathered from multiple studies in meta- analysis (e.g., Berlin, Laird, Sacks, & Chalmers, 24 studies in Table 6 shows that the odds ratios over 1989; DerSimonian & Laird, 1986; Fleiss, 1981; those studies are significantly more variable than Shadish & Haddock, 1994). Under a fixed-effects would be expected by chance, which makes a fixed- model, the effect size statistics, d,, from the k studies effects model questionable. The variance component under examination estimate a common population pa- provides an index of how much additional variability rameter Sj that applies to only those particular studies is present; it should be zero if a fixed-effects model fits. Hence, the interpretation of the average fixed- and is fixed at a particular value (i.e., 8L = . .. = 8t effects log odds ratio is ambiguous. It does describe a = 8). The effect size estimate dt from any given study differs from the population parameter 8 only because weighted average of study effects, but that average of sampling error. That is, a given study used only a does not estimate a population parameter for these 24 sample of participants from the population, so the studies. Given heterogeneity, it is possible to compute effect size estimate computed on that sample differed a random-effects average log odds ratio that better somewhat from the population effect size. reflects the uncertainty in the data, and it also is re- In a random-effects model, the population param- ported in Table 6. eter is not fixed but is itself random and has its own A significant test of heterogeneity could be attrib- distribution. Thus, the observed variability in sample utable to (a) random variation in population param- estimates of effect size vf reflects both the conditional eters or (b) an between treatment outcome variation v, of that effect size around each population and study characteristics (i.e., the fixed-effects model S, and the random variation, a\, of the individual 8, is not well specified, and other variables that affect around the mean population effect size, or vf = outcome should be added to the analysis). In addition, oi + v, when the within-study sample sizes in each study are In the view of many statisticians (e.g., Raudenbush, large, the homogeneity test may be rejected even 1994), the random-effects model is often the appro- when the individual effect size estimates differ only priate one for meta-analysis, partly because it reduces slightly. Hence, if Q is rejected, the researcher may

Table 6 Average Fixed- and Random-Effect Sizes for Addiction Studies

Fixed effects Random effects Average Homogeneity Variance Average Study Indj SE test component Ino, Overall 0.349* 0.090 51.274* 0.256 0.430 Alcohol 0.024 0.169 11.714 0.025 0.024 Substance abuse 1.233* 0.276 3.085 0.000 1.150 Smoking cessation 0.345* 0.116 22.471* 0.282 0.413 *p< .05. 346 HADDOCK, RINDSKOPF, AND SHADISH

wish to (a) disaggregate study effect sizes into appro- The between-studies equation is priate categories until Q is not rejected in those cat- egories, (b) use fixed-effects regression techniques to s, = 7o + iiWij + 72 W2, + . . . account for further variance, (c) use a random-effects where each 7 represents a regression coefficient, and regression model (described below), or (d) use clini- each W represents a study characteristic used as a cal consistency rather than statistical consistency as predictor of the effect sizes. In some contexts, these the criterion for aggregation (Boissel, Blanchard, are called the fixed effects in the model. For each Panak, Peyrieux, & Sacks, 1989) and interpret the study, Uj represents the part of the true effect size that fixed-effects means and variances anyway. The last is not predictable from study characteristics. This rep- suggestion is in the tradition of determining whether resents the random effect in the between-studies differences are practically important rather than just model. statistically different. Sometimes the two equations are presented as one Following on Suggestion a above, Table 6 reports by the substitution of the second into the first: additional univariate analyses disaggregated by type of addiction (alcohol, substance abuse, smoking). Note that the alcohol and substance abuse studies pass The above model is very general; many special cases the homogeneity test, and, consistent with that, the are useful in meta-analysis. The model as written is variance components for these two groups of studies called a random-effects (or sometimes mixed-effects) are very close to zero. However, homogeneity is re- model, because it includes both fixed and random jected for the smoking cessation studies, and its vari- effects. If the Uj terms are dropped from the model ance component is relatively large. (assumed to be zero), then it is called a fixed-effects model, because only fixed effects remain hi the be-

Random-Effects Regression Models for tween-study equation. The variance of uj: which is Meta-Analysis of 2 x 2 Tables denoted as T in the Bryk and Raudenbush (1992) model, is zero for a fixed-effects model, and greater Continuing to disaggregate effect sizes by more than zero for a random-effects model; it is what we predictors rapidly becomes impractical, especially as referred to as the variance component in Table 6. In the number of predictors increases. Alternatively, the our opinion, one should always fit the random-effects researcher could begin with a single random-effects model first; if a model is found in which T is estimated regression model. Such a model has not been de- to be near zero or does not differ significantly from scribed in the psychology or educational literature be- zero, then the model reduces to a fixed-effects model fore. However, a number of approaches to statistical anyway. If the estimate of T is zero, then the estimates modeling in meta-analysis can easily be adapted to the obtained for the 7 parameters are the same as for the analysis of data from fourfold tables. We illustrate by fixed-effects model. using the approach of Bryk and Raudenbush (1992). If the estimate of T is nonzero, an extension of the Their model is described by two equations: a within- usual homogeneity test can be used to test whether T studies equation and a between-studies equation. The differs significantly from zero. Note that the usual within-studies equation is homogeneity test is merely a test of whether the (un- conditional) variation in effect sizes is greater than dj = 8, + ef would be expected because of sampling error alone. where dj is the observed effect size for study/ fy is the The modified test is whether the conditional variance

true effect size for study/ and e} is the sampling error (the residual after accounting for the predictors W) is for study j. The effect size could be a standardized greater than would be expected attributable to sam- mean difference statistic, a correlation coefficient, or pling error alone. Computationally, the test can be a log odds ratio, as in the present case. This equation done by applying the usual calculations (e.g., Equa- merely represents that the observed effect size is a tion 18-6 in Shadish & Haddock, 1994) to the residu- true effect size plus error. In the analysis of fourfold als: tables, the most natural approach is to use the log odds d, - To - 7iWiy - J2W2J - ... - ysWSJ, ratio as the effect size, and its squared standard error (see Equation 3) as the estimate of the sampling vari- where 7 denotes an estimate of the parameter. ance of Cj for each study. Another specialized version of the model is the case META-ANALYSIS OF DICHOTOMOUS DATA 347

in which there are no between-studies variables (WSJ). a test of whether -y0 differs significantly from zero. To

In this case, the model contains only one fixed effect, check the model, we first tested whether T = Var(u}) a constant term that represents the average effect size. differs significantly from zero; this is a test of homo- If there is a single categorical predictor variable, geneity of effect sizes across the studies. The HLM which would result in a series of dummy variables as estimate of T is 0.22989, x2 (23, N = 24) = 51.992, Ws, then the model is equivalent to an overall test of p = .001, so there is true variability in effect sizes homogeneity within subgroups of studies based on the across the studies, and the HLM estimate of •/„ is .43, . The test statistic is the sum of the which is the average effect size (here, the log odds statistics that would be obtained by considering sepa- ratio). Note that these statistics are very close to those rately each group of studies (defined by values of the reported in Table 6 for the overall univariate analysis. categorical predictor) and with degrees of freedom Note also that the square root of T is .48. This means equal to the sum of the degrees of freedom for those that about 95% of the true effect sizes (log odds ra- tests (number of studies minus number of categories tios) should be between a lower bound of [.43 - of the predictor variable). In more complicated cases, 1.96(.48)] and an upper bound of [.43 + 1.96(.48)], or in which there are several categorical predictors, or between -.53 and 1.39. This is quite a wide of with continuous predictors, the method outlined here effect sizes: from somewhat negative to quite strongly is the only practical one, because subdividing the positive. However, Table 3 shows an even larger studies becomes either impractical or impossible. If, range of observed log odds ratios, ranging from -1.37 for example, there were three categorical predictors to 2.01; this discrepancy can be simply explained, and with four levels each, then the studies would form a 4 we do so shortly. Finally, note that it is possible to x 4 x 4 design with 64 cells. The only sensible ap- compute a Birge ratio as RB = Q^(df), where Qw is proach under this circumstance is to test main effects the x2 reported above of 51.992 and dfis the degrees 2 and perhaps some two-way interactions with the ran- of freedom for that x . In the present case, RB = 2.26, dom-effects approach. Testing homogeneity across all which suggests that there is 126% more variability 64 cells would be statistically problematic unless than one would expect by chance. there were hundreds of studies, and even then, all one The next question is whether some of this variabil- could say is that the studies either are homogenous or ity can be accounted for by characteristics of the stud- are not. ies. Here we use only a very simple ; Computation can either be done with packaged pro- We include dummy variables for type of addiction. grams or be written easily with any package that al- One can pick any two out of the three categories of the lows matrix computations. For the calculations in this variable type of addiction (the third is omitted, be- article, we have used both the HLM program, which cause it is linearly dependent on the other two plus the is based on the work of Bryk and Raudenbush (1992) constant). We begin by using dummy variables for and Bryk, Raudenbush, and Congdon (1996a, 1996b) alcohol and smoking studies. The statistical model is and the ML-n program (Rasbash & Woodhouse, written as: 1995), which is based on the work of Goldstein (1995) and Goldstein et al. (1998). InOj = -y0 + y1Alcohollj + •y2Smokmg-ij + Uj + ej,

Example Using Data From Addiction where Alcohol is a dummy variable that takes the Treatment Studies value one if the study involved treatment for alcohol addiction and zero otherwise, and Smoking is a To illustrate these methods, we analyze data from dummy variable that takes the value one if the study the 24 studies of treatment for addiction. The studies involved treatment for smoking addiction and zero deal with three different types of addiction (alcohol, otherwise. Part of the output from the computer pro- drugs, and smoking), and one might suspect that the gram HLM is included here as Table 7. It shows the success rate may differ across types of addiction. To estimates of the fixed parameters and tests of whether provide a baseline, the first analysis uses the model these differ significantly from zero. The term labeled

Constant is the intercept -y0; because of the coding of dj = To + «,- + e f the dummy variables, this represents the average ef-

There is one fixed term, -y0, which represents the av- fect size for substance addiction studies. Its value is erage effect size. If this model proves a satisfactory 1.285, which the associated t statistic indicates is sig- one, we are interested in the estimate of ya as well as nificantly different from zero and which closely ap- 348 HADDOCK, RINDSKOPF, AND SHAD1SH

2 test Table 7 and the Birge ratio is now RB = 1.79. The x Fixed Parameters From HLM Program for shows that the residual variation still differs signifi- Addiction Analysis cantly from zero, so there is remaining variation in Fixed effect Gamma (7) SE t effect sizes. Table 6 shows that there is no true varia- tion among effect sizes for either alcohol or substance Constant (-y0) 1.284922 0.329311 3.902 .001 Alcohol (•/,) -1.251355 0.396055 -3.160 .006 abuse treatments; the remaining variation is among

Smoking (y2) -0.870044 0.381450 -2.281 .035 smoking studies. Presumably additional coding of dummy variables might help explain this remaining variation, but at least the source has been located as proximates the univariate result in Table 6. The fixed being in the smoking studies. effect for Alcohol represents the difference between This latter point brings us back to the question of alcohol studies and substance addiction studies; the why the range of the observed log odds ratios in Table associated t statistic suggests the average effect size in 3 is so much larger than that predicted by the random- the alcohol studies is significantly smaller than the effects model. The HLM program weights each study average effect for the substance addiction studies. The odds ratio by a function of its sample size, and it also average effect for alcohol studies is 1.285 - 1.251 = uses information from other studies to predict odds 0.034, very close in value to the effect reported in ratios for each study by shrinking each observed es- Table 6. Similarly, the average effect size for Smoking timate toward the mean of its subgroup. These latter is significantly smaller than that of the substance are called Empirical Bayes estimates (sometimes abuse studies and can be calculated as 1.285 - 0.870 "shrunken" estimates), and they are presented in = 0.415, again very close to the value in Table 6. The Table 9 along with the observed log odds ratios. Fig- average effect for substance abuse studies is therefore ure 1 plots these data so that the shrinkage that occurs significantly larger than for either smoking or alcohol. is more visible. Inspection of observed effect sizes The preceding analyses do not tell whether the av- might lead one to the wrong conclusion that the true erage effect for either alcohol or smoking differ from effect sizes for the alcohol studies are the most vari- each other or from zero, however. To obtain such able; in fact, this is attributable to much larger sam- tests, one repeats the analysis two more times, once pling error for alcohol studies, many of which have using smoking as the baseline and once using alcohol very small sample sizes. Taking sampling error into as the baseline. However, one should be concerned account, the results of the alcohol studies are fairly with doing large numbers of hypothesis tests in this consistent, and both the alcohol and substance abuse situation and perhaps use some method to control the studies are far less variable in true effect size than are overall Type I error probability, such as a Bonferroni the smoking studies. All of this is consistent with the correction. The situation is similar to that of doing data in Tables 6, 7, and 8. Therefore, additional analy- follow-up tests in , except that ses are needed to investigate why the smoking studies here the value of zero is important and must be in- are not homogenous: What characteristics of the cluded as well as the observed effect sizes when test- smoking studies might differentiate them in terms of ing for differences. their effects'? Table 8 reports the random-effects part of the Mixing Continuous and Dichotomous Outcomes analysis and shows the estimate of the residual true in Meta-Analysis variation in effect sizes T = 0.14182. Note that the original unexplained variation (T = 0.22898) has Meta-analysts frequently need to combine out- been reduced by 38% by the addition of these three comes, some of which are dichotomous and some of predictors [(0.22898 - 0.14182)70.22898 = 0.38]; which are continuous. For the continuous outcomes, the meta-analyst often appropriately uses a correlation Table 8 coefficient, r, or a standardized mean difference sta- Random Effects From HLM Program for tistic, d. Simple formulas exist for transforming r to d Addiction Analysis or vice versa, so that one result can be reported. The same can be done with the odds ratio. For example, Parameter \2 using the same logic as that in comparing logistic Parameter variance df (gvv) p regression with probit regression (see, e.g.. Cox, True effect size (\ ) 0.14182 21 37.642 SH4 ; 1970), we can easily translate from the logistic metric META-ANALYSIS OF DICHOTOMOUS DATA 349

Table 9 Sample Size Issues Observed Log Odds Ratios and Empirical Bayes Estimates A number of sample size issues arise in the meta- analysis of fourfold tables. Major issues for overall Log odds Empirical Bayes sample size are the number of participants in each Study ratio estimates study and the number of studies in the meta-analysis. Alcohol studies The analysis demonstrated previously assumes that Bowers & Al-Redha the sample size in each study is sufficient to accu- (1990) 1.673900 0.151660 rately estimate the sampling variance of the log odds Drummond et al. (1990) -0.336870 -0.045135 ratio in that study. For most meta-analyses, this is Ferrell & Galassi (1981) 1.203900 0.175090 Madill et al. (1966) 0.287430 0.069865 true, because most published studies have at least Olson et al. (1981) -1,374400 -0.255000 moderately large sample sizes. When study sample Orford & Edwards (1977) 0.092579 0.059631 sizes are small, this assumption becomes more prob- Penick et al. (1969) -0.916290 -0.220950 lematic, and one could consider analyses that are Primo et al. (1972) 2.014900 0.163050 more complicated but do not make the assumption of Sannibale (1988) 0.000000 0.027867 known sampling variance. One possibility is to model 0,182320 0.075556 Shaffer et al. (1964) the logit of probability of success within each of the Stein et al. (1975) 0.481190 0.169960 treatment and control groups in a study and have treat- Walker et al. (1983) 0.037296 0.036400 ment status nested within study. For the example we Substance abuse studies used, this would result in a Level 1 equation of the Henggeler et al. (1991) 1.692300 1.385000 form Joanning et al. (1992) 1.830000 1.424300 Lewis et al. (1990) 0,693150 1.038500 McClellan et al. (1993) 1.589200 1.341300 Ziegler-Driscoll (1977) 1,044500 1.236000 where Group is coded zero for control and one for treatment group membership, and Level 2 equations Smoking cessation studies of the form Areechon & Punnotok (1988) 0.619500 0.522740 Po; = 800 + + g^Alcohol + Hoj Hall et al. (1985) 1.181700 0.917280 Smakin Harackiewicz et al. (1988) -0.209490 0.179780 Pi; = £10 + gn 8 + Hjalmarson (1984) 0.862470 0.669660 The second Level 2 equation has the coefficients of Jamrozik et al. (1984) 0.211880 0.341080 primary interest in the meta-analysis: glo represents Puska et al. (1979) 0,322810 0.364860 the average effect in substance abuse studies, g is RCBTS (1983) -0.225650 -0.090403 n the difference between the average effects in sub- Note. RCBTS = Research Committee of the British Thoracic stance abuse studies and smoking studies, and g12 is Society. the difference between the average effects in sub- stance abuse studies and alcohol studies. This formu- into a standard normal metric typically used for other lation of the model has the added advantage of gen- effect size estimates. One merely divides the log odds eralizing more easily to cases in which there are more 2 ratios by 1.65 (Cox, 1970; other authors use 1.7 or than two groups. This can be a considerable advan- similar numbers; however, any number around 1.7 tage when there is reason to question these assump- will make the normal curve and the logistic curve tions about sampling variance. We are encouraged, differ by no more than approximately 1%) to put the however, by the fact that, at least in the example used effect sizes on the scale of a standard normal variable. in this article, our meta-analytic model gives very Alternatively, Hasselblad and Hedges (1995) sug- similar results to the analysis Raudenbush suggested, gested that multiplying the log odds ratio by V3/ir estimates a standardized mean difference statistic. This is equivalent to dividing the log odds ratio by 2 We thank Stephen Raudenbush (personal communica- 1.81, approximately the same as the 1.65 suggestion tion, July 14, 1997) for pointing out to us that two of the by Cox. The resulting estimates of d can then be com- standard programs for multilevel modeling, HML and ML-n, bined with d statistics derived from continuous vari- can analyze data with this structure without the need to ables so that one result can be reported. assume known sampling variances. 350 HADDOCK, RINDSKOPF, AND SHADISH

-1.5 1

Observed Log Empirical Bayes Odds Ratio Estimates

Figure 1. Plot of observed log odds ratios and Empirical Bayes estimates from Tables 7 and 8. Alcohol studies are represented with solid lines, substance abuse studies with dashed lines, and smoking studies with dotted lines. even though some of our studies have relatively few certain how much. A general rule that we believe participants. Moreover, the approach that Raudenbush (though on the basis of very little evidence) to be suggested probably is less useful to the extent that useful is to consider the "standardized" parameter substantive diversity among dichotomous dependent estimate to have a t distribution, with degrees of free- variables increases, and it is of no use in combining dom equal to the number of studies minus the number studies that include both dichotomous and continuous of parameters estimated (including the constant term). dependent variables. It is in these latter situations that This corresponds to the rule for t tests printed by the meta-analytic techniques we discuss probably ex- HLM and discussed by Bryk and Raudenbush (1992, cel. p. 49ff). A more complete solution to most of the The second major problem with sample size is the problems attributable to inadequate sample size is to number of studies in the meta-analysis. Whereas most use a fully Bayesian model, such as that in the BUGS meta-analyses include at least a few dozen studies, program (A. Thomas, Spiegelhalter, & Gilks, 1992), some may cover a limited area or an area in which although specifying the model, checking for conver- little research has been done. Because the fixed-effect gence, and interpreting the results requires much more estimates for the methods we recommend (as well as sophistication than the methods proposed here. Smith, most other methods) assume a large sample of studies, Spiegelhalter, and Thomas (1995) described the use the standardized parameter estimates (parameter esti- of this model for the meta-analysis of binary outcome mates divided by the standard error) will not have a data. standard . Thus, we cannot use the Other sample size issues may arise in meta- usual standard of 1.96 as a critical value; the actual analysis, including that of fourfold tables. One might critical value is most likely larger, though it is not hypothesize that one type of treatment is more or less META-ANALYSIS OF DrCHOTOMOUS DATA 351 effective than another type of treatment and include a Agresti, A. (1990). Categorical data analysis. New York: variable that specifies which type of treatment was Wiley. given in a particular study. If a relatively large num- *Areechon, W., & Punnotok, J. (1988). Smoking cessation ber of studies were done investigating each type of through the use of nicotine chewing gum: A double-blind treatment, the estimate of the difference in effective- trial in Thailand. Clinical Therapeutics, 10, 183-186. ness is not a problem. However, if one treatment is Berlin, J. A., Laird, N. M., Sacks, H. S., & Chalmers, T. C. seldom used, then the estimate of the difference in (1989). A comparison of statistical methods for combin- effectiveness will have a large standard error, which ing event rates from clinical trials. Statistics in Medicine, makes the power to detect any difference very low. 8, 141-151. Unlike what happens with a designed , the Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. meta-analyst must take what comes in terms of the (1975). Discrete mu.ltivarw.te analysis: Theory and prac- distribution of variables. (For those who like viewing tice. Cambridge, MA: MIT Press. these problems in a general context, an extremely un- Blanchard, E. B., & Schwarz, S.P. (1988). Clinically sig- balanced dummy variable becomes nearly collinear nificant changes in behavioral medicine. Behavioral As- with the constant term; this can therefore be viewed as sessment, 10, 171-188. a problem of multicollinearity. The result is a large Boissel, J. P., Blanchard, I, Panak, E., Peyrieux, J. C., & standard error and low power.) Sacks, H. (1989). Considerations for the meta-analysis of randomized clinical trials: Summary of a panel discus- Conclusions sion. Controlled Clinical Trials, 10, 254-281. *Bowers, T. G., & Al-Redha, M. R. (1990). A comparison We have outlined odds ratio methods to compute of outcome with group/marital and standard/individual and then aggregate measures of effect size from four- therapies with alcoholics. Journal of Studies on Alcohol, fold tables. These techniques are always superior to 51, 301-309. techniques such as combining phi coefficients, pro- Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical lin- duct-moment correlation coefficients, standardized ear models: Applications and data analysis methods. mean differences, and standardized rate differences. Newbury Park, CA: Sage. These techniques frequently are superior to alterna- Bryk, A. S., Raudenbush, S. W., & Congdon, R. T. (1996a). tives such as rate differences, and they are computa- HIM: Hierarchical linear modeling with the HLM/2L tionally more convenient than otherwise appropriate and HLM/3L programs. Chicago: Scientific Software In- techniques such as the tetrachoric correlation. They ternational. are never inferior to any of these alternative methods. Bryk, A. S., Raudenbush, S. W., & Congdon, R. T. (1996b). Aggregation methods based on odds ratios are also HLM (Version 4) [Computer software]. Chicago: Scien- available for data types not covered in this article. For tific Software International. instance, several variants of the Mantel-Haenszel Carroll, J. B. (1961). The nature of the data, or how to technique that deal with combining data from other choose a correlation coefficient. Psychometrika, 26, 347- than fourfold tables are available, including estimates 372. for larger tables (Greenland, 1989; Somes, 1986); Cox, D. R. (1970). Analysis of binary data. New York: variants are also available for case control studies Chapman & Hall. with varying numbers of controls matched to each DerSimonian, R., & Laird, N. (1986). Meta-analysis in case (Fleiss, 1984). With the explication of a random clinical trials. Controlled Clinical Trials, 7, 177-188. regression model for odds ratios in the present article, *Drummond, D. C., Thorn, B., Brown, C., Edwards, G., & moreover, all the usual analyses applied to meta- Mullan, M. J. (1990). Specialist versus general practitio- analytic data also can be applied to odds ratios. For all ner treatment of problem drinkers. Lancet, 336, 915-918. these reasons, then, odds ratios are routinely the most *Ferrell, W. L., & Galassi, I. P. (1981). Assertion training reliable choice among the various effect size measures and human relations training in the treatment of chronic for fourfold tables. alcoholics. The International Journal of the Addictions, 16, 959-968. References Fienberg, S. E. (1980). The analysis of cross-classified cat- egorical data. Cambridge, MA: MIT Press. References marked with an asterisk indicate studies in- Fisher, R. A. (1954). Statistical methods for research work- cluded in the meta-analysis. ers (12th ed.). Edinburgh, Scotland: Oliver & Boyd. 352 HADDOCK, RINDSKOPF, AND SHADISH

Fleiss, J. L. (1981). Statistical methods for rates and pro- (1992). Treating adolescent drug abuse: A comparison of portions (2nd ed.) New York: Wiley. family systems therapy, group therapy, and family drug Fleiss, J. L. (1984). The Mantel-Haenszel estimator in case- education. Journal of Marriage and Family Therapy, IS, control studies with varying numbers of controls matched 345-356. to each case. American Journal of Epidemiology, 120, Kleinbaum, D. G., Kupper, L. L., & Morgenstern, H. 1-3. (1982). Epidemiologic research: Principles and quanti- Fleiss, J. L. (1994). Measures of effect size for categorical tative methods. New York: Van Nostrand Reinhold. data. In H. M. Cooper & L. V. Hedges (Eds.), Handbook *Lewis, R. A. Piercy, F. P., Sprenkle, D. H., & Trepper, of research synthesis (pp. 245-260). New York: Russell T. S. (1990). Family-based interventions for helping Sage Foundation. drug-abusing adolescents. Journal of Adolescent Re- Goldstein, H. (1995). Multilevel statistical models. New search, 5, 82-95. York: Halsted Press. *Madill, M., Campbell, D., Laverty, S. G., Sanderson, Goldstein, H., Rasbash, J., Plewis, I., Draper, D., Yang, M., R. E., & Vandewater, S. L. (1966). Aversion treatment of Woodhouse, G., & Healy, M. (1998). A user's guide to alcoholics by succinylcholine-induced apneic paralysis. MLwiN. London: Institute of Education. Quarterly Journal of Studies on Alcohol, 27, 483-509. Goodman, L. A., & Kruskal, W. H. (1954). Measures of Mantel, N., & Hankey, B. F. (1975). The odds ratios of a 2 association for cross classification. Journal of the Ameri- x 2 . American Statistician, 29, 143- can Statistical Association, 49, 732—764. 145. Greenland, S. (1989). Generalized Mantel-Haenszel estima- *McClellan, A. T., Arndt, I. O., Metzger, D. W., Woody, tors for K 2 x J tables. Biometrics, 45, 183-191. G. E., & O'Brien, C. P. (1993). The effects of psychoso- *Hall, S. M., Tunstall, C., Rugg, D., Jones, R. T., & Be- cial services in substance abuse treatment. Journal of the nowitz, N. (1985). Nicotine gum and behavioral treat- American Medical Association, 269, 1953-1959. ment in smoking cessation. Journal of Consulting and Nunnally, J. C. (1979). Psychometric theory (2nd ed.). New Clinical Psychology, 53, 256-258. York: McGraw-Hill. Hamilton, M. A. (1979). Choosing the parameter for a 2 x *Olson, R. P., Ganley, R., Devine, V. T., & Dorsey, G. C. 2 table or a 2 x 2 X 2 table analysis. American Journal of (1981). Long-term effects of behavioral versus insight- Epidemiology, 109, 362-375. oriented therapy with inpatient alcoholics. Journal of *Harackiewicz, J. M., Blair, L. W., Sansone, C., Epstein, Consulting and Clinical Psychology, 49, 866-877. J. A., & Stuchell, R. N. (1988). Nicotine gum and self- *Orford, J., & Edwards, G. (1977). Alcoholism: A compari- help manuals in smoking cessation: An evaluation in a son of treatment and advice, with a study of the influence medical context. Addictive Behaviors, 13, 319-330. of marriage. New York: Oxford University Press. Hasselblad, V., & Hedges, L. V. (1995). Meta-analysis of *Penick, S. B., Carrier, R. N., & Sheldon, J. B. (1969). Met- screening and diagnostic tests. Psychological Bulletin, ronidazole in the treatment of alcoholism. American 117, 167-178. Journal of Psychiatry, 125, 99-102. *Henggeler, S. W., Borduin, C. M., Melton, G. B., Mann, *Primo, R. V., Terrell, F., & Wener, A. (1972). An aver- B. J., Smity, L. A., Hall, J. A., Cone, L., & Fucci, B. R. sion-desensitization treatment for alcoholism. Journal of (1991). Effects of multisystemic therapy on drug use and Consulting and Clinical Psychology, 38, 394-398. abuse in serious juvenile offenders: A progress report *Puska, P., Bjorkqvist, S., & Koskela, K. (1979). Nicotine- from two outcome studies. Family Dynamics of Addiction containing chewing gum in smoking cessation: A double Quarterly, I, 40-51. blind trial with half year follow-up. Addictive Behaviors, *Hjalmarson, A. I. M. (1984). Effect of nicotine chewing 4, 141-146. gum in smoking cessation. Journal of the American Rasbash J., & Woodhouse, G. (1995). ML-n [Computer Medical Association, 252, 2835-2838. software]. London: MLM Project, Institute of Education. Jacobson, N. S., Follette, W. C., & Revenstorf, D. (1984). Raudenbush, S. W. (1994). Random effects models. In H. Psychotherapy outcome research: Methods for reporting Cooper & L. V. Hedges (Eds.), Handbook of research variability and evaluating clinical significance. Behavior synthesis (pp. 301-321). New York: Russell Sage Foun- Therapy, 15. 336-352. dation. *Jamrozik, K., Fowler, G., Vessey, M., & Wald, N. (1984). 'Research Committee of the British Thoracic Society. Placebo comparisonled trial of nicotine chewing gum in (1983). Comparison of four methods of smoking with- general practice. British Medical Journal, 289, 794-797. drawal in patients with smoking related . British *Joanning, H., Quinn, W., Thomas, F., & Mullen, R. Medical Journal, 286, 595-597. META-ANALYSIS OF DICHOTOMOUS DATA 353

Sandercock, P. (1989). The odds ratio: A useful tool in *Stein, L. I., Newton, J. R., & Bowman, R. S. (1975). Du- neurosciences. Journal of Neurology, Neurosurgery, and ration of hospitalization for alcoholism. Archives of Gen- Psychiatry, 52, 817-820. eral Psychiatry, 32, 247-252. *Sannibale, C. (1988). The differential effect of a set of Thomas, A., Spiegelhalter, D. J., & Gilks, W. R. (1992). brief interventions on the functioning of a group of BUGS: A program to perform using ' 'early-stage'' problem drinkers. Australian Drug and Al- Gibbs sampling. In J. M. Bernardo, J. O. Berger, A. P. cohol Review, 7, 147-155. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 4. Shadish, W. R., & Haddock, C. K. (1994). Methods for Oxford, England: Clarendon Press. combining effect size estimates. In H. M. Cooper & L. V. Hedges (Eds.), Handbook of research synthesis (pp. 261- Thomas, D. G., & Gart, J. J. (1977). A table of exact con- 281). New York: Russell Sage Foundation. fidence limits for differences and ratios of two propor- 'Shaffer, J. W., Freinek, W. R., Wolf, S., Foxwell, N. H., & tions and their odds ratios. Journal of the American Sta- Kurland, A. A. (1964). of a study of nial- tistical Association, 72, 73-76. amide in the treatment of convalescing alcoholics with *Walker, R. D., Donovan, D. M., Kivlahan, D. R. & emphasis on prediction of response. Current Therapeutic O'Leary, M. R. (1983). Length of stay, neuropsycholog- Research, 6, 521-531. ical performance, and aftercare: Influences on alcohol Smith, T. C., Spiegelhalter, D. J., & Thomas, A. (1995). treatment outcome. Journal of Consulting and Clinical Bayesian approaches to random-effects meta-analysis: A Psychology, 51, 900-901. comparative study. Statistics in Medicine, 14, 2685- *Ziegler-Driscoll, G. (1977). Family research study at 2699. Eagleville Hospital and Rehabilitation Center. Family Somes, G. W. (1986). The generalized Mantel-Haenszel Process, 16, 175-190. statistic. The American Statistician, 40, 106-108. Somes, G. W., & O'Brien, K. F. (1985). Mantel-Haenszel statistic. In S. Kotz, N. L. Johnson, & C. B. Read (Eds.), Received March 7, 1997 Encyclopedia of statistical sciences (Vol. 5, pp. 407- Revision received October 23, 1997 410). New York: Wiley. Accepted November 10, 1997