Power Analysis and Sample Size Determination: Concepts and Software Tools

Roger J. Lewis, M.D., Ph.D. Department of Emergency Medicine Harbor-UCLA Medical Center Torrance, California

Presented at the 2000 Annual Meeting of the Society for Academic Emergency Medicine (SAEM) in San Francisco, California.

Table of Contents: Introduction ...... 2 Classical Hypothesis Testing ...... 2 Type I Error, Type II Error, Power...... 3 Determinants of Power and Their ...... 4 Features of Sample Size Formulas and “Rules of Thumb” ...... 5 Precision versus ...... 6 Post hoc Power Analyses ...... 6 Ethical Issues in Sample Size Determination ...... 7 Unknown or Poorly-Known ...... 7 Examples and Software Tools...... 7 Summary and Conclusions ...... 11 References ...... 12

c:\my documents\roger\saem 2000\power and sample size 2000.doc Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

Introduction Determining the appropriate sample size for an investigation, whether it is a or a laboratory animal investigation, is an essential step in the statistical design of the project. An adequate sample size helps ensure that the study will yield reliable information, regardless of whether the ultimate data suggest a clinically important difference between the treatments being studied, or the study is intended to measure the accuracy of a dia gnostic test or the incidence of a disease. Unfortunately, many studies published in both the emergency medicine and the general medical literature are conducted with inadequate sample sizes, making the interpretation of negative results difficult. Conducting a study with an inadequate sample size is not only futile, it is also unethical. Exposing human subjects or laboratory animals to the risks inherent in research is only justifiable if there is a realistic possibility that the results will benefit those subjects, future subjects, or lead to substantial scientific progress. Historically, learning the techniques of sample size determination and power analysis has been difficult, because of relatively complex mathematical considerations and numerous different formulas. Luckily, over the last several years there has been a tremendous improvement in the availability, ease of use, and capability of commercial sample -size determination software. Many of these programs now allow the determination of sample size and the resulting power for a wide variety of clinical trial designs and analysis methods. The primary goal of this lecture is to teach the concepts underlying sample size determination and power analysis. In addition, the use of two commercially available software programs will be illustrated, as the use of such programs has largely replaced hand-calculation for the determination of sample size.

Classical Hypothesis Testing The term “classical hypothesis testing” refers to the use of P values to determine which of two possible hypotheses or conclusions to draw from available data. The first hypothesis to be considered, the null hypothesis, is the conclusion that there is no difference between the groups being compared, with respect to the variable of interest. For example, comparing the effect of two treatments on survival, the null hypothesis would be that the survival rate in patients receiving one treatment is the same as the survival rate in the group receiving the other treatment. To be valid and useful, the null hypothesis must be defined prior to the beginning of and ought to be clinically or scientifically relevant. In other words, if one were able to conclusively prove the null hypothesis were wrong, that should be an important result. Mathematically, the null hypothesis may be represented in many different ways. For example, when considering the effect of a new treatment on survival, the null hypothesis would be that the proportion of patients surviving is the same in each group and could be written as (P1 = P2), or written in terms of the relative risk (RR) or the odds ratio (OR) being equal to one, or in terms of a coefficient within a logistic regression model being equal to zero. In each case, the null hypothesis is the idea that there is no difference between the groups being compared. The other conclusion to be considered, the alternative hypothesis, is the idea that there is a difference between the groups being compared with respect to the variable of interest. Furthermore, the alternative hypothesis must define the size of that difference. In the context of power analysis and sample size determination, the difference is termed the “effect size.” The effect size sought by the study is a major determinant of the sample size required. Choosing an appropriate effect size is the first step in sample size determination. Usually, one designs clinical studies to reliably detect an effect which is of the minimum clinically-significant magnitude. In other words, one designs clinical studies to reliably find the smallest effect size which would result in a change in clinical practice. Determining the minimum clinically-significant effect size is a medical and scientific judgment, not a statistical decision. As will be discussed below, the smaller the effect size sought by a study, the larger the sample size required. Frequently, available time and resources do not allow the conduct of a clinical trial large enough to reliably detect the smallest clinically- significant effect. In these cases, one may choose a larger effect size, with the realization that should the trial result be negative, it will not reliably exclude the possibility of a smaller but clinically-important treatment difference.

Page 2 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

Rejecting the Null Hypothesis: Rejecting the Null Hypothesis: Graphical Representation Using 4´ Sample Size

P < 0.05 Area = 0.025 P < 0.05 (a/2) Probability of Result

“0” Probability of Result “0” Observed Result Observed Result

Under the Null Hypothesis Under the Null Hypothesis

Figure 1. Figure 2.

In order to choose between the null and the alternative hypotheses, a P value is calculated based on the available data. The P value is the probability, assuming the null hypothesis is true and there is no real difference between the two groups, that one might observe a difference between the groups equal to or larger than the difference actually observed. It is a measure of the apparent contradiction between the observed data and the null hypothesis of no difference. If the P value is less than some predetermined value, denoted a (usually 0.05), then the data are considered to be inconsistent with the null hypothesis and the null hypothesis is rejected. In that case alternative hypothesis is chosen as the result of the study. If, on the other hand, the P value is > a then the null hypothesis is accepted as true. The process of rejecting the null hypothesis is shown in Figures 1 and 2. The width of the curve, which shows the distribution of results likely under the null hypothesis, depends on both the inherent variability of the outcome variable and on the sample size. For a given sample size, the more variable the outcome data from subject to subject, the wider the curve. As shown in Figure 2, for a given inherent variability in the data, however, a larger sample size decreases the width of the curve. In the vast majority of clinical and scientific studies, one considers differences in either direction. In other words, one would consider it important if the experimental treatment was superior to or inferior to the control treatment. In this setting one calculates two-tailed P values, which include the probability of a result in either direction. A one-tailed P value, in which one considers only the probability of seeing a difference in one direction, is less conservative and more likely to lead to a false positive result.

Type I Error, Type II Error, Power Any method for deriving a conclusion from experimental data carries with it some risk of drawing a false conclusion. There are two types of false conclusions to be considered, called a “type I error” and a “type II error.” A type I error occurs when one concludes that a difference exists between the treatment groups when, in fact, it does not. It is a type of false positive. The risk of a type I error, assuming that there really is no difference between the groups, is equal to a. A type II error occurs when one concludes that a difference does not exist between the groups being compared when, a difference does exist and the difference is equal to or larger than the effect size defined by the alternative hypothesis. A type II error is a type of false negative. The risk of a type II error occurring, assuming there is a difference between the groups exactly equal to that defined by the alternative hypothesis, is denoted b. When considering the risk of a type II error, one usually thinks in terms of the power of the study, rather than the b. The power of the study is the probability of obtaining a statistically significant P value, if a true difference exists that is equal to the effect size defined by the alternative hypothesis. Since the risk of a type II error, assuming the alternative hypothesis is true, is equal to b, the power of the study is equal to 1-b. For any particular study design and planned analysis method, the power of a study is determined by the sample size, the effect size, and by a.

Page 3 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

Determinants of Power and Their Interaction Once the basic study design and planned analysis method are defined, and given the inherent variability of the outcome to be measured, there are three parameters under our control which define the power of the study. These are the sample size, the effect size defined by the alternative hypothesis, and a. In fact, considering these three parameters and the power of the study together, fixing any three will allow the determination of the fourth. For example, once one defines the effect size, a, and the desired power, this determines the required sample size. Similarly, if one defines the effect size, a, and the sample size, this will define the power. Figure 3 shows a graphical representation of power analysis. The left hand curve shows the Graphical Representation of Power distribution of results expected under the null hypothesis of no difference, while the right hand Effect Size (dashed) curve shows the distribution of results Power = 0.80 expected under the alternative hypothesis. The difference in the location of the peak of these two curves is the effect size defined by the alternative P < 0.05 hypothesis. The vertical line shows the observed Probability of Result “0” result which corresponds to a P value of exactly a or Observed Result 0.05. Any result at that location or to the right of that Under the Null Hypothesis location will result in rejection of the null hypothesis Under the Alternative Hypothesis and the conclusion that the alternative hypothesis is Figure 3. true. Since we are considering two-tailed analyses, the null hypothesis would also be rejected if the result was far to the left of the peak of the first curve, but this would not be consistent with the alternative hypothesis and would require a reconsideration of the assumptions underlying the study. In Figure 3, the power of the study is the area under the right hand curve which lies to the right of the vertical line. That area is 80% of the area under the curve, representing a power of 0.80. If one chooses a smaller value for a, setting a more stringent criteria for a statistically significant P Effect of Smaller a: Lower Power value, this will result in a lower power. This is shown in Figure 4. Because a is smaller, the vertical line, Effect Size showing the location above which we attain statistical Power = 0.59 significance, is moved to the right. Now a smaller area under the right hand curve lies to the right of the P < 0.01 vertical line, representing the lower power. Setting a Probability of Result more stringent criteria for a positive result increases “0” the risk of a false negative or type II error. Observed Result

Figure 5 shows the effect of increasing the Under the Null Hypothesis effect size defined by the alternative hypothesis, while Under the Alternative Hypothesis maintaining the values for a and sample size constant. Figure 4. Because the effect size is larger, the right hand curve is moved to the right, and a larger fraction of the area under the right hand curve now lies to the right of the vertical line. Thus, a larger power is achieved with a larger effect size. As mentioned above, the effect size is usually dictated by clinical or scientific considerations, rather than statistical ones. In addition, the value of a is usually set at 0.05. Since the effect size and a are usually fixed, and since the inherent variability in the outcome data is not under our control, the only way left to increase power is to increase the sample size. Figure 6 shows the effect of increasing the sample size on the power of a study. Because the sample size is increased, the curves representing the expected results under the null and alternative hypotheses are narrower. The criteria for is moved to the left, because of the narrower curve for outcomes under the null hypothesis.

Page 4 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

Larger Effect Size: Larger Power Larger Sample Size: Larger Power

Effect Size Effect Size Power > 0.9998 Power = 0.95

P < 0.05

P < 0.05 Probability of Result Probability of Result “0” “0” Observed Result Observed Result

Under the Null Hypothesis Under the Null Hypothesis Under the Alternative Hypothesis Under the Alternative Hypothesis Figure 5. Figure 6. Now virtually all of the area under the right hand curve falls to the right of the vertical line, showing that the power of the study is increased to more than 0.999.

Features of Sample Size Formulas and “Rules of Thumb” For each method of statistical analysis and type of study design (unpaired samples, case-control studies, longitudinal studies with repeated measures, etc) there is a different formula relating sample size to desired power. In preparing this lecture, I performed a search of MEDLINE and Current Contents using the key words “sample size” and “statistical methods” and found over one hundred primary articles containing sample size determination formulas. Luckily, most sample size formulas share common characteristics that can be used to generate rough “rules of thumb” which are useful in Typical Sample Size Formulas estimating the dependence of sample size Repeated Measures of Continuous Data requirements on various study parameters. For 2 s 2 example, one reference gives the first formula in N = (s 2 + )(Z + Z )2 Figure 7 for determining the sample size in a study d 2 b k a /2 b which utilizes repeated correlated measures of a continuous variable. Another author suggests the Mortality Reduction from a Screening Program second formula in Figure 7 for determining the ( f +1)(Z + Z )2 N = a/2 b sample size required to detect a given reduction in 4 f [sin-1( R ) -[sin-1( 0.70 ´ R )]2 mortality resulting from the institution of a disease t t screening program. In each of these cases, and many Figure 7. others, mathematical approximations have been introduced which allow the outcome of interest to be treated as if it were normally distributed (hence the 2 use of the Z ). Thus, in the numerator there is usually a term of the form (Za/2 +Zb) . In the denominator there is a term which is proportional to the square of the effect size defined by the alternative hypothesis. These features (Figure 8) can be used together with the values for Z shown in Figure 9 to estimate the dependence of the required sample size on changes in the effect size, a, or the desired power. For calculating the dependence of the required sample size on the design parameters, we will consider our “standard” study to use a=0.05 and have a power of 0.80 (b=0.20). Using these values, the squared term in the numerator equals 7.84. Replacing the Z values with those corresponding to a=0.05

(Za/2=1.96) and a power of 0.95 Zb=1.64) results in the squared term in the numerator equal to 12.96. Thus increasing the power from 0.80 to 0.95 will require a 65% increase in the required number of subjects (Figure 9). Similarly derived “rules” are shown in Figure 10. Because the sample size is a function of the inverse square of the effect size, designing a study to detect an effect size half as large will require four times the number of patients.

Page 5 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

Common Features of Sample Values of Z

Size Formulas a Za/2 Power (b) Zb

0.05 1.96 0.80 (0.20) 0.84 0.01 2.58 0.90 (0.10) 1.28 2 0.95 (0.05) 1.64 dZa /2 + Zb i N » Constant´ 0.99 (0.01) 2.33 Effect Size 2 b g Example: a = 0.05; changing b = 0.20 to 0.05 (1.96 + 0.84)2 = 7.84 (1.96 + 1.64)2 = 12.96 è 65% more subjects

Figure 8. Figure 9.

Precision versus Effect Size Frequently, studies are designed to yield an Sample Size “Rules of Thumb” estimate of some parameter of interest, for example the of a continuous variable, a proportion, or a • To detect half the effect size requires difference in proportions. A difference in proportions four times the number of subjects can be measured as a raw difference, a relative risk (inverse square rule) (RR), an odds ratio (OR), and in other ways. In each • Assuming a = 0.05 and b = 0.20 requires case, however, the precision of the final estimate can n patients: be measured by the width of the 95% confidence • a = 0.05 and b = 0.05 è 1.65 ´ n interval (CI). The larger the sample size used to • a = 0.01 and b = 0.20 è 1.49 ´ n calculate the final CI, the narrower the CI will be. • a = 0.01 and b = 0.05 è 2.27 ´ n Therefore, the desired final CI width can be used, instead of a desired effect size, to determine the Figure 10. appropriate sample size for a study. Because the predicted precision or CI width depends mostly on the variance of the data and much less on the effect size, a study can be planned to yield a given precision or CI width without choosing a likely effect size. An example of such a study will be given later. When designing a study to yield a CI with a given width, the general technique is to choose a sample size so that the “average” resulting CI will have the given width. It is important to realize that, given such a sample size, there is a 50-50 chance that the final width will be narrower or wider than the desired width, even if the estimate used for the variance of the outcome data is accurate. Alternatively, one can choose a sample size so that there is a defined probability (e.g., 0.80 or 0.95) that the final CI width is below a given value. This calculation is more analogous to a typical power calculation, but is more complex and not generally part of commercially-available sample size software.

Post hoc Power Analysis As mentioned above, a CI width-based calculation can be used, before a study is initiated, to determine the appropriate sample size. In addition, CIs can be calcula ted after the study is completed, based on the final data, to help in the interpretation of the study results. In contrast, a power calculation can only be performed before a study is initiated, to determine the appropriate sample size. A power calculation is not valid once the study data have been obtained. How can one interpret data from a negative study in which no power calculation was initially performed? Although tempting, performing a post hoc power analysis, in which one calculates the effect size that could have been found with the actual sample size and with a given power, is invalid and should never be done. The correct approach to analyzing such data is to calculate the 95% CI for the outcome of

Page 6 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D. interest, based on the final data, and use the CI to guide interpretation of the study results. An excellent reference on this topic is reference A-4 by Goodman and Berlin.

Ethical Issues in Sample Size Determination Because there is some risk in participation in clinical research, it is unethical to enroll patients in a trial that has an inadequate sample size and is unlikely to yield useful information. This issue warrants special consideration in the design of phase one studies and “pilot” studies. Studies involving human subjects (or laboratory animals) should be designed with a large enough sample size so it is highly likely that they will yield useful information. In most clinical trials, patients are enrolled sequentially and outcome data from patients enrolled early in the trial have been collected before the last patients are enrolled. It is unethical to enroll a patient in a randomized trial of treatments for a significant illness when, because of the number of patients previously studied, the information exists to determine the better treatment. This is true whether or not the investigator has analyzed the previously acquired data and actually knows which treatment is better. On the other hand, when the difference in the treatments being compared is difficult to predict, it is also difficult to know how many patient will be required to demonstrate the difference. Thus, the investigator frequently does not know whether, at some point midway through the study, the information required to determine the best treatment actually exists. In this case, a sequential or group sequential clinical trial design should be used. In a sequential or group sequential trial, the accumulating data are analyzed after every patient or group of patients, in order to determine if the trial can be stopped and a reliable conclusion drawn from the available data. Such a trial design safeguards against unnecessarily continuing a trial if the effect is larger than anticipated, or if there is an unexpected negative effect. In a sense, a sequential or group sequential trial adjusts itself based on the accumulating results. Although the design of sequential or group sequential trials is beyond the scope of this lecture, some references have been included.

Unknown or Poorly-Known Variance It is particularly difficult to plan a study if the variability or variance of the main outcome measure is completely unknown or very poorly known. Although it is generally not under our control, the expected variance is an important factor in sample size determination, whether based on effect size and power or on the desired width of a CI. It is impossible to determine an appropriate sample size without estimating the variance in the data. There are two related approached to solving this problem, when designing a research study. The first, which is applicable even if the variance of the data to be obtained is completely unknown, consists of conducting a pilot study whose primary purpose is to provide a data-based estimate of the variance of the outcome measure. Although the details are beyond the scope of this lecture, a number of authors have provided specific plans for designing such pilot studies and for incorporating the data from the pilot phase into the final study results. Thus, the outcome information derived during the pilot study is not lost. In many cases, although the variance is poorly known, one can obtain a rough estimate of its likely magnitude. From a sample -size-determination point of view, this situation is analogous to the situation in which the likely effect size is difficult to predict. In this case, one can design a sequential or group sequential clinical trial, based on the maximum variance that is likely and the desired effect size. If, in fact, the variance in the real data is smaller, this will reduce the required sample size and the trial will terminate during one of the planned interim analyses.

Examples and Software Tools In this section, a number of sample size calculations will be illustrated, including determinations for studies comparing continuous variables and comparing proportions. The use of two software tools will be illustrated. The first program is PASS 2000 by NCSS (http://www.ncss.com). The second program is Power and Precision 2.00 by Biostat (http://www.powerandprecision.com). Both of these

Page 7 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D. programs run on Microsoft Windows 95, 98 and Windows NT 4.0. Trial versions of these programs are available free of charge via the Internet. Our first example will be the sample size calculation for a study whose goal is to compare two treatments for hypotension. The primary outcome will be the change in systolic blood pressure, denoted DSBP. We estimate, based on other studies, that the of DSBP is 10 mmHg, and, based on clinical considerations, the effect size to be sought is 5 mmHg. We will assume that student’s t test and the Wilcoxon rank sum test are to be used for comparing the two groups. As you recall, student’s t test is most frequently used to compare the of two sets of numbers (continuous data). Student’s t test requires two assumptions: that the data are normally distributed and that there is equal variance in the two groups. Different forms of student’s t test exist for paired and unpaired data. In unpaired data, a single measurement is taken on each subject and there is no specific relationship between pairs of subjects in the groups being compared. Paired data occur when two measurements are taken on each subject and the difference between these measurements is the quantity of interest, or when there is pairwise matching of subjects in the two treatment groups. The Wilcoxon rank sum test is a nonparametric test, which compares the of two sets of numbers, (continuous data or “roughly” continuous data). The Wilcoxon rank sum test makes no assumption regarding the normality of the data. Because it does not require the normality assumption, this test should generally be used instead of student’s t test when comparing two groups of numbers, unless there is very good reason to believe the data are normally distributed. The use of a nonparametric test does, in general, results in some loss of power compared to a nonparametric test. This will be quantified below. Assuming a relatively large sample size and that the variance is equal in the two groups and known, there is a simple formula for the sample size requirement of student’s t test for unpaired samples (Figure 11). Using the value of 10 mmHg for the standard deviation, 5 mmHg for the effect size, and

Manual Calculation NCSS and PASS 2000 One Time Only s 2 n (per group) = 2(Z + Z )2 L O a /2 b M 2 P Nd Q 2 2 L10 O n (per group) = 2(1.96+ 0.84) M 2 P N 5 Q = 62.7 or 63 patients per group

Figure 11. Figure 12.

Unpaired t test (PASS 2000) Unpaired t test (PASS 2000)

The standard deviations were assumed to be known and equal.

Power N1 Alpha Beta 0.35261 20 0.05 0.64739 0.60878 40 0.05 0.39122 0.78191 60 0.05 0.21809 0.88538 80 0.05 0.11462 0.94244 100 0.05 0.05756 0.97213 120 0.05 0.02787

Figure 13. Figure 14.

Page 8 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D. choosing a=0.05 and a power of 0.80 results in an estimate of 63 patients required per group, or 126 patients for the study. The same calculation can be performed using PASS 2000 (Figures 12-15). PASS 2000 gives both a table of potential sample sizes and their associated power, and a graphical representation of the relationship between sample size and power (Figure 15). PASS 2000 also allows one to remove the assumptions that the standard deviations in the two groups are known and/or equal (Figure 16). Removing these assumptions results in a slight loss of power (Figure 17). For example with the assumption of known and equal variance, the power with 80 patients in each group is 0.885. Without these assumptions, the power with 80 patients in each group is 0.882.

Unpaired t test-Graphical Output Unpaired t test (PASS 2000) (PASS 2000)

Power vs N1 with M1=0.0 M2=5.0 S1=10.0 S2=10.0 Alpha=0.05 N2=N1 2-Sided T Test

1.1

0.9

0.7 Power

0.5

0.3 0 50 100 150 N1 Figure 15. Figure 16.

Relaxing t Test Assumption that the Unpaired t test (PASS 2000) Variance and SD are Known

Power Power N1 (with) (without) 20 0.35261 0.33794 40 0.60878 0.59815 60 0.78191 0.77527 80 0.88538 0.88160 100 0.94244 0.94043 120 0.97213 0.97176

Figure 17. Figure 18.

PASS 2000 also allows one to remove the assumption of normality of data (Figure 18). With this assumption removed, the program is calculating the power for the nonparametric Wilcoxon rank sum test. Once the normality assumption is removed, however, one must choose the true underlying distribution of the data. In order to allow a direct comparison of the power between student’s t test and the Wilcoxon rank sum test, one can choose the assumption that the data are normally distributed but will be analyzed using the Wilcoxon rank sum test. With this comparison, the use of the nonparmetric Wilcoxon rank sum test results in a power of 0.865 for 80 patients in each group, while student’s t test resulted in a power of 0.885 and 0.882 with and without the assumption of known and equal variance respectively (Figure 19). Thus, the loss of power associated with using the nonparametric test is minimal. We will now consider the sample size calculation for a study whose goal is to compare two proportions. Consider a study comparing the efficacy of two different approaches to defibrillation, for the termination of ventricular fibrillation (DF) in patients with witnessed cardiac arrest. The null hypothesis

Page 9 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D. would be that proportion of patients successfully defibrillated in the two groups are equal (P1=P 2). The alternative hypothesis might be that the proportion of patients successfully defibrillated with the new approach is 0.6 and the proportion successfully defibrillated with the standard technique is 0.4 (P1=0.4 and P2=0.6). Assume, based on clinical and scientific criteria, that we desire a=0.05 and a power of 0.95 (b=0.05). Generally, the chi-square test would be used to analyze this sort of categorical data.

Relaxing t Test Assumption that the Data are Normally Distributed Chi-Square Test Power Power • Used to compare proportions between N1 (with) (without) two or more groups 20 0.35261 0.32307 Group 1 Group 2

40 0.60878 0.57581 Good A B 80 0.88538 0.86487 Outcome Bad C D 120 0.97213 0.96481 Outcome • Right hand column is power of a Wilcoxon rank sum test performed on • Assumptions: 5 or more observations expected in any “cell” data that are normally distributed

Figure 19. Figure 20.

The Chi-square test (Figure 20) can be used to compare the proportion of subjects with a given outcome in two or more treatment groups. The outcome may have more than two levels (e.g., low, moderate, high). Here, we will consider the simplest case in which there are two treatment groups and two possible outcomes (successful defibrillation and no successful defibrillation). The formula used to calculate a P value, when using the chi-square test, makes the assumption that there are five or more subjects expected in each of the table cells, under the null hypothesis. When this assumption is not met, because of small sample sizes and/or observed proportions very different from 0.5, then Fisher’s exact test should be used instead. Fisher’s exact test is also used to compare proportions between two or more treatment groups, and can accommodate two or more levels of outcome. Although Fisher’s exact test does not require any assumption regarding the minimum number of expected subjects in any cell, there is an assumption about “marginal” frequencies in the . Figure 21 shows the user interface for determining the required sample size in this setting, using the program Power and Precision 2.0. Finding the appropriate sample size is an iterative process, in which the user adjusts the number of patient in each group and instantly sees the resulting power in the “Power” bar. In addition, this program allows one to select a variety of different sample size formulas, so one can see whether the choice of sample size formula substantially alters the sample size estimate.

Comparing Two Proportions (Power and Precision 2.0) Verbal Summaries (Power and Precision)

Figure 21. Figure 22.

Page 10 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

Power and Precision 2.0 also provides verbal Comparing Two Proportions (PASS 2000) summaries of the sample size calculation results (Figure 22). The program PASS 2000 can also be used to perform this sample size calculation. The user interface for this program is showing in Figure 23. In this case, the parameters for the study are entered into the first screen and a “Run” button is pushed. The program then produces a second window, which contains both a text table of the sample size calculation results, as well as graphical results. Figure 24 shows part of the text results, while Figure 25 shows the graphical output. It is interesting to note that the Figure 23. program Power and Precision 2.0 suggested a sample size of 180 patients per group, while PASS 2000 suggests a sample size of 160 patients per group. This results from different choices for this specific sample size formula used.

Comparing Two Proportions Comparing Two Proportions--Graphical Edited Output (PASS 2000) Output (PASS)

Power vs N1 by Alpha with P1=0.40 P2=0.60 N2=N1 Null Hypothesis: P1=P2 2-Sided Prop Test Alternative Hypothesis: P1<>P2. 1.00

Arcsine Used. 0.95

0.90 Power N1 N2 P1 P2 Alpha Beta 0.01 0.92064 140 140 0.40 0.60 0.05 0.07936 0.85 Power Alpha 0.05 0.94971 160 160 0.40 0.60 0.05 0.05029 0.80 0.96859 180 180 0.40 0.60 0.05 0.03141 0.10 0.98064 200 200 0.40 0.60 0.05 0.01936 0.75

0.98821 220 220 0.40 0.60 0.05 0.01179 0.70 100 150 200 250 N1 Figure 24. Figure 25.

As our next example, consider a study whose goal is to demonstrate, with defined precision, that two Equivalence of Two Proportions treatments yield equivalent emergency department (Power and Precision 2.0) discharge rates for patients with acute asthma exacerbations. The goal is to demonstrate equivalent discharge rates, rather than demonstrate a defined difference. We will assume that the proportion discharge from the ED after standard therapy is 0.75. We want to demonstrate equivalence, within a 10% absolute difference. Figure 26 shows the user interface when using the program Power and Precision 2.0 for this calculation. In this case, one enters the identical proportion for each of the populations and Figure 26. adjusts the sample size per group while observing the power to demonstrate equivalence (see arrows).

Summary and Conclusions Determining the appropriate sample size for a clinical or animal study is a fundamental part of sound and ethical study design. Performing a valid sample size calculation requires estimates of the variability in the data, as well as defining the effect size sought or the desired precision, the allowable

Page 11 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D. error rates (a and power), and a planned method of analysis. Although the concepts underlying power analysis and sample size determination are relatively few (at least compared to other statistical topics), the large number of different study designs and analysis methods results in a staggering number of different sample size formulas. Because of this, the use of commercial power-analysis software is extremely helpful.

References

A. General and Introductory References 1. Meinert CL. Clinical Trials: Design, Conduct, and Analysis. New York: Oxford University Press, 1986. 2. Florey CDV. Sample Size for Beginners. BMJ 1993;306:1181-1184. 3. Moher D, Dulberg CS, Wells GA. Statistical Power, Sample Size, and Their Reporting in Randomized Controlled Trials. JAMA 1994;272:122-124. 4. Goodman SN, Berlin JA. The Use of Predicted Confidence Intervals When Planning and the Misuse of Power when Interpreting Results. Annals of Internal Medicine 1994;121:200-206. 5. Lachin JM. Introduction to Sample Size Determination and Power Analysis for Clinical Trials. Controlled Clinical Trials 1981;2:93-113. 6. Dupont WD, Plummer WD Jr. Power and Sample Size Calculations: A Review and Computer Program. Controlled Clinical Trials 1990;11:116-128. 7. Donner A. Approaches to Sample Size Estimation in the Design of Clinical Trials—A Review. in Medicine 1984;3:199-214.

B. Sample Size Determination for Tests of Continuous Variables 1. Cohen J. Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press, 1969. 2. Pearson ES, Hartley HO. Biometrika Tables for Statisticians. Third Edition. Cambridge: Cambridge University Press, 1970 (Volume I). 3. Day SJ, Graham DF. Sample Size Estimation for Comparing Two or More Treatment Groups in Clinical Trials. Statistics in Medicine 1991;10:33-43.

C. Sample Size Determination for Tests of Proportions 1. Fleiss JL. Statistical Methods for Rates and Proportions. Second Edition. New York: Wiley & Sons, 1981. 2. Casagrande JT, Pike MC, Smith PG. An Improved Approximate Formula for Calculating Sample Sizes for Comparing Two Binomial Distributions. Biometrics 1978;34:483-486. 3. Gordon I, Watson R. The Myth of Continuity-Corrected Sample Size Formulae. Biometrics 1996;52:71-76. 4. Feigl P. A Graphical Aid for Determin ing Sample Size When Comparing Two Independent Proportions. Biometrics 1978;34:111-122. 5. Haseman JK. Exact Sample Sizes for Use with the Fisher-Irwin Test for 2 ´ 2 Tables. Biometrics 1978;34:106-109. 6. Thomas RG, Conlon M. Sample Size Determination Based on Fisher’s Exact Test for Use in 2 ´ 2 Comparative Trials with Low Event Rates. Controlled Clinical Trials 1992;13:134-147. 7. O’Neill RT. Sample Sizes for Estimation of the Odds Ratio in Unmatched Case-Control Studies. American Journal of 1984;120:145-153. 8. Lemeshow S, Hosmer DW, Klar J. Sample Size Requirements for Studies Estimating Odds Ratios or Relative Risks. Statistics in Medicine 1988;7:759-764. 9. Whitehead J. Sample Size Calculations for Ordered Categorical Data. Statistics in Medicine 1993;12:2257-2271.

Page 12 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

10. Roebruck P, Kuhn A. Comparison of Tests and Sample Size Formulae for Proving Therapeutic Equivalence Based on the Difference of Binomial Probabilities. Statistics in Medicine 1995;14:1583- 1594. 11. Lubin JH, Gail MH. On Power and Sample Size for Studying Features of the Relative Odds of Disease. American Journal of Epidemiology 1990;131:552-66.

D. Sample Size Determination for Time-to-Event (Survival) Data 1. Lakatos E, Lan KKG. A Comparison of Sample Size Methods for the Logrank Statistic. Statistics in Medicine 1992;11:179-191. 2. Schoenfeld DA, Richter JR. Nomograms for Calculating the Number of Patients Needed for a Clinical Trial with Survival as an Endpoint. Biometrics 1982;38:163-170.

E. Sample Size Determination for Receiver Operating Curve (ROC) Analysis 1. Hanley JA, McNeil BJ. The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve. Radiology 1982;143:29-36. 2. Obuchowski NA. Computing Sample Size for Receiver Operating Characteristic Studies. Investigative Radiology 1994;29:238-243. 3. Obuchowski NA, McClish DK. Sample Size Determination for Diagnostic Accuracy Studies Involving Binormal ROC Curve Indices. Statistics in Medicine 1997;16:1529-1542.

F. Sample Size Determination for Logistic and 1. Whittemore AS. Sample Size for Logistic Regression with Small Response Probability. Journal of the American Statistical Association 1981;76:27-32. 2. Hsieh FY. Sample Size Tables for Logistic Regression. Statistics in Medicine 1989;8:795-802. 3. Flack VF, Eudey TL. Sample Size Determinations Using Logistic Regression with Pilot Data. Statistics in Medicine 1993;12:1079-1084. 4. Bull SB. Sample Size and Power Determination for a Binary Outcome and an Ordinal Exposure when Logistic is Planned. American Journal of Epidemiology 1993;137:676- 684. 5. Signorini DF. Sample Size for Poisson Regression. Biometrika 1991;78:446-450.

G. Sample Size Determination When Repeated Measurements are Planned 1. Lui KJ, Cumberland WG. Sample Size Requirment for Repeated Measurements in Continuous Data. Statistics in Medicine 1992;11:633-641. 2. Lipsitz SR, Fitzmaurice GM. Sample Size for Repeated Measures Studies with Binary Responses. Statistics in Medicine 1994;13:1233-1239.

H. Sample Size Determination Based on Precision 1. Greenland S. On Sample -size and Power Calculations for Studies Using Confidence Intervals. American Journal of Epidemiology 1988;128:231-237. 2. Samuels ML, Lu TFC. Sample Size Requirement for the Back-of-the-Envelope Binomia l . American Statistician 1992;46:228-231. 3. Buderer NMF. Statistical Methodology: I. Incorporating the Prevalence of Disease into the Sample Size Calculation for Sensitivity and Specificity. Academic Emergency Medicine 1996;3:895-900. 4. Satten GA, Kupper LL. Sample Size Requirements for of the Odds Ratio. American Journal of Epidemiology 1990;131:177-84. 5. Streiner DL. Sample -Size Formulae for Parameter Estimation. Perceptual and Motor Skills 1994;78:275-84. 6. Beal SL. Sample Size Determination for Confidence Intervals on the Population Mean and on the Difference Between Two Population Means. Biometrics 1989;45:969-77.

Page 13 Power Analysis and Sample Size Determination Roger J. Lewis, M.D., Ph.D.

I. Sample Size Determination for Paired Samples 1. Dupont WD. Power Calculations for Matched Case-Control Studies. Biometrics 1988;44:1157- 1168. 2. Parker RA, Bregman DJ. Sample Size for Individually Matched Case-Control Studies. Biometrics 1986;42;919-926. 3. Nam JM. Sample Size Determination for Case-Control Studies and the Comparison of Stratified and Unstratified Analyses. Biometrics 1992;48:389-395. 4. Lu Y, Bean JA. On the Sample Size for One-Sided Equivalence of Sensitivities Based Upon McNemar’s Test. Statistics in Medicine 1995;14:1831-1839. 5. Lachenbruch PA. On the Sample Size for Studies Based Upon McNemar’s Test. Statistics in Medicine 1992;11:1521-1525. 6. Lachin JM. Power and Sample Size Evaluation for the McNemar Test with Application to Matched Case-Control Studies. Statistics in Medicine 1992;11:1239-1251. 7. Royston P. Exact Conditional and Unconditional Sample Size for Pair-Matched Studies with Binary Outcome: A Practical Guide. Statistics in Medicine 1993;12:699-712. 8. Nam JM. Establishing Equivalence of Two Treatments and Sample Size Requirements in Matched- Pairs Design. Biometrics 1997;53:1422-30.

J. Sample Size Determination for Measurement of Agreement 1. Donner A, Eliasziw M. A Goodness-of-fit Approach to Inference Procedures for the Kappa Statistic: Confidence Interval Construction, Significance-Testing and Sample Size Estimation. Statistics in Medicine 1992;11:1511-1519.

K. Issues Surrounding Estimating Variance, Sample Size Re-estimation Based on Interim Data 1. Birkett MA, Day SJ. Internal Pilot Studies for Estimating Sample Size. Statistics in Medicine 1994;13:2455-2463. 2. Gould AL. Planning and Revising the Sample Size for a Trial. Statistics in Medicine 1995;14:1039- 1051. 3. Browne RH. On the use of a Pilot Sample for Sample Size Determination. Statistics in Medicine 1995;14:1933-1940. 4. Shih WJ, Zhao PL. Design for Sample Size Re-estimation with Interim Data for Double Blind Clinical Trials with Binary Outcomes. Statistics in Medicine 1997;16:1913-1923.

L. Studies with Planned Interim Analyses 1. Pocock SJ. Group Sequential Methods in the Design and Analysis of Clinical Trials. Biometrika 1977;64:191-199. 2. O’Brien PC, Fleming TR. A Multiple Testing Procedure for Clinical Trials. Biometrics 1979;35:549-556. 3. Kim K, DeMets DL. Sample Size Determination for Group Sequential Clinical Trials with Immediate Response. Statistics in Medicine 1992;11:1391-1399. 4. Geller NL, Pocock SJ. Interim Analyses in Randomized clinical Trials: Ramifications and Guidelines for Practitioners. Biometrics 1987;43:213-223. 5. Lewis RJ. An Introduction to the Use of Interim Data Analyses in Clinical Trials. Annals of Emergency Medicine 1993;22:1463-1469. 6. Whitehead J. The Design and Analysis of Sequential Clinical Trials. Second Edition. Chichester: Ellis Horwood, 1992.

M. Ethical Issues 1. Lantos JD. Sample Size: Profound Implications of Mundane Calculations. Pediatrics 1993;91:155-157.

Page 14