Prev Next

Lawrence M. Friedman, Curt D. Furberg, David L. DeMets, David M. Reboussin and Christopher B. GrangerFundamentals of Clinical Trials10.1007/978-3-319-18539-2_8

8. Sample Size

Lawrence M. Friedman¹, Curt D. Furberg², David L. DeMets³, David M. Reboussin⁴ and Christopher B. Granger⁵

(1)

North Bethesda, MD, USA

(2)

Division of Public Health Sciences, Wake Forest School of Medicine, Winston-Salem, NC, USA

(3)

Department Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA

(4)

Department of Biostatistics, Wake Forest School of Medicine, Winston-Salem, NC, USA

(5)

Department of Medicine, Duke University, Durham, NC, USA

The size of the study should be considered early in the planning phase. In some instances, no formal sample size is ever calculated. Instead, the number of participants available to the investigators during some period of time determines the size of the study. Many clinical trials that do not carefully consider the sample size requirements turn out to lack the statistical power or ability to detect intervention effects of a magnitude that has clinical importance. In 1978, Freiman and colleagues [1] reviewed the power of 71 published randomized controlled clinical trials which failed to find significant differences between groups. “Sixty-seven of the trials had a greater than 10% risk of missing a true 25% therapeutic improvement, and with the same risk, 50 of the trials could have missed a 50% improvement.” The situation was not much improved in 1994, when a similar survey found only 16% of negative trials had 80% power for a 25% effect, and only 36% for a 50% effect [2]. In other instances, the sample size estimation may assume an unrealistically large intervention effect. Thus, the power for more realistic intervention effects will be low or less than desired. The danger in studies with low statistical power is that interventions that could be beneficial are discarded without adequate testing and may never be considered again. Certainly, many studies do contain appropriate sample size estimates, but in spite of many years of critical review many are still too small [3, 4].

This chapter presents an overview of sample size estimation with some details. Several general discussions of sample size can be found elsewhere [5–21]. For example, Lachin [11] and Donner [9] have each written a more technical discussion of this topic. For most of the chapter, the focus is on sample size where the study is randomizing individuals. In the some sections, the concept of sample size for randomizing clusters of individuals or organs within individuals is presented.

Fundamental Point

Clinical trials should have sufficient statistical power to detect differences between groups considered to be of clinical importance. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of planning.

Before a discussion of sample size and power calculations, it must be emphasized that, for several reasons, a sample size calculation provides only an estimate of the needed size of a trial [6]. First, parameters used in the calculation are estimates, and as such, have an element of uncertainty. Often these estimates are based on small studies. Second, the estimate of the relative effectiveness of the intervention over the control and other estimates may be based on a population different from that intended to be studied. Third, the effectiveness is often overestimated since published pilot studies may be highly selected and researchers are often too optimistic. Fourth, during the final planning stage of a trial, revisions of inclusion and exclusion criteria may influence the types of participants entering the trial and thus alter earlier assumptions used in the sample size calculation. Assessing the impact of such changes in criteria and the screening effect is usually quite difficult. Fifth, trial experience indicates that participants enrolled into control groups usually do better than the population from which the participants were drawn. The reasons are not entirely clear. One factor could be that participants with the highest risk of developing the outcome of interest are excluded in the screening process. In trials involving chronic diseases, because of the research protocol, participants might receive more care and attention than they would normally be given, or change their behavior because they are part of a study, thus improving their prognosis, a phenomenon sometimes called the Hawthorne or trial effect [22]. Also, secular trends toward improved care may result in risk estimates from past studies being higher than what will be found in current patient populations [23]. Participants assigned to the control group may, therefore, be better off than if they had not been in the trial at all. Finally, sample size calculations are based on mathematical models that may only approximate the true, but unknown, distribution of the response variables.

Due to the approximate nature of sample size calculations, the investigator should be as conservative as can be justified while still being realistic in estimating the parameters used in the calculation. If a sample size is drastically overestimated, the trial may be judged as unfeasible. If the sample size is underestimated, there is a good chance the trial will fall short of demonstrating any differences between study groups or be faced with the need to justify an increase in sample size or an extension of follow-up [24–26]. In general, as long as the calculated sample size is realistically obtainable, it is better to overestimate the size and possibly terminate the trial early (Chap. 16) than to modify the design of an ongoing trial, or worse, to arrive at incorrect conclusions.

Statistical Concepts

An understanding of the basic statistical concepts of hypothesis testing, significance level, and power is essential for a discussion of sample size estimation. A brief review of these concepts follows. Further discussion can be found in many basic medical statistics textbooks [27–37] as well as textbooks on sample size [17–21]. Those with no prior exposure to these basic statistical concepts might find these resources helpful.

Except where indicated, trials of one intervention group and one control group will be discussed. With some adjustments, sample size calculations can be made for studies with more than two groups [8]. For example, in the Coronary Drug Project (CDP), five active intervention arms were each compared against one control arm [38]. The Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack trial (ALLHAT) compared four active intervention arms: three newer drugs to an older one as first line therapy for hypertension [39]. Both trials used the method of Dunnett [40], where the number of participants in the control group is equal to the number assigned to each of the active intervention groups times the square root of the number of active groups. The optimal size of the control arm in the CDP was determined to be 2.24 times the size of each individual active intervention arm [38]. In fact, the CDP used a factor of 2.5 in order to minimize variance. Other approaches are to use the Bonferroni adjustment to the alpha level [41]; that is, divide the overall alpha level by the number of comparisons, and use that revised alpha level in the sample size comparison.

Before computing sample size, the primary response variable used to judge the effectiveness of intervention must be identified (see Chap. 3). This chapter will consider sample size estimation for three basic kinds of outcomes: (1) dichotomous response variables, such as success and failure (2), continuous response variables, such as blood pressure level or a change in blood pressure, and (3) time to failure (or occurrence of a clinical event).

For the dichotomous response variables, the event rates in the intervention group (p _I) and the control group (p _C) are compared. For continuous response variables, the true, but unknown, mean level in the intervention group (μ_I) is compared with the mean level in the control group (μ_C). For survival data, a hazard rate, λ, is often compared for the two study groups or at least is used for sample size estimation. Sample size estimates for response variables which do not exactly fall into any of the three categories can usually be approximated by one of them.

In terms of the primary response variable, p _I will be compared with p _C or μ_I will be compared with μ_C. This discussion will use only the event rates, p _I , and p _C , although the same concepts will hold if response levels μ_I and μ_C are substituted appropriately. Of course, the investigator does not know the true values of the event rates. The clinical trial will give him only estimates of the event rates, $\widehat{p_{I\ }}\ \mathrm{and}\ \widehat{p_C}$ . Typically, an investigator tests whether or not a true difference exists between the event rates of participants in the two groups. The traditional way of indicating this is in terms of a null hypothesis, denoted H ₀, which states that no difference between the true event rates exists (H ₀: p _C − p _I = 0). The goal is to test H ₀ and decide whether or not to reject it. That is, the null hypothesis is assumed to be true until proven otherwise.

Because only estimates of the true event rates are obtained, it is possible that, even if the null hypothesis is true (p _C − p _I = 0), the observed event rates might by chance be different. If the observed differences in event rates are large enough by chance alone, the investigator might reject the null hypothesis incorrectly. This false positive finding, or Type I error, should be made as few times as possible. The probability of this Type I error is called the significance level and is denoted by α. The probability of observing differences as large as, or larger than the difference actually observed given that H ₀ is true is called the “p-value,” denoted as p. The decision will be to reject H ₀ if p ≤ α. While the chosen level of α is somewhat arbitrary, the ones used and accepted traditionally are 0.01, 0.025 or, most commonly, 0.05. As will be shown later, as α is set smaller, the required sample size estimate increases.

If the null hypothesis is not in fact true, then another hypothesis, called the alternative hypothesis, denoted by H _A, must be true. That is, the true difference between the event rates p _C and p _I is some value δ where δ ≠ 0. The observed difference $\widehat{p_{\mathrm{C}\ }}-\widehat{p_{\mathrm{I}}}$ can be quite small by chance alone even if the alternative hypothesis is true. Therefore, the investigator could, on the basis of small observed differences, fail to reject H ₀ even when it is not true. This is called a Type II error, or a false negative result. The probability of a Type II error is denoted by β. The value of β depends on the specific value of δ, the true but unknown difference in event rates between the two groups, as well as on the sample size and α. The probability of correctly rejecting H ₀ is denoted 1 − β and is called the power of the study. Power quantifies the potential of the study to find true differences of various values δ. Since β is a function of α, the sample size and δ, 1 − β is also a function of these parameters. The plot of 1 − β versus δ for a given sample size is called the power curve and is depicted in Fig. 8.1. On the horizontal axis, values of δ are plotted from 0 to an upper value, δ_A (0.25 in this figure). On the vertical axis, the probability or power of detecting a true difference δ is shown for a given significance level and sample size. In constructing this specific power curve, a sample size of 100 in each group, a one-sided significance level of 0.05 and a control group event rate of 0.5 (50%) were assumed. Note that as δ increases, the power to detect δ also increases. For example, if δ = 0.10 the power is approximately 0.40. When δ = 0.20 the power increases to about 0.90. Typically, investigators like to have a power (1 − β) of at least 0.80, but often around 0.90 or 0.95 when planning a study; that is to have an 80%, 90% or 95% chance of finding a statistically significant difference between the event rates, given that a difference, δ, actually exists.

Fig. 8.1

A power curve for increasing differences (δ) between the control group rate of 0.5 and the intervention group rate with a one-sided significance level of 0.05 and a total sample size (2N) of 200

Since the significance level α should be small, say 0.05 or 0.01, and the power (1 − β) should be large, say 0.90 or 0.95, the only quantities which are left to vary are δ, the size of the difference being tested for, and the total sample size. In planning a clinical trial, the investigator hopes to detect a difference of specified magnitude δ or larger. One factor that enters into the selection of δ is the minimum difference between groups judged to be clinically important. In addition, previous research may provide estimates of δ. This is part of the question being tested as discussed in Chap. 3. The exact nature of the calculation of the sample size, given α, 1 − β and δ is considered here. It can be assumed that the randomization strategy will allocate an equal number (N) of participants to each group, since the variability in the responses for the two groups is approximately the same; equal allocation provides a slightly more powerful design than unequal allocation. For unequal allocation to yield an appreciable increase in power, the variability needs to be substantially different in the groups [42]. Since equal allocation is usually easier to implement, it is the more frequently used strategy and will be assumed here for simplicity.

Before a sample size can be calculated, classical statistical theory says that the investigator must decide whether he is interested in differences in one direction only (one-sided test)—say improvements in intervention over control—or in differences in either direction (two-sided test). This latter case would represent testing the hypothesis that the new intervention is either better or worse than the control. In general, two-sided tests should be used unless there is a very strong justification for expecting a difference in only one direction. An investigator should always keep in mind that any new intervention could be harmful as well as helpful. However, as discussed in Chap. 16, some investigators may not be willing to prove the intervention harmful and would terminate a study if the results are suggestive of harm. A classic example of this issue was provided by the Cardiac Arrhythmia Suppression Trial or CAST [43]. This trial was initially designed as a one-sided, 0.025 significance level hypothesis test that anti-arrhythmic drug therapy would reduce the incidence of sudden cardiac death. Since the drugs were already marketed, harmful effects were not expected. Despite the one-sided hypothesis in the design, the monitoring process used a two-sided, 0.05 significance level approach. In this respect, the level of evidence for benefit was the same for either the one-sided 0.025 or two-sided 0.05 significance level design. As it turned out, the trial was terminated early due to increased mortality in the intervention group (see Chaps. 16 and 17).

If a one-sided test of hypothesis is chosen, in most circumstances the significance level ought to be half what the investigator would use for a two-sided test. For example, if 0.05 is the two-sided significance level typically used, 0.025 would be used for the one-sided test. As done in the CAST trial, this requires the same degree of evidence or scientific documentation to declare a treatment effective, regardless of the one-sided vs. two-sided question. In this circumstance, a test for negative or harmful effects might also be done at the 0.025 level. This in effect, provides two one-sided 0.025 hypothesis tests for an overall 0.05 significance level.

As mentioned above, the total sample size 2N (N per arm) is a function of the significance level (α), the power (1 − β) and the size of the difference in response (δ) which is to be detected. Changing either α, 1 − β or δ will result in a change in 2N. As the magnitude of the difference δ decreases, the larger the sample size must be to guarantee a high probability of finding that difference. If the calculated sample size is larger than can be realistically obtained, then one or more of the parameters in the design may need to be reconsidered. Since the significance level is usually fixed at 0.05, 0.025, or 0.01, the investigator should generally reconsider the value selected for δ and increase it, or keep δ the same and settle for a less powerful study. If neither of these alternatives is satisfactory, serious consideration should be given to abandoning the trial.

Rothman [44] argued that journals should encourage using confidence intervals to report clinical trial results instead of significance levels. Several researchers [44–46] discuss sample size formulas from this approach. Confidence intervals are constructed by computing the observed difference in event rates and then adding and subtracting a constant times the standard error of the difference. This provides an interval surrounding the observed estimated difference obtained from the trial. The constant is determined so as to give the confidence interval the correct probability of including the true, but unknown difference. This constant is related directly to the critical value used to evaluate test statistics. Trials often use a two-sided α level test (e.g., α = 0.05) and a corresponding (1 − α) confidence interval (e.g., 95%). If the 1 − α confidence interval excludes zero or no difference, we would conclude that the intervention has an effect. If the interval contains zero difference, no intervention effect would be claimed. However, differences of importance could exist, but might not be detected or not be statistically significant because the sample size was too small. For testing the null hypothesis of no treatment effect, hypothesis testing and confidence intervals give the same conclusions. However, confidence intervals provide more information on the range of the likely difference that might exist. For sample size calculations, the desired confidence interval width must be specified. This may be determined, for example, by the smallest difference between two event rates that would be clinically meaningful and important. Under the null hypothesis of no treatment effect, half the desired interval width is equal to the difference specified in the alternative hypothesis. The sample size calculation methods presented here do not preclude the presentation of results as confidence intervals and, in fact, investigators ought to do so. However, unless there is an awareness of the relationship between the two approaches, as McHugh and Le [46] have pointed out, the confidence interval method might yield a power of only 50% to detect a specified difference. This can be seen later, when sample size calculations for comparing proportions are presented. Thus, some care needs to be taken in using this method.

So far, it has been assumed that the data will be analyzed only once at the end of the trial. However, as discussed in Chaps. 16 and 17, the response variable data may be reviewed periodically during the course of a study. Thus, the probability of finding significant differences by chance alone is increased [47]. This means that the significance level α may need to be adjusted to compensate for the increase in the probability of a Type I error. For purposes of this discussion, we assume that α carries the usual values of 0.05, 0.025 or 0.01. The sample size calculation should also employ the statistic which will be used in data analysis. Thus, there are many sample size formulations. Methods that have proven useful will be discussed in the rest of this chapter.

Dichotomous Response Variables

We shall consider two cases for response variables which are dichotomous, that is, yes or no, success or failure, presence or absence. The first case assumes two independent groups or samples [48–59]. The second case is for dichotomous responses within an individual, or paired responses [60–64].

Two Independent Samples

Suppose the primary response variable is the occurrence of an event over some fixed period of time. The sample size calculation should be based on the specific test statistic that will be employed to compare the outcomes. The null hypothesis H ₀ (p _C − p _I = 0) is compared to an alternative hypothesis H _A (p _C − p _I ≠ 0). The estimates of p _I and p _C are $\widehat{p_{\mathrm{C}\ }}-\widehat{p_{\mathrm{I}}}$ where $\widehat{p_{\mathrm{I}}}={r}_{\mathrm{I}}/{N}_{\mathrm{I}}$ and $\widehat{p_{\mathrm{C}}}={r}_{\mathrm{C}}/{N}_{\mathrm{C}}$ with r _I and r _C being the number of events in the intervention and control groups and N _I and N _C being the number of participants in each group. The usual test statistic for comparing such dichotomous or binomial responses is

$Z=\left(\widehat{p_{\mathrm{C}\ }}-\widehat{p_{1\ }}\right)/\sqrt{\overline{p}\left(1-\overline{p}\right)\;\left(1/{N}_{\mathrm{C}}+1/{N}_{\mathrm{I}}\right)}$

where $\overline{p}=\left({r}_{\mathrm{I}}+{r}_{\mathrm{C}}\right)/\left({N}_{\mathrm{I}}+{N}_{\mathrm{C}}\right)$ . The square of the Z statistic is algebraically equivalent to the chi-square statistic, which is often employed as well. For large values of N _I and N _C, the statistic Z has approximately a normal distribution with mean 0 and variance 1. If the test statistic Z is larger in absolute value than a constant Z _α, the investigator will reject H ₀ in the two-sided test.

The constant Z _α is often referred to as the critical value. The probability of a standard normal random variable being larger in absolute value than Z _α is α. For a one-sided hypothesis, the constant Z _α is chosen such that the probability that Z is greater (or less) than Z _α is α. For a given α, Z _α is larger for a two-sided test than for a one-sided test (Table 8.1). Z _α for a two-sided test with α = 0.10 has the same value as Z _α for a one-sided test with α = 0.05. While a smaller sample size can be achieved with a one-sided test compared to a two-sided test at the same α level, we in general do not recommend this approach as discussed earlier.

Table 8.1

Z _α for sample size formulas for various values of α

α	Z _α
α	One-sided test	Two-sided test
0.10	1.282	1.645
0.05	1.645	1.960
0.025	1.960	2.240
0.01	2.326	2.576

The sample size required for the design to have a significance level α and a power of 1 − β to detect true differences of at least δ between the event rates p _I and p _C can be expressed by the formula [11]:

$2N=2{\left\{{Z}_{\upalpha}\sqrt{\overline{p}\left(1-\overline{p}\right)}+{Z}_{\upbeta}\sqrt{\overline{p_{\mathrm{C}}}\left(1-\overline{p_{\mathrm{C}}}\right)+{\overline{p}}_{\mathrm{I}}\left(1-{\overline{p}}_{\mathrm{I}}\right)}\right\}}^2/{\left({p}_{\mathrm{C}}-{p}_{\mathrm{I}}\right)}^2$

where 2N = total sample size (N participants/group) with $\overline{p}=\left({p}_{\mathrm{C}}+{p}_{\mathrm{I}}\right)/2$ ; Z _α is the critical value which corresponds to the significance level α; and Z _β is the value of the standard normal value not exceeded with probability β. Z _β corresponds to the power 1 − β (e.g., if 1 − β = 0.90, Z _β = 1.282). Values of Z _α and Z _β are given in Tables 8.1 and 8.2 for several values of α and 1 − β. More complete tables may be found in most introductory texts textbooks [27–29, 31, 33–37, 51], sample size texts [17–21, 65], or by using software packages and online resources [66–73]. Note that the definition of $\overline{p}$ given earlier is equivalent to the definition of $\overline{p}$ given here when N _I = N _C; that is, when the two study groups are of equal size. An alternative to the above formula is given by

Table 8.2

Z _β for sample size formulas for various values of power (1 − β)

1 − β	Z _β
0.50	0.00
0.60	0.25
0.70	0.53
0.80	0.84
0.85	1.036
0.90	1.282
0.95	1.645
0.975	1.960
0.99	2.326

$2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2\overline{p}\left(1-\overline{p}\right)/{\left({p}_{\mathrm{C}}-{p}_{\mathrm{I}}\right)}^2$

These two formulas give approximately the same answer and either may be used for the typical clinical trial.

Example: Suppose the annual event rate in the control group is anticipated to be 20%. The investigator hopes that the intervention will reduce the rate to 15%. The study is planned so that each participant will be followed for 2 years. Therefore, if the assumptions are accurate, approximately 40% of the participants in the control group and 30% of the participants in the intervention group will develop an event. Thus, the investigator sets p _C = 0.40, p _I = 0.30, and, therefore, $\overline{p}=\left(0.4+0.3\right)/2=0.35$ . The study is designed as two-sided with a 5% significance level and 90% power. From Tables 8.1 and 8.2, the two-sided 0.05 critical value is 1.96 for Z _β and 1.282 for Z _β. Substituting these values into the right-hand side of the first sample size formula yields 2N to be

$2{\left\{1.96\sqrt{2(0.35)\;(0.65)}+1.282\sqrt{0.4(0.6)+0.3(0.7)}\right\}}^2/{\left(0.4-0.3\right)}^2$

Evaluating this expression, 2N equals 952.3. Using the second formula, 2N is 4(1.96 + 1.202)2 (0.35)(0.65)/(0.4 − 0.3)2 or 2N = 956. Therefore, after rounding up to the nearest ten, the calculated total sample size by either formula is 960, or 480 in each group.

Sample size estimates using the first formula are given in Table 8.3 for a variety of values of p _I and p _C, for two-sided tests, and for α = 0.01, 0.025 and 0.05 and 1 − β = 0.80 or 0.90. For the example just considered with α = 0.05 (two-sided), 1 − β = 0.90, p _C = 0.4 and p _I = 0.3, the total sample size using Table 8.3 is 960. This table shows that, as the difference in rates between groups increases, the sample size decreases.

Table 8.3

Sample size

Alpha/power		2α (Two-sided)
Alpha/power		0.01		0.025		0.05
p _C	p _I	0.90	0.80	0.90	0.80	0.90	0.80
0.6	0.5	1470	1160	1230	940	1040	780
	0.4	370	290	310	240	260	200
	0.3	160	130	140	110	120	90
	0.20	90	70	80	60	60	50
0.5	0.40	1470	1160	1230	940	1040	780
	0.30	360	280	300	230	250	190
	0.25	220	180	190	140	160	120
	0.20	150	120	130	100	110	80
0.4	0.30	1360	1060	1130	870	960	720
	0.25	580	460	490	370	410	310
	0.20	310	250	260	200	220	170
0.3	0.20	1120	880	930	710	790	590
	0.15	460	360	390	300	330	250
	0.10	240	190	200	150	170	130
0.2	0.15	3440	2700	2870	2200	2430	1810
	0.10	760	600	630	490	540	400
	0.05	290	230	240	190	200	150
0.1	0.05	1650	1300	1380	1060	1170	870

The event rate in the intervention group can be written as p _I = (1 − k) p _C where k represents the proportion that the control group event rate is expected to be reduced by the intervention. Figure 8.2 shows the total sample size 2N versus k for several values of p _C using a two-sided test with α = 0.05 and 1 − β = 0.90. In the example where p _C = 0.4 and p _I = 0.3, the intervention is expected to reduce the control rate by 25% or k = 0.25. In Fig. 8.2, locate k = 0.25 on the horizontal axis and move up vertically until the curve labeled p _C = 0.4 is located. The point on this curve corresponds to a 2N of approximately 960. Notice that as the control group event rate p _C decreases, the sample size required to detect the same proportional reduction increases. Trials with small event rates (e.g., p _C = 0.1) require large sample sizes unless the interventions have a dramatic effect.

Fig. 8.2

Relationship between total sample size (2N) and reduction in event rate (k) for several control group event rates (p _C), with a two-sided significance level of 0.05 and power of 0.90

In order to make use of the sample size formula or table, it is necessary to know something about p _C and k. The estimate for p _C is usually obtained from previous studies of similar people. In addition, the investigator must choose k based on preliminary evidence of the potential effectiveness of the intervention or be willing to specify some minimum difference or reduction that he wants to detect. Obtaining this information is difficult in many cases. Frequently, estimates may be based on a small amount of data. In such cases, several sample size calculations based on a range of estimates help to assess how sensitive the sample size is to the uncertain estimates of p _C, k, or both. The investigator may want to be conservative and take the largest, or nearly largest, estimate of sample size to be sure his study has sufficient power. The power (1 − β) for various values of δ can be compared for a given sample size 2N, significance level α, and control rate p _C. By examining a power curve such as in Fig. 8.1, it can be seen what power the trial has for detecting various differences in rates, δ. If the power is high, say 0.80 or larger, for the range of values δ that are of interest, the sample size is probably adequate. The power curve can be especially helpful if the number of available participants is relatively fixed and the investigator wants to assess the probability that the trial can detect any of a variety of reductions in event rates.

Investigators often overestimate the number of eligible participants who can be enrolled in a trial. The actual number enrolled may fall short of goal. To examine the effects of smaller sample sizes on the power of the trial, the investigator may find it useful to graph power as a function of various sample sizes. If the power falls far below 0.8 for a sample size that is very likely to be obtained, he can expand the recruitment effort, hope for a larger intervention effect than was originally assumed, accept the reduced power and its consequences or abandon the trial.

To determine the power, the second sample size equation in this section is solved for Z _β:

${Z}_{\upbeta}=\left\{-{Z}_{\upalpha}\sqrt{2\overline{p}\left(1-\overline{p}\right)}+\sqrt{N}\left({p}_{\mathrm{C}}-{p}_{\mathrm{I}}\right)\right\}/\sqrt{p_{\mathrm{C}}\left(1-{p}_{\mathrm{C}}\right)+{p}_{\mathrm{I}}\left(1-{p}_{\mathrm{I}}\right)}$

where $\overline{p}$ as before is (p _C + p _I)/2. The term Z _β can be translated into a power of 1 − β by use of Table 8.2. For example, let p _C = 0.4 and p _I = 0.3. For a significance level of 0.05 in a two-sided test of hypothesis, Z _α is 1.96. In a previous example, it was shown that a total sample of approximately 960 participants or 480 per group is necessary to achieve a power of 0.90. Substituting Z _α = 1.96, N = 480, p _C = 0.4 and p _I = 0.3, a value for Z _β = 1.295 is obtained. The closest value of Z _β in Table 8.2 is 1.282 which corresponds to a power of 0.90. (If the exact value of N = 476 were used, the value of Z _β would be 1.282.) Suppose an investigator thought he could get only 350 participants per group instead of the estimated 480. Then Z _β = 0.818 which means that the power 1 − β is somewhat less than 0.80. If the value of Z _β is negative, the power is less than 0.50. For more details of power calculations, a standard text in biostatistics [27–29, 31, 33–37, 51] or sample size [17–21, 65] should be consulted.

For a given 2N, α, 1 − β, and p _C the reduction in event rate that can be detected can also be calculated. This function is nonlinear and, therefore, the details will not be presented here. Approximate results can be obtained by scanning Table 8.3, by using the calculations for several p _I until the sample size approaches the planned number, or by using a figure where sample sizes have been plotted. In Fig. 8.2, α is 0.05 and 1 − β is 0.90. If the sample size is selected as 1000, with p _C = 0.4, k is determined to be about 0.25. This means that the expected p _I would be 0.3. As can be seen in Table 8.3, the actual sample size for these assumptions is 960.

The above approach yields an estimate which is more accurate as the sample size increases. Modifications [49, 51–55, 58, 59, 74] have been developed which give some improvement in accuracy to the approximate formula presented for small studies. However, the availability of computer software to perform exact computations [66–73] has reduced the need for good small sample approximations. Also, given that sample size estimation is somewhat imprecise due to assumptions of intervention effects and event rates, the formulation presented is probably adequate for most clinical trials.

Designing a trial comparing proportions using the confidence interval approach, we would need to make a series of assumptions as well [6, 42, 52]. A 100(1 − α)% confidence interval for a treatment comparison θ would be of the general form $\widehat{\theta}\pm {Z}_{\upalpha}\mathrm{S}\mathrm{E}\left(\widehat{\theta}\right)$ , where $\widehat{\theta}$ is the estimate for θ and $\mathrm{S}\mathrm{E}\left(\widehat{\theta}\right)$ is the standard error of $\widehat{\theta}$ . In this case, the specific form would be:

$\left(\widehat{p_{\mathrm{C}}}-\widehat{p_{\mathrm{I}}}\right)\pm {Z}_{\upalpha}\sqrt{\overline{p}\left(1-\overline{p}\right)\;\left(1/{N}_{\mathrm{I}}+1/{N}_{\mathrm{C}}\right)}$

If we want the width of the confidence interval (CI) not to exceed W _CI, where W _CI is the difference between the upper confidence limit and the lower confidence limit, then if N = N _I = N _C, the width W _CI can be expressed simply as:

${W}_{\mathrm{CI}}=2{Z}_{\upalpha}\sqrt{\overline{p}\left(1-\overline{p}\right)\;\left(N/2\right)}$

or after solving this equation for N,

$N=8\;{Z}_{\upalpha}^2\overline{p}\left(1-\overline{p}\right)/{\left({W}_{\mathrm{CI}}\right)}^2$

Thus, if α is 0.05 for a 95% confidence interval, p _C = 0.4 and p _I = 0.3 or 0.35, N = 8(1.96)2(0.35)(0.65)/(W _CI)². If we desire the upper limit of the confidence interval to be not more than 0.10 from the estimate or the width to be twice that, then W _CI = 0.20 and N = 175 or 2N = 350. Notice that even though we are essentially looking for differences in p _C − p _I to be the same as our previous calculation, the sample size is smaller. If we let p _C − p _I = W _CI/2 and substitute this into the previous sample size formula, we obtain

$\begin{array}{c}\hfill 2N=2{\left\{{Z}_{\upalpha}+{Z}_{\upbeta}\right\}}^2\overline{p}\left(1-\overline{p}\right)/{\left({W}_{\mathrm{CI}}/2\right)}^2\hfill \\ {}\hfill =8{\left\{{Z}_{\upalpha}+{Z}_{\upbeta}\right\}}^2\overline{p}\left(1-\overline{p}\right)/{\left({W}_{\mathrm{CI}}\right)}^2\hfill \end{array}$

This formula is very close to the confidence interval formula for two proportions. If we select 50% power, β is 0.50 and Z _β is 0 which would yield the confidence interval formula. Thus, a confidence interval approach gives 50% power to detect differences of W _CI/2. This may not be adequate, depending on the situation. In general, we prefer to specify greater power (e.g., 80–90%) and use the previous approach.

Analogous sample size estimation using the confidence interval approach may be used for comparing means, hazard rates, or regression slopes. We do not present details of these since we prefer to use designs which yield power greater than that obtained from a confidence interval approach.

Paired Dichotomous Response

For designing a trial where the paired outcomes are binary, the sample size estimate is based on McNemar’s test [60–64]. We want to compare the frequency of success within an individual on intervention with the frequency of success on control (i.e., p _I − p _C). McNemar’s test compares difference in discordant responses within an individual p _I − p _C, between intervention and control.

In this case, the number of paired observations, N _p, may be estimated by:

${N}_{\mathrm{p}}={\left[{Z}_{\upalpha}\sqrt{\mathrm{f}}+{Z}_{\upbeta}\sqrt{f-{d}^2}\right]}^2/{d}^2$

where d = difference in the proportion of successes (d = p _I − p _C) and f is the proportion of participants whose response is discordant. An alternative approximate formula for N _p is

${N}_{\mathrm{p}}={\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2f/{d}^2$

Example: Consider an eye study where one eye is treated for loss in visual acuity by a new laser procedure and the other eye is treated by standard therapy. The failure rate on the control, p _C is estimated to be 0.40 and the new procedure is projected to reduce the failure rate to 0.20. The discordant rate f is assumed to be 0.50. Using the latter sample size formula for a two-sided 5% significance level and 90% power, the number of pairs N _p is estimated as 132. If the discordant rate is 0.8, then 210 pairs of eyes will be needed.

Adjusting Sample Size to Compensate for Nonadherence

During the course of a clinical trial, participants will not always adhere to their prescribed intervention schedule. The reason is often that the participant cannot tolerate the dosage of the drug or the degree of intervention prescribed in the protocol. The investigator or the participant may then decide to follow the protocol with less intensity. At all times during the conduct of a trial, the participant’s welfare must come first and meeting those needs may not allow some aspects of the protocol to be followed. Planners of clinical trials must recognize this possibility and attempt to account for it in their design. Examples of adjusting for nonadherence with dichotomous outcomes can be found in several clinical trials [75–82].

In the intervention group a participant who does not adhere to the intervention schedule is often referred to as a “drop-out.” Participants who stop the intervention regimen lose whatever potential benefit the intervention might offer. Similarly, a participant on the control regimen may at some time begin to use the intervention that is being evaluated. This participant is referred to as a “drop-in.” In the case of a drop-in a physician may decide, for example, that surgery is required for a participant assigned to medical treatment in a clinical trial of surgery versus medical care [77]. Drop-in participants from the control group who start the intervention regimen will receive whatever potential benefit or harm that the intervention might offer. Therefore, both the drop-out and drop-in participants must be acknowledged because they tend to dilute any difference between the two groups which might be produced by the intervention. This simple model does not take into account the situation in which one level of an intervention is compared to another level of the intervention. More complicated models for nonadherence adjustment can be developed. Regardless of the model, it must be emphasized that the assumed event rates in the control and intervention groups are modified by participants who do not adhere to the study protocol.

People who do not adhere should remain in the assigned study groups and be included in the analysis. The rationale for this is discussed in Chap. 18. The basic point to be made here is that eliminating participants from analysis or transferring participants to the other group could easily bias the results of the study. However, the observed δ is likely to be less than projected because of nonadherence and thus have an impact on the power of the clinical trial. A reduced δ, of course, means that either the sample size must be increased or the study will have smaller power than intended. Lachin [11] has proposed a simple formula to adjust crudely the sample size for a drop-out rate of proportion R _O. This can be generalized to adjust for drop-in rates, R _I, as well. The unadjusted sample size N should be multiplied by the factor {1/(1 − R _O − R _I)}² to get the adjusted sample size per arm, N*. Thus, if R _O = 0.20 and R _I = 0.05, the originally calculated sample should be multiplied by 1/(0.75)2, or 16/9, and increased by 78%. This formula gives some quantitative idea of the effect of drop-out on the sample size:

${N}^{*}=N/{\left(1-{R}_{\mathrm{O}}-{R}_{\mathrm{I}}\right)}^2$

However, more refined models to adjust sample sizes for drop-outs from the intervention to the control [83–89] and for drop-ins from the control to the intervention regimen [83] have been developed. They adjust for the resulting changes in p _I and p _C, the adjusted rates being denoted p _I* and p _C*. These models also allow for another important factor, which is the time required for the intervention to achieve maximum effectiveness. For example, an anti-platelet drug may have an immediate effect; conversely, even though a cholesterol-lowering drug reduces serum levels quickly, it may require years to produce a maximum effect on coronary mortality.

Example: A drug trial [76] in post myocardial infarction participants illustrates the effect of drop-outs and drop-ins on sample size. In this trial, total mortality over a 3-year follow-up period was the primary response variable. The mortality rate in the control group was estimated to be 18% (p _C = 0. 18) and the intervention was believed to have the potential for reducing p _C by 28% (k = 0.28) yielding p _I = 0.1296. These estimates of p _C and k were derived from previous studies. Those studies also indicated that the drop-out rate might be as high as 26% over the 3 years; 12% in the first year, an additional 8% in the second year, and an additional 6% in the third year. For the control group, the drop-in rate was estimated to be 7% each year for a total drop-in rate of 21%.

Using these models for adjustment, p _C* = 0.1746 and p _I* = 0.1375. Therefore, instead of δ being 0.0504 (0.18 − 0.1296), the adjusted δ* is 0.0371 (0.1746 − 0.1375). For a two-sided test with α = 0.05 and 1 − β = 0.90, the adjusted sample size was 4020 participants compared to an unadjusted sample size of 2160 participants. The adjusted sample size almost doubled in this example due to the expected drop-out and drop-in experiences and the recommended policy of keeping participants in the originally assigned study groups. The remarkable increases in sample size because of drop-outs and drop-ins strongly argue for major efforts to keep nonadherence to a minimum during trials.

Sample Size Calculations for Continuous Response Variables

Similar to dichotomous outcomes, we consider two sample size cases for response variables which are continuous [9, 11, 90]. The first case is for two independent samples. The other case is for paired data.

Two Independent Samples

For a clinical trial with continuous response variables, the previous discussion is conceptually relevant, but not directly applicable to actual calculations. “Continuous” variables such as length of hospitalization, blood pressure, spirometric measures, neuropsychological scores and level of a serum component may be evaluated. Distributions of such measurements frequently can be approximated by a normal distribution. When this is not the case, a transformation of values, such as taking their logarithm, can often make the normality assumption approximately correct.

Suppose the primary response variable, denoted as x, is continuous with N _I and N _C participants randomized to the intervention and control groups respectively. Assume that the variable x has a normal distribution with mean μ and variance σ². The true levels of μ_I and μ_C for the intervention and control groups are not known, but it is assumed that σ² is known. (In practice, σ² is not known and must be estimated from some data. If the data set used is reasonably large, the estimate of σ² can be used in place of the true σ². If the estimate for σ² is based on a small set of data, it is necessary to be cautious in the interpretation of the sample size calculations.)

The null hypothesis is H₀: δ = μ_C − μ_I = 0 and the two-sided alternative hypothesis is H_A: δ = μ_C − μ_I ≠ 0. If the variance is known, the test statistic is:

$Z=\left({\overline{\mathrm{x}}}_{\mathrm{C}}-{\overline{\mathrm{x}}}_{\mathrm{I}}\right)/\upsigma \sqrt{\left(1/{N}_{\mathrm{C}}+1/{N}_{\mathrm{I}}\right)}$

where ${\overline{\mathrm{x}}}_{\mathrm{I}}$ and ${\overline{\mathrm{x}}}_{\mathrm{C}}$ represent mean levels observed in the intervention and control groups respectively. For adequate sample size (e.g. 50 participants per arm) this statistic has approximately a standard normal distribution. The hypothesis-testing concepts previously discussed apply to the above statistic. If Z > Z _α, then an investigator would reject H ₀ at the α level of significance. By use of the above test statistic it can be determined how large a total sample 2N would be needed to detect a true difference δ between μ_I and μ_C with power (1 − β) and significance level α by the formula:

$2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2{\upsigma}^2/{\updelta}^2$

Example: Suppose an investigator wishes to estimate the sample size necessary to detect a 10 mg/dL difference in cholesterol level in a diet intervention group compared to the control group. The variance from other data is estimated to be (50 mg/dL)². For a two-sided 5% significance level, Z _α = 1.96 and for 90% power, Z _β = 1.282. Substituting these values into the above formula, 2N = 4(1.96 + 1.282)²(50)²/10² or approximately 1,050 participants. As δ decreases, the value of 2N increases, and as σ² increases the value of 2N increases. This means that the smaller the difference in intervention effect an investigator is interested in detecting and the larger the variance, the larger the study must be. As with the dichotomous case, setting a smaller α and larger 1 − β also increases the sample size. Figure 8.3 shows total sample size 2 N as a function of δ/σ. As in the example, if δ = 10 and σ = 50, then δ/σ = 0.2 and the sample size 2N for 1 − β = 0.9 is approximately 1,050.

Fig. 8.3

Total sample size (2N) required to detect the difference (δ) between control group mean and intervention group mean as a function of the standardized difference (δ/σ) where σ is the common standard deviation, with two-sided significance level of 0.05 and power (1 − β) of 0.80 and 0.90

Paired Data

In some clinical trials, paired outcome data may increase power for detecting differences because individual or within participant variation is reduced. Trial participants may be assessed at baseline and at the end of follow-up. For example, instead of looking at the difference between mean levels in the groups, an investigator interested in mean levels of change might want to test whether diet intervention lowers serum cholesterol from baseline levels when compared to a control. This is essentially the same question as asked before in the two independent sample case, but each participant’s initial cholesterol level is taken into account. Because of the likelihood of reduced variability, this type of design can lead to a smaller sample size if the question is correctly posed. Assume that Δ_C and Δ_I represent the true, but unknown levels of change from baseline to some later point in the trial for the control and intervention groups, respectively. Estimates of Δ_C and Δ_I would be ${\overline{d}}_{\mathrm{C}}={\overline{\mathrm{x}}}_{{\mathrm{C}}_1}-{\overline{\mathrm{x}}}_{{\mathrm{C}}_2}$ and ${\overline{d}}_{\mathrm{I}}={\overline{\mathrm{x}}}_{{\mathrm{I}}_1}-{\overline{\mathrm{x}}}_{{\mathrm{I}}_2}$ . These represent the differences in mean levels of the response variable at two points for each group. The investigator tests H ₀: Δ_C − Δ_I = 0 versus H _A: Δ_C − Δ_I = δ ≠ 0. The variance σ² in this case reflects the variability of the change, from baseline to follow-up, and is assumed here to be the same in the control and intervention arms. This variance is likely to be smaller than the variability at a single measurement. This is the case if the correlation between the first and second measurements is greater than 0.5. Using δ and σ _Δ ², as defined in this manner, the previous sample size formula for two independent samples and graph are applicable. That is, the total sample size 2 N can be estimated as

$2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2{\upsigma}_{\varDelta}^2/{\updelta}^2$

Another way to represent this is

$2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2\left(1-\uprho \right){\upsigma}^2/{\updelta}^2$

where ${\upsigma}_{\varDelta}^2=2{\upsigma}^2\left(1-\uprho \right)$ and σ² is the variance of a measurement at a single point in time, the variability is assumed to be the same at both time points (i.e. at baseline and at follow-up), and ρ is the correlation coefficient between the first and second measurement. As indicated, if the correlation coefficient is greater than 0.5, comparing the paired differences will result in a smaller sample size than just comparing the mean values at the time of follow-up.

Example: Assume that an investigator is still interested in detecting a 10 mg/dL difference in cholesterol between the two groups, but that the variance of the change is now (20 mg/dL)². The question being asked in terms of δ is approximately the same, because randomization should produce baseline mean levels in each group which are almost equal. The comparison of differences in change is essentially a comparison of the difference in mean levels of cholesterol at the second measurement. Using Fig. 8.3, where δ/σ_Δ = 10/20 = 0.5, the sample size is 170. This impressive reduction in sample size from 1,050 is due to a reduction in the variance from (50 mg/dL)² to (20 mg/dL)².

Another type of pairing occurs in diseases that affect paired organs such as lungs, kidneys, and eyes. In ophthalmology, for example, trials have been conducted where one eye is randomized to receive treatment and the other to receive control therapy [61–64]. Both the analysis and the sample size estimation need to take account of this special kind of stratification. For continuous outcomes, a mean difference in outcome between a treated eye and untreated eye would measure the treatment effect and could be compared using a paired t-test [9, 11], $Z=\overline{d}/{S}_{\mathrm{d}}\sqrt{1/N}$ , where $\overline{d}$ is the average difference in response and S _d is the standard deviation of the differences. The mean difference μ_d is equal to the mean response of the treated or intervention eye, for example, minus the mean response of the control eye; that is μ_d = μ_I − μ_C. Under the null hypothesis, μ_d equals δ_d. An estimate of δ_d, $\overline{d}$ , can be obtained by taking an estimate of the average differences or by calculating ${\overline{\mathrm{x}}}_{\mathrm{I}}-{\overline{\mathrm{x}}}_{\mathrm{C}}$ . The variance of the paired differences σ _d² is estimated by S _d². Thus, the formula for paired continuous outcomes within an individual is a slight modification of the formula for comparison of means in two independent samples. To compute sample size, N _d, for number of pairs, we compute:

${N}_{\mathrm{d}}={\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2{\upsigma}_{\mathrm{d}}^2/{\updelta}_{\mathrm{d}}^2$

As discussed previously, participants in clinical trials do not always fully adhere with the intervention being tested. Some fraction (R _O) of participants on intervention drop-out of the intervention and some other fraction (R _I) drop-in and start following the intervention. If we assume that these participants who drop-out respond as if they had been on control and those who drop-in respond as if they had been on intervention, then the sample size adjustment is the same as for the case of proportions. That is, the adjusted sample size N* is a function of the drop-out rate, the drop-in rate, and the sample size N for a study with fully compliant participants:

${N}^{*}=N/{\left(1-{R}_{\mathrm{O}}-{R}_{\mathrm{I}}\right)}^2$

Therefore, if the drop-out rate were 0.20 and the drop-in 0.05, then the original sample size N must be increased by 16/9 or 1.78; that is, a 78% increase in sample size.

Sample Size for Repeated Measures

The previous section briefly presented the sample size calculation for trials where only two points, say a baseline and a final visit, are used to determine the effect of intervention and these two points are the same for all study participants. Often, a continuous response variable is measured at each follow-up visit. Considering only the first and last values would give one estimate of change but would not take advantage of all the available data. Many models exist for the analysis of repeated measurements and formulae [13, 91–97] as well as computer software [66, 67, 69–73] for sample size calculation are available for most. In some cases, the response variable may be categorical. We present one of the simpler models for continuous repeated measurements. While other models are beyond the scope of this book, the basic concepts presented are still useful in thinking about how many participants, how many measurements per individual, and when they should be taken, are needed. In such a case, one possible approach is to assume that the change in response variable is approximately a linear function of time, so that the rate of change can be summarized by a slope. This model is fit to each participant’s data by the standard least squares method and the estimated slope is used to summarize the participant’s experience. In planning such a study, the investigator must be concerned about the frequency of the measurement and the duration of the observation period. As discussed by Fitzmaurice and co-authors [98], the observed measurement x can be expressed as x = a + bt + error, where a = intercept, b = slope, t = time, and error represents the deviation of the observed measurement from a regression line. This error may be due to measurement variability, biological variability or the nonlinearity of the true underlying relationship. On the average, this error is expected to be equally distributed around 0 and have a variability denoted as σ_(error)². Though it is not necessary, it simplifies the calculation to assume that σ_(error)² is approximately the same for each participant.

The investigator evaluates intervention effectiveness by comparing the average slope in one group with the average slope in another group. Obviously, participants in a group will not have the same slope, but the slopes will vary around some average value which reflects the effectiveness of the intervention or control. The amount of variability of slopes over participants is denoted as σ _b ². If D represents the total time duration for each participant and P represents the number of equally spaced measurements, σ_b² can be expressed as:

${\upsigma}_b^2 = {\upsigma}_B^2+\left\{12\left(P-1\right)\ {\upsigma}_{\left(\mathrm{error}\right)}^2/\left({D}^2P\left(P+1\right)\right)\right\}$

where σ _B ² is the component of variance attributable to differences in participants’ slope as opposed to measurement error and lack of a linear fit. The sample size required to detect difference δ between the average rates of change in the two groups is given by:

$2N=\left[4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2/{\updelta}^2\ \right]\left[{\upsigma}_B^2+\left\{12\left(P-1\right)\ {\upsigma}_{\left(\mathrm{error}\right)}^2/\left({D}^2P\left(P+1\right)\right)\right\}\right]$

As in the previous formulas, when δ decreases, 2N increases. The factor on the right-hand side relates D and P with the variance components σ _B ² and σ_(error)². Obviously as σ _B ² and σ_(error)² increase, the total sample size increases. By increasing P and D, however, the investigator can decrease the contribution made by σ_(error)². The exact choices of P and D will depend on how long the investigator can feasibly follow participants, how many times he can afford to have participants visit a clinic and other factors. By manipulating P and D, an investigator can design a study which will be the most cost effective for his specific situation.

Example: In planning for a trial, it may be assumed that a response variable declines at the rate of 80 units/year in the control group. Suppose a 25% reduction is anticipated in the intervention group. That is, the rate of change in the intervention group would be 60 units/year. Other studies provided an estimate for σ_(error) of 150 units. Also, suppose data from a study of people followed every 3 months for 1 year (D = 1 and P = 5) gave a value for the standard deviation of the slopes, σ_b = 200. The calculated value of σ_B is then 63 units. Thus, for a 5% significance level and 90% power (Z_α = 1.96 and Z_β = 1.282), the total sample size would be approximately 630 for a 3-year study with four visits per year (D = 3, P = 13). Increasing the follow-up time to 4 years, again with four measurements per year, would decrease the variability with a resulting sample size calculation of approximately 510. This reduction in sample size could be used to decide whether or not to plan a 4-year or a 3-year study.

Sample Size Calculations for “Time to Failure”

For many clinical trials, the primary response variable is the occurrence of an event and thus the proportion of events in each group may be compared. In these cases, the sample size methods described earlier will be appropriate. In other trials, the time to the event may be of special interest. For example, if the time to death or a nonfatal event can be increased, the intervention may be useful even though at some point the proportion of events in each group are similar. Methods for analysis of this type of outcome are generally referred to as life table or survival analysis methods (see Chap. 15). In this situation, other sample size approaches are more appropriate than that described for dichotomous outcomes [99–118]. At the end of this section, we also discuss estimating the number of events required to achieve a desired power.

The basic approach is to compare the survival curves for the groups. A survival curve may be thought of as a graph of the probability of surviving, or not having an event, up to any given point in time. The methods of analysis now widely used are non-parametric; that is, no mathematical model about the shape of the survival curve is assumed. However, for the purpose of estimating sample size, some assumptions are often useful. A common model assumes that the survival curve, S(t), follows an exponential distribution, S(t) = e^−λt = exp(−λt) where λ is called the hazard rate or force of mortality. Using this model, survival curves are totally characterized by λ. Thus, the survival curves from a control and an intervention group can be compared by testing H ₀: λ _C = λ _I. An estimate of λ is obtained as the inverse of the mean survival time. If the median survival time, T _M, is known, the hazard rate λ may also be estimated by − ln(0.5)/T _M. Sample size formulations have been considered by several investigators [103, 112, 119]. One simple formula is given by

$2N=4{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2/{\left[ \ln \left({\lambda}_{\mathrm{C}}/{\lambda}_{\mathrm{I}}\right)\right]}^2$

where N is the size of the sample in each group and Z _α and Z _β are defined as before. As an example, suppose one assumes that the force of mortality is 0.30 in the control group and expects it to be 0.20 for the intervention being tested; that is, λ _C/λ _I = 1.5. If α = .05 (two-sided) and 1 − β = 0.90, then N = 128 or 2N = 256. The corresponding mortality rates for 5 years of follow-up are 0.7769 and 0.6321 respectively. Using the comparison of two proportions, the total sample size would be 412. Thus, the time to failure method may give a more efficient design, requiring a smaller number of participants.

The method just described assumes that all participants will be followed to the event. With few exceptions, clinical trials with a survival outcome are terminated at time T before all participants have had an event. For those still event-free, the time to event is said to be censored at time T. For this situation, Lachin [11] gives the approximate formula:

$2N=2{\left({Z}_{\upalpha}+{Z}_{\upbeta}\right)}^2\left[\upvarphi \left({\lambda}_{\mathrm{C}}\right)+\upvarphi \left({\lambda}_{\mathrm{I}}\right)\right]/{\left({\lambda}_{\mathrm{I}}\hbox{--} {\lambda}_{\mathrm{C}}\right)}^2$

where φ(λ) = λ²/(1 − e^−λT ) and where φ(λ_C) or φ(λ_I) are defined by replacing λ with λ_C or λ_I, respectively. If a 5 year study were being planned (T = 5) with the same design specifications as above, then the sample size, 2N is equal to 376. Thus, the loss of information due to censoring must be compensated for by increasing the sample size. If the participants are to be recruited continually during the 5 years of the trial, the formula given by Lachin is identical but with φ(λ) = λ³ T/(λT − 1 + e^−λT ). Using the same design assumptions, we obtain 2N = 620, showing that not having all the participants at the start requires an additional increase in sample size.

More typically participants are recruited uniformly over a period of time, T ₀, with the trial continuing for a total of T years (T > T ₀). In this situation, the sample size can be estimated as before using:

$\upphi \left(\lambda \right)={\lambda}^2/\left[1-\left({\mathrm{e}}^{-\lambda \left(T-T0\right)}\hbox{--} {\mathrm{e}}^{-\lambda T}\right)/\left(\lambda {T}_0\right)\right]$

Here, the sample size (2N) of 466 is between the previous two examples suggesting that it is preferable to get participants recruited as rapidly as possible to get more follow-up or exposure time.

One of the methods used for comparing survival curves is the proportional hazards model or the Cox regression model which is discussed briefly in Chap. 15. For this method, sample size estimates have been published [101, 115]. As it turns out, the formula by Schoenfeld for the Cox model [115] is identical to that given above for the simple exponential case, although developed from a different point of view. Further models are given by Lachin [11].

All of the above methods assume that the hazard rate remains constant during the course of the trial. This may not be the case. The Beta-Blocker Heart Attack Trial [76] compared 3-year survival in two groups of participants with intervention starting one to 3 weeks after an acute myocardial infarction. The risk of death was high initially, decreased steadily, and then became relatively constant.

For cases where the event rate is relatively small and the clinical trial will have considerable censoring, most of the statistical information will be in the number of events. Thus, the sample size estimates using simple proportions will be quite adequate. In the Beta-Blocker Heart Attack Trial, the 3 year control group event rate was assumed to be 0.18. For the intervention group, the event rate was assumed to be approximately 0.13. In the situation of ϕ(λ) = λ²(1 − e^−λT ), a sample size 2N = 2,208 is obtained, before adjustment for estimated nonadherence. In contrast, the unadjusted sample size using simple proportions is 2,160. Again, it should be emphasized that all of these methods are only approximations and the estimates should be viewed as such.

As the previous example indicates, the power of a survival analysis still is a function of the number of events. The expected number of events E(D) is a function of sample size, hazard rate, recruitment rate, and censoring distribution [11, 106]. Specifically, the expected number of events in the control group can be estimated as

$E(D)=N{\lambda}_{\mathrm{C}}^2/\upphi \left({\lambda}_{\mathrm{C}}\right)$

where ϕ(λ_C) is defined as before, depending on the recruitment and follow-up strategy. If we assume a uniform recruitment over the interval (0,T ₀) and follow-up over the interval (0,T), then E(D) can be written using the most general form for ϕ(λ_C);

$E(D) = N\left[1-\left({\mathrm{e}}^{-\lambda \left(T-T0\right)}\hbox{--} {\mathrm{e}}^{-\lambda T}\right)/\left(\lambda {T}_0\right)\right]$

This estimate of the number of events can be used to predict the number of events at various time points during the trial including the end of follow-up. This prediction can be compared to the observed number of events in the control group to determine if an adjustment needs to be made to the design. That is, if the number of events early in the trial is larger than expected, the trial may be more powerful than designed or may be stopped earlier than the planned T years of follow-up (see Chap. 16). However, more worrisome is when the observed number of events is smaller than what is expected and needed to maintain adequate power. Based on this early information, the design may be modified to attain the necessary number of events by increasing the sample size or expanding recruitment effort within the same period of time, increasing follow-up, or a combination of both.

This method can be illustrated based on a placebo-controlled trial of congestive heart failure [82]. Severe or advanced congestive heart failure has an expected 1 year event rate of 40%, where the events are all-cause mortality and nonfatal myocardial infarction. A new drug was to be tested to reduce the event rate by 25%, using a two-sided 5% significance level and 90% power. If participants are recruited over 1.5 years (T ₀ = 1.5) during a 2 year study (T = 2) and a constant hazard rate is assumed, the total sample size (2N) is estimated to be 820 participants with congestive heart failure. The formula E(D) can be used to calculate that approximately 190 events (deaths plus nonfatal myocardial infarctions) must be observed in the control group to attain 90% power. If the first year event rate turns out to be less than 40%, fewer events will be observed by 2 years than the required 190. Table 8.4 shows the expected number of control group events at 6 months and 1 year into the trial for annual event rates of 40, 35, 30, and 25%. Two years is also shown to illustrate the projected number of events at the completion of the study. These numbers are obtained by calculating the number of participants enrolled by 6 months (33% of 400) and 1 year (66% of 400) and multiplying by the bracketed term on the right hand side of the equation for E(D). If the assumed annual event rate of 40% is correct, 60 control group events should be observed at 1 year. However, if at 1 year only 44 events are observed, the annual event rate might be closer to 30% (i.e., λ = 0.357) and some design modification should be considered to assure achieving the desired 190 control group events. One year would be a sensible time to make this decision, based only on control group events, since recruitment efforts are still underway. For example, if recruitment efforts could be expanded to 1220 participants in 1.5 years, then by 2 years of follow-up the 190 events in the placebo group would be observed and the 90% power maintained. If recruitment efforts were to continue for another 6 months at a uniform rate (T ₀ = 2 years), another 135 participants would be enrolled. In this case, E(D) is 545 × 0.285 = 155 events which would not be sufficient without some additional follow-up. If recruitment and follow-up continued for 27 months (i.e., T ₀ = T = 2.25), then 605 control group participants would be recruited and E(D) would be 187, yielding the desired power.

Table 8.4

Number of expected events (in the control group) at each interim analysis given different event rates in control group

	Number of expected events
	Calendar time into study
Yearly event rate in control group	6 Months	1 Year	1.5 Years	2 Years
Yearly event rate in control group	(N = 138/group)	(N = 275/group)	(N = 412/group)	(N = 412/group)
40%	16	60	124	189
35%	14	51	108	167
30%	12	44	94	146
25%	10	36	78	123
Assumptions
1. Time to event exponentially distributed
2. Uniform entry into the study over 1.5 years
3. Total duration of 2 years

Sample Size for Testing “Equivalency” or Noninferiority of Interventions

In some instances, an effective intervention has already been established and is considered the standard. New interventions under consideration may be preferred because they are less expensive, have fewer side effects, or have less adverse impact on an individual’s general quality of life. This issue is common in the pharmaceutical industry where a product developed by one company may be tested against an established intervention manufactured by another company. Studies of this type are sometimes referred to as trials with positive controls or as noninferiority designs (see Chaps. 3 and 5).

Given that several trials have shown that certain beta-blockers are effective in reducing mortality in post-myocardial infarction participants [76, 120, 121], it is likely that any new beta-blockers developed will be tested against proven agents. The Nocturnal Oxygen Therapy Trial [122] tested whether the daily amount of oxygen administered to chronic obstructive pulmonary disease participants could be reduced from 24 to 12 h without impairing oxygenation. The Intermittent Positive Pressure Breathing [80] trial considered whether a simple and less expensive method for delivering a bronchodilator into the lungs would be as effective as a more expensive device. A breast cancer trial compared the tumor regression rates between subjects receiving the standard, diethylstilbestrol, or the newer agent, tamoxifen [123].

The problem in designing noninferiority trials is that there is no statistical method to demonstrate complete equivalence. That is, it is not possible to show δ = 0. Failure to reject the null hypothesis is not a sufficient reason to claim two interventions to be equal but merely that the evidence is inadequate to say they are different [124]. Assuming no difference when using the previously described formulas results in an infinite sample size.

While demonstrating perfect equivalence is an impossible task, one possible approach has been discussed for noninferiority designs [125–128]. The strategy is to specify some value, δ, such that interventions with differences which are less than this might be considered “equally effective” or “noninferior” (see Chap. 5 for discussion of noninferiority designs). Specification of δ may be difficult but it is a necessary element of the design. The null hypothesis states that p _C > p _I + δ while the alternative specifies p _C < p _I + δ. The methods developed require that if the two interventions really are equally effective or at least noninferior, the upper 100(1 − α)% confidence interval for the intervention difference will not exceed δ with the probability of 1 − β. One can alternatively approach this from a hypothesis testing point of view, stating the null hypothesis that the two interventions differ by less than δ.

For studies with a dichotomous response, one might assume the event rate for the two interventions to be equal to p (i.e., p = p _C = p _I). This simplifies the previously shown sample size formula to

$2N=4p\left(1\hbox{--} p\right){\left({\mathrm{Z}}_{\upalpha}+{\mathrm{Z}}_{\upbeta}\right)}^2/{\updelta}^2$

where N, Z_α and Z_β are defined as before. Makuch and Simon [127] recommend for this situation that α = 0.10 and β = 0.20. However, for many situations, β or Type II error needs to be 0.10 or smaller in order to be sure a new therapy is correctly determined to be equivalent to an older standard. We prefer an α = 0.05, but this is a matter of judgment and will depend on the situation. (This formula differs slightly from its analogue presented earlier due to the different way the hypothesis is stated.) The formula for continuous variables,

$2N=4{\left({\mathrm{Z}}_{\upalpha}+{\mathrm{Z}}_{\upbeta}\right)}^2/{\left(\updelta /\upsigma \right)}^2$

is identical to the formula for determining sample size discussed earlier. Blackwelder and Chang [126] give graphical methods for computing sample size estimates for studies of equivalency.

As mentioned above and in Chap. 5, specifying δ is a key part of the design and sample size calculations of all equivalency and noninferiority trials. Trials should be sufficiently large, with enough power, to address properly the questions about equivalence or noninferiority that are posed.

Sample Size for Cluster Randomization

So far, sample size estimates have been presented for trials where individuals are randomized. For some prevention trials or health care studies, it may not be possible to randomize individuals. For example, a trial of smoking prevention strategy for teenagers may be implemented most easily by randomizing schools, some schools to be exposed to the new prevention strategy while other schools remain with a standard approach. Individual students are grouped or clustered within each school. As Donner et al. [129] point out, “Since one cannot regard the individuals within such groups as statistically independent, standard sample size formulas underestimate the total number of subjects required for the trial.” Several authors [129–133] have suggested incorporating a single inflation factor in the usual sample size calculation to account for the cluster randomization. That is, the sample size per intervention arm N computed by previous formulas will be adjusted to N* to account for the randomization of N _m clusters, each with m individuals.

A continuous response is measured for each individual within a cluster of these components. Differences of individuals within a cluster and differences of individuals between clusters contribute to the overall variability of the response. We can separate the between-cluster variance σ² _b and within cluster variance σ² _w. Estimates are denoted S _b² and S _w², respectively and can be estimated by standard analysis of variance. One measure of the relationship of these components is the intra-class correlation coefficient. The intra-class correlation coefficient ρ is ${\upsigma}_{\mathrm{b}}^2/\left({\upsigma}_{\mathrm{b}}^2+{\upsigma}_{\mathrm{w}}^2\right)$ where 0 ≤ ρ ≤ 1. If ρ = 0, all clusters respond identically so all of the variability is within a cluster. If ρ = 1, all individuals in a cluster respond alike so there is no variability within a cluster. Estimates of ρ are given by $r={S}_{\mathrm{b}}^2/\left({S}_{\mathrm{b}}^2+{S}_{\mathrm{w}}^2\right)$ . Intra-class correlation may range from 0.1 to 0.4 in typical clinical studies. If we computed the sample size calculations assuming no clustering, the sample size per arm would be N participants per treatment arm. Now, instead of randomizing N individuals, we want to randomize N _m clusters with m individuals each for a total of N* = N _m × m participants per treatment arm. The inflation factor [133] is [1 + (m − 1)r] so that

${N}^{*}={N}_m\times m=N\left[1+\left(m-1\right)\uprho \right]$

Note that the inflation factor is a function of both cluster size m and intra-class correlation. If the intra-cluster correlation (ρ = 0), then each individual in one cluster responds like any individual in another cluster, and the inflation factor is unity (N* = N). That is, no penalty is paid for the convenience of cluster randomization. At the other extreme, if all individuals in a cluster respond the same (ρ = 1), there is no added information within each cluster, so only one individual per cluster is needed, and the inflation factor is m. That is, our adjusted sample N* = N × m and we pay a severe price for this type of cluster randomization. However, it is unlikely that ρ is either 0 or 1, but as indicated, is more likely to be in the range of 0.1–0.4 in clinical studies.

Example: Donner et al. [129] provide an example for a trial randomizing households to a sodium reducing diet in order to reduce blood pressure. Previous studies estimated the intra-class correlation coefficient to be 0.2; that is $\widehat{\uprho}=r={S}_{\mathrm{b}}^2/\left({S}_{\mathrm{b}}^2+{S}_{\mathrm{w}}^2\right)=0.2$ . The average household size was estimated at 3.5 (m = 3.5). The sample size per arm N must be adjusted by 1 + (m − 1)ρ = 1 + (3.5 − 1)(0.2) = 1.5. Thus, the normal sample size must be inflated by 50% to account for this randomization indicating a small between cluster variability. If ρ = 0.1, then the factor is 1 + (3.5 − 1)(0.1) or 1.25. If ρ = 0.4, indicating a larger between cluster component of variability, the inflation factor is 2.0 or a doubling.

For binomial responses, a similar expression for adjusting the standard sample size can be developed. In this setting, a measure of the degree of within cluster dependency or concordancy rate in participant responses is used in place of the intra-class correlation. The commonly used measure is the kappa coefficient, denoted κ, and may be thought of as an intra-class correlation coefficient for binomial responses, analogous to ρ for continuous responses. A concordant cluster with κ = 1 is one where all responses within a cluster are identical, all successes or failures, in which a cluster contributes no more than a single individual. A simple estimate for κ is provided [129]:

$\kappa ={p}^{*}\left[{p}_{\mathrm{C}}^m+{\left(1-{p}_{\mathrm{C}}\right)}^m\right]/\left(1-\left[{p}_{\mathrm{C}}^m+{\left(1-{p}_{\mathrm{C}}\right)}^m\right]\right)$

Here p* is the proportion of the control group with concordant clusters, and p _C is the underlying success rate in the control group. The authors then show that the inflation factor is [1 + (m − 1)κ], or that the regular sample size per treatment arm N must be multiplied by this factor to attain the adjust sample size N*:

${N}^{*}=N\left[1+\left(m\hbox{--} 1\right)\kappa \right]$

Example: Donner et al. [129] continues the sodium diet example where couples (m = 2) are randomized to either a low sodium or a normal diet. The outcome is the hypertension rate. Other data suggest the concordancy of hypertension status among married couples is 0.85 (p* = 0.85). The control group hypertension rate is 0.15 (p _C = 0.15). In this case, κ = 0.41, so that the inflation factor is 1 + (2 − 1)(0.41) = 1.41; that is, the regular sample size must be inflated by 41% to adjust for the couples being the randomization unit. If there is perfect control group concordance, p* = 1 and κ = 1, in which case, N* = 2 N.

Cornfield proposed another adjustment procedure [130]. Consider a trial where C clusters will be randomized, each cluster of size m _i (i = 1, 2, …, C) and each having a different success rate of p _i (i = 1, 2, …, C). Define the average cluster size $\overline{m}=\varSigma {m}_{\mathrm{i}}/C$ and $\overline{p}=\varSigma {m}_{\mathrm{i}}{p}_{\mathrm{i}}/\varSigma\;{m}_{\mathrm{i}}$ as the overall success rate weighted by cluster size. The variance of the overall success rate is ${\upsigma}_p^2=\varSigma\;{m}_{\mathrm{i}}{\left({p}_{\mathrm{i}}-\overline{p}\right)}^2/C{\overline{m}}^2$ . In this setting, the efficiency of simple randomization to cluster randomization is $E=\overline{p}{\left(1-\overline{p}\right)}^2\overline{m}\;{\upsigma}_p^2$ . The inflation factor (IF) for this design is $\mathrm{IF}=1/E=\overline{m}\;{\upsigma}_{\mathrm{p}}^2/\left(1-\overline{p}\right)$ . Note that if the response rate varies across clusters, the normal sample size must be increased.

While cluster randomization may be logistically required, the process of making the cluster the randomization unit has serious sample size implications. It would be unwise to ignore this consequence in the design phase. As shown, the sample size adjustments can easily be factors of 1.5 or higher. For clusters which are schools or cities, the intra-class correlation is likely to be quite small. However, the cluster size is multiplied by the intra-class correlation so the impact might still be nontrivial. Not making this adjustment would substantially reduce the study power if the analyses were done properly, taking into account the cluster effect. Ignoring the cluster effect in the analysis would be viewed critically in most cases and is not recommended.

Multiple Response Variables

We have stressed the advantages of having a single primary question and a single primary response variable, but clinical trials occasionally have more than one of each. More than one question may be asked because investigators cannot agree about which outcome is most important. As an example, one clinical trial involving two schedules of oxygen administration to participants with chronic obstructive pulmonary disease had three major questions in addition to comparing the mortality rate [122]. Measures of pulmonary function, neuro-psychological status, and quality of life were evaluated. For the participants, all three were important.

Sometimes more than one primary response variable is used to assess a single primary question. This may reflect uncertainty as to how the investigator can answer the question. A clinical trial involving participants with pulmonary embolism [134] employed three methods of determining a drug’s ability to resolve emboli. They were: lung scanning, arteriography, and hemodynamic studies. Another trial involved the use of drugs to limit myocardial infarct size [135]. Precordial electrocardiogram mapping, radionuclide studies, and enzyme levels were all used to evaluate the effectiveness of the drugs. Several approaches to the design and analysis of trials with multiple endpoints have been described [136–139].

Computing a sample size for such clinical trials is not easy. One could attempt to define a single model for the multidimensional response and use one of the previously discussed formulas. Such a method would require several assumptions about the model and its parameters and might require information about correlations between different measurements. Such information is rarely available. A more reasonable procedure would be to compute sample sizes for each individual response variable. If the results give about the same sample size for all variables, then the issue is resolved. However, more commonly, a range of sample sizes will be obtained. The most conservative strategy would be to use the largest sample size computed. The other response variables would then have even greater power to detect the hoped-for reductions or differences (since they required smaller sample sizes). Unfortunately, this approach is the most expensive and difficult to undertake. Of course, one could also choose the smallest sample size of those computed. That would probably not be desirable, because the other response variables would have less power than usually required, or only larger differences than expected would be detectable. It is possible to select a middle range sample size, but there is no assurance that this will be appropriate. An alternative approach is to look at the difference between the largest and smallest sample sizes. If this difference is very large, the assumptions that went into the calculations should be re-examined and an effort should be made to resolve the difference.

As is discussed in Chap. 18, when multiple comparisons are made, the chance of finding a significant difference in one of the comparisons (when, in fact, no real differences exist between the groups) is greater than the stated significance level. In order to maintain an appropriate significance level α for the entire study, the significance level required for each test to reject H ₀ should be adjusted [41]. The significance level required for rejection (α′) in a single test can be approximated by α/k where k is the number of multiple response variables. For several response variables this can make α′ fairly small (e.g., k = 5 implies α′ = 0.01 for each of k response variables with an overall α = 0.05). If the correlation between response variables is known, then the adjustment can be made more precisely [140, 141]. In all cases, the sample size would be much larger than if the use of multiple response variables were ignored, so that most studies have not strictly adhered to this solution of modifying the significance level. Some investigators, however, have attempted to be conservative in the analysis of results [142]. There is a reasonable limit as to how much α′ can be decreased in order to give protection against false rejection of the null hypothesis. Some investigators have chosen α′ = 0.01 regardless of the number of tests. In the end, there are no easy solutions. A somewhat conservative value of α′ needs to be set and the investigators need to be aware of the multiple testing problem during the analysis.

Estimating Sample Size Parameters

As shown in the methods presented, sample size estimation is quite dependent upon assumptions made about variability of the response, level of response in the control group, and the difference anticipated or judged to be clinically relevant [16, 143–148]. Obtaining reliable estimates of variability or levels of response can be challenging since the information is often based on very small studies or studies not exactly relevant to the trial being designed. Applying Bayesian methods to incorporate explicitly uncertainty in these estimated parameters has been attempted [149]. Sometimes, pilot or feasibility studies may be conducted to obtain these data. In such cases, the term external pilot has been used [148].

In some cases, the information may not exist prior to starting the trial, as was the case for early trials in AIDS; that is, no incidence rates were available in an evolving epidemic. Even in cases where data are available, other factors affect the variability or level of response observed in a trial. Typically, the variability observed in the planned trial is larger than expected or the level of response is lower than assumed. Numerous examples of this experience exist [143]. One is provided by the Physicians’ Health Study [150]. In this trial, 22,000 U.S. male physicians were randomized into a 2 × 2 factorial design. One factor was aspirin versus placebo in reducing cardiovascular mortality. The other factor was beta-carotene versus placebo for reducing cancer incidence. The aspirin portion of the trial was terminated early in part due to a substantially lower mortality rate than expected. In the design, the cardiovascular mortality rate was assumed to be approximately 50% of the U.S. age-adjusted rate in men. However, after 5 years of follow-up, the rate was approximately 10% of the U.S. rate in men. This substantial difference reduced the power of the trial dramatically. In order to compensate for the extremely low event rate, the trial would have had to be extended another 10 years to get the necessary number of events [150]. One can only speculate about reasons for low event rates, but screening of potential participants prior to the entry almost certainly played a part. That is, screenees had to complete a run-in period and be able to tolerate aspirin. Those at risk for other competing events were also excluded. This type of effect is referred to as a screening effect. Physicians who began to develop cardiovascular signs may have obtained care earlier than non-physicians. In general, volunteers for trials tend to be healthier than the general population, a phenomenon often referred to as the healthy volunteer effect.

Another approach to obtaining estimates for ultimate sample size determination is to design so-called internal pilot studies [148]. In this approach, a small study is initiated based the best available information. A general sample target for the full study may be proposed, but the goal of the pilot is to refine that sample size estimate based on screening and healthy volunteer effects. The pilot study uses a protocol very close if not identical to the protocol for the full study, and thus parameter estimates will reflect those effects. If the protocol for the pilot and the main study are essentially identical, then the small pilot can become an internal pilot. That is, the data from the internal pilot become part of the data for the overall study. This approach was used successfully in the Diabetes Control and Complications Trial [151]. If data from the internal pilot are used only to refine estimates of variability or control group response rates, and not changes in treatment effect, then the impact of this two-step approach on the significance level is negligible. However, the benefit is that this design will more likely have the desired power than if data from external pilots and other sources are relied on exclusively [147]. It must be emphasized that pilot studies, either external or internal, should not be viewed as providing reliable estimates of the intervention effect [152]. Because power is too small in pilot studies to be sure that no effect exists, small or no differences may erroneously be viewed as reason not to pursue the question. A positive trend may also be viewed as evidence that a large study is not necessary, or that clinical equipoise no longer exists.

Our experience indicates that both external and internal pilot studies are quite helpful. Internal pilot studies should be used if at all possible in prevention trials, when screening and healthy volunteer effects seem to cause major design problems. Design modifications based on an internal pilot are more prudent than allowing an inadequate sample size to create yield misleading results.

One approach is to specify the number of events needed for a desired power level. Obtaining the specified number of events requires a number of individuals followed for a period of time. How many participants and how long a follow-up period can be adjusted during the early part of the trial, or during an internal pilot study, but the target number of events does not change. This is also discussed in more detail in Chaps. 16 and 17.

Another approach is to use adaptive designs which modify the sample size based on an emerging trend, referred to as trend adaptive designs (see Chaps. 5 and 17). Here the sample size may be adjusted for an updated estimate of the treatment effect, δ, using the methods described in this chapter. However, an adjustment must then be made at the analysis stage which may require a substantially larger critical value than the standard one in order to maintain a prespecified α level.

References

Freiman JA, Chalmers TC, Smith H, Jr., Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial. Survey of 71 “negative” trials. N Engl J Med 1978;299:690–694.

Moher D, Dulberg CS, Wells GA. Statistical power, sample size, and their reporting in randomized controlled trials. JAMA 1994;272:122–124.

Chan AW, Altman DG. Epidemiology and reporting of randomised trials published in PubMed journals. Lancet 2005;365:1159–1162.

Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA 2002;288:358–362.

Altman DG. Statistics and ethics in medical research: III How large a sample? Br Med J 1980;281:1336–1338.

Brown BW. Statistical Controversies in the Design of Clinical-Trials - Some Personal Views. Control Clin Trials 1980;1:13–27.

Campbell MJ, Julious SA, Altman DG. Estimating sample sizes for binary, ordered categorical, and continuous outcomes in two group comparisons. BMJ 1995;311:1145–1148.

Day SJ, Graham DF. Sample size estimation for comparing two or more treatment groups in clinical trials. Stat Med 1991;10:33–43.

Donner A. Approaches to sample size estimation in the design of clinical trials—a review. Stat Med 1984;3:199–214.

10.

Gore SM. Statistics in question. Assessing clinical trials—trial size. Br Med J (Clin Res Ed) 1981;282:1687–1689.

11.

Lachin JM. Introduction to sample size determination and power analysis for clinical trials. Control Clin Trials 1981;2:93–113.

12.

Phillips AN, Pocock SJ. Sample size requirements for prospective studies, with examples for coronary heart disease. J Clin Epidemiol 1989;42:639–648.

13.

Schlesselman JJ. Planning a longitudinal study. I. Sample size determination. J Chronic Dis 1973;26:553–560.

14.

Schouten HJ. Planning group sizes in clinical trials with a continuous outcome and repeated measures. Stat Med 1999;18:255–264.

15.

Streiner DL. Sample size and power in psychiatric research. Can J Psychiatry 1990;35:616–620.

16.

Whitehead J. Sample sizes for phase II and phase III clinical trials: an integrated approach. Stat Med 1986;5:459–464.

17.

Chow SC, Shao J, Wang H. Sample size calculations in clinical research, ed 2nd ed. Boca Raton, Taylor & Francis, 2008.

18.

Desu MM, Raghavarao D. Sample Size Methodology. Boston, Academic Press, 1990.

19.

Julious SA. Sample Sizes for Clinical Trials. Chapman and Hall, 2009.

20.

Machin D, Campbell MJ, Tan S-B, Tan S-H. Sample Size Tables for Clinical Studies, ed 3rd. Wiley-Blackwell, 2008.

21.

Odeh RE, Fox M. Sample Size Choice: Charts for Experiments with Linear Models, ed 2nd. New York, Marcel Dekker, 1991.

22.

Braunholtz DA, Edwards SJ, Lilford RJ. Are randomized clinical trials good for us (in the short term)? Evidence for a “trial effect”. J Clin Epidemiol 2001;54:217–224.

23.

Paynter NP, Sharrett AR, Louis TA, et al. Paired comparison of observed and expected coronary heart disease rates over 12 years from the Atherosclerosis Risk in Communities Study. Ann Epidemiol 2010;20:683–690.

24.

Brancati FL, Evans M, Furberg CD, et al. Midcourse correction to a clinical trial when the event rate is underestimated: the Look AHEAD (Action for Health in Diabetes) Study. Clin Trials 2012;9:113–124.

25.

McClure LA, Szychowski JM, Benavente O, Coffey CS. Sample size re-estimation in an on-going NIH-sponsored clinical trial: the secondary prevention of small subcortical strokes experience. Contemp Clin Trials 2012;33:1088–1093.

26.

Wittes J. On changing a long-term clinical trial midstream. Statist Med 10-15-2002;21:2789–2795.

27.

Armitage P, Berry G, Mathews J. Statistical Methods in Medical Research, ed 4th. Malden MA, Blackwell Publishing, 2002.

28.

Brown BW, Hollander M. Statistics - A Biomedical Introduction. New York, John Wiley and Sons, 1977.MATH

29.

Dixon WJ, Massey FJ Jr. Introduction to Statistical Analysis, ed 3rd. New York, McGraw-Hill, 1969.

30.

Fisher L, Van Belle G. Biostatistics - A Methodology for the Health Sciences. New York, John Wiley and Sons, 1993.MATH

31.

Fisher L, Van Belle G, Heagerty PL, Lumley TS. Biostatistics - A Methodology for the Health Sciences. New York, John Wiley and Sons, 2004.MATH

32.

Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions, ed 3rd. John Wiley & Sons, Inc., 2003.

33.

Remington RD, Schork MA. Statistics With Applications to the Biological and Health Sciences. Englewood Cliffs, Prentice-Hall, 1970.

34.

Rosner B. Fundamentals of Biostatistics, ed 3rd. Boston, PWS-Kent, 1990.

35.

Schork MA, Remington RD. Statistics With Applications to the Biological and Health Sciences, ed 3rd. Englewood Cliffs, Prentice-Hall, 2000.

36.

Snedecor GW, Cochran WG. Statistical Methods, ed 8th. Ames, Iowa State University Press, 1989.

37.

Woolson RF, Clarke WR. Statistical Methods for the Analysis of Biomedical Data. Wiley, 2011.

38.

Canner PL, Klimt CR. The Coronary Drug Project. Experimental design features. Control Clin Trials 1983;4:313–332.

39.

Davis BR, Cutler JA, Gordon DJ, et al. Rationale and design for the Antihypertensive and Lipid Lowering Treatment to Prevent Heart Attack Trial (ALLHAT). ALLHAT Research Group. Am J Hypertens 1996;9:342–360.

40.

Dunnett CW. A multiple comparison procedure for comparing several treatments with a control. J Am Statist Assoc 1955;50:1096–1121.MATH

41.

Costigan T. Bonferroni inequalities and intervals; in Armitage P, Colton T (eds): Encyclopedia of Biostatistics. John Wiley and Sons, 2007.

42.

Brittain E, Schlesselman JJ. Optimal Allocation for the Comparison of Proportions. Biometrics 1982;38:1003–1009.

43.

The Cardiac Arrhythmia Suppression Trial (CAST) Investigators. Preliminary report: effect of encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. N Engl J Med 1989;321:406–412.

44.

Rothman KJ. A show of confidence. N Engl J Med 1978;299:1362–1363.

45.

Brown BW, Hollander M. Statistics: A Biomedical Introduction. Wiley, 2009.

46.

McHugh RB, Le CT. Confidence estimation and the size of a clinical trial. Control Clin Trials 1984;5:157–163.

47.

Armitage P, McPherson CK, Rowe BC. Repeated Significance Tests on Accumulating Data. J R Stat Soc Ser A 1969;132:235–244.MathSciNet

48.

Bristol DR. Sample sizes for constructing confidence intervals and testing hypotheses. Statist Med 1989;8:803–811.

49.

Casagrande JT, Pike MC, Smith PG. An Improved Approximate Formula for Calculating Sample Sizes for Comparing Two Binomial Distributions. Biometrics 1978;34:483–486.MATH

50.

Day SJ. Optimal placebo response rates for comparing two binomial proportions. Statist Med 1988;7:1187–1194.

51.

Fleiss JL, Tytun A, Ury HK. A Simple Approximation for Calculating Sample Sizes for Comparing Independent Proportions. Biometrics 1980;36:343–346.

52.

Fu YX, Arnold J. A Table of Exact Sample Sizes for Use with Fisher’s Exact Test for 2 × 2 Tables. Biometrics 1992;48:1103–1112.MATH

53.

Gail MH, Gart JJ. The Determination of Sample Sizes for Use with the Exact Conditional Test in 2 x 2 Comparative Trials. Biometrics 1973;29:441–448.

54.

Gail MH. The determination of sample sizes for trials involving several independent 2 × 2 tables. J Chronic Dis 1973;26:669–673.

55.

Haseman JK. Exact Sample Sizes for Use with the Fisher-Irwin Test for 2 × 2 Tables. Biometrics 1978;34:106–109.

56.

Lachenbruch PA. A note on sample size computation for testing interactions. Statist Med 1988;7:467–469.

57.

McMahon RP, Proschan M, Geller NL, et al. Sample size calculation for clinical trials in which entry criteria and outcomes are counts of events. Statist Med 1994;13:859–870.

58.

Ury HK, Fleiss JL. On Approximate Sample Sizes for Comparing Two Independent Proportions with the Use of Yates’ Correction. Biometrics 1980;36:347–351.

59.

Wacholder S, Weinberg CR. Paired versus Two-Sample Design for a Clinical Trial of Treatments with Dichotomous Outcome: Power Considerations. Biometrics 1982;38:801–812.

60.

Connor RJ. Sample Size for Testing Differences in Proportions for the Paired-Sample Design. Biometrics 1987;43:207–211.

61.

Donner A. Statistical Methods in Ophthalmology: An Adjusted Chi-Square Approach. Biometrics 1989;45:605–611.MATH

62.

Gauderman W, Barlow WE. Sample size calculations for ophthalmologic studies. Arch Ophthalmol 1992;110:690–692.

63.

Rosner B. Statistical Methods in Ophthalmology: An Adjustment for the Intraclass Correlation between Eyes. Biometrics 1982;38:105–114.

64.

Rosner B, Milton RC. Significance Testing for Correlated Binary Outcome Data. Biometrics 1988;44:505–512.MATH

65.

Hayes RL, Moulton LH. Cluster Randomised Trials. Chapman and Hall, 2009.

66.

Cytel: SiZ. Cambridge, MA, Cytel Software Corporation, 2011.

67.

Elashoff JD. nQuery Advisor Version 7.0 User’s Guide. Cork, Ireland, Statistical Solutions, 2007.

68.

Pezzullo JC. Web Pages that Perform Statistical Calculations. Computer Program 2014.

69.

R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, R Foundation for Statistical Computing, 2013.

70.

SAS Institute: Getting Started with the SAS Power and Sample Size Application. Cary, NC, SAS Institute Inc., 2004.

71.

Shiboski S. Power and Sample Size Programs. Department of Epidemiology and Biostatistics, University of California San Francisco. Computer Program 2006.

72.

StataCorp: Stata: Release 13. College Station, TX, StataCorp, 2013.

73.

TIBCO Software I: SPLUS. TIBCO Softward Inc., 2008.

74.

Feigl P. A Graphical Aid for Determining Sample Size when Comparing Two Independent Proportions. Biometrics 1978;34:111–122.

75.

Aspirin Myocardial Infarction Study Research Group. A randomized, controlled trial of aspirin in persons recovered from myocardial infarction. JAMA 1980;243:661–669.

76.

Beta Blocker Heart Attack Trial Research Group. A randomized trial of propranolol in patients with acute myocardial infarction: I. mortality results. JAMA 1982;247:1707–1714.

77.

CASS Principle Investigators and Their Associates. Coronary artery surgery study (CASS): a randomized trial of coronary artery bypass surgery. Survival data. Circulation 1983;68:939–950.

78.

The Coronary Drug Project Research Group. The Coronary Drug Project: Design, Methods, and Baseline Results. Circulation 1973;47:I–1.

79.

Hypertension Detection and Follow-up Program Cooperative Group. Five-year findings of the hypertension detection and follow-up program: I. reduction in mortality of persons with high blood pressure, including mild hypertension. JAMA 1979;242:2562–2571.

80.

The Intermittent Positive Pressure Breathing Trial Group. Intermittent Positive Pressure Breathing Therapy of Chronic Obstructive Pulmonary Disease A Clinical Trial. Ann Intern Med 1983;99:612–620.

81.

Multiple Risk Factor Intervention Trial Research Group. Multiple risk factor intervention trial: Risk factor changes and mortality results. JAMA 1982;248:1465–1477.

82.

Packer M, Carver JR, Rodeheffer RJ, et al. Effect of Oral Milrinone on Mortality in Severe Chronic Heart Failure. N Engl J Med 1991;325:1468–1475.

83.

Barlow W, Azen S. The effect of therapeutic treatment crossovers on the power of clinical trials. Control Clin Trials 1990;11:314–326.

84.

Halperin M, Rogot E, Gurian J, Ederer F. Sample sizes for medical trials with special reference to long-term therapy. J Chronic Dis 1968;21:13–24.

85.

Lakatos E. Sample size determination in clinical trials with time-dependent rates of losses and noncompliance. Control Clin Trials 1986;7:189–199.

86.

Lavori P. Statistical Issues: Sample Size and Dropout; in Benkert O, Maier W, Rickels K (eds): Methodology of the Evaluation of Psychotropic Drugs. Springer Berlin Heidelberg, 1990, pp 91–104.

87.

Newcombe RG. Explanatory and pragmatic estimates of the treatment effect when deviations from allocated treatment occur. Statist Med 1988;7:1179–1186.

88.

Schork MA, Remington RD. The determination of sample size in treatment-control comparisons for chronic disease studies in which drop-out or non-adherence is a problem. J Chronic Dis 1967;20:233–239.

89.

Wu MC, Fisher M, DeMets D. Sample sizes for long-term medical trial with time-dependent dropout and event rates. Control Clin Trials 1980;1:111–124.

90.

Pentico DW. On the Determination and Use of Optimal Sample Sizes for Estimating the Difference in Means. Am Stat 1981;35:40–42.

91.

Dawson JD, Lagakos SW. Size and Power of Two-Sample Tests of Repeated Measures Data. Biometrics 1993;49:1022–1032.MathSciNetMATH

92.

Kirby AJ, Galai N, Munoz A. Sample size estimation using repeated measurements on biomarkers as outcomes. Control Clin Trials 1994;15:165–172.

93.

Laird NM, Wang F. Estimating rates of change in randomized clinical trials. Control Clin Trials 1990;11:405–419.

94.

Lipsitz SR, Fitzmaurice GM. Sample size for repeated measures studies with binary responses. Statist Med 1994;13:1233–1239.

95.

Nam J. A Simple Approximation for Calculating Sample Sizes for Detecting Linear Trend in Proportions. Biometrics 1987;43:701–705.

96.

Overall JE, Doyle SR. Estimating sample sizes for repeated measurement designs. Control Clin Trials 1994;15:100–123.

97.

Rochon J. Sample Size Calculations for Two-Group Repeated-Measures Experiments. Biometrics 1991;47:1383–1398.MathSciNet

98.

Fitzmaurice GM, Laird NM, Ware JH. Applied longitudinal analysis, ed 2nd. John Wiley & Sons, 2011.

99.

Cantor AB. Power estimation for rank tests using censored data: Conditional and unconditional. Control Clin Trials 1991;12:462–473.

100.

Emrich LJ. Required Duration and Power Determinations for Historically Controlled-Studies of Survival Times. Statist Med 1989;8:153–160.

101.

Freedman LS. Tables of the number of patients required in clinical trials using the logrank test. Statist Med 1982;1:121–129.

102.

Gail MH. Applicability of sample size calculations based on a comparison of proportions for use with the logrank test. Control Clin Trials 1985;6:112–119.

103.

George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some critical event. J Chronic Dis 1974;27:15–24.

104.

Halperin M, Johnson NJ. Design and Sensitivity Evaluation of Follow-Up Studies for Risk Factor Assessment. Biometrics 1981;37:805–810.MATH

105.

Hsieh FY. Sample size tables for logistic regression. Statist Med 1989;8:795–802.

106.

Lachin JM, Foulkes MA. Evaluation of sample size and power for analyses of survival with allowance for nonuniform patient entry, losses to follow-up, noncompliance, and stratification. Biometrics 1986;42:507–519.MATH

107.

Lachin JM. Biostatistical Methods: The Assessment of Relative Risks, ed 2nd. John Wiley & Sons, Inc., 2010.

108.

Lakatos E. Sample Sizes Based on the Log-Rank Statistic in Complex Clinical Trials. Biometrics 1988;44:229–241.MathSciNetMATH

109.

Lui KJ. Sample size determination under an exponential model in the presence of a confounder and type I censoring. Control Clin Trials 1992;13:446–458.

110.

Morgan TM. Nonparametric Estimation of Duration of Accrual and Total Study Length for Clinical Trials. Biometrics 1987;43:903–912.MathSciNetMATH

111.

Palta M, Amini SB. Consideration of covariates and stratification in sample size determination for survival time studies. J Chronic Dis 1985;38:801–809.

112.

Pasternack BS, Gilbert HS. Planning the duration of long-term survival time studies designed for accrual by cohorts. J Chronic Dis 1971;24:681–700.

113.

Rubinstein LV, Gail MH, Santner TJ. Planning the duration of a comparative clinical trial with loss to follow-up and a period of continued observation. J Chronic Dis 1981;34:469–479.

114.

Schoenfeld DA, Richter JR. Nomograms for Calculating the Number of Patients Needed for a Clinical Trial with Survival as an Endpoint. Biometrics 1982;38:163–170.

115.

Schoenfeld DA. Sample-Size Formula for the Proportional-Hazards Regression Model. Biometrics 1983;39:499–503.MATH

116.

Taulbee JD, Symons MJ. Sample Size and Duration for Cohort Studies of Survival Time with Covariables. Biometrics 1983;39:351–360.

117.

Wu MC. Sample size for comparison of changes in the presence of right censoring caused by death, withdrawal, and staggered entry. Control Clin Trials 1988;9:32–46.

118.

Zhen B, Murphy JR. Sample size determination for an exponential survival model with an unrestricted covariate. Statist Med 1994;13:391–397.

119.

Pasternack BS. Sample sizes for clinical trials designed for patient accrual by cohorts. J Chronic Dis 1972;25:673–681.

120.

Hjalmarson A, Herlitz J, Malek I, et al. Effect on Mortality of Metoprolol in Acute Myocardial Infarction: A Double-blind Randomised Trial. The Lancet 1981;318:823–827.

121.

The Norwegian Multicenter Study Group. Timolol-Induced Reduction in Mortality and Reinfarction in Patients Surviving Acute Myocardial Infarction. N Engl J Med 1981;304:801–807.

122.

Nocturnal Oxygen Therapy Trial Group. Continuous or Nocturnal Oxygen Therapy in Hypoxemic Chronic Obstructive Lung Disease A Clinical Trial. Ann Intern Med 1980;93:391–398.

123.

Ingle JN, Ahmann DL, Green SJ, et al. Randomized Clinical Trial of Diethylstilbestrol versus Tamoxifen in Postmenopausal Women with Advanced Breast Cancer. N Engl J Med 1981;304:16–21.

124.

Spriet A, Beiler D. When can ‘non significantly different’ treatments be considered as ‘equivalent’? Br J Clin Pharmacol 1979;7:623–624.

125.

Blackwelder WC. “Proving the null hypothesis” in clinical trials. Control Clin Trials 1982;3:345–353.

126.

Blackwelder WC, Chang MA. Sample size graphs for “proving the null hypothesis”. Control Clin Trials 1984;5:97–105.

127.

Makuch R, Simon R. Sample size requirements for evaluating a conservative therapy. Cancer Treat Rep 1978;62:1037–1040.

128.

Rothmann MD, Wiens BL, Chan ISF. Design and Analysis of Non-Inferiority Trials. Taylor & Francis, 2011.

129.

Donner A, Birkett N, Buck C. Randomization by cluster. Sample size requirements and analysis. Am J Epidemiol 1981;114:906–914.

130.

Cornfield J. Randomization by group: A formal analysis. Am J Epidemiol 1978;108:100–102.

131.

Hsieh FY. Sample size formulae for intervention studies with the cluster as unit of randomization. Statist Med 1988;7:1195–1201.

132.

Lee EW, Dubin N. Estimation and sample size considerations for clustered binary responses. Statist Med 1994;13:1241–1252.

133.

Murray DM. Design and Analysis of Group-randomized Trials. Oxford University Press, 1998.

134.

Urokinase Pulmonary Embolism Trial Study Group: Urokinase-streptokinase embolism trial: Phase 2 results. JAMA 1974;229:1606–1613.

135.

Roberts R, Croft C, Gold HK, et al. Effect of Propranolol on Myocardial-Infarct Size in a Randomized Blinded Multicenter Trial. N Engl J Med 1984;311:218–225.

136.

Follmann D. A Simple Multivariate Test for One-Sided Alternatives. J Am Stat Assoc 1996;91:854–861.MathSciNetMATH

137.

O’Brien PC. Procedures for Comparing Samples with Multiple Endpoints. Biometrics 1984;40:1079–1087.MathSciNet

138.

Tang DI, Gnecco C, Geller NL. Design of Group Sequential Clinical Trials with Multiple Endpoints. J Am Stat Assoc 1989;84:776–779.

139.

Tang DI, Geller NL, Pocock SJ. On The Design and Analysis of Randomized Clinical Trials with Multiple Endpoints. Biometrics 1993;49:23–30.MathSciNetMATH

140.

Hsu J. Multiple Comparisons: Theory and Methods. Taylor & Francis, 1996.

141.

Miller RG. Simultaneous Statistical Inference, ed 2nd. Springer New York, 2011.

142.

The Coronary Drug Project Research Group. Clofibrate and niacin in coronary heart disease. JAMA 1975;231:360–381.

143.

Church TR, Ederer F, Mandel JS, et al. Estimating the Duration of Ongoing Prevention Trials. Am J Epidemiol 1993;137:797–810.

144.

Ederer F, Church TR, Mandel JS. Sample Sizes for Prevention Trials Have Been Too Small. Am J Epidemiol 1993;137:787–796.

145.

Neaton JD, Bartsch GE. Impact of measurement error and temporal variability on the estimation of event probabilities for risk factor intervention trials. Statist Med 1992;11:1719–1729.

146.

Patterson BH. The impact of screening and eliminating preexisting cases on sample size requirements for cancer prevention trials. Control Clin Trials 1987;8:87–95.

147.

Shih WJ. Sample size reestimation in clinical trials; in Peace KE (ed): Biopharmaceutical Sequential Statistical Applications. Taylor & Francis, 1992, pp 285–301.

148.

Wittes J, Brittain E. The role of internal pilot studies in increasing the efficiency of clinical trials. Statist Med 1990;9:65–72.

149.

Ambrosius WT, Polonsky TS, Greenland P, et al. Design of the Value of Imaging in Enhancing the Wellness of Your Heart (VIEW) trial and the impact of uncertainty on power. Clin Trials 2012;9:232–246.

150.

Steering Committee of the Physicians’ Health Study. Final Report on the Aspirin Component of the Ongoing Physicians’ Health Study. N Engl J Med 1989;321:129–135.

151.

The Diabetes Control and Complications Trial Research Group: The Effect of Intensive Treatment of Diabetes on the Development and Progression of Long-Term Complications in Insulin-Dependent Diabetes Mellitus. N Engl J Med 1993;329:977–986.

152.

Davis BR, Wittes J, Pressel S, et al. Statistical considerations in monitoring the Systolic Hypertension in the Elderly Program (SHEP). Control Clin Trials 1993;14:350–361.

Prev Next