When (Not) to Stop a Clinical Trial for Benefit
When (Not) to Stop a Clinical Trial
for Benefit
Stuart J. Pocock, PhD
cal practice, some lenient statistical boundaries are not a sen-sible choice in the direction of benefit. For instance, the so- INTHISISSUEOFJAMA,MONTORIANDCOLLEAGUES1PRO- calledPocockboundary9andtheO’Brien-Flemingboundary’s vide a valuable extensive and critical systemic review last interim look9 both typically require values around P=.02 of clinical trials that were stopped early for benefit. Read- for stopping, which is usually insufficient strength of evi- ers of the reports of such trials often feel a sense of ex- dence to stop a trial for benefit. Both boundaries can be made citement, especially when phrases such as “a major treat- more appropriate if the overall type I error is set at 1% rather ment advance,” “ethical need to stop the inferior treatment,” and “vital to tell the world immediately” are used. How- Many complex methods exist for statistical stopping bound- ever, experience suggests that early results and enthusi- aries, whereas in practice there is considerable merit in the asm, especially for modestly sized trials terminated early for simple Haybittle-Peto boundary,9 which requires PϽ.001 as apparent major benefit, are often moderated as subsequent evidence required to consider stopping a trial early for ben- efit. Even so, such a boundary should not be applied too soon, The skeptic should ask first whether correct and appro- when few outcome events have been observed.
priate structures were in place for analyzing and review- Decisions on early stopping (or not) need to be based on ing, and making decisions based on, the trial’s accumulat- wise judgments interpreting the totality of available evi- ing interim data. Having the members of an effective dence, both in the current trial (considering primary and independent data monitoring committee (DMC) or data and other efficacy outcomes and safety issues) and in other ex- safety monitoring board as the only individuals accessing ternal evidence (especially from related trials).10 Accord- and interpreting interim data split by treatment group is now ingly, a statistical stopping boundary is only one useful ob- considered an essential part of good practice for major ran- jective component in an inevitably more challenging decision- domized trials.3-5 Still, a substantial minority of reported ma- making process. The ethical dilemma is to safeguard the jor trials appear not to have a DMC in place.6 interests of patients randomized in the current trial while Second, with or without a formal DMC recommenda- also protecting society from overzealous premature claims tion, another question is whether the decision to stop a trial of treatment benefit.11 For instance, if a trial is evaluating a early and report the results was an appropriate judgment.
treatment meant to be given long-term for conditions such This decision should be aided by a predefined statistical stop- as hypertension or chronic arthritis, short-term benefits, no ping boundary for a primary outcome,7-9 but some trials have matter how statistically significant, may not merit early stop- no such guideline. It is important that such a boundary is ping. If a trial is for regulatory approval, the sponsor and sufficiently stringent (eg, very strong evidence of a treat- trialists should be encouraged not to stop early unless there ment difference with a very small P value) to match the ethi- is overwhelming evidence of treatment superiority, since the cal and public health implications of a decision to stop the regulators require substantial evidence of both efficacy and trial. In a spirit of requiring proof beyond reasonable doubtthat a treatment difference is sufficient to affect future clini- Author Affiliation: Medical Statistics Unit, London School of Hygiene and Tropi-
safety, often in at least 2 trials reaching their intended full hence an exaggerated claim of survival benefit was avoided and important long-term benefits in other outcomes, such Montori et al1 rightly draw attention to some reports of as cardiovascular death and heart failure hospitalization, were trials that were stopped early but that did not document realized in each of the 3 component trials of the CHARM the planned size and circumstances of the relevant interim analysis and stopping boundary. Such deficiencies need So when is it appropriate to stop a trial early? The ASCOT correcting by authors, peer reviewers, and editors in line factorial trial’s data monitoring experience provides useful with CONSORT recommendations.12 Indeed, journals insights.15,16 First, in 10305 patients with hypertension, the should consider rejecting the report of any trial potentially comparison of atorvastatin with placebo was halted when stopped prematurely and lacking adequate documentation, the difference in the primary end point, major coronary and access to trial protocols by journals would help in events, at interim analysis reached PϽ.001, the stopping making this decision. There is probably less need to pre- boundary. With 100 vs 154 primary events in the atorvas- sent adjusted analyses that attempt to correct for the tatin and placebo groups, respectively, and a risk ratio of interim monitoring and early stopping, since stopping 0.64 (P = .0005), the published result was clear-cut.15 The depends on more than a statistical boundary, and com- appropriateness of stopping early was supported by other plexities of adjustment can clutter the presentation of trials of statins in other populations and by important ben- results and make interpretation of the findings more diffi- efits in other outcomes, such as stroke.
cult. Real insight rests more on a full understanding of the A more difficult stopping decision arose in the ASCOT circumstances at the time of stopping. Also, between the trial for the 19342 patients randomized to receive moment of making the decision to stop and locking the amlodipine-based and atenolol-based regimens. The pre- final database used for analysis and publication, substantial defined primary end point was major coronary events, additional and corrected data may become available for whereas it is well known that the key effect of antihyper- analysis. Indeed, such data cleaning may justify a pause tensive treatment is in reducing risk of stroke. Thus, when before any definite decision to stop the trial.
there emerged a highly significant reduction in stroke for From a reader’s perspective, the key problem is whether amlodipine-based compared with atenolol-based treatment to believe the treatment benefit is truly as great as the data (PϽ.001), much debate ensued on whether to stop the imply. Montori et al1 appropriately emphasize that trials stop- trial, resulting in a decision to continue to the next interim ping early will tend to be on a “random high” of observed analysis. Some months later, the trial was stopped early benefit, and if further data had been collected in either this when there was also a significantly higher rate of mortality or another trial, some “regression to the truth” to a more in the atenolol-based group, although still no significant modest effect estimate would occur.2,13 These issues are more difference existed for the primary end point. This example illustrates the complexities and tough decisions that can Montori et al reported a median of 66 events observed at the time trials were stopped. To achieve a difference Can a trial be stopped on the basis of secondary end points? between treatment that is significant at PϽ.001 requires a Perhaps not, but on occasion, such as with the ASCOT-BPLA split by treatment group of at least 46 vs 20 events, which study, results of secondary end points (327 strokes with am- means that risk happens to be reduced by 57% or more. In lodipine vs 422 with atenolol, a 23% risk reduction most therapeutic areas, this is highly implausible and is [P = .0003]) provide convincing evidence of great public often associated with relatively short patient follow-up health importance.16 In lay terms, “when early results proved time. Thus in many settings, trials should not stop so soon, so promising it was no longer fair to keep patients on the because it is highly likely that the therapeutic claim is older drugs for comparison, without giving them the op- portunity to change.”18 However, the data in these 2 ex- The data monitoring experience in the CHARM pro- amples are more substantial compared with those in the ma- gram in 7599 patients with heart failure provides a thought- jority of trials reviewed by Montori et al. The message is clear: provoking example.14 At the fourth interim analysis with a most trials stopped early for benefit should not have been median 1-year follow-up, there were 260 vs 339 deaths in stopped at that point. Stopping for harm or futility is an- the candesartan and placebo groups, respectively, a 24% risk other matter19 that equally importantly requires future sys- reduction that crossed the PϽ.001 stopping boundary. For tematic review and comment. Inappropriate stopping of trials several documented reasons,14 the DMC voted to continue for commercial reasons raises additional serious con- until the next interim analysis. The treatment mortality dif- ference was then attenuated in subsequent interim analy- In summary, all major randomized trials should have an ses so that at the trial’s intended completion with a median independent DMC that functions effectively and makes wise of 3.1 years of follow-up, there were 886 deaths in the can- judgments aided by stringent statistical stopping bound- desartan group vs 945 deaths in the placebo group, a 9% aries for benefit. It is critical that the DMC, principal inves- risk reduction (P = .055). Early stopping was resisted, and tigators, executive committees, and sponsors all recognize 2005 American Medical Association. All rights reserved.
