Progesterone to Prevent Recurrent Preterm Delivery – How Much Do We Know? (Part 1)

By Lynnepi

Spoiler alert: not as much as we should at this point.  I was going to write a more straightforward post but I was at an ob/gyn journal club a couple of days ago where we discussed a fairly recent article (Oct 2019) reporting on a randomized controlled trial meant to confirm the positive results of earlier trials concerning the use of a progestogen to prevent recurrent preterm delivery in women with a prior history of preterm delivery.  I’ve followed this subject since 2003 when one of the first influential trials reported success in preventing preterm delivery in a high-risk population.  The article reporting this trial, and two others that followed, provide some valuable lessons for reviewing the quality of evidence from randomized clinical trials (RCTs).  I’ll cover the first trial in this post and the other two in subsequent posts. Scroll down to the bottom of this post for a quick view of the essentials (table).

First Trial – 2003 (da Fonseca, et al)

Prophylactic administration of progesterone by vaginal suppository to reduce the incidence of spontaneous preterm birth in women at increased risk: A randomized placebo-controlled double-blind study. (Am J Obstet Gynecol 2003;188:419-424)

At this same journal club we reviewed the first trial conducted by da Fonseca, et al, when it was originally published in 2003.  This was before the CONSORT reporting guideline for RCTs was widely used, so some of the methods are not sufficiently described.  The investigators randomized 157 women with high-risk singleton pregnancies, defined as history of prior preterm delivery (PTD), uterine malformation or prophylactic cervical cerclage.  The patients received vaginal suppositories, either progesterone or placebo. They self-administered these suppositories on a daily basis from the 24th to 34th week of pregnancy.  The prior PTD had occurred at an average gestational age of 33 weeks for women in both groups.  It took the investigators five years to enroll 157 women and they excluded 15 after randomization due to premature rupture of membranes (n=10), therapeutic preterm delivery (n=3), allergic reaction (n=1) or loss to follow-up (n=1) (nine progesterone, five placebo).  Technically the only patient who could have been excluded from the primary analysis was the one lost to follow-up, as the outcomes of the other patients would be known.  This relates to the principle of intention-to-treat.  More on this later.

The mean gestational age at randomization was 25.2 weeks for placebo and 26.5 weeks for progesterone (no standard deviation or indication of variability was given).  The women in the placebo group were a little more likely to have had a previous PTD (97% vs. 90%) and a little less likely to have uterine malformation or cervical cerclage, but the level of imbalance is not out of line in a small trial with 72 patients receiving progesterone and 70 placebo.

The study objective was stated as determining whether progesterone could “reduce the incidence of preterm birth.”  The investigators reported two PTD outcomes, at <37 weeks and “34 weeks” (<34 weeks?).  Nowadays you would need to pick which one was your primary outcome.  Their sample size was based on the control group’s expected PTD rate at <37 weeks based on their institutional data.  More on that later.

The results came out as follows: for PTD <37 weeks, the placebo group had a rate of 28.6% (20/70), and the progesterone group’s rate was 13.9% (10/72).  This tested statistically significant and was reported as p=0.03, although if you do Fisher’s exact test, it’s p=0.04 and the chi-squared test p-value is p=0.0527.  The discrepancy is fairly small, though.  A more dramatic difference was seen in the rate of PTD at 34 weeks (<34? ≤34?):  the placebo group rate vs. the progesterone group rate was 18.6% (13/70) vs. 2.8% (2/72), p=0.002.  (That p-value was correct.)  A comparison of survival curves for time to delivery and increased frequency of contractions in the placebo group were consistent with these results.  The rate of protocol-defined preterm labor was not statistically significantly different, however (31.4% in placebo patients and 19.4% in progesterone patients, p=0.1238).

So let’s review a couple of issues which affect the quality of the evidence provided by this trial.

  1. Excluding patients, whose outcomes should have been known, after randomization.  This reduces the validity of the results.  The principle of “intention-to-treat” means that the data of any trial participant who is randomized is included in the statistical analysis according to the group to which they were randomized.  When patients are eliminated from your originally randomized groups, you are essentially “unrandomizing” your study patients.  It’s very rare that patients are removed from a trial randomly, as if by lottery.  This can cause your study groups to become imbalanced in terms of their characteristics, which defeats the purpose of randomization. Excluding patients because of their outcome (as in this article) is especially bad. 

Table I in the paper outlines the reasons patients were removed, by placebo vs. progesterone treatment:  premature rupture of membranes (6 progesterone, 4 placebo), therapeutic preterm delivery (2 progesterone, 1 placebo), allergic reaction (1 progesterone, 0 placebo), and loss to follow-up (0 progesterone, 1 placebo).   All of those patients could have been included in the analysis, with the exception of the patient lost to follow-up.  If we factor those patients into the analysis presented in Table III, the results look like this:

  Placebo
n = 75
Progesterone
n = 81
p-value
PTD <37 weeks 25 (33.3%) 19 (23.5%) 0.213
PTD 34 weeks (Worst case scenario*) 18 (24.0%) 11 (13.6%) 0.104
PTD 34 weeks (Moderate case scenario*) 16 (21.3%) 7 (8.6%) 0.040

*Worst case = all excluded cases delivered prior to 34 weeks.  Moderate case = 3 placebo and 5 progesterone excluded cases delivered prior to 34 weeks while the others delivered after 34 weeks.

Although the specific results (“point estimates”) still favor the progesterone group, the differences are smaller and the only comparison that is statistically significant (and just barely) is the “moderate case” scenario where not all of the excluded cases delivered prior to 34 weeks.

Fifteen out of 157 patients excluded is not a large percentage, but excluding them and not following the intention-to-treat principle had a large, problematic effect on how the results were reported. 

2. Lack of precision in study results due to small sample size.  In statistics, precision is roughly thought of as follows:  if you did another study just like the original one (same study design, same patient population, same execution, etc.), how close would the result from your next study be to the original one?  Of course, we prefer to have the next result be very similar to the original result – this equals high precision, and that we can “bank on” the results from the original study.  If you are familiar with confidence intervals, it means the confidence interval is narrow.  A clinician wants to be confident that if they change their practice because of a trial result, their patients will have outcomes similar to those from the trial.

The investigators reported that if they enrolled 48 women in each group, they would have 90% power to find an expected difference in PTD between the placebo and progesterone groups of 20% vs. 12.5% (7.5% difference).  Ninety-percent power is high precision, and most properly done trials aim for at least 80% power.  When I read this in 2003, I knew this was extremely unlikely since I had calculated many sample sizes for various differences. 

I calculated the appropriate sample size for a trial expecting to see 20% vs. 12.5% using a formula from a well-known textbook, and the answer came out 506 women in each group (greater than the article’s number by more than a factor of 10!).  A physician and I wrote a letter to the editor questioning the investigators’ assertion.  Our main concern was the lack of precision in study results.  In these situations, just a few patients experiencing a different outcome can cause a big change in the study’s conclusions, and lack of predictability around what might happen if the study was repeated (or progesterone given to other, similar, patients).

Our letter was published and the first author replied to us, stating he had used “Altman’s nomogram” to determine his sample size and it had indicated a number of 48 women per group.  Doug Altman was a very well-known and respected statistician who developed several tools for non-statisticians to assist them in designing and analyzing their research.  I recently looked up Altman’s nomogram and followed the instructions for estimating the sample size using the article’s criteria. The answer was … a little over 500 women in each group.  The investigators either did not use Altman’s nomogram or did not use it correctly.  No ifs, ands or buts. 

So, these study results lack precision and did not follow the intention-to-treat principle.  The latter problem may have biased their results away from the “true” results.  I have a couple of other concerns about this trial which I won’t cover here for the sake of “brevity.” (This post is much longer than I thought it would be.)

Does this mean progesterone does not delay or prevent preterm delivery? No.  It means the evidence produced by this trial is not of high quality and is not particularly helpful in answering that question.

In my next post I’ll review the large trial from Meis which concluded that 17-alpha hydroxyprogesterone caproate (a synthetic progestogen not tested in the da Fonseca trial) is effective in preventing PTD in patients with a history of PTD.

The Essentials

Concept or Issue Description Why It’s Important
Intention-to-treat Applies to randomized trials.  Every effort is made to include data from all randomized patients in the statistical analysis, according to the group to which they were originally randomized. Randomization assigns patients to groups based on chance, increasing the likelihood that the people across study groups are equivalent with respect to important characteristics which may affect the results.  Not following intention-to-treat may cause imbalances in study groups.
Study Precision Related to study power, which is the probability that a statistical test will reject the null hypothesis (usually, “groups are not different”) if in fact the null hypothesis is wrong.  Indicates the range of possible results which could occur with repeated, identical studies. Studies with high precision indicate that their results are reliable, i.e. would be repeated in similar circumstances.  This increases their usefulness in factoring them into clinical decisionmaking. Caveat:  a study can have high precision but be poorly designed and provide reliable, but wrong, evidence.
Nomogram A visual graphic which assists in making a calculation.  Typically uses two y-axes on either side of a specifically-angled line.  The answer is determined by where a line drawn from one y-axis to the other y-axis intersects the specifically-angled line. From the Greek for “law-line.” Nomograms are convenient in that they remove the need to carry out tedious calculations. The answer can often be obtained visually and quickly.

References:

Da Fonseca EB, Bittar RB, Carvalho MHB, Zugaib M.  Prophylactic administration of progesterone by vaginal suppository to reduce the incidence of spontaneous preterm birth in women at increased risk: A randomized placebo-controlled double-blind study.  Am J Obstet Gynecol 2003;188:419-424. PMID: 12592250

O’Shaughnessy RW, Shaffer LET.  Supplemental progesterone to prevent preterm birth. Am J Obstet Gynecol 2004 Jun;190(6):1800-1. PMID: 15290821

Whitley E, Ball J.  Statistics review 4: sample size calculations.  Crit Care Aug 2002;6(4):335-341. PMID: 12225610