On the Interpretation of "Negative" Trials

(Video Version)

Today we’re talking about negative clinical trials.  This discussion / rant is inspired by this article, appearing in the New England Journal of Medicine:

More detail, but no catchy pictures.  You be the judge.

More detail, but no catchy pictures.  You be the judge.

The article, which you should read, by the way, is much more in depth than I can be in a single blog post, but I will give you the gist and my own take on the issue.

So, here’s the deal.  We often say that a clinical trial “failed”, meaning the drug that was being tested did not perform statistically better than placebo.  We also call that a “negative” trial.  A “positive trial” is a good thing – one of the few places in medicine, aside from the letter from Donald Trump’s doctor, where a “positive finding” is something you want to get.

I feel like Teddy Roosevelt was pretty robust.

I feel like Teddy Roosevelt was pretty robust.

Now, we label a trial as positive or negative based upon, generally, one test – a test of statistical significance of the primary outcome.  And we set a threshold for statistical significance at a p-value of 0.05. Sound good?

www.xkcd.com: if you read my blog, and dont't read this comic, you may want to consider changing allegiances.

www.xkcd.com: if you read my blog, and dont't read this comic, you may want to consider changing allegiances.

It’s not. For many reasons. For a detailed description why, check out a prior rant here

First, as the authors of the NEJM paper note, describing a trial as “negative” is often misleading. The literature is rife with underpowered studies that didn’t enroll enough people to ensure that a clinically meaningful effect of the drug was ruled out. Better to describe such studies as “uninformative”. How do you tell the difference?  Decide what the smallest difference in effect size would be to be clinically important. If the 95% confidence interval includes that number, you’re underpowered. If it doesn’t the study is negative.

I don't mean to imply with this logo that The Methods Man has a negative effect, though it hasn't been rigorously tested.

I don't mean to imply with this logo that The Methods Man has a negative effect, though it hasn't been rigorously tested.

We should also note the arbitrariness of the 0.05 threshold for the p-value. The authors note, correctly that

Using a bit of Bayes theorem can highlight this issue to an even greater extent.  If you had 50% confidence in the efficacy of a drug before the study, and the p-value in the trial was 0.04, you should be 74% confident in the drug efficacy after the trial. If the p-value was 0.01, you’d be 89% confident. The equation to figure this out appears below, but take my word for it.

In other words, we should take the degree of statistical significance and allow that to inform our future confidence in the drug. Taking a binary positive/negative approach misses this point entirely.

Finally, a word on secondary outcomes. The primacy of the primary outcome is due in large part to concerns about multiple comparisons.  Without a pre-specified primary outcome, a researcher could check multiple outcomes and just report on whichever one happens to give the best results.  All-cause mortality doesn’t work?  How about cardiovascular mortality? No?  How about myocardial infarction?

We often tell our students that if you check 20 outcomes, statistically speaking, at least one is likely to be a false positive.  Here’s a little graph that shows, over 10,000 simulated studies, the number of false positive outcomes you’d get if you checked 20 total.

Assumption: All outcomes are completely independent.  

Assumption: All outcomes are completely independent.  

This is true, but only if the secondary outcomes are completely independent of each other.  Think of it this way –let's say we want to compare hand size in the current presidential candidates:

Note: We should not do this.

Note: We should not do this.

If we measured in both inches and centimeters, we wouldn’t really penalize ourselves for multiple comparisons – inches and centimeters are 100% correlated.

I like to measure hand size in cm to sound more impressive.

I like to measure hand size in cm to sound more impressive.

This matters because often secondary outcomes are highly correlated. So, in certain circumstances, when the primary outcome is negative we should look to those secondary outcomes – if they seem to be telling a coherent biologically-plausible story, we might not want to dismiss that new drug out of hand.

Hopefully this wasn’t too depressing a glimpse into the world of clinical trial outcomes.  If it was, well, try to stay positive.