A Science Paper Gets It Wrong (IMHO)

A study published in Science suggests that an early childhood intervention program can have lasting effects on adult health. If true, this is an incredibly worthy investment. But the data may not support the conclusions.


I was once told that the most important question in epidemiology is "What is the question?".  The meaning behind this maxim is that you need to really know what you are asking of the data before you ask it. Then you determine whether your data can answer that question.

In this paper, published in Science in 2014, researchers had a great question:  Would an intensive, early-childhood intervention focusing on providing education, medical care, and nutrition lead to better health outcomes later in life? 

The data they used to answer this question might appear promising at first, but looking under the surface, one can see that the dataset can't handle what is being asked of it. This is not a recipe for a successful study, and the researchers’ best course of action might have been to move on to a new dataset, or a new question.

What the authors of this Science paper did was to torture the poor data until it gave them an answer.

Let's take a closer look.

The researchers used data from a randomized trial called the Abecedarian study.  The study randomized 122 children to control (usual care) or a 3-pronged early childhood intervention (comprising an educational, a health care, and a nutritional component). 

In their mid-30's, some of these children visited a physician and had a constellation of measurements performed (blood pressure, weight, cholesterol, etc.).

The question: was randomization to the intensive childhood intervention associated with better health in terms of these measurements?  Per the researchers: "we find that disadvantaged children randomly assigned to treatment have significantly lower prevalence of risk factors for cardiovascular and metabolic diseases in their mid-30's."

Per the data: "Aaaaaahhhh!  Please stop!!!"

Here are the issues, in two very red flags:

Red Flag 1: The study doesn't report the sample size. 

I couldn't really believe this when I read the paper the first time. In the introduction, I read that 57 children were assigned to the intervention and 54 to control.  But then I read that there was substantial attrition between enrollment and age 30 (as you might expect).  But all the statistical tests are done at age 30.  I had to go deep into the supplemental files to find out that, for example, they had lab data on 12 of the 23 males in the control group and 20 of the 29 males in the treatment group.  That's a very large loss-to-follow-up. It's also a differential loss-to-follow-up, meaning more people were lost in one group (the controls in this case) than in the other (treatment). If this loss is due to different reasons in the two groups (it likely is), you lose the benefit of randomizing in the first place.

The authors state they account for this using inverse probability weighting.  The idea here is that you create a model that predicts the chance that you'll follow-up in your mid-30's. Men with fewer siblings were more likely to follow-up for example. Then you look at all the people who did follow-up, and weight their data according to how unlikely it was that they would have followed-up.  The "surprise" patients get extra weight – because they now need to represent all those people who didn't show up. This might sound good in theory, but it is entirely dependent on how good your model predicting who will follow-up is. And, as you might expect, predicting who will show up for a visit 30 years after the fact is a tall order. Without a good model, the inverse-probability weighting doesn't help at all. 

In the end, the people who showed up to this visit self-selected. The results may have been entirely different if the nearly 50% of individuals who were lost to follow-up had been included.

Red Flag 2: Multiple comparisons accounted for! (Not Really).

Referring to challenges of this type of analysis, the authors write this in their introduction:

"Numerous treatment effects are analyzed. This creates an opportunity for "cherry picking" – finding spurious treatment effects merely by chance if conventional one-hypothesis-at-a-time approaches to testing are used. We account for the multiplicity of the hypotheses being tested using recently developed stepdown procedures".

Translation: We are testing a lot of things. False positives are an issue.  We'll fix the issue using this new statistical technique.

The stepdown procedure they refer to does indeed account for multiple comparisons. But only if you use it on… well… all your comparisons.  The authors don’t do this and instead divide their many comparisons into "blocks," most of which have only 2 or 3 variables. Vitamin D deficiency, for example, stands all alone in its block – its p-value gets adjusted from 0.021 to 0.021. In other words, no adjustment at all is made for the fact that it is one of many things being tested. Correcting your findings for multiple comparisons only works if you account for all the comparisons.

Where are we left after all this?  Right where we started.  With a really interesting question and no firm answer. Do early childhood interventions lead to better health later in life?  Maybe. I can't tell from this study. And that's sad because if the hypothesis is true, it's really important.

For a rejoinder from author (and Nobel Laureate) James Heckman please click here.

This post was originally distributed as part of the Evidence Initiative, and appears on the website developed and maintained by the Laura and John Arnold Foundation ("LJAF"). LJAF created the site as part of its broader effort to encourage governments and nonprofit organizations to help build the evidence base for social interventions and to consider reliable evidence as one of the primary factors in their decisions. Cited material should be attributed to LJAF.  For more information about LJAF, please visit www.arnoldfoundation.org.