“It has been claimed and demonstrated that many (and possibly most) of the conclusions drawn from biomedical research are probably false.”
Today’s paper: Button, Ioannidis, et al. (2013). Power failure: why small sample size undermines the reliability of neuroscience.
NOTE: a low statistical power means that “the chance of discovering effects that are genuinely true is low”. This means that “low-powered studies produce more false negatives than high-powered studies.”
A study with low statistical power has a reduced chance of detecting a true effect, but it is less well appreciated that low power also reduces the likelihood that a statistically significant result reflects a true effect. Here, we show that the average statistical power of studies in the neurosciences is very low. The consequences of this include overestimates of effect size and low reproducibility of results. There are also ethical dimensions to this problem, as unreliable research is inefficient and wasteful. Improving reproducibility in neuroscience is a key priority and requires attention to well-established but often ignored methodological principles.
Quotes & Notes
Bias built into the system:
A central cause for this important problem is that researchers must publish in order to succeed, and publishing is a highly competitive enterprise, with certain kinds of findings more likely to be published than others.
Even if everything else is good, when the statistical power is low, we run into 3 big problems:
- ” the low probability of finding true effects;”
- “the low positive predictive value when an effect is claimed;”
- “an exaggerated estimate of the magnitude of the effect when a true effect is discovered.”
Low power is also “associated with other biases”:
- “low-powered studies are more likely to provide a wide range of estimates of the magnitude of an effect”;
- “publication bias, selective data analysis and selective reporting of outcomes are more likely to affect low-powered studies”;
- “small studies may be of lower quality in other aspects of their design as well”.
What about Neuroscience, specifically?
Our results indicate that the median statistical power in neuroscience is 21%.
What are the implications?
Our results indicate that the average statistical power of studies in the field of neuroscience is probably no more than between ~8% and ~31%, on the basis of evidence from diverse subfields within neuroscience. If the low average power we observed across these studies is typical of the neuroscience literature as a whole, this has profound implications for the field. A major implication is that the likelihood that any nominally significant finding actually reflects a true effect is small.
There are also ethical implications given that many neuroscience studies involve animals, and the killing of those animals.
We argue that it is important to appreciate the waste associated with an underpowered study — even a study that achieves only 80% power still presents a 20% possibility that the animals have been sacrificed without the study detecting the underlying true effect. If the average power in neuroscience animal model studies is between 20–30%, as we observed in our analysis above, the ethical implications are clear.
Low power therefore has an ethical dimension — unreliable research is inefficient and wasteful. This applies to both human and animal research. The principles of the ‘three Rs’ in animal research (reduce, refine and replace) require appropriate experimental design and statistics — both too many and too few animals present an issue as they reduce the value of research outputs. A requirement for sample size and power calculation is included in the Animal Research: Reporting In Vivo Experiments (ARRIVE) guidelines, but such calculations require a clear appreciation of the expected magnitude of effects being sought.
Small, low-powered studies are endemic in neuroscience. Nevertheless, there are reasons to be optimistic.
Why? Because the onslaught of studies like this one exposing the problem have resulted in many fields taking the problem seriously, and attempting to correct it.
Some fields are confronting the problem of the poor reliability of research findings that arises from low-powered studies. For example, in genetic epidemiology sample sizes increased dramatically with the widespread understanding that the effects being sought are likely to be extremely small.
But don’t get complacent!
Nevertheless, we should not assume that science is effectively or efficiently self-correcting. There is now substantial evidence that a large proportion of the evidence reported in the scientific literature may be unreliable. Acknowledging this challenge is the first step towards addressing the problematic aspects of current scientific practices and identifying effective solutions
Now go lift something heavy,