Wanted: scientific immune system to identify weak studies getting lots of attention

Apr 23, 2018

In the interest of keeping the scientific enterprise towards finding truth, it is important to reduce the impact of problematic studies in the scientific literature. Studies can be problematic in many ways (e.g. lacks a control for genetic confounding in social science), but one relatively simple problem to automatically identify is low precision due to small sample size. Since such studies are too imprecise to really tell us much about reality, they should be given quite little attention. Unfortunately, something like the opposite might be true, with small studies with flashy results being given worldwide media attention.

Here's my idea. We need to have a preferably public database of all scientific papers with fulltext as they are published. It doesn't have to have complete backlog coverage because we are just trying to reduce incoming damage to the literature from uninformative papers, it is too late for the old ones. As each new paper comes out, it is put in the database along with extracted metadata from it such as sample sizes, statistical tests, standard errors, and whatever information one can find. Then we calculate some kind of overall paper informativeness score, which could be simply something like the replication index or median observed power. We also monitor every papers' altmetrics score, which tracks attention to papers. Then we identify papers with weak statistics and high attention. The scientific team seeing the results can then write a specific response to that particular paper in an attempt to reduce its impact on the literature.

I'll give two case studies of what I have in mind.

Yet another trans-generational epigenetics study

BBC: Exercise benefits to the brain 'may be passed on'
Cell reports: RNA-Dependent Intergenerational Inheritance of Enhanced Synaptic Plasticity after Environmental Enrichment
Altmetrics info shows a lot of attention, Google finds a lot more in big mainstream newspapers but they don't even bother to link to the study, as BBC above, so it's not easy to tie them

So what's the study? Besides being a trans-generational epigenetics study, itself a red flag, the study does not list the sample size anywhere, not even in the methods section. However, we can find it by looking at the reported degrees of freedom, which are at 10. My guess is that it is a balanced 2x6 study, i.e. a study of 12 mice is causing worldwide media attention! Looking at the stats makes us even more worried. There are 10 p values reported exactly, these are: p = 0.17, p = 0.10, p = 0.07, p = 0.018, p = 0.04, p = 0.01, p = 0.01, p = 0.28, p = 0.19, p = 0.08. All values between 0.01 and 0.28, very suspicious! We can't do a formal test for too little variation here because some of these test related hypothesis, thus the values are correlated by design, violating the assumptions of the TIVA test (independence). But we are still pretty skeptical because a study of n = 12 can produce pretty much any result imaginable and is basically useless.

Trusting co-partisans more

Epistemic Spillovers: Learning Others’ Political Views Reduces the Ability to Assess and Use Their Expertise in Nonpolitical Domains

Study is being given attention by the heterodox community on Twitter (complete listing of people posting the link on Twitter):

https://twitter.com/JonHaidt/status/987782578714312704

The methods section:

2.1. Participants American residents over 18 years of age who speak English were recruited on Amazon Mechanical Turk. All participants provided demographic information (see supplementary materials). 154 participants completed the first part of the task (Learning Stage), out of which 97 participants (34 females and 63 males, aged 20-58 years M = 34.81, SD = 9.59) completed the second stage (Choice Stage). All participants were paid $2.50 for completing the first stage of the experiment and were told they could earn a bonus of $2.50 to $7.50 based on their performance. Thus, they had an incentive to perform well. Because in reality participant performance was held constant at 50% all participants who completed the entire experiment were paid a $5 bonus.

I.e. a design they could have easily collected more data for. Why stop at such a low value? Very suspicious of optional stopping. The splashy result is is described as "t(96) = -2.10, p = .038, d = -.37". I suggest there is no reason to read the rest of the paper until they bump the sample size.

Just Emil Kirkegaard Things

Discussion about this post

Ready for more?