Maybe you've heard about huge psychological replication study that was released on August 28th. The headline finding is this: 270 psychologists jointly attempted to replicate the results of 100 recent studies in three top psychology journals. In fewer than 40% of cases were the researchers able to replicate the originally reported effect.
But here's perhaps an even more telling finding: Only 47% of the originally reported effect sizes were within the 95% confidence interval of the replication effect size. In other words, if you just used the replication study as your basis for guessing the real effect size, you would not expect the real effect size to be as large the effect size originally reported. [Note 1] This reported result inspired me to look at the raw data, to try a related analysis that the original replication study did not appear to report: What percentage of the replications find a significantly lower effect size than the original reported study? By my calculations: 36/95, or 38%. [See Note 2 for what I did.]
(A rather surprising additional ten studies showed statistically marginal trends toward lower effect sizes, which it is tempting to interpret as combination of a non-effect with a poorly powered original study or replication. A representative case is one study with an effect size of r = .22 on 96 participants and a replication effect size of r = .02 on 108 participants (one-tailed p value for difference between the r's = .07). Thus, it seems likely that 38% is a conservative estimate of the tendency toward lower effect size in replications.)
This study-by-study statistical comparison of effect sizes is useful because it helps distinguish the file drawer problem from what we might call the invisible factor problem.
The file drawer problem is this: Researchers are more likely to publish statistically significant findings than findings that show no statistically significant effect. Statistically chance results will sometimes occur, and if mostly these results are published, it might look like there's a real effect when actually there is no real effect.
The invisible factor problem is this: There are a vast number of unreported features of every experiment. Possibly one of those unreported features, invisible to the study's readership, is an important contributor to the reported findings. In infancy research for example, it's not common to report the experimenter's pre-testing interactions with the infant, if any, but pre-testing interactions might have a big effect. In cognitive research, it's not common to report what time of day participants performed the tasks, but time of day can influence arousal and performance. And so on.
The file drawer problem is normally managed in meta-analysis by assuming a substantial number of unpublished null-result studies (maybe five times as many as the published studies) and then seeing if the result still proves significant in a merged analysis. But this is only an adequate approach if the only risk to be considered is a chance tendency for a non-effect to show up as significant in some studies. If, on the other hand, there are a large number of invisible factors, or moderators, that dependably confound studies, leading to statistically significant positive results other than by chance, standard meta-analytic file-drawer compensations will not suffice. The invisible factors might be large and non-chance, unintentionally sought out and settled upon by well-meaning researchers, perhaps even passed along teacher to student. ("It works best if you do it like this.")
Here's how I think psychological research sometimes goes. You try an experiment one way and it "fails" -- that is, it doesn't produce the hoped-for result. So you try another way and it fails again. So then you try a third way and it succeeds. Maybe to make sure it's not chance, you do it that same way again and it still succeeds, so you publish. But there might be no real underlying effect of the sort you think there is. What you might have done is find the right set of moderating factors (time of day, nonverbal experimenter cues, whatever), to get the pattern of results you want. If those factors are visible -- that is, reported in the published study -- then others can evaluate and critique and try to manipulate them. But if those factors are invisible, then you will have an irreproducible result, but one not due to chance. In a way this is a file-drawer effect, since null results are disproportionately non-reported, but it's one driven by biased search for experimental procedures that "succeed" because of real moderating factors rather than just chance fluctuations.
If failure of replication in psychology is due to publishing results that by mere statistical chance happen to fall below the threshold for statistical significance, then most failed replications will not be statistically significantly different from the originally reported results -- just closer to zero and non-significant. But if the failure of replication in psychology is due to invisible moderating factors unreported in the original experiment, then failed replications with decent statistical power will tend to find significantly different results from the original experiment.
I think that is what we see.
[revised Sep. 10]
Related posts: What a Non-Effect Looks Like (Aug. 7, 2013) Meta-Analysis of the Effect of Religion on Crime: The Missing Positive Tail (Apr. 11, 2014) Psychology Research in the Age of Social Media (Jan. 7, 2015)
Note 1: In no case was the original study's effect size outside the 95% interval for the replication study because the original study's effect size was too low.
Note 2: I used the r's and N's reported in the study's z-to-r conversions for non-excluded studies, then plugged the ones that were not obviously either significant or non-significant one-by-one into Lowry's online calculator for significant difference between correlation coefficients, using one-tailed p values and a significance threshold of p < .05. Note that this analysis is different from and more conservative than simply looking at whether the the 95% CI of the replication includes the effect size of the original study, since it allows for statistical error in the original study rather than assuming a fixed original effect size.