But here's perhaps an even more telling finding: Only 47% of the originally reported effect sizes were within the 95% confidence interval of the replication effect size. In other words, if you just used the replication study as your basis for guessing the real effect size, you would not expect the real effect size to be as large the effect size originally reported. [Note 1] This reported result inspired me to look at the raw data, to try a related analysis that the original replication study did not appear to report: What percentage of the replications find a significantly lower effect size than the original reported study? By my calculations: 36/95, or 38%. [See Note 2 for what I did.]
(A rather surprising additional ten studies showed statistically marginal trends toward lower effect sizes, which it is tempting to interpret as combination of a non-effect with a poorly powered original study or replication. A representative case is one study with an effect size of r = .22 on 96 participants and a replication effect size of r = .02 on 108 participants (one-tailed p value for difference between the r's = .07). Thus, it seems likely that 38% is a conservative estimate of the tendency toward lower effect size in replications.)
This study-by-study statistical comparison of effect sizes is useful because it helps distinguish the file drawer problem from what we might call the invisible factor problem.
The file drawer problem is this: Researchers are more likely to publish statistically significant findings than findings that show no statistically significant effect. Statistically chance results will sometimes occur, and if mostly these results are published, it might look like there's a real effect when actually there is no real effect.
The invisible factor problem is this: There are a vast number of unreported features of every experiment. Possibly one of those unreported features, invisible to the study's readership, is an important contributor to the reported findings. In infancy research for example, it's not common to report the experimenter's pre-testing interactions with the infant, if any, but pre-testing interactions might have a big effect. In cognitive research, it's not common to report what time of day participants performed the tasks, but time of day can influence arousal and performance. And so on.
The file drawer problem is normally managed in meta-analysis by assuming a substantial number of unpublished null-result studies (maybe five times as many as the published studies) and then seeing if the result still proves significant in a merged analysis. But this is only an adequate approach if the only risk to be considered is a chance tendency for a non-effect to show up as significant in some studies. If, on the other hand, there are a large number of invisible factors, or moderators, that dependably confound studies, leading to statistically significant positive results other than by chance, standard meta-analytic file-drawer compensations will not suffice. The invisible factors might be large and non-chance, unintentionally sought out and settled upon by well-meaning researchers, perhaps even passed along teacher to student. ("It works best if you do it like this.")
Here's how I think psychological research sometimes goes. You try an experiment one way and it "fails" -- that is, it doesn't produce the hoped-for result. So you try another way and it fails again. So then you try a third way and it succeeds. Maybe to make sure it's not chance, you do it that same way again and it still succeeds, so you publish. But there might be no real underlying effect of the sort you think there is. What you might have done is find the right set of moderating factors (time of day, nonverbal experimenter cues, whatever), to get the pattern of results you want. If those factors are visible -- that is, reported in the published study -- then others can evaluate and critique and try to manipulate them. But if those factors are invisible, then you will have an irreproducible result, but one not due to chance. In a way this is a file-drawer effect, since null results are disproportionately non-reported, but it's one driven by biased search for experimental procedures that "succeed" because of real moderating factors rather than just chance fluctuations.
If failure of replication in psychology is due to publishing results that by mere statistical chance happen to fall below the threshold for statistical significance, then most failed replications will not be statistically significantly different from the originally reported results -- just closer to zero and non-significant. But if the failure of replication in psychology is due to invisible moderating factors unreported in the original experiment, then failed replications with decent statistical power will tend to find significantly different results from the original experiment.
I think that is what we see.
[revised Sep. 10]
----------------------------------------
Related posts: What a Non-Effect Looks Like (Aug. 7, 2013) Meta-Analysis of the Effect of Religion on Crime: The Missing Positive Tail (Apr. 11, 2014) Psychology Research in the Age of Social Media (Jan. 7, 2015)
----------------------------------------
Note 1: In no case was the original study's effect size outside the 95% interval for the replication study because the original study's effect size was too low.
Note 2: I used the r's and N's reported in the study's z-to-r conversions for non-excluded studies, then plugged the ones that were not obviously either significant or non-significant one-by-one into Lowry's online calculator for significant difference between correlation coefficients, using one-tailed p values and a significance threshold of p < .05. Note that this analysis is different from and more conservative than simply looking at whether the the 95% CI of the replication includes the effect size of the original study, since it allows for statistical error in the original study rather than assuming a fixed original effect size.
Are you aware of Kenneth Gergen's Toward Transformation in Social Knowledge? He criticizes the idea that psychology (most specifically, social psychology) is building up an understanding of human nature, like physics builds up an understanding of reality. The reason is that so much human behavior is contingent on eminently changeable circumstances. For example, how confident women are will vary greatly depending on which population is studied, where it is, and when it is. An earlier version of Gergen's idea shows up in his 1973 paper "Social Psychology as History".
ReplyDeleteI wonder if this has been factored into the replication study. I recall reading Ian Hacking's essay in Arguing About Human Nature which documents some examples where social science research, when disseminated into the population, changed the population. Gergen also notes that such research frequently ends up impacting people, breaking down the traditional subject/object dichotomy of scientific research, where the object is not impacted by the subject.
I don't think there is any evidence for the reasoning of hidden moderators - your analysis cannot support your conclusion. See http://daniellakens.blogspot.nl/2015/08/power-of-replications-in.html
ReplyDeleteAren't studies a bit rubbish? They don't use a control, for example. They are kind of the gossip of the scientific world.
ReplyDeleteI ran into a video outlining all the issues studies have, once. Wish I had the link. But the lack of a control was a biggie.
The effects of your invisible factors will manifest themselves as significant
ReplyDeletebetween-study heterogeneity (overdispersion) in a meta-analysis.
Thanks for the comments, folks!
ReplyDeleteLuke -- thanks for the suggestion! I haven't read Gergen, but that sounds pretty interesting. I do know some of Hacking's work on this issue.
Daniel -- I've left a comment over at your own post.
Callan -- I think a lot of studies have problems, but I wouldn't go as far as you!
David -- yes, that sounds right. And in the meta-analyses I've seen, there is a lot of between-study heterogeneity, even among individual studies with large N's, but the authors of the meta-analyses tend not to make much of it.
So if psychological experimentation is unreliable, what isn't, what do we do then? Is conceptual analysis any surer?
ReplyDelete