Estimating Reproducibility of Psychology (No. 151): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 151 “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” by Simone Schnall and colleagues has been the subject of heated debates among social psychologists.  The main finding of the article failed to replicate in an earlier replication attempt (Johnson, Cheung, & Donnellan, 2012).  In response to the replication failure, Simone Schnall suggested that the replication study was flawed and stood by her original findings.  This response led me to publish my first R-Index blog post that suggested the original results were not as credible as they seem because Simone Schnall was trained to use questionable research practices that produce significant results with low replicability. She was simply not aware of the problems of using these methods. However, Simone Schnall was not happy with my blog post and when I refused to take it down, she complained to the University of Toronto about it. UofT found that the blog post did not violate ethical standards.

The background is important because the OSC replication study was one of the replication studies that were published earlier and criticized by Schnall. Thus, it is necessary to revisit Schnall’s claim that the replication failure can be attributed to problems with the replication study.

Summary of Original Article 

The article “With a Clean Conscience: Cleanliness Reduces the Severity of Moral Judgments” was published in Psychological Science. The article has been cited 197 times overall and 20 times in 2017.

Simone.Schnall

The article extends previous research that suggested a connection between feelings of disgust and moral judgments.  The article reports two experiments that test the complementary hypothesis that thoughts of purity make moral judgments less severe.  Study 1 used a priming manipulation. Study 2 evoked disgust followed by self-purification. Results in both studies confirmed this prediction.

Study 1

Forty undergraduate students (n = 20 per cell) participated in Study 1.

Half of the participants were primed with a scrambled sentence task that contained cleanliness words (e.g. pure, washed).  The other half did a scrambled sentence task with neutral words.

Right after the priming procedure, participants rated how morally wrong an action was in a series of six moral dilemmas.

The ANOVA showed a marginally mean difference, F(1,38) = 3.63, p = .064.  The results was reported with p-rep = .90, which was an experimental statistic in Psychological Science form 2005-2009 that was partially motivated by an attempt to soften the strict distinction between p-values just above or below .05.  Although a p-value of .064 is not meaningfully different from a p-value of .04, neither p-value suggests that a result is highly replicable. A p-value of .05 corresponds to 50% replicability (with large uncertainty around this point estimate) and the estimate is inflated if questionable research methods were used to produce it.

Study 2

Study 2 could have followed up the weak evidence of Study 1 with a larger sample to increase statistical power.  However, the sample size in Study 2 was nearly the same (N = 44).

Participants first watched a disgusting film clip.  Half (n = 21) of the participants then washed their hands before rating moral dilemmas.  The other half (n = 22) did not wash their hands.

The ANOVA showed a significant difference between the two conditions, F(1,41) = 7.81, p = .008.

Replicability Analysis 

No N Test p.val z OP
Study 1 40 F(1,38)=3.63 0.064 1.85 0.58*
Study 2 44 F(1,41)=7.81 0.008 2.66 0.76

*  using p < .10 as criterion for power analysis

With two studies it is difficult to predict replicability because observed power in a single study is strongly influenced by sampling error.  Individually, Study 1 has a low replicability index because the success (p < .10) was achieved with only 58% power. The inflation index (100 – 58 = 42) is high and the R-Index, 58 – 42 = 16, is low.

Combining both studies, still produces a low R-Index (Median Observed Power = 67, Inflation = 33, R-Index = 67 – 33 = 34).

My original blog post pointed out that we can predict replicability based on a researchers typical R-Index.  If a researcher typically conducts studies with high power, a p-value of .04 will sometimes occur due to bad luck, but the replication study is likely to be successful with a lower p-value because bad luck does not repeat itself.

In contrast, if a researcher conducts low powered studies, a p-value of .04 is a lucky outcome and the replication study is unlikely to be lucky again and therefore more likely to produce a non-significant result.

Since I published the blog post, Jerry Brunner and I have developed a new statistical method that allows meta-psychologists to take a researcher’s typical research practices into account. This method is called z-curve.

The figure below shows the z-curve for automatically extracted test statistics from articles by Simone Schnall from 2003 to 2017.  Trend analysis showed no major changes over time.

 

For some help with reading these plots check out this blog post.

The Figure shows a few things. First, it shows that the peak (mode) of the distribution is at z = 1.96, which corresponds to the criterion for significance (p < .05, two-tailed).  The steep drop on the left is not explained by normal sampling error and reveals the influence of QRPs (this is not unique to Schnall; the plot is similar for other social psychologists).  The grey line is a rough estimate of the proportion of non-significant results that would be expected given the distribution of significant results.  The discrepancy between the proportion of actual non-significant results and the grey line shows the extent of the influence of QRPs.

Simone.Schnall.2.png

Once QRPs are present, observed power of significant results is inflated. The average estimate is 48%. However, actual power varies.  The estimates below the x-axis show power estimates for different ranges of z-scores.  Even z-scores between 2.5 and 3 have only an average power estimate of 38%.  This implies that the z-score of 2.66 in Study 2 has a bias-corrected observed power of less than 50%. And as 50% power corresponds to p = .05, this implies that a bias-corrected p-value is not significant.

A new way of using z-curve is to fit z-curve with different proportions of false positive results and to compare the fit of these models.

Simone.Schnall.3.png

The plot shows that models with 0 or 20% false positives fit the data about equally well, but a model with 40% false positives lead to notably worse model fit.  Although this new feature is still in development, the results suggest that few of Schnall’s results are strictly false positives, but that many of her results may be difficult to replicate because QRPs produced inflated effect sizes and much larger samples might be needed to produce significant results (e.g., N > 700 is needed for 80% power with a small effect size, d = .2).

In conclusion, given the evidence for the presence of QRPs and the weak evidence for the cleanliness hypothesis, it is unlikely that equally underpowered studies would replicate the effect. At the same time, larger studies might produce significant results with weaker effect sizes.  Given the large sampling error in small samples, it is impossible to say how small the effects would be and how large samples would have to have high power to detect them.

Actual Replication Study

The replication study was carried out by Johnson, Cheung, and Donnellan.

Johnson et al. conducted replication studies of both studies with considerably larger samples.

Study 1 was replicated with 208 participants (vs. 40 in original study).

Study 2 was replicated with 126 participants (vs. 44 in original study).

Even if some changes in experimental procedures would have slightly lowered the true effect size, the larger samples would have compensated for this by reducing sampling error.

However, neither replication produced a significant result.

Study 1: F(1, 206) = 0.004, p = .95

Study 2: F(1, 124) = 0.001, p = .97.

Just like two p-values of .05 and .07 are unlikely, it is also unlikely to obtain two p-values of .95 and .97 even if the null-hypothesis is true because sampling error produces spurious mean differences.  When the null-hypothesis is true, p-values have a uniform distribution, and we would expect 10% of p-values between 9 and 1. To observe this event twice in a row has a probabiilty of .10 * .10 = .01.  Unusual events do sometimes happen by chance, but defenders of the original research could use this observation to suggest “reverse p-hacking” a term coined by Fritz Strack to insinuate that it can be of interest for replication researchers to make original effects go away.  Although I do not believe that this was the case here, it would be unscientific to ignore the surprising similarity of these two p-values

The authors conducted two more replication studies. These studies also produced non-significant results, with p = .31 and p = .27.  Thus, the similarity of the first two p-values was just a statistical fluke, just like some suspiciously similar  p-values of .04 are sometimes just a chance finding.

Schnall’s Response 

In a blog post, Schnall comments on the replication failure.  She starts with the observation that publishing failed replications is breaking with old traditions.

One thing, though, with the direct replications, is that now there can be findings where one gets a negative result, and that’s something we haven’t had in the literature so far, where one does a study and then it doesn’t match the earlier finding. 

Schnall is concerned that a failed replication could damage the reputation of the original researcher, if the failure is attributed either to a lack of competence or a lack of integrity.

Some people have said that well, that is not something that should be taken personally by the researcher who did the original work, it’s just science. These are usually people outside of social psychology because our literature shows that there are two core dimensions when we judge a person’s character. One is competence—how good are they at whatever they’re doing. And the second is warmth or morality—how much do I like the person and is it somebody I can trust.

Schnall believes that direct replication studies were introduced as a crime control measure in response to the revelation thaat Diedrik Stapel had made up data in over 50 articles.  This violation of research integrity is called fabrication.  However, direct replication studies are not an effective way to detect fabrication (Strobe and Strack, 2014).

In social psychology we had a problem a few years ago where one highly prominent psychologist turned out to have defrauded and betrayed us on an unprecedented scale. Diederik Stapel had fabricated data and then some 60-something papers were retracted… This is also when this idea of direct replications was developed for the first time where people suggested that to be really scientific we should do what the clinical trials do rather our regular [publish conceptual replication studies that work] way of replication that we’ve always done.

Schnall overlooks that another reason for direct replications were concerns about falsification.

Falsification is manipulating research materials, equipment, or processes, or changing or omitting data or results such that the research is not accurately represented in the research record (The Office of Research Integrity)

In 2011/2012 numerous articles suggested that falsification is a much bigger problem than fabrication and direct replications were used to examine whether falsified evidence also produced false positive results that could not be replicated.  Failures in direct replications are at least in part due to the use of questionable research practices that inflate effect sizes and success rates.

Today it is no longer a secret that many studies failed to replicate because original studies reported inflated effect sizes (OSC, 2015).  Given the widespread use of QRPs, especially in experimental social psychology, replication failures are the norm.  In this context, it makes sense that individual researches feel attacked if one of their studies is replicated.

There’s been a disproportional number of studies that have been singled out simply because they’re easy to conduct and the results are surprising to some people outside of the literature

Why me?  However, the OSC (2015) project did not single out individual researchers. It put up any study that was published in JPSP or Psychological Science in 2008 up for replication.  Maybe the ease of replication was a factor.

Schnall’s next complaint is that failure to replicate are treated as more credible than successful original studies.

Often the way these replications are interpreted is as if one single experiment disproves everything that has come before. That’s a bit surprising, especially when a finding is negative, if an effect was not confirmed. 

This argument ignores two things. First, it ignores that original researchers have a motivated bias to show a successful result.  Researchers who conduct direct replication studies are open to finding a positive or a negative result.  Second, Schnall ignores sample size.  Her original Study 1 had a sample size of N = 40.  The replication study had a sample size of N = 208.  Studies with larger samples have less sampling error and are more robust to violations of statistical assumptions underlying significance tests.  Thus, there are good reasons to believe the results of the failed replication studies more than the results of Schnall’s small original study.

Her next issue was that a special issue published a failed replication without peer review.  This led to some controversy, but it is not the norm.  More important, Schnall overstates the importance of traditional, anonymous, pre-publication peer-review.

It may not seem like a big deal but peer review is one of our laws; these are our publication ethics to ensure that whatever we declare as truth is unbiased. 

Pre-publication peer-review does not ensure that published resutls are unbiased. The OSC (2015) results clearly show that published results were biased in favor of supporting researchers’ hypotheses. Traditional peer-review does not check whether researchers used QRPs or not.  Peer-review does not end once a result is published.  It is possible to evaluate the results of original studies or replication studies even after the results are published.

And this is what Schnall did. She looked at the results and claimed that there was a mistake in the replication study.

I looked at their data, looked at their paper and I found what I consider a statistical problem.

However, others looked at the data and didn’t agree with her.  This led Schnall to consider replications a form of bulling.

“One thing I pointed to was this idea of this idea of replication bullying, that now if a finding doesn’t replicate, people take to social media and declare that they “disproved” an effect, and make inappropriate statements that go well beyond the data.”

It is of course ridiculous to think of failed replication studies as a form of bulling. We would not need to conduct empirical studies, if only successful replication studies were allowed to be published.  Apparently some colleagues tried to point this out to Schnall.

Interestingly, people didn’t see it that way. When I raised the issue, some people said yes, well, it’s too bad she felt bullied but it’s not personal and why can’t scientists live up to the truth when their finding doesn’t replicate?

Schnall could not see it this way.  According to her, there are only two reasons why a replication study may fail.

If my finding is wrong, there are two possibilities. Either I didn’t do enough work and/or reported it prematurely when it wasn’t solid enough or I did something unethical.

In reality there are many more reasons for a replication failure. One possible explanation is that the original result was an honest false positive finding.  The very notion of significance testing implies that some published findings can be false positives and that only future replication studies can tell us which published findings are false positives.  So a simple response to a failed replication is simply to say that it probably was a false positive result and that is the end of the story.

But Schnall does not believe that it is a false positive result ….

because so far I don’t know of a single person who failed to replicate that particular finding that concerned the effect of physical cleanliness and moral cleanliness. In fact, in my lab, we’ve done some direct replications, not conceptual replications, so repeating the same method. That’s been done in my lab, that’s been done in a different lab in Switzerland, in Germany, in the United States and in Hong Kong; all direct replications. As far as I can tell it is a solid effect.

The problem with this version of the story is that it is impossible to get significant results again and again with small samples, even if the effect is real.  So, it is not credible that Schnall was able to get significant results in many unpublished results and never obtained a contradictory result (Schimmack, 2012).

Despite many reasonable comments about the original study and the replication studies (e.g., sample size, QRPs, etc.), Schnall cannot escape the impression that replication researchers have an agenda to tear down good research.

Then the quality criteria are oftentimes not nearly as high as for the original work. The people who are running them sometimes have motivations to not necessarily want to find an effect as it appears.

This accusation motivated me to publish my first blog post and to elaborate on this study from the OSC reproducibilty project.  There is ample evidence that QRPs contributed to replication failures. In contrast, there is absolutely no empirical evidence that replication researchers deliberately produced non-significant results, and as far as I know Schnall has not yet apologized for her unfounded accusation.

One reason for her failure to apologize is probably that many social psychologists expressed support for Schnall either in public or mostly in private.

I raised these concerns about the special issue, I put them on a blog, thinking I would just put a few thoughts out there. That blog had some 17,000 hits within a few days. I was flooded with e-mails from the community, people writing to me to say things like “I’m so glad that finally somebody’s saying something.” I even received one e-mail from somebody writing to me anonymously, expressing support but not wanting to reveal their name. Each and every time I said: “Thank you for your support. Please also speak out. Please say something because we need more people to speak out openly. Almost no one did so.”

Schnall overlooks a simple solution to the problem.  Social psychologists who feel attacked by failed replications could simply preregister their own direct replications with large samples and show that their results do replicate.  This solution was suggested by Daniel Kahneman in 2012 in response to a major replication failure of a study by John Bargh that cast doubt on social priming effects.

What social psychology needs to do as a field is to consider our intuitions about how we make judgments, about evidence, about colleagues, because some of us have been singled out again and again and again. And we’ve been put under suspicion; whole areas of research topics such as embodied cognition and priming have been singled out by people who don’t work on the topics. False claims have been made about replication findings that in fact are not as conclusive as they seem. As a field we have to set aside our intuitions and move ahead with due process when we evaluate negative findings. 

However, what is most telling is the complete absence of direct replications by experimental social psychologists to demonstrate that their published results can be replicated.  The first major replication attempt by Vohs and Schmeichel just failed to replicate ego-depletion in a massive self-replication attempt.

In conclusion, it is no longer a secret that experimental social psychologists have used questionable research practices to produce more significant results than unbiased studies would produce.  The response to this crisis of confidence has been denial.

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s