Dr. Ulrich Schimmack Blogs about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

Prediction Markets of Replicability

Abstract

I reinvestigate the performance of prediction markets for the Open Science Collaboration replicability project. I show that performance of prediction markets varied considerably across the two markets, with the second market failing to replicate the excellent performance of the first market. I also show that the markets did not perform significantly better than a “burn everything to the ground” rule that bets on failure every time. Finally, I suggest a simple rule that can be easily applied to published studies that only treats results with p-values below .005 as significant. Finally, I discuss betting on future studies as a way to optimize resource allocation for future studies.

Introduction

For decades, psychologists failed to properly test their hypotheses. Statistically significant results in journals are meaningless because published results are selected for significance. A replication project with 100 studies from three journals that reported significant results found that only 37% (36/97) of published significant results could be replicated (Open Science Collaboration, 2015).

Unfortunately, it is impossible to rely on actual replication studies to examine the credibility of thousands of findings that have been reported over the years. Dreber, Pfeiffer, Almenberg, Isakssona, Wilson, Chen, Nosek, and Johannesson (2015) proposed prediction markets as a solution to this problem. Prediction markets rely on a small number of traders to bet on the outcome of replication studies. They can earn a small amount of real money for betting on studies that actually replicate.

To examine the forecasting abilities of prediction markets, Dreber et al. (2015) conducted two studies. The first study with 23 studies started in November 2012 and lasted two month (N = 47 participants). The second study with 21 studies started in October 2014 (N = 45 participants). The studies are very similar to each other. Thus, we can consider Study 2 a close replication of Study 1.

Studies were selected from the set of 100 studies based on time of completion. To pay participants, studies were chosen that were scheduled to be completed within two month after the completion of the prediction market. It is unclear how completion time may influence the type of study that was included or the outcome of the included studies.

The published article reports the aggregated results across the two studies. A market price above 50% was considered to be a prediction of a successful replication and a market price below 50% was considered to be a prediction of a replication failure. The key finding was that “the prediction markets correctly predict the outcome of 71% of the replications (29 of 41 studies” (p. 15344). The authors compare this finding to a proverbial coin flip which implies a replication rate of 50% and find that 71% is [statistically] significantly higher than than 50%.

Below I am conducting some additional statistical analyses of the open data. First, we can compare the performance of the prediction market with a different prediction rule. Given the higher prevalence of replication failures than successes, a simple rule is to use the higher base rate of failures to predict that all studies will fail to replicate. As the failure rate for the total set of 97 studies was 37%, this prediction rule has a success rate of 1-.37 = 63%. For the 43 studies with significant results, the success rate of replication studies was also 37% (15/41). Somewhat surprisingly, the success rates were also close to 37% for Prediction Market 1, 32% (7/22) and Prediction Market 2, 42% (8 / 19).

In comparison to a simple prediction rule that everything in psychology journals does not replicate, prediction markets are no longer statistically significantly better, chi2(1) = 1.82, p = .177.

Closer inspection of the original data also revealed a notable difference in the performance of the two prediction markets. Table 1 shows the results separately for prediction markets 1 and 2. Whereas the performance of the first prediction market is nearly perfect, 91% (20/22), the replication market performed only at chance levels (flip a coin), 53% (10/19). Despite the small sample size, the success rates in the two studies are statistically significantly different, chi2(1) = 5.78, p = .016.

There is no explanation for these discrepancies and the average result reported in the article can still be considered the best estimate of prediction markets’ performance, but trust in their ability is diminished by the fact that a close replication of excellent performance failed to replicate. Not reporting the different outcomes for two separate studies could be considered a questionable decision.

The main appeal of prediction markets over the nihilistic trash-everything rule is that decades of research would have produced some successes. However, the disadvantage of prediction markets is that they take a long time, cost money, and the success rates are currently uncertain. A better solution might be to find rules that can be applied easily to large sets of studies (Yang, Wu, & Uzzi, 2020). One simple rule is suggested by the simple relationship between strength of evidence and replicability. The stronger the evidence against the null-hypothesis is (i.e., lower p-values), the more likely it is that the original study had high power and that the results will replicate in a replication study. There is no clear criterion for a cut-off point to optimize prediction, but the results of the replication project can be used to validate cut-off points empirically.

One suggests has been to consider only p-values below .005 as statistically significant (Benjamin et al., 2017). This rule is especially useful when selection bias is present. Selection bias can produce many results with p-values between .05 and .005 that have low power. However, p-values below .005 are more difficult to produce with statistical tricks.

The overall success rate for the 41 studies included in the Prediction Markets was 63% (26/41), a difference of 4 studies. The rule also did better for the first market, 81% (18.22) than for the second market, 42% (8/19).

Table 2 shows that the main difference between the two markets was that the first market contained more studies with questionable p-values between .05 and .005 (15 vs. 6). For the second market, the rule overpredicts successes and there are more false (8) than correct (5) predictions. This finding is consistent with examinations of the total set of replication studies in the replicability project (Schimmack, 2015). Based on this observation, I recommended a 4-sigma rule, p < .00006. The overall success rate increases to 68% (28/41) and improvement by 2 studies. However, an inspection of correct predictions of successes shows that the rule only correctly predicts 5 of the 15 successes (33%), whereas the p < .005 rule correctly predicted 10 of the 15 successes (67%). Thus, the p < .005 rule has the advantage that it salvages more studies.

Conclusion

Meta-scientists are still scientists and scientists are motivated to present their results in the best possible light. This is also true for Derber et al.’s (2015) article. The authors enthusiastically recommend prediction markets as a tool “to quickly identify findings
that are unlikely to replicate” Based on their meta-analysis of two prediction markets with a total of just 41 studies, the authors conclude that “prediction markets are well suited” to assess the replicability of published results in psychology. I am not the only one to point out that this conclusion is exaggerated (Yang, Wu, & Uzzi, 2020). First, prediction markets are not quick at identifying replicable results, especially when we compare the method to a simple computation of the exact p-values to decide whether the p-value is below .005 or not. It is telling that nobody has used prediction markets to forecast the outcome of new replication studies. One problem is that a market requires a set of studies, which makes it difficult to use them to predict outcomes of single studies. It is also unclear how well prediction markets really work. The original article omitted the fact that it worked extremely well in the first market and not at all in the second market, a statistically significant difference. The outcome seems to depend a lot on the selection of studies in the market. Finally, I showed that a simple statistical rule alone can predict replication outcomes nearly as well as prediction markets.

There is no reason to use markets for multiple studies. One could also set up betting for individual studies, just like individuals can bet on the outcome of a single match in sports or a single election outcome. Betting might be more usefully employed for the prediction of original studies than to vet the outcome of replication studies. For example, if everybody bets that a study will produce a significant result, there appears to be little uncertainty about the outcome, and the actual study may be a waste of resources. One concerns in psychology is that many studies merely produce significant p-values for obvious predictions. Betting on effect sizes would help to make effect sizes more important. If everybody bets on a very small effect size a study might not be useful to run because the expected effect size is trivial, even if the effect is greater than zero. Betting on effect sizes could also be used for informal power analyses to determine the sample size of the actual study.

References

Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. PNAS Proceedings of the National Academy of Sciences of the United States of America, 112(50), 15343–15347. https://doi.org/10.1073/pnas.1516179112 1 commen

Bill von Hippel and Ulrich Schimmack discuss Bill’s Replicability Index

Background:

We have never met in person or interacted professionally before we embarked on this joint project. This blog emerged from a correspondence that was sparked by one of Uli’s posts in the Replicability-Index blog. We have carried the conversational nature of that correspondence over to the blog itself.

Bill: Uli, before we dive into the topic at hand, can you provide a super brief explanation of how the Replicability Index works? Just a few sentences for those who might be new to your blog and don’t understand how one can assess the replicability of the findings from a single paper.

Uli: The replicability of a finding is determined by the true power of a study, and true power depends on the sample size and the population effect size. We have to estimate replicability because the population effect size is unknown, but studies with higher power are more likely to produce smaller p-values. We can convert the p-values into a measure of observed power. For a single statistical test this estimate is extremely noisy, but it is the best guess we can make. So, a result with a p-value of .05 (50% observed power) is less likely to replicate than a p-value of .005 (80% power). A value of 50% may still look good, but observed power is inflated if we condition on significance. To correct for this inflation, the R-Index takes the difference between the success rate and observed power. For a single test, the success rate is 1 (100%) because a significant result was observed. This means that observed power of 50%, produces an R-Index of 50% – (100% – 50%) = 0. In contrast, 80% power, still produces an R-Index of 80% – (100% – 80%) = 60%. The main problem is that observed p-values are very variable. It is therefore better to use several p-values and to compute the R-Index based on average power. For larger sets of studies, a more sophisticated method called z-curve can produce actual estimates of power. It also can be used to estimate the false positive risk. Sorry, if this is not super brief.

Bill: That wasn’t super brief, but it was super informative. It does raise a worry in me, however. Essentially, your formula stipulates that every p value of .05 is inherently unreliable. Do we have any empirical evidence that the true replicability of p = .05 is functionally zero?

Uli:  The inference is probabilistic. Sometimes p = .05 will occur with high power (80%). However, empirically we know that p = .05 more often occurs with low power. The Open Science Collaboration project showed that results with p > .01 rarely replicated, whereas results with p < .005 replicated more frequently. Thus, in the absence of other information, it is rational to bet on replication failures when the p-value is just or marginally significant.

Bill: Good to know. I’ll return to this “absence of other information” issue later in our blog, but in the meantime, back to our story…I was familiar with your work prior to this year, as Steve Lindsay kept the editorial staff at Psychological Science up to date with your evaluations of the journal. But your work was made much more personally relevant on January 19, when you wrote a blog on “Personalized p-values for social/personality psychologists.”

Initially, I was curious if I would be on the list, hoping you had taken the time to evaluate my work so that I could see how I was doing. Your list was ordered from the highest replicability to lowest, so when I hadn’t seen my name by the half-way point, my curiosity changed to trepidation. Alas, there I was – sitting very near the bottom of your list with one of the lowest replicability indices of all the social psychologist’s you evaluated.

I was aghast. I had always thought we maintained good data practices in our lab: We set a desired N at the outset and never analyzed the data until we were done (having reached our N or occasionally run out of participants); we followed up any unexpected findings to be sure they replicated before reporting them, etc. But then I thought about the way we used to run the lab before the replication crisis:

  1. We never reported experiments that failed to support our hypotheses, but rather tossed them in the file drawer and tried again.
  2. When an effect had a p value between .10 and the .05 cut-off, we tried various covariates/control variables to see if they would push our effect over that magical line. Of course we reported the covariates, but we never reported their ad-hoc nature – we simply noted that we included them.
  3. We typically ran studies that were underpowered by today’s standards, which meant that the effects we found were bouncy and could easily be false positives.
  4. When we found an effect on one set of measures and not another, sometimes we didn’t report the measures that didn’t work.

The upshot of these reflections was that I emailed you to get the details on my individual papers to see where the problems were (which led to my next realization; I doubt I would have bothered contacting you if my numbers had been better). So here’s my first question: How many people have contacted you about this particular blog and is there any evidence that people are taking it seriously?

Uli:  I have been working on replicability for 10 years now. The general response to my work is to ignore it with the justification that it is not peer-reviewed. I also recall only two requests. One to evaluate a department and one to evaluate an individual. However, the R-Index analysis is easy to do and Mickey Inzlicht published a blog post about his self-analysis. I don’t know how many researchers have evaluated their work in private. It is harder to evaluate how many people take my work seriously. The main indicator of impact is the number of views of my blogs which has increased from 50,000 in 2015 to 250,000 in 2020. The publication of the z-curve package for R also has generated interest among researchers to conduct their own analyses.

Bill: That’s super impressive. I don’t think my entire body of research has been viewed anywhere near 250K times.

OK, once again, back to our story. When you sent me the data file on my papers, initially I was unhappy that you only used a subset of my empirical articles (24 of the 70 or so empirical papers I’ve published) and that your machine coding had introduced a bit of error into the process. But we decided to turn this into a study of sorts, so we focused on those 24 papers and the differences that would emerge as a function of machine vs. hand-coding and as a function of how many stats we pulled out of each experiment (all the focal statistics vs. just one stat for each experiment). Was that process useful for you? If so, what did you learn from it?

Uli: First of all, I was pleased that you engaged with the results. More important, I was also curious how the results would compare to hand-coding. I had done some comparisons for other social psychologists and I had some confidence in the results to post them, but I am aware that the method is not flawless and can produce misleading results in individual cases. I am also aware that my own hand-coding can be biased. So, for you to offer to do your own coding was a fantastic opportunity to examine the validity of my results.

Bill: Great. I’m still a little unsure what I’ve learned from this particular proctology exam, so let’s see what we can figure out here. If you’ll humor me, let’s start with our paper that has the lowest replicability index in your subsample – no matter which we way we calculate it, we find less than a 50% chance that it will replicate. It was published in 2005 in PSPB and took an evolutionary approach on grandparenting. Setting aside the hypothesis, the relevant methods were as follows:

  1. We recruited all of the participants who were available that year in introductory psychology, so our N was large, but determined by external constraints.
  2. We replicated prior findings that served as the basis of our proposed effect.
  3. The test of our new hypothesis yielded a marginally significant interaction (F(1, 412) = 2.85, p < .10). In decomposing the interaction, we found a simple effect where it was predicted (F(1, 276) = 5.92, p < .02) and no simple effect where it wasn’t predicted (F < 1, ns. – *apologies for the imprecise reporting practices).

Given that: 1) we didn’t exclude any data (a poor practice we sometimes engaged in by reporting some measures and not others, but not in this paper), 2) we didn’t include any ad-hoc control variables (a poor practice we sometimes engaged in, but not in this paper), 3) we didn’t run any failed studies that were tossed out (a poor practice we regularly engaged in, but not in this paper), and 4) we reported the a priori test of our hypothesis exactly as planned…what are we to conclude from the low replicability index? Is the only lesson here that marginally significant interactions are highly unlikely to replicate? What advice would you have given me in 2004, if I had shown you these data and said I wanted to write them up?

Uli: There is a lot of confusion about research methods, the need for preregistration, and the proper interpretation of results. First, there is nothing wrong with the way you conducted the study. The problems arise when the results are interpreted as a successful replication of prior studies. Here is why. First, we do not know whether prior studies used questionable research practices and reported inflated effect sizes. Second, the new findings are reported without information about effect sizes. What we really would like to know is the confidence interval around the predicted interaction effect, which would be the difference in the effect sizes between the two conditions. With a p-value greater than .05, we know that the 95%CI includes a value of 0. So, we cannot reject the hypotheses that the two conditions differ at that level of confidence. We can increase uncertainty in the conclusion by using a 90% or 80% confidence interval, but we still would want to know what effect sizes we can reject. It would also be important to specify what effect sizes would be considered too small to warrant a theory that predicts this interaction effect. Finally, the results suggest that the sample size of about 400 participants was still too small to have good power to detect and replicate the effect. A conclusive study would require a larger sample.

Bill: Hmm, very interesting. But let me clarify one thing before we go on. In this study, the replication of prior effects that I mentioned wasn’t the marginal interaction that yielded the really low replicability index. Rather, it was a separate main effect, whereby participants felt closest to their mother’s mother, next to their mother’s father, next to their father’s mother, and last to their father’s father. The pairwise comparisons were as follows: “participants felt closer to mothers’ mothers than mothers’ fathers, F(1,464) = 35.88, p < .001, closer to mothers’ fathers than fathers’ mothers, F(1, 424) = 3.96, p < .05, and closer to fathers’ mothers than fathers’ fathers, F(1, 417) = 4.88, p < .03.”

We were trying to build on that prior effect by explaining the difference in feelings toward father’s mothers and mothers’ fathers, and that’s where we found the marginal interaction (which emerged as a function of a third factor that we had hypothesized would moderate the main effect).

I know it’ll slow things down a bit, but I’m inclined to take your advice and rerun the study with a larger sample, as you’ve got me wondering whether this marginal interaction and simple effect are just random junk or meaningful. We could run the study with one or two thousand people on Prolific pretty cheaply, as it only involves a few questions.

Shall we give it a try before we go on? In the spirit of both of us trying to learn something from this conversation, you could let me know what sample size would satisfy you as giving us adequate power to attempt a replication of the simple effect that I’ve highlighted in red above. I suspect that the sample size required to have adequate power for a replication of the marginal interaction would be too expensive, but I believe a study that is sufficiently powered to detect that simple effect will reveal an interaction of at least the magnitude we found in that paper (as I still believe in the hypothesis we were testing).

If that suits you, I’m happy to post this opener on your blog and then return in a few weeks with the results of the replication effort and the goal of completing our conversation.

Uli:  Testing our different intuitions about this particular finding with empirical data is definitely interesting, but I am a bit puzzled about the direction this discussion has taken. It is surely interesting to see whether this particular finding is real and can be replicated. Let’s assume for the moment that it does. This unfortunately, increases the chances that some of the other studies in the z-curve are even less likely to be replicated because there is clear evidence of selection bias and a low probability of replication. Think about it as an urn with 9 red and 1 green marble. Red ones do not replicate and green ones do replicate. After we pick the green marble on the first try, there are only red marbles left.

One of the biggest open questions is what researchers actually did to get too many significant results. We have a few accounts of studies with non-significant results that were dropped and anonymous surveys show that a variety of questionable research practices were used. Even though these practices are different from fraud and may have occurred without intent, researchers have been very reluctant to talk about the mistakes they made in the past. Carney walked away from power posing by admitting to the use of statistical shortcuts. I wonder whether you can tell us a bit more about the practices that led to the low EDR estimate for your focal tests. I know it is a big ask, but I also know that young social psychologists would welcome open disclosure of past practices. As Mickey Inzlicht always tells me “Hate the sin. Love the sinner.” As my own z-curve shows, I also have a mediocre z-curve and I am currently in the process of examining my past articles to see which ones I still believe and which ones I no longer believe.

Bill: Fair question Uli – I’ve made more mistakes than I care to remember! But (at least until your blog post) I’ve comforted myself in the belief that peer-review corrected most of them and that the work I’ve published is pretty solid. So forgive me for banging on for so long, but I have a two-part answer to your question. Part 1 refers back to your statement above, about your work being “in the absence of other information”, and also incorporates your statement above about red and green marbles in an urn. And Part 2 builds on Part 1 by digging through my studies with low Replicability indices and letting you know whether (and if so where) I think they were problematic.

Part 1: “In the absence of other information” is a really important caveat. I understand that it’s the basis of your statistical approach, but of course research isn’t conducted in the absence of other information. In my own case, some of my hypotheses were just hunches about the world, based on observations or possible links between other ideas. I have relatively little faith in these hypotheses and have abandoned them frequently in the face of contrary or inconsistent evidence. But some of my hypotheses are grounded in a substantial literature or prior theorizing that strike me as rock solid. The Darwinian Grandparenting paper is just such an example, and thus it seems like a perfect starting point. The logic is so straightforward and sensible that I’d be very surprised if it’s not true. As a consequence, despite the weak statistical support for it, I’m putting my money on it to replicate (and it’s just self-report, so super easy to conduct a replication online).

And this line of reasoning leads me to dispute your “red and green marbles in the urn” metaphor. Your procedure doesn’t really tell us how many marbles are in the urn of these two colors. Rather, your procedure makes a guess about the contents of the urn, and that guess intentionally ignores all other information. Thus, I’d argue that a successful or failed replication of the grandparenting paper tells us nothing at all about the probability of replicating other papers I’ve published, as I’m bringing additional information to bear on the problem by including the theoretical strength of the claims being made in the paper. In other words, I believe your procedure has grossly underestimated the replicability of this paper by focusing only on the relevant statistics and ignoring the underlying theory. That doesn’t mean your procedure has no value, but it does mean that it’s going to make predictable mistakes.

Part 2: HereI’m going to focus on papers that I first authored, as I don’t think it’s appropriate for me to raise concerns about work that other people led without involving them in this conversation. With that plan in mind, let’s start at the bottom of the replication list you made for me in your collection of 24 papers and work our way up.

  1. Darwinian Grandparenting – discussed above and currently in motion to replicate (Laham, S. M., Gonsalkorale, K., & von Hippel, W. (2005). Darwinian grandparenting: Preferential investment in more certain kin. Personality and Social Psychology Bulletin, 31, 63-72.)
  2. The Chicken-Foot paper – I love this paper but would never conduct it that way now. The sample was way too small and the paper only allowed for a single behavioral DV, which was how strongly participants reacted when they were offered a chicken foot to eat. As a consequence, it was very under-powered. Although we ran that study twice, first as a pilot study in an undergraduate class with casual measurement and then in the lab with hidden cameras, and both studies “worked”, the first one was too informal and the second one was too small and would never be published today. Do I believe it would replicate? The effect itself is consistent with so many other findings that I continue to believe in it, but I would never place my money on replicating this particular empirical demonstration without a huge sample to beat down the inevitable noise (which must have worked in our favor the first time).

(von Hippel, W., & Gonsalkorale, K. (2005). “That is bloody revolting!” Inhibitory control of thoughts better left unsaid. Psychological Science, 16, 497-500.)

  • Stereotyping Against your Will – this paper was wildly underpowered, but I think it’s low R-index reflects the fact that in our final data set you asked me to choose just a single statistic for each experiment. In this study there were a few key findings with different measures and they all lined up as predicted, which gave me a lot more faith in it. Since its publication 20 years ago, we (and others) have found evidence consistent with it in a variety of different types of studies. I think we’ve failed to find the predicted effect in one or maybe two attempts (which ended up in the circular file, as all my failed studies did prior to the replication crisis), but all other efforts have been successful and are published. When we included all the key statistics from this paper in our Replicability analysis, it has an R-index of .79, which may be a better reflection of the reliability of the results.

Important caveat: With all that said, the original data collection included three or four different measures of stereotyping, only one of which showed the predicted age effect. I never reported the other measures, as the goal of the paper was to see if inhibition would mediate age differences in stereotyping and prejudice. In retrospect that’s clearly problematic, but at the time it seemed perfectly sensible, as I couldn’t mediate an effect that didn’t exist. On the positive side, the experiment included only two measures of prejudice, and both are reported in the paper.

(von Hippel, W., Silver, L. A., & Lynch, M. E. (2000). Stereotyping against your will: The role of inhibitory ability in stereotyping and prejudice among the elderly. Personality and Social Psychology Bulletin, 26, 523-532.)

  • Inhibitory effect of schematic processing on perceptual encoding – given my argument above that your R-index makes more sense when we include all the focal stats from each experiment, I’ve now shifted over to the analysis you conducted on all of my papers, including all of the key stats that we pulled out by hand (ignoring only results with control variables, etc.). That analysis yields much stronger R-indices for most of my papers, but there are still quite a few that are problematic. Sadly, this paper is the second from the bottom on my larger list. I say sadly because it’s my dissertation. But…when I reflect back on it, I remember numerous experiments that failed. I probably ran two failed studies for each successful one. At the time, no one was interested in them, and it didn’t occur to me that I was engaging in poor practices when I threw them in the bin. The main conclusion I came to when I finished the project was that I didn’t want to work on it anymore as it seemed like I spent all my time struggling with methodological details trying to get the experiments to work. Maybe each successful study was the one that found just the right methods and materials (as I thought at the time), but in hindsight I suspect not. And clearly the evidentiary value for the effect is functionally zero if we collapse across all the studies I ran. With that said, the key finding followed from prior theory in a pretty straightforward manner and we later found evidence for the proposed mechanism (which we published in a follow-up paper*). I guess I’d conclude from all this that if other people have found the effect since then, I’d believe in it, but I can’t put any stock in my original empirical demonstration.

(von Hippel, W., Jonides, J., Hilton, J. L., & Narayan, S. (1993). Inhibitory effect of schematic processing on perceptual encoding. Journal of Personality and Social Psychology, 64, 921-935.

*von Hippel, W., & Hawkins, C. (1994). Stimulus exposure time and perceptual memory. Perception and Psychophysics, 56, 525-535.)

  • The Linguistic Intergroup Bias (LIB) as an Indicator of Prejudice. This is the only other paper on which I was first author that gets an R-index of less than .5 when you include all the focal stats in the analysis. I have no doubt it’s because, like all of my work at the time, it was wildly under-powered and the effects weren’t very strong. Nonetheless, we’ve used the LIB many times since, and although we haven’t found the predicted results every time, I believe it works pretty reliably. Of course, I could easily be wrong here, so I’d be very interested if any readers of this blog have conducted studies using the LIB as an indicator of prejudice, and if so, whether it yielded the predicted results.

(von Hippel, W., Sekaquaptewa, D., & Vargas, P. (1997). The Linguistic Intergroup Bias as an implicit indicator of prejudice. Journal of Experimental Social Psychology, 33, 490-509.)

  • All my articles published in the last ten years with a low Rindex – Rather than continuing to torture readers with the details of each study, in this last section I’ve gone back and looked at all my papers with an R-index less than .70 based on all the key statistics (not just a single stat for each experiment) that were published in the last 10 years. This exercise yields 9 out of 25 empirical papers with an R-index ranging from .30 to .62 (with five other papers in which an R-index apparently couldn’t be calculated). The evidentiary value of these 9 papers is clearly in doubt, despite the fact that they were published at a time when we should have known better. So what’s going on here? Six of them were conducted on special samples that are expensive to run or incredibly difficult to recruit (e.g., people who have suffered a stroke, people who inject drugs, studies in an fMRI scanner), and as a result they are all underpowered. Perhaps we shouldn’t be doing that work, as we don’t have enough funding in our lab to run the kind of sample sizes necessary to have confidence in the small effects that often emerge. Or perhaps we should publish the papers anyway, and let readers decide if the effects are sufficiently meaningful to be worthy of further investigation. I’d be curious to hear your thoughts on this Uli. Of the remaining three papers, one of them reports all four experiments we ran prior to publication, but since then has proven difficult to replicate and I have my doubts about it (let’s call that a clear predictive win for the R-Index). Another is well powered and largely descriptive without much hypothesis testing and I’m not sure an R-index makes sense for it. And the last one is underpowered (despite being run on undergraduates), so we clearly should have done better.

What do I conclude from this exercise? A consistent theme in our findings that have a low R-index is that they have small sample sizes and report small effects. Some of those probably reflect real findings, but others probably don’t. I suspect the single greatest threat to their validity (beyond the small samples sizes) was the fact that until very recently we never reported experiments that failed. In addition, sometimes we didn’t report measures we had gathered if they didn’t work out as planned and sometimes we added control variables into our equations in an ad-hoc manner. Failed experiments, measures that don’t work, and impactful ad-hoc controls are all common in science and reflect the fact that we learn what we’re doing as we go. But the capacity for other people to evaluate the work and its evidentiary value is heavily constrained when we don’t report those decisions. In retrospect, I deeply regret placing a greater emphasis on telling a clear story than on telling a transparent and complete story.

Has this been a wildly self-serving tour through the bowels of a social psychology lab whose R-index is in the toilet? Research on self-deception suggests you should probably decide for yourself.

Uli:  Thank you for your candid response. I think for researchers our age (not sure really how old you are) it will be easier to walk away from some articles published in the anything-goes days of psychological science because we still have time to publish some new and better work. As time goes on, it may become easier for everybody to acknowledge mistakes and become less defensive. I hope that your courage and our collaboration encourage more people to realize that the value of a researcher is not measured in terms of number of publications or citations. Research is like art and not every van Gogh is a masterpiece. We are lucky if we make at least one notable contribution to our science. So, losing a few papers to replication failures is normal. Let’s see what the results of the replication study will show.

Bill: I couldn’t agree with you more! (Except for the ‘courage’ part; I’m trepidatious as hell that my work won’t be taken seriously [or funded] anymore. But so be it.) I’ll be in touch as soon as we get ethics approval and run our replication study…

Self-Replications in JPSP: A Meta-Analysis

Prefix

This blog post was inspired by my experience to receive a rejection of a replication manuscript. We replicated Diener et al.’s (1995) JPSP article on the personality structure of affect. For the most part, it was a successful replication and a generalization to non-student samples. The manuscript was desk rejected because the replication study was not close enough in terms of the items and methods that we used. I was shocked that JPSP would reject replication studies, which made me wonder what the criteria for acceptance are.

Abstract

In 2015, JPSP started to publish online only replication articles. I examined what these articles have revealed about the replicability of articles published in JPSP before 215. There were only 21 articles published between 2015 and 2020. Only 7 of these articles reported replications of JPSP articles, one included replications of 3 articles. Out of these 9 replications, six were successful and 3 were failures. This finding shows once more that psychologists do everything in their power to appear trustworthy without doing the things that are required to gain or regain trust. While fabulous review articles tout the major reforms that have been made (Nelson et al., 2019), the reality is often much less glamourous. It remains unclear which articles in JPSP can be trusted and selection for significance may undermine the value of self-replications in JPSP.

Introduction

The past decade has revealed a replication crisis in social psychology. First, Bargh’s famous elderly walking study did not replicate. Then, only 14 out of 55 (25%) significant results could be replicated in an investigation of the replicability of social psychology (Open Science Collaboration, 2015). While some social psychologists tried to dismiss this finding, additional evidence further confirmed that social psychology has a replication crisis (Motyl et al., 2017). A statistical method that corrects for publication bias and other questionable research practices estimates a replication rate of 43% (Schimmack, 2020). This estimate was replicated with a larger dataset of the most cited articles by eminent social psychologists (49%; Schimmack, 2021). However, the statistical estimates assume that it is possible to replicate studies exactly, but most replication studies are often conceptual replications that vary in some attributes. Most often the population between original and replication studies differ. Due to regression to the mean, effect sizes in replication studies are likely to be weaker. Thus, the statistical estimates are likely to overestimate the success rate of actual replication studies (Bartos & Schimmack, 2021). Thus, 49% is an upper limit and we can currently conclude that the actual replication rate is somewhere between 25% and 50%. This is also consistent with analyses of statistical power in social psychology (Cohen, 1961; Sedlmeier & Gigerenzer, 1989).

There are two explanations for the emergence of replication failures in the past decade. One explanation is that social psychologists simply did not realize the importance of replication studies and forgot to replicate their findings. They only learned about the need to replicate findings in 2011 and when they started conducting replication studies, they realized that many of their findings are not replicable. Consistent with this explanation, Nelson, Simmons, and, Simonsohn (2019) report that out of over 1,000 curated replication attempts, 96% have been conducted since 2011. The problem with this explanation is that it is not true. Psychologists have conducted replication studies since the beginning of their science. Since the late 1990, many articles in social psychology reported at least two and sometimes many more conceptual replication studies. Bargh reported two close replications of his elderly priming study in an article with four studies (Bargh et al., 1996).

The real reason for the replication crisis is that social psychologists selected studies for significance (Motyl et al., 2017; Schimmack, 2021; Sterling, 1959; Sterling et al., 1995). As a result, only replication studies with significant results were published. What changed in 2011 is that researchers suddenly were able to circumvent censorship at traditional journals and were able to published replication failures in new journals that were less selective, which in this case, was a good thing (Doyen ,Klein, Pichon, & Cleeremans, 2012; Ritchie, Wiseman, & French, 2012). The problem with this explanation is that it is true, but it makes psychological scientists look bad. Even undergraduate students with little formal training in philosophy of science realize that selective publishing of successful studies is inconsistent with the goal to search for the truth (Ritchie, 2021). However, euphemistic descriptions of the research culture before 2011 avoid mentioning questionable research practices (Weir, 2015) or describe these practices as honest (Nelson et al. 2019). Even suggestions that these practices were at best honest mistakes are often met with hostility (Fiske, 2016). Rather than cleaning up the mess that has been created by selection for significance, social psychologists avoid discussion of their practices to hide replication failures. As a result, not much progress has been made in vetting the credibility of thousands of published articles that provide no empirical support of their claims because most of these results might not replicate.

In short, social psychology suffers from a credibility crisis. The question is what social psychologists can do to restore credibility and to regain trust in their published results. For new studies this can be achieved by avoiding the pitfalls of the past. For example, studies can be pre-registered and journals may accept articles before the results are known. But what should researchers, teachers, students, and the general public do with the thousands of published results?

One solution to this problem is to conduct replication studies of published findings and to publish the results of these studies whether they are positive or negative. In their fantastic (i.e., imaginative or fanciful; remote from reality) review article, Nelson et al. (2019) proclaim that “top journals [are] routinely publishing replication attempts, both failures and successes” (p. 512). That would be wonderful, if it were true, but top journals are considered top journals because they are highly selective in what they are publishing and they have limited journal space. So, every replication study competes with an article that has an intriguing, exiting, and groundbreaking new discovery. Editors would need superhuman strength to resist the temptation to publish the sexy new finding and instead to publish a replication of an article from 1977 or 1995. Surely, there are specialized journals for this laudable effort that makes an important contribution to science, but unfortunately do not meet the high threshold of a top journal that has to maintain its status as a top journal.

The Journal of Personality and Social Psychology found an ingenious solution to this problem. To avoid competition with groundbreaking new research, replication studies can be published in the journal, but only online. Thus, these extra articles do not count towards the limited page numbers that are needed to ensure high profit margins for predatory (i.e., for-profit) publisher. Here, I examined what articles JPSP has published as e-online only publications.

Data

The first e-only replication study was published in 2015. Over the past five years, JPSP has published 21 articles as e-replications (not counting 2021).

In the years from 1965 to 2014, JPSP has published 9,428 articles. Thus, the 21 replication articles provide new, credible evidence for 21/9428 = 0.22% of articles that were published before 2015, when selection bias undermined the credibility of the evidence in these articles. Despite the small sample size, it is interesting to examine the nature and the outcome of the studies reported in these 21 articles.

1. SUCCESS

Eschleman, K. J., Bowling, N. A., & Judge, T. A. (2015). The dispositional basis of attitudes: A replication and extension of Hepler and Albarracín (2013). Journal of Personality and Social Psychology, 108(5), e1–e15. https://doi.org/10.1037/pspp0000017

Hepler, J., & Albarracín, D. (2013). Attitudes without objects: Evidence for a dispositional attitude, its measurement, and its consequences. Journal of Personality and Social Psychology, 104(6), 1060–1076. https://doi.org/10.1037/a0032282

The original authors introduced a new measure called the Dispositional Attitude Measure (DAM). The replication study was designed to examine whether the DAM shows discriminant validity compared to an existing measure, the Neutral Objects Satisfaction Questionnaire (NOSQ). The replication studies replicated the previous findings, but also suggested that DAM and NOSQ are overlapping measures of the same construct. If we focus narrowly on replicability, this replication study is a success.

2. FAILURE

Van Dessel, P., De Houwer, J., Roets, A., & Gast, A. (2016). Failures to change stimulus evaluations by means of subliminal approach and avoidance training. Journal of Personality and Social Psychology, 110(1), e1–e15. https://doi.org/10.1037/pspa0000039

This article failed to show evidence that subliminal stimuli to change evaluations that was reported by Kawakami et al. in 2007. Thus, this article counts as a failure.

Kawakami, K., Phills, C. E., Steele, J. R., & Dovidio, J. F. (2007). (Close) distance makes the heart grow fonder: Improving implicit racial attitudes and interracial interactions through approach behaviors. Journal of Personality and Social Psychology, 92, 957–971. http://dx.doi.org/10.1037/0022-3514.92.6.957

Citation counts suggest that the replication failure has reduced citations, although 4 articles already cited it in 2021.

Most worrisome, an Annual Review of Psychology chapter (editor Susan Fiske) perpetuates the idea that subliminal stimuli could reduce prejudice. “Interventions seeking to automate more positive responses to outgroup members may train people to have an “approach” response to Black faces (e.g., by pulling a joystick toward themselves when Black faces appear on a screen; see Kawakami et al. 2007)” (Paluck, Porat, Clark, & Green, 2021, p. 543). The chapter does not cite the replication failure.

3. SUCCESS

Rieger, S., Göllner, R., Trautwein, U., & Roberts, B. W. (2016). Low self-esteem prospectively predicts depression in the transition to young adulthood: A replication of Orth, Robins, and Roberts (2008). Journal of Personality and Social Psychology, 110(1), e16–e22. https://doi.org/10.1037/pspp0000037

The original article used a cross-lagged panel model to claim that low self-esteem causes depression (rather than depression causing low self-esteem).

Orth, U., Robins, R. W., & Roberts, B. W. (2008). Low self-esteem prospectively predicts depression in adolescence and young adulthood. Journal of Personality and Social Psychology, 95(3), 695–708. https://doi.org/10.1037/0022-3514.95.3.695

The replication study showed the same results. In this narrow sense it is a success.

The same year, JPSP also published an “original” article that showed the same results.

Orth, U., Robins, R. W., Meier, L. L., & Conger, R. D. (2016). Refining the vulnerability model of low self-esteem and depression: Disentangling the effects of genuine self-esteem and narcissism. Journal of Personality and Social Psychology, 110(1), 133–149. https://doi.org/10.1037/pspp0000038

Last year, the authors published a meta-analysis of 10 studies that all consistently show the main result.

Orth, U., Clark, D. A., Donnellan, M. B., & Robins, R. W. (2021). Testing prospective effects in longitudinal research: Comparing seven competing cross-lagged models. Journal of Personality and Social Psychology, 120(4), 1013-1034. http://dx.doi.org/10.1037/pspp0000358

The high replicability of the key finding in these articles is not surprising because it is a statistical artifact (Schimmack, 2020). The authors also knew about this because I told them as a reviewer when their first manuscript was under review at JPSP, but neither the authors not the editor seemed to care about it. In short, statistical artifacts are highly replicable.

4. SUCCESS

Davis, D. E., Rice, K., Van Tongeren, D. R., Hook, J. N., DeBlaere, C., Worthington, E. L., Jr., & Choe, E. (2016). The moral foundations hypothesis does not replicate well in Black samples. Journal of Personality and Social Psychology, 110(4), e23–e30. https://doi.org/10.1037/pspp0000056

The main focus of this “replication” article was to test the generalizability of the key finding in Graham, Haidt, and Nosek’s (2009) original article to African Americans. They also examined whether the results replicate in White samples.

Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029–1046. https://doi.org/10.1037/a0015141

Study 1 found weak evidence that the relationship between political conservatism and authority differs across racial groups, beta = .25, beta = .47, chi(1) = 3.92, p = .048. Study 2 replicated this finding, but the p-value was still above .005, beta = .43, beta = .00, chi(1) = 7.04. While stronger evidence for the moderator effect of race is needed, the study counts as a successful replication of the relationship among White or predominantly White samples.

5. EXCLUDED

Crawford, J. T., Brandt, M. J., Inbar, Y., & Mallinas, S. R. (2016). Right-wing authoritarianism predicts prejudice equally toward “gay men and lesbians” and “homosexuals”. Journal of Personality and Social Psychology, 111(2), e31–e45. https://doi.org/10.1037/pspp0000070

This article reports replication studies, but the original studies were not published in JPSP. Thus, the results provide no information about the replicability of JPSP.

Rios, K. (2013). Right-wing authoritarianism predicts prejudice against “homosexuals” but not “gay men and lesbians.” Journal of Experimental Social Psychology, 49, 1177–1183. http://dx.doi.org/10.1016/j.jesp.2013.05.013

6. EXCLUDED

Panero, M. E., Weisberg, D. S., Black, J., Goldstein, T. R., Barnes, J. L., Brownell, H., & Winner, E. (2016). Does reading a single passage of literary fiction really improve theory of mind? An attempt at replication. Journal of Personality and Social Psychology, 111(5), e46–e54. https://doi.org/10.1037/pspa0000064

I excluded this article because it did not replicate a JPSP article. The original article was published in Science. Thus, the outcome of this replication study tells us nothing about the replicability published in JPSP.

7. EXCLUDED

Twenge, J. M., Carter, N. T., & Campbell, W. K. (2017). Age, time period, and birth cohort differences in self-esteem: Reexamining a cohort-sequential longitudinal study. Journal of Personality and Social Psychology, 112(5), e9–e17. https://doi.org/10.1037/pspp0000122

This article challenges the conclusions of the original article and presents new analyses using the same data. Thus, it is not a replication study.

8. EXCLUDED

Gebauer, J. E., Sedikides, C., Schönbrodt, F. D., Bleidorn, W., Rentfrow, P. J., Potter, J., & Gosling, S. D. (2017). The religiosity as social value hypothesis: A multi-method replication and extension across 65 countries and three levels of spatial aggregation. Journal of Personality and Social Psychology, 113(3), e18–e39. https://doi.org/10.1037/pspp0000104

This article is a successful self-replication of an article by the first two authors. The original article was published in Psychological Science. Thus, it does not provide evidence about the replicability of JPSP articles.

Gebauer, J. E., Sedikides, C., & Neberich, W. (2012). Religiosity, social self-esteem, and psychological adjustment: On the cross-cultural specificity of the psychological benefits of religiosity. Psychological Science, 23, 158–160. http://dx.doi.org/10.1177/0956797611427045

9. EXCLUDED

Siddaway, A. P., Taylor, P. J., & Wood, A. M. (2018). Reconceptualizing Anxiety as a Continuum That Ranges From High Calmness to High Anxiety: The Joint Importance of Reducing Distress and Increasing Well-Being. Journal of Personality and Social Psychology, 114(2), e1–e11. https://doi.org/10.1037/pspp0000128

This article replicates an original study published in Psychological Assessment. Thus, it does not tell us anything about the replicability of research in JPSP.

Vautier, S., & Pohl, S. (2009). Do balanced scales assess bipolar constructs? The case of the STAI scales. Psychological Assessment, 21, 187–193. http://dx.doi.org/10.1037/a0015312

10. EXCLUDED

Hounkpatin, H. O., Boyce, C. J., Dunn, G., & Wood, A. M. (2018). Modeling bivariate change in individual differences: Prospective associations between personality and life satisfaction. Journal of Personality and Social Psychology, 115(6), e12-e29. http://dx.doi.org/10.1037/pspp0000161

This article is a method article. The word replication does not appear once in it.

11. EXCLUDED

Burns, S. M., Barnes, L. N., McCulloh, I. A., Dagher, M. M., Falk, E. B., Storey, J. D., & Lieberman, M. D. (2019). Making social neuroscience less WEIRD: Using fNIRS to measure neural signatures of persuasive influence in a Middle East participant sample. Journal of Personality and Social Psychology, 116(3), e1–e11. https://doi.org/10.1037/pspa0000144

“In this study, we demonstrate one approach to addressing the imbalance by using portable neuroscience equipment in a study of persuasion conducted in Jordan with an Arabic-speaking sample. Participants were shown persuasive videos on various health and safety topics while their brain activity was measured using functional near infrared spectroscopy (fNIRS). Self-reported persuasiveness ratings for each video were then recorded. Consistent with previous research conducted with American subjects, this work found that activity in the dorsomedial
and ventromedial prefrontal cortex predicted how persuasive participants found the videos and how much they intended to engage in the messages’ endorsed behaviors.”

This article reports a conceptual replication study. It uses a different population (US vs. Jordan) and a different methodology. As a key finding did replicate, it might be considered a successful replication, but a failure could have been attributed to the difference in population and methodology. It is also not clear that a failure would have been reported. The study should have been conducted as a registered report.

12. FAILURE

Wilmot, M. P., Haslam, N., Tian, J., & Ones, D. S. (2019). Direct and conceptual replications of the taxometric analysis of type a behavior. Journal of Personality and Social Psychology, 116(3), e12–e26. https://doi.org/10.1037/pspp0000195

This article fails to replicate the claim that Type A and Type B are distinct types rather than extremes on a continuum. Thus, this article counts as a failure.

Strube, M. J. (1989). Evidence for the Type in Type A behavior: A taxometric analysis. Journal of Personality and Social Psychology, 56(6), 972–987. https://doi.org/10.1037/0022-3514.56.6.972

It is difficult to evaluate the impact of this replication failure because the replication study was just published and the original article received hardly any citations in recent years. Overall, it has 68 citations since 89.

13. EXCLUDED

Kim, J., Schlegel, R. J., Seto, E., & Hicks, J. A. (2019). Thinking about a new decade in life increases personal self-reflection: A replication and reinterpretation of Alter and Hershfield’s (2014) findings. Journal of Personality and Social Psychology, 117(2), e27–e34. https://doi.org/10.1037/pspp0000199

This article replicated an original articles published in PNAS. It therefore cannot be used to examine the replicability of articles published in JPSP.

Alter, A. L., & Hershfield, H. E. (2014). People search for meaning when they approach a new decade in chronological age. Proceedings of the National Academy of Sciences of the United States of America, 111, 17066–17070. http://dx.doi.org/10.1073/pnas.1415086111

14. EXCLUDED

Mõttus, R., Sinick, J., Terracciano, A., Hřebíčková, M., Kandler, C., Ando, J., Mortensen, E. L., Colodro-Conde, L., & Jang, K. L. (2019). Personality characteristics below facets: A replication and meta-analysis of cross-rater agreement, rank-order stability, heritability, and utility of personality nuances. Journal of Personality and Social Psychology, 117(4), e35–e50. https://doi.org/10.1037/pspp0000202

This article replicates a previous study of personality structure using the same items and methods using a different sample. The results are a close replication. Thus, it is a success, but the study is excluded because the original study was published in 2017. Thus, the study does not shed light on the replicability of articles published in JPSP before 2015.

Mõttus, R., Kandler, C., Bleidorn, W., Riemann, R., & McCrae, R. R. (2017). Personality traits below facets: The consensual validity, longitudinal stability, heritability, and utility of personality nuances. Journal of Personality and Social Psychology, 112, 474–490. http://dx.doi.org/10.1037/pspp0000100

15. EXCLUDED

van Scheppingen, M. A., Chopik, W. J., Bleidorn, W., & Denissen, J. J. A. (2019). Longitudinal actor, partner, and similarity effects of personality on well-being. Journal of Personality and Social Psychology, 117(4), e51–e70. https://doi.org/10.1037/pspp0000211

This study is a replication and extension of a previous study that examined the influence of personality on well-being in couples. A key finding was that personality similarity explained very little variance in well-being. While evidence for the lack of an effect is important, the replication crisis is about the reporting of too many significant results. A concern could be that the prior article reported a false negative results, but the studies were based on large samples with high power to detect even small effects.

Dyrenforth, P. S., Kashy, D. A., Donnellan, M. B., & Lucas, R. E. (2010). Predicting relationship and life satisfaction from personality in nationally representative samples from three countries: The relative importance of actor, partner, and similarity effects. Journal of Personality and Social Psychology, 99, 690–702. http://dx.doi.org/10.1037/a0020385

16. EXCLUDED

Buttrick, N., Choi, H., Wilson, T. D., Oishi, S., Boker, S. M., Gilbert, D. T., Alper, S., Aveyard, M., Cheong, W., Čolić, M. V., Dalgar, I., Doğulu, C., Karabati, S., Kim, E., Knežević, G., Komiya, A., Laclé, C. O., Ambrosio Lage, C., Lazarević, L. B., . . . Wilks, D. C. (2019). Cross-cultural consistency and relativity in the enjoyment of thinking versus doing. Journal of Personality and Social Psychology, 117(5), e71–e83. https://doi.org/10.1037/pspp0000198

This article mainly aims to examine the cross-cultural generality of a previous study by Wilson et al. (2014). Moreover, the study was published in Science. Thus, it does not help to examine the replicability of research published in JPSP before 2015.

Wilson, T. D., Reinhard, D. A., Westgate, E. C., Gilbert, D. T., Ellerbeck, N., Hahn, C., . . . Shaked, A. (2014). Just think: The challenges of the disengaged mind. Science, 345, 75–77. http://dx.doi.org/10.1126/science.1250830

17. MIXED

Yeager, D. S., Krosnick, J. A., Visser, P. S., Holbrook, A. L., & Tahk, A. M. (2019). Moderation of classic social psychological effects by demographics in the U.S. adult population: New opportunities for theoretical advancement. Journal of Personality and Social Psychology, 117(6), e84-e99. http://dx.doi.org/10.1037/pspa0000171

17a. EXCLUDED

This article reports replications of seven original studies. It also examined whether results are moderated by age / student status. Conformity to a simply presented descriptive norm (Asch, 1952; Cialdini, 2003; Sherif, 1936).
[None of these references are from JPSP]

17a. SUCCESS

The effect of a content-laden persuasive message on attitudes as moderated by argument quality and need for cognition (e.g., Cacioppo, Petty, & Morris, 1983).

Cacioppo, J. T., Petty, R. E., & Morris, K. J. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805–818. http://dx.doi.org/10.1037/0022-3514.45.4.805

17c. EXCLUDED

Base-rate underutilization (using the “lawyer/engineer” problem; Kahneman & Tversky, 1973). [not in JPSP]

17d. EXCLUDED

The conjunction fallacy (using the “Linda” problem; Tversky & Kahneman, 1983). [not in JPSP]

17e. EXCLUDED

Underappreciation of the law of large numbers (using the “hospital” problem; Tversky & Kahneman, 1974). [Not in JPSP]

17f. SUCCESS

The false consensus effect (e.g., Ross, Greene, & House, 1977). Ross, L., Greene, D., & House, P. (1977).

The “false consensus effect”: An egocentric bias in social perception and attribution processes. Journal of Experimental Social Psychology, 13, 279 –301. http://dx.doi.org/10
.1016/0022-1031(77)90049-X

17g. FAILURE

The effect of “ease of retrieval” on self-perceptions (e.g., Schwarz et al., 1991).

Schwarz, N., Bless, H., Strack, F., Klumpp, G., Rittenauer-Schatka, H., & Simons, A. (1991). Ease of retrieval as information: Another look at the availability heuristic. Journal of Personality and Social Psychology, 61, 195–202. http://dx.doi.org/10.1037/0022-3514.61.2.195

Because the replication failure was just published, it is not possible to examine whether it had any effect on citations.

18. SUCCESS

In sum, the article successfully replicated 2 JPSP articles and failed to replicate 1.

Van Dessel, P., De Houwer, J., Gast, A., Roets, A., & Smith, C. T. (2020). On the effectiveness of approach-avoidance instructions and training for changing evaluations of social groups. Journal of Personality and Social Psychology, 119(2), e1–e14. https://doi.org/10.1037/pspa0000189

This is another replication of the Kawakami et al. (2007) article, but it focusses on Experiment 1 that did not use subliminal stimuli. This article reports a successful replication in Study 1, t(61) = 1.72, p = .045 (one-tailed), Study 3, t(981) = 2.19, p = .029, t(362) = 2.76, p = .003. Thus, this article counts as a success. It should be noted, however, that these effects disappear in studies with a delay between the training and testing sessions (Lai et al., 2016).

Kawakami, K., Phills, C. E., Steele, J. R., & Dovidio, J. F. (2007). (Close) distance makes the heart grow fonder: Improving implicit racial attitudes and interracial interactions through approach behaviors. Journal of Personality and Social Psychology, 92, 957–971. http://dx.doi.org/10.1037/0022-3514.92.6.957

19. EXCLUDED

Aknin, L. B., Dunn, E. W., Proulx, J., Lok, I., & Norton, M. I. (2020). Does spending money on others promote happiness?: A registered replication report. Journal of Personality and Social Psychology, 119(2), e15–e26. https://doi.org/10.1037/pspa0000191

This article replicated a study that was published in Science. It therefore does not tell us anything about the replicability of articles published in JPSP.

Dunn, E. W., Aknin, L. B., & Norton, M. I. (2008). Spending money on others promotes happiness. Science, 319, 1687–1688. http://dx.doi.org/10.1126/science.1150952

20. EXCLUDED

Calderon, S., Mac Giolla, E., Ask, K., & Granhag, P. A. (2020). Subjective likelihood and the construal level of future events: A replication study of Wakslak, Trope, Liberman, and Alony (2006). Journal of Personality and Social Psychology, 119(5), e27–e37. https://doi.org/10.1037/pspa0000214

Although this article reports two replication studies (both failures), the original studies were published in a different journal. Thus, the results do not provide information about the replicability of research published in JPSP.

Wakslak, C. J., Trope, Y., Liberman, N., & Alony, R. (2006). Seeing the forest when entry is unlikely: Probability and the mental representation of events. Journal of Experimental Psychology: General, 135, 641–653. http://dx.doi.org/10.1037/0096-3445.135.4.641

21. EXCLUDED

Burnham, B. R. (2020). Are liberals really dirty? Two failures to replicate Helzer and Pizarro’s (2011) study 1, with meta-analysis. Journal of Personality and Social Psychology, 119(6), e38–e42. https://doi.org/10.1037/pspa0000238

Although this article reports two replication studies (both failures), the original studies were published in a different journal. Thus, the results do not provide information about the replicability of research published in JPSP.

Helzer, E. G., & Pizarro, D. A. (2011). Dirty liberals! Reminders of physical cleanliness influence moral and political attitudes. Psychological Science, 22, 517–522. http://dx.doi.org/10.1177/0956797611402514

Results

Out of the 21 articles published under the e-replication format, only 7 articles report replications of studies published in JPSP before 2015. One of these article reports replications of three articles, but two of these articles report replications of different studies in the same article (one failure, one success; Kawakami et al., 2007). Thus, there are a total of 9 replications with 6 success and 3 failures. This is a success rate of 67%, 95%CI = 31% to 98%.

The first observation is that the number of replication studies of studies published in JPSP is abysmally low. It is not clear why this is the case. Either researchers are not interested in conducting replication studies or JPSP is not accepting all submissions of replication studies for publication. Only the editors of JPSP know.

The second Some evidence that JPSP published more successful than failed replications. This is inconsistent with the results of the Open Science Collaboration project and predictions based on statistical analyses of the p-values in JPSP articles (Open Science Collaboration, 2015; Schimmack, 2020, 2021). Although this difference may simply be sampling error because the sample of replication studies in JPSP is so small, it is also possible that this high success rate reflects reflects systematic factors that select for significance.

First, researchers may be less motivated to conduct studies with a low probability of success, especially in research areas that have been tarnished by the replication crisis. Who still wants to do priming studies in 2021? Thus, bad research that was published before 2015 may simply die out. The problem with this slow death model of scientific self-correction is that old studies continue to be cited as evidence. Thus, JPSP should solicit replication studies of prominent articles with high citations even if these replication studies may produce failures.

Second, it is unfortunately possible that editors at JPSP prefer to publish studies that report successful outcomes rather than replication failures. To ensure consumers of JPSP, editors should make it transparent whether replication studies get rejected and why they get rejected. Given the e-only format, it is not clear why any replication studies would be rejected.

Conclusion

Unfortunately, the results of this meta-analysis show once more that psychologists do everything in their power to appear trustworthy without doing the things that are required to gain or regain trust. While fabulous review articles tout the major reforms that have been made (Nelson et al., 2019), the reality is often much less glamourous. Trying to get social psychologists to openly admit that they made (honest) mistakes in the past and to correct themselves is akin to getting Trump to admit that he lost the 2020 election. Most of the energy is wasted on protecting the collective self-esteem of card carrying social psychologists in the face of objective, scientific criticism of their past and current practices. It remains unclear which results in JPSP are replicable and provide solid foundations for a science of human behavior and which results are nothing but figments of social psychologists’ imagination. Thus, I can only warn consumers of social psychological research to be as careful as they would be when they are trying to buy a used car. Often the sales pitch is better than the product (Ritchie, 2020; Singal, 2021).

Prejudice is in the Brain: So What?

Introduction

Social psychology aims to study real-world problems with the tools of experimental psychology. The classic obedience studies by Milgram aimed to provide insights into the Holocaust by examining participants’ reactions to a sadistic experimenter. In this tradition, social psychologists have studied prejudice against African Americans since the beginning of experimental social psychology.

As Amadio and Cikara (2021) note, these studies aim to answer questions such as “How do humans learn to favor some groups over others?” and “Why does merely knowing a person’s ethnicity or nationality affect how we see them, the emotions we feel toward them, and the way we treat them?”

In their chapter, the authors review the social neuroscience of prejudice. Looking for prejudice in the brain makes a lot of sense. A person without a brain is not prejudiced, so we know that prejudice is somewhere in the brain. The problem for neuroscience is that the brain is complex and it is not easy to figure out how it does what it does. Another problem is that prejudice is most relevant when individuals act on their prejudices. For example, it is possible that prejudice contributes to the high prevalence of police shootings that involve Black civilians (Schimmack & Carlson, 2020). However, the measurement of brain activity often requires repeated measurements of many trials under constrained laboratory conditions.

In short, using neuroscience to understand prejudice faces some practical challenges. I was therefore skeptical that this research has produced much useful information about prejudice. When I voiced my skepticism on twitter, Amadio called me a bully. I therefore took a closer look at the chapter to see whether my skepticism was reasonable or whether I was uninformed and prejudice against social neuroscientists.

Face Processing

To act prejudice, the brain has to detect a difference between White and Black faces. Research shows that people do indeed notice these differences, especially in faces that were selected to be clear examples of the two categories. If asked to indicate whether a face is White or Black, responses can be made within a few hundred milliseconds.

The authors acknowledge that we do not need social neuroscience to know this. “behavioral studies suggest that social categorization occurs quickly (Macrae & Bodenhausen, 2000)” (p. 3), but they suggest that neuroscience produces additional information. Unfortunately, the meaning of early brain signals in the EEG are often unclear. Thus, the main conclusion that the authors can draw from these findings is that they provide “support for the early detection and categorization of race.” In other words, when the brain sees a Black person, we usually notice that the person is Black. It is not clear however that any of these early brain measures reflect the evaluation of the person, which is really the core of prejudice. Categorization is necessary, but not sufficient for prejudice. Thus, this research does not really help us to understand why noticing that somebody is Black leads to negative evaluations of this person.

Another problem with this research is the artificial nature of the task. Rather than presenting a heterogeneous set of faces that is representative of participants’ social environment, participants see 50% White and 50% Black faces. Every White person who has been suddenly in a situation where 50% or more of the people are Black may notice that they respond differently to this situation. The brain may look very different in these situations than in situations where race is not salient. In addition, the faces are strangers. Thus, these studies have no relevance for prejudice in work-settings with colleagues or in classrooms where teachers know their students. This lack of ecological validity is of course not unique to brain studies of prejudice. It applies also to behavioral experiments.

The only interesting and surprising claim in this section is that Black participants respond to White participants just like White participants respond to Black faces. “Research with
Black participants, in addition to White participants, has replicated this pattern and clarified that it is typically larger to racial outgroup faces rather than Black faces per se” (p. 5). The statement is a bit muddled because the out-group for Black participants is White.

Looking up the results from Dickter and Bartholow (2007) shows a clear participant x picture interaction effect (i.e., responses for opposite race trials are different for same race trials). While the effect for White participants is clearly significant, the effect for Black participants is not, F(1,13) = 4.46, p = .054, but was misreported as significant, p < .05. The second study did not examine Black participants or faces. It also did not include White participants. It showed that Asian participants responded stronger to the outgroup (White) than the in-group (Asian), F(1, 19) = 17.06, p = .0006. The lack of a White group of participants is puzzling. The third study had the largest sample of Black and White participants (Volpert-Esmond & Bartholow, 2019), but did not replicate Dickter and Bartholow’s findings. “the predicted Target race × Participant race interaction was not significant, b=−0.09, t(61.3)=−1.47, p=.146” I have seen shady citations and failure to cite disconfirming evidence, but it is pretty rare for authors to simply list a disconfirming study as if it produced consistent evidence. In conclusion, there is no clear evidence how minority groups respond to faces of different groups because most of the research is done by White researchers at White universities with White students.

The details are of course not important because the authors main goal is to sell social neuroscience. “Social neuroscience research has significantly advanced our understanding of the social categorization process” (p. 11). A close reading shows that this is not the case and that it is unclear what early brain signals mean and how they are modulated by the context, race of participants, and race of faces.

How is prejudice learned, represented, and activated?

Studying learning is a challenging task in an experimental context. To measure learning some form of memory task must be administered. Moreover, this assessment has to be preceded by a learning task. To make learning experiments more realistic, it is ideal to have a retention interval between the learning and the memory task. However, most studies in psychology are one-shot laboratory studies. Thus, the ecological validity of learning studies is low. Not surprisingly, the chapter contains no studies that examine neurological responses during learning or memory tasks related to prejudice.

Instead, the chapter reviews circumstantial evidence that may be related to prejudice. First, the authors review the general literature on Pavlovian aversive conditioning. However, they provide no evidence that prejudice is rooted in fear conditioning. In fact, many White Americans in White parts of the country are prejudice without threatening interactions with Black Americans. Not surprisingly, even the authors note that fear conditioning is not the most plausible root of prejudice.

“Some research has attempted to demonstrate a Pavlovian basis of prejudice using prepared fear or reversal learning paradigms (Dunsmoor et al., 2016; Olsson et al., 2005), but these results have been inconclusive regarding a prepared fear to Black faces (among White Participants) or have failed to replicate (Mallan et al., 2009; Molapour et al., 2015; Navarrete et al., 2009; Navarette et al., 2012). To our knowledge, research has not yet directly tested the hypothesis that social prejudice can be formed through Pavlovian aversive conditioning” (p. 14)

As processing of feared objects often involves the amygdala, one would expect White brains to show an amygdala response to Black faces. Contrary to this prediction, “most fMRI studies of race perception have not observed a difference in amygdala response to viewing racial outgroup compared with ingroup members (e.g., Beer et al., 2008; Gilbert et al., 2012; Golby et al., 2001; Knutson et al., 2007; Mattan et al., 2018; Phelps et al., 2000; Richeson et al., 2003; Ronquillo et al., 2005; Stanley et al., 2012; Telzer et al., 2013; Van Bavel et al., 2008, 2011).” (p. 15). The large number of studies shows how many resources were wasted on a hypotheses that is not grounded in an understanding of racism in the United States.

The chapter then reviews research on stereotypes. The main insight provided here is that “while the neural basis of stereotyping remains understudied, existing research consistently identifies the ATL (anterior temporal lobe) as supporting the representation of social stereotypes” (p. 17). However, it remains unclear what we learn about prejudice from this finding. If stereotypes were supported by some other brain area, would this change prejudice in some important way?

The authors next examine the involvement of instrumental learning in prejudice. “Although social psychologists have long hypothesized a role for instrumental learning in attitudes and social behavior (e.g., Breckler, 1984), this idea has only recently been tested using contemporary reinforcement learning paradigms and computational modeling (Behrens et al.,
2009; Hackel & Amodio, 2018).” (p. 19). Checking Hackel and Amodio (2018) shows that this review article does not mention prejudice. Other statements have nothing to do with prejudice, but rather explain why prejudice may not influence responses to all group-members. “Behavioral studies confirm that people incrementally update their attitudes about both persons (Hackel et al., 2019)” (p. 19). The authors want (us) to believe that “a model of instrumental prejudice may help to understand aspects of implicit prejudice” (p. 20), but they fail to make clear how instrumental learning is related to prejudice, let alone implicit prejudice.

The section on prejudice as habits starts with a wrong premises. “Habits: A basis for automatic prejudice? Automatic prejudices are often likened to habits; they appear to emerge from repeated negative experiences with outgroup members, unfold without intention, and resist change (Devine, 1989).” Devine’s (1989) classic subliminal priming study has not been replicated and subliminal priming in general has been questioned as producing robust findings. Moreover, the study has been questioned on methodological grounds and it has been shown that classifying an individual as Black does not automatically trigger negative responses. The main reason why prejudice is not a habit is that it requires often many repeated instances to form a habit and many White individuals have too little contact with Black individuals to form prejudice habits. The whole section is irrelevant because the authors note that “social neuroscience has yet to investigate the role of habit in prejudice” (p. 21). We can only hope that funding agencies are smart enough not to waste money on this kind of research.

This whole section ends with the following summary.

” A major contribution of social neuroscience research on prejudice has been to link different aspects of prejudice—stereotypes, affective bias, and discriminatory actions—to neurocognitive models of learning and memory. It reveals that intergroup bias, and implicit bias in particular, is not one phenomenon, but a set of different processes that may be formed, represented in the mind, expressed in behavior, and potentially changed via distinct interventions.” In short, we don’t know anything more about prejudice that we did not know without social neuroscience.

Effects of prejudice on perception

The first topic is face perception. Behavior studies show that individuals tend to be better able to discriminate between faces of their own group than faces of another group. Faces are processed in a brain area called the fusiform gyrus. A study by Golbi et al. (2001) with 10 White and 10 Black participants confirmed this finding for White Americans, t(8) = 2.10, p = .03, but not for African Americans. t(9) = 0.63. Given the small sample size the interaction is not significant in this study. The more important finding was that the fusiform gyrus showed more activation to same-race faces, t(18) = 2.58, p = .02. Inconsistent with the behavioral data, African American participants showed more activation of the fusiform face area as much as White participants. Over the past two decades, this preliminary study has been cited over 300 times. We would expect a review in 2021 to include follow-up and replication studies, but the preliminary results of this seminal study are offered as evidence as if they are conclusive. Yet, in 2021 it is clear that many results with just significant p-values, p > .005, often do not replicate. The authors seem to be blissfully unaware of the replication crisis. I was able to find a recent study that examined own-group bias for White participants only with three age groups. The study replicated findings that White participants show more activation to White faces than to Black faces, especially for adolescents and adults. The study also linked this finding to prejudice, but I will discuss these results later because it was not the focus of the review article.

In short, behavioral studies have demonstrated that White Americans have difficulties in distinguishing Black faces. This has led to false convictions based on misidentification by eye-witnesses. Expert testimony by psychologists has helped to draw awareness to this problem. Social neuroscience shows that this problem is correlated with activity in the fusiform gyrus. It is not clear, however, how knowledge about the localization of face processing in the brain provides a deeper understanding of the problem.

The authors suggest, however, that face processing may directly lead to discriminatory behavior based on an article by Krosch and Amodio (2019). In a pair of experiments, White participants were given a small or large amount of money and then had to allocate it to White or Black recipients based on some superficial impression of deservingness. In Study 1 (N = 81, 10 excluded), EEG responses to the faces showed a greater N170 response to Black faces, but only when resources were scarce, 2way interaction F(1, 69) = 4.97, p = .029. Furthermore, the results showed a significant mediation effect on resource allocation, b = .14, se = .09, p = .039. Study 2 used fMRI (N = 35, 5 excluded). This study showed the race effect on the fusiform gyrus, but only in the scarcity condition, F(1, 28) = 7.16, p = .012. Despite the smaller sample size, the mediation analysis was also significant, b = .43, se = .17, t = 2.64, p = .014. While the conceptual replication of the finding across two different studies with different brain measures makes these result look credible, the fact that all critical tests produced just significant results, p > .01 undermines the credibility of these findings (Schimmack, 2012). The most powerful test of credibility for a small set of tests is the Test of Insufficient Variance (Schimmack, 2014; Renkewitz & Keiner, 2019). The test first converts the p-values into z-scores. It then compares the observed variance to the expected variance of 1. The observed variance for these four p-values is much smaller, V = .05. A chi-square test shows that the probability of this outcome by chance is p = .013. Thus, it is unlikely that sampling error alone produced this restricted amount of variation. A more likely explanation is that the authors used questionable research practices to produce a perfect picture of significant results when the actual studies had insufficient power to produce significant results even if the main hypotheses are true. The main problem are the mediation analysis that rely on correlations in small sample sizes. It has been shown that many mediation analyses cannot be trusted because they are biased by questionable research practices

Effects of prejudice on emotion

Emotion is the most important topic for understanding prejudice. Whereas attitudes are broad dispositions to evaluate members of a specific group positively or negatively, emotions are the actual, momentary affective reactions to members of these groups. Ideally, neuroscience would be able to provide objective measures of emotions. These measures would reveal whether a White person responds with negative feelings in an interaction with a Black person. Obtaining objective, physiological indicators of emotions has been the holy grail of emotion research. First attempts to locate emotions in the body failed. Facial expressions (smiles and frowns) can provide valid information, but facial expressions can be controlled and do not always occur in response to emotional stimuli. Thus, the ability to measure brain activity seemed to open the door for objective measures of emotions. However, attempts to find signals of emotional valence in the EEG have failed. fMRI research focused on amygdala activity as a signal of fear, but latter research showed that the amygdala also responds to some positive stimuli, specifically erotic stimuli. Given this disappointing history, I was curious to see what the latest social neuroscience research on emotion has uncovered.

As it turns out, this section provides no new insights into emotional responses to members of an outgroup. The main focus is on empathy in the context of taking the perspective of an in-group or out-group member and guilt. The main reason why fear or hate are not explored is probably that there are no known neural correlates of these emotions and that research with undergraduate students in response to pictures of Black and White faces is unlikely to elicit strong emotions.

In short, the main topic where neuroscience could make a contribution lacks from knowledge of valid measures of emotions in the brain.

Effects of prejudice on decision making

Emotional responses would be less of a problem if individuals would not act on their emotions. Most adult individuals learn to regulate their emotions and to inhibit undesirable behaviors. The reason prejudice is a problem for minority groups is that some White individuals do not feel a need to regulate their negative emotions towards African Americans or that they lack the ability to do so in some situations, which is often called implicit bias. Thus, understanding how the brain is involved in actual behaviors is even more important than understanding its contribution to emotions. Although of prime importance, this section is short and contains few citations. One reference is to the resource allocation study by Krosch and Amodio that I reviewed in detail earlier. Blissfully aware of the questions raised about oxytocin research, another reference is to a study with oxytocin administration (Marsh et al., 2017). Thus, there is no research reviewed here that illuminates what the brain is doing when White individuals discriminate against African Americans. This does not stop the authors from making a big summary statement that “social neuroscience research has refined our understanding of how prejudice influences the visual processing of faces, intergroup emotion, and decision-making processes, particularly as each type of response pertains to behavior” (p. 34).

Self-regulation of Prejudice

This section starts of with a study by Amodio et al. (2004) and the claim that the results of this study have been replicated in numerous studies (Amodio et al., 2006; 2008; Amodio & Swencionis, 2018; Bartholow et al., 2006; Beer et al., 2008; Correll et al., 2006; Hughes et al., 2017). The main claim based on these studies is that self-regulation of prejudice relies on “detection of bias and initiation of control, in dACC—a process that can operate rapidly and in the absence of deliberation, and which can explain individual differences in prejudice control failures” (p. 39).

Amodio et al.’s (2004) study used the weapons – identification task. This task is an artificial task that puts participants in the position of a police officer who has to make a split second decision whether a civilian is holding a gun or some other object (cell phone). Respondents have to respond as quickly as possible whether the object is a gun or not. The race of the civilians is manipulated to examine racial biases. A robust finding is that White participants are faster to identify guns after seeing a Black face than a White face and slower to identify a tool after seeing a White face than a Black face. On some trials, participants also make mistakes. When the brain of participants notices that a mistake was made, EEG shows a distinct signal that is called the error-related negativity (ERN). The key finding in this article is that the ERN was more pronounced when participants identified a tool as a gun in trials with Black faces than in trials with White faces, t(33) = 2.94, p = .006. Correlational analysis suggested that participants with larger ERNs after mistakes with Black faces learned from their mistakes and reduced their errors, r(32) = -.50, p = .004. These results show that at least some individuals are aware when prejudice influences their behaviors and control their behaviors to avoid acting on their prejudice. It is difficult to generalize from this study to regulation of prejudice in real-life because the task is artificial and most situations provide only ambiguous feedback about the appropriateness of actions. Even the behavior here is a mere identification rather than an actual behavior such as a shoot or no-shoot decision, which might produce more careful responses and fewer errors especially in more realistic training scenarios (Andersen zzz).

Another limitation of these studies is the reliance on a few pictures to represent the large diversity of Black and White people.

All replication studies seem to have used the same faces. Therefore, it is unclear how generalizable these results and how much contextual factors (e.g., gender, age, clothing, location, etc.) might moderate the effect.

Some limitations of the generalizability were reported by Amadio and Swencionis (2018). The racial bias effect was eliminated (no longer statistically significant) when 80% of trials showed Black faces with tools rather than guns. This finding is not predicted by models that assume racial bias often has an implicit (automatic and uncontrollable) effect on behavior. Here it seems that simple knowledge about the low frequency of Black people with guns was sufficient to block the behavioral expression of prejudice. Study 4 measured EEG, but did not report ERN results.

The summary of this section concludes that “social neuroscience research on prejudice control has significantly expanded psychological theory by identifying and distinguishing multiple mechanisms of control” (p. 39). I would disagree. The main finding appears to be that the brain sometimes fails to notice that it made an error and that lack of awareness of these errors prohibits correcting this error. However, the studies are designed to produce errors in the first place to be able to measure the ERN. Without time pressure, few errors would be made and as shown by Amadio and Swencionis show that racial bias depends on a specific context. That being said, lack of awareness may cause sustained prejudice in the real world. One important role of diversity training is to make majority members aware of behaviors that hurt minority members. Awareness of the consequences should reduce the frequency of these behaviors because they are controllable as the reviewed research suggests.

The conclusion section repeats the claim that the review highlights “major theoretical advances produced by this literature to date” (p. 42). However, this claim rings hollow in comparison to the dearth of findings that inform our understanding of prejudice. The main problem for social neuroscience of prejudice is that the core component of prejudice, negative affect, has no clear neural correlates in EEG or fMRI measures of the brain, and that experimental designs suitable for neuroscience have low ecological validity. The authors suggests that this may change in the future. They provide a study with Black and White South Africans as an example. The study measured fMRI while participants viewed short video-clips of Black and White individuals in distress. The videos were taken from the South African Truth
and Reconciliation Commission. The key finding was that brain signals related to empathy showed an in-group bias. Both groups responded more to distress by members of their own group. The fact that this study is offered as an example for greater ecological validity shows the problems for social neuroscience to study prejudice in realistic settings where one individual responds to another individual and their behavior is influenced by prejudice. The authors also point to technological advances as a way to increase ecological validity. Wearable neuroimaging makes it possible to measure the brain in naturalistic settings, but it is not clear what brain signals would produce valuable information about prejudice.

My main concerns is that social neuroscience research on prejudice takes away resources from other, in my opinion more important, prejudice research that focuses on actual behaviors in the real world. I am not the only one who has observed that the focus on cognition and the brain has crowded out research of actual behaviors (Baumeister, Vohs, & Funder, 2007; Cesario, 2021). If a funding agency can spend a million dollars on a grant to study the brains of undergraduate students while they look at Black and White faces or on the shooting errors of police officers in realistic simulations, I would give money to the study of actual behavior. There is also a dearth of research on prejudice from the perspective of the victims. They know best what prejudice is and how it affects them. There needs to be more diversity in research and White researchers should collaborate with Black researchers who can draw on personal experiences and deep cultural knowledge that White researchers lack or fail to use in their research. Finally, the incentive structure needs to change. Prejudice researchers are rewarded like all other researchers for publishing in prestigious journals that are controlled by White researchers. Even journals dedicated to social issues have this systemic bias. Prejudice research more than any other field needs to ensure equity, diversity, and inclusions at all levels. Moving social neuroscience of prejudice out of White social cognition research into a diverse and interdisciplinary field might help to ensure that these studies actually inform our understanding of prejudice. Thus, a reallocation of funding is needed to ensure that funding for prejudice research benefits African Americans and other minority groups.

Z-Curve: An even better p-curve

In 2011, it dawned on psychologists that something was wrong with their science. Daryl Bem had just published an article with nine studies that showed an incredible finding. Participants’ responses were influenced by random events that had not yet occurred. Since then, the flaws in research practices have become clear and it has been shown that they are not limited to mental time travel (Schimmack, 2020). For decades, psychologists assumed that statistically significant results reveal true effects and reported only statistically significant results (Sterling, 1959). However, selective reporting of significant results undermines the purpose of significance testing to distinguish true and false hypotheses. If only significant results are reported, most published results could be false positive results like those reported by Bem (2011).

Selective reporting of significant results also undermines the credibility of meta-analyses (Rosenthal, 1979), which explains why meta-analyses also suggest humans posses psychic abilities (Bem & Honorton, 1994). This sad state of affairs stimulated renewed interest in methods that detect selection for significance (Schimmack, 2012) and methods that correct for publication bias in meta-analyses. Here I focus on a comparison of p-curve (Simonsohn et al., 2014a, Simonsohn et al., 2014b), and z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).

P-Curve

P-curve is a name for a family of statistical tests that have been combined into the p-curve app that researchers can use to conduct p-curve analyses, henceforth called p-curve . The latest version of p-curve is version 4.06 that was last updated on November 30, 2017 (p-curve.com).

The first part of a p-curve analysis is a p-curve plot. A p-curve plot is a histogram of all significant p-values where p-values are placed into five bins, namely p-values ranging from 0 to .01, .01 to .02, .02 to .03, .03 to .04, and .04 to .05. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, the plot shows decreasing frequencies as p-values increase (more p-values between 0 and .01 than between .04 and .05). This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or reversed (more p-values between .04 and .05 than between 0 and .01), most p-values may be false positive results.

The main limitation of p-curve plots is that it is difficult to evaluate ambiguous cases. To aid in the interpretation of p-curve plots, p-curve also provides statistical tests of evidential value. One test is a significance tests against the null-hypothesis that all significant p-values are false positive results. If this null-hypothesis can be rejected with the traditional alpha criterion of .05, it is possible to conclude that at least some of the significant results are not false positives.

The main problem with significance tests is that they do not provide information about effect sizes. A right-skewed p-curve with a significant p-values may be due to weak evidence with many false positive results or strong evidence with few false positives.

To address this concern, the p-curve app also provides an estimate of statistical power. This estimate assumes that the studies in the meta-analysis are homogeneous because power is a conditional probability under the assumption that an effect is present. Thus, power does not apply to a meta-analysis of studies that contain true positive and false positive results because power is not defined for false positive results.

To illustrate the interpretation of p-curve analysis, I conducted a meta-analysis of all studies published by Leif D. Nelson, one of the co-authors of p-curve analysis. I found 119 studies with codable data and coded the most focal hypothesis for each of these studies. I then submitted the data to the online p-curve app. Figure 1 shows the output.

Visual inspection of the p-curve plot shows a right-skewed distribution with 57% of the p-values between 0 and .01 and only 6% of p-values between .04 and .05. The statistical tests against the null-hypothesis that all of the significant p-values are false positives is highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 97%, with a tight confidence interval ranging from 96% to 98%. Somewhat redundant with this information, the p-curve app also provides a significance test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 96%.

The next part of a p-curve output provides more details about the significance tests, but does not add more information.

The next part provides users with an interpretation of the results.

The interpretation informs readers that this set of p-values provides evidential value. Somewhat surprisingly, this automated interpretation does not mention the power estimate to quantify the strength of evidence. The focus on p-values is problematic because p-values are influenced by the number of tests. The p-value could be lower wit 100 studies with 40% power than with 10 studies with 99% power. As significance tests are redundant with confidence intervals, it is sufficient to focus on the confidence interval of the power estimate. With a 90% confidence interval ranging from 96% to 98%, we would be justified to conclude that this set of p-values provides strong support for the hypotheses tested in Nelson’s articles.

Z-Curve

Like p-curve, z-curve analyses also start with a plot of the p-values. The main difference is that p-values are converted into z-scores using the formula for the inverse normal distribution; z = qnorm(1-p/2). The second difference is that significant and non-significant p-values are plotted. The third difference is that z-curve plots have a much finer resolution than p-curve plots. Whereas p-curve bins all z-scores from 2.58 to infinity into one bin (p < .01), z-curve uses the information about the distribution of z-scores all the way up to z = 6 (p = .000000002; 1/500,000,000).

Visual inspection of the z-curve plot reveals something that the p-curve plot does not show, namely there is clear evidence for the presence of selection bias. Whereas p-curve suggests that “highly” significant results (0 to .01) are much more common than “just” significant results (.04 to .05), z-curve shows that just significant results (.05 to .005) are much more frequent than highly significant (p < .005) results. The difference is due to the implicit definition of high and low in the two plots. The high frequency of highly significant (p < .01) results in the p-curve plots is due to the wide range of values that are lumped together into this bin. Once it is clear that many p-values are clustered just below .05 (z > 1.96, the vertical red line), it is immediately notable that there are too few just non-significant (z < 1.96) values. This steep drop is not consistent with random sampling error. To summarize, z-curve plots provide more information than p-curve plots. Whereas z-curve plots make the presence of selection for significance visible, p-curve plots provide no means to evaluate selection bias. Even worse, right skewed distributions are often falsely interpreted as evidence that there is no selection for significance. This example shows that notable right-skewed distributions can be found even when selection bias is present.

The second part of a z-curve analysis uses a finite mixture model to estimate two statistical parameters of the data. These parameters are called the estimated discovery rate and the estimated replication rate (Bartos & Schimmack, 2021). Another term for these parameters is mean power before selection and mean power after selection for significance (Brunner & Schimmack, 2020). The meaning of these terms is best understood with a simple example where a researcher tests 100 false hypotheses and 100 true hypotheses with 100% power. The outcome of this study produces significant and non-significant p-values. The expected value for the frequency of significant p-values is 100 for the 100 true hypotheses tested with 100% power and 5% for the 100 false hypotheses that produce 5 significant results when alpha is set to 5%. Thus, we are expecting 105 significant results and 95 non-significant results. Although we know the percentages of true and false hypotheses, this information is not available with real data. Thus, any estimate of average power changes the meaning of power. It now includes false hypotheses with a power equal to alpha. We call this unconditional power to distinguish it from the typical meaning of power conditioned on a true hypothesis.

It is now possible to compute mean unconditional power for two populations of studies. One population of studies are all studies that were conducted. In this example, this population are all 200 studies (100 true, 100 false hypotheses). The average power for these 200 studies is easy to compute as (.5*100 + 1*100)/200 = 52.5%. The second population of studies focuses only on the significant studies. After selecting only significant studies, mean unconditional power is (.05*5 + 1*100)/105 = 95.5%. The reason why power is so much higher after selection for significance is that the significance filter keeps most false hypotheses out of the population of studies with a significant result (95 of the 100 studies to be exact). Thus, power is mostly determined by the true hypotheses that were tested with perfect power. Of course, real data are not as clean as this simple example, but the same logic applies to all sets of studies with a diverse range of power values for individual studies (Brunner & Schimmack, 2020).

Mean power before selection of significance determines the percentage of significant results for a number of tests. With 50% mean power before selection, 100 tests are expected to produce 50 significant results (Brunner & Schimmack, 2020). It is common to refer to statistically significant results as discoveries (Soric, 1989). Importantly, discoveries could be true or false, just like a significant result could be a true effect or a type-I error. In our example, there were 105 discoveries. Normally we would not know that 100 of these discoveries are true discoveries. All we know is the percentage of significant results. I use the term estimated discovery rate (EDR) to refer to mean unconditional power before selection, which is a mouthful. In short, EDR is an estimate of the percentage of significant results in a series of statistical tests.

Mean power after selection for significance is relevant because power of significant results determines the probability that a significant result can be successfully replicated in a direct replication study with the same sample size (Brunner & Schimmack, 2020). Using the EDR would be misleading. In the present example, the EDR of 52.5% would dramatically underestimate replicability of significant results, which is actually 95.5%. Using the EDR would punish researchers who conduct high-powered tests of true and false hypotheses. To assess the replicability of this researchers, it is necessary to compute power only for the studies that produced significant results. The problem with traditional meta-analyses is that selection for significance leads to inflated effect size estimates even if the researcher reported all non-significant results. To estimate the replicability of the significant results, the data are conditioned on significance, which inflates replicability estimates. Z-curve models this selection process and corrects for regression to the mean in the estimation of mean unconditional power after selection for significance. I call this statistic the estimated replication rate. The reason is that mean unconditional power after selection for significance determines the percentage of significant results that is expected in direct replication studies of studies with a significant result. In short, the ERR is the probability that a direct replication study with the same sample size produces a significant result.

I start discussion of the z-curve results for Nelson’s data with the estimated replication rate because this estimate is conceptually similar to the power estimate in the p-curve analysis. Both estimates focus on the population of studies with significant results and correct for selection for significance. Thus, one would expect similar results. However, the p-curve estimate of 97%, 95%CI = 96% to 98%, is very different from the z-curve estimate of 52%, 95%CI = 40% to 68%. The confidence intervals do not overlap, showing that the difference between these estimates is statistically significant itself.

The explanation for this discrepancy is that p-curve estimates are inflated estimates of the ERR when power is heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). This is even true, if effect sizes are homogeneous and studies vary only in sample sizes (Brunner, 2018). The p-curve authors have been aware of this problem since 2018 (Datacolada), and have not updated the p-curve app in response to this criticism of their app. The present example shows that using the p-curve app can lead to extremely misleading conclusions. Whereas p-curve suggests that nearly every study by Nelson would produce a significant result again in a direct replication attempt, the correct z-curve estimates suggests that only every other result would replicate successfully. This difference is not only statistically significant, but also practically significant in the evaluation of Nelson’s work.

In sum, p-curve is not only redundant with z-curve. It also produces false information about the strength of evidence in a set of p-values.

Unlike p-curve, z-curve.2.0 also estimates the discovery rate based on the distribution of the significant p-values. The results are shown in Figure 2 as the grey curve in the range of non-significant results. As can be seen, while z-curve predicts a large number of non-significant results, the actual studies reported very few non-significant results. This suggests selection for significance. To quantify the amount of selection bias, it is possible to compare the observed discovery rate (i.e., the actual percentage of significant results), 87%, to the estimated discovery rate, EDR = 27%. The 95% confidence interval around the EDR can be used for a significance test. As 87% is well outside the 95%CI of the EDR, 5% to 51%, the results provide strong evidence that the reported results were selected from a larger set of tests with non-significant results that were not reported. In this specific case, this inference is consistent with the authors’ admission that questionable research practices were used (Simmons, Nelson, & Simonsohn, 2011).

Our best guess was that so many published findings were false because researchers
were conducting many analyses on the same data set and just reporting those that were statistically significant, a behavior that we later labeled “p-hacking” (Simonsohn, Nelson, & Simmons, 2014). We knew many researchers—including ourselves—who readily admitted
to dropping dependent variables, conditions, or participants to achieve significance.
” (Simmons, Nelson, & Simonsohn, 2018, p. 255).

The p-curve authors also popularized the idea that selection for significance may have produced many false positive results (Simmons et al., 2011). However, p-curve does not provide an estimate of the false positive risk. In contrast, z-curve provides information about the false discovery risk because the false discovery risk is a direct function of the discovery rate. Using the EDR with Soric’s formula, shows that the false discovery risk for Nelson studies is 14%, but due to the small number of tests, the 95%CI around this estimate ranges from 5% to 100%. Thus, even though the ERR suggests that half of the studies can be replicated, it is possible that the other half of the studies contain a fairly large number of false positive results. Without the identification of moderator variables, it would be impossible to say whether a result is a true or a false discovery.

The ability to estimate the false positive risk makes it possible to identify a subset of studies with a low false positive risk by lowering alpha. Lowering alpha reduces the false positive risk for two reasons. First, it follows logically that a lower alpha produces a lower false positive risk. For example, in the prior example with 100 true and 100 false hypothesis, an alpha of 5% produced 105 significant results that included 5 non-significant results and the false positive rate was 5/105 = 4.76%. Lowering alpha to 1%, produces only 101% significant results and the false positive rate is 1/100 = 1.00%. Second, questionable research practices are much more likely to produce false positive results with alpha = .05 than with alpha = .01.

In a z-curve analysis can be set to different values to examine the false positive rate. A reasonable criterion is to aim for a false discovery rate of 5%, which many psychologists falsely assume is the goal of setting alpha to 5%. For Nelson’s 109 publications, alpha can be lowered to .01 to achieve a false discovery risk of 5%.

With alpha = .01, there are still 60 out of 119 (50%) significant results. It is therefore not necessary to dismiss all of the published results because some results were obtained with questionable research practices.

For Nelson’s studies, a plausible moderator is timing. As Nelson and colleagues reported, he used QRPs before he himself drew attention to the problems with these practices. In response, he may have changed his research practices. To test this hypothesis, it is possible to fit a z-curve analysis to articles published before and after 2012 (due to publication lack, articles in 2012 are likely to still contain QRPs).

Consistent with the hypothesis, The EDR for 2012 and before is only 11%, 95%CI 5% to 31%, and the false discovery risk increases to 42%, 95%CI = 12% to 100%. Even with alpha = .01, the FDR is still 11%, and with alpha = .005 it is still 10%. With alpha = .001, it is reduced to 2% and 18 results remain significant. Thus, most of the published results lack credible evidence against the null-hypothesis.

Results look very different after 2012. The EDR is 83% and not different from the ODR, suggesting no evidence that selection for significance occurred. The high EDR implies a low false discovery risk even with the conventional alpha criterion of 5%. Thus, all 40 results with p < .05 provide credible evidence against the null-hypothesis.

To see how misleading p-curves can be, I also conducted a p-curve analysis for the studies published in the years up to 2012. The p-curve analysis shows merely that the studies have evidential value and provides a dramatically inflated estimate of power (84% vs. 35%). It does not show evidence that p-values are selected for significance and it does not provide information to distinguish p-hacked studies from studies with evidential value.

Conclusion

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis (evidential value). Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis.

I showed that all of the information that is provided by a p-curve analysis is also provided by a z-curve analysis. Moreover, z-curve provides additional information about the presence of selection bias and the risk of false positive results. I also show how alpha levels can be adjusted to separate significant results with weak and strong evidence to select credible findings even when selection for significance is present.

As z-curve does every thing that p-curve does and more, the rational choice is to choose z-curve for the meta-analysis of p-values.

Replicability Audit of Ap Dijksterhuis

Abstract

This blog post reports a replicability audit of Ap Dijksterhuis 48 most highly cited articles that provide the basis for his H-Index of 48 (WebofScience, 4/23/2021). The z-curve analysis shows lack of evidential value and a high false positive risk. Rather than dismissing all findings, it is possible to salvage 10 findings by setting alpha to .001 to maintain a false positive risk below 5%. The main article that contains evidential value was published in 2016. Based on these results, I argue that 47 of the 48 articles do not contain credible empirical information that supports the claims in these articles. These articles should not be cited as if they contain empirical evidence.

INTRODUCTION

“Trust is good, but control is better”  

Since 2011, it has become clear that social psychologists misused the scientific method. It was falsely assumed that a statistically significant result ensures that a finding is not a statistical fluke. This assumption is false for two reasons. First, even if the scientific method is used correctly, statistically significance can occur without a real effect in 5% of all studies. This is a low risk if most studies test true hypothesis with high statistical power, which produces a high discovery rate. However, if many false hypotheses are tested and true hypotheses are tested with low power, the discovery rate is low and the false discovery risk is high. Unfortunately, the true discovery rate is not known because social psychologists only published significant results. This selective reporting of significant results renders statistically significance insignificant. In theory, all published results could be false positive results.

The question is what we, the consumers of social psychological research, should do with thousands of studies that provide only questionable evidence. One solution is to “burn everything to the ground” and start fresh. Another solution is to correct the mistake in the application of the scientific method. I compare this correction to the repair of the Hubble telescope (https://www.nasa.gov/content/hubbles-mirror-flaw). Only after the Hubble telescope was launched into space, it was discovered that a mistake was made in the creation of the mirror. Replacing the mirror in space was impractical. As a result, a correction was made to take the discrepancy in the data into account.

The same can be done with significance testing. To correct for the misuse of the scientific method, the criterion for statistical significance can be lowered to ensure an acceptably low risk of false positive results. One solution is to apply this correction to articles on a specific topic or to articles in a particular journal. Here, I focus on authors for two reasons. First, authors are likely to use a specific approach to research that depends on their training and the field of study. Elsewhere I demonstrated that researchers differ considerably in their research practices (Schimmack, 2021). More controversial, I also think that authors are accountable for their research practices. If they realize that they made mistakes, they could help the research community by admitting to their mistakes and retract articles or at least express their loss of confidence in some of their work (Rohrer et al., 2020).

Ap Dijksterhuis

Ap Dijksterhuis is a well-known social psychologist. His main focus has been on unconscious processes. Starting in the 1990s, social psychologists became fascinated by unconscious and implicit processes. This triggered what some call an implicit revolution (Greenwald & Banaji, 1995). Dijksterhuis has been prolific and his work is highly cited, which earned him an H-Index of 48 in WebOfScience.

However, after 2011 it became apparent that many findings in this literature are difficult to replicate (Kahneman, 2012). A large replication project also failed to replicate one of Dijksterhuis’s results (O’Donnell et al., 2018). It is therefore interesting and important to examine the credibility of Dijksterhuis’s studies.

Data

I used WebofScience to identify the most cited articles by Dijksterhuis  (datafile).  I then coded empirical articles until the number of coded articles matched the number of citations. The 48 articles reported 105 studies with a codable focal hypothesis test.

The total number of participants was 7,470 with a median sample size of N = 57 participants. For each focal test, I first computed the exact two-sided p-value and then computed a z-score for the p-value divided by two. Consistent with practices in social psychology, all reported studies supported predictions, even when the results were not strictly significant. The success for p < .05 (two-tailed) was 100/105 = 95%, which has been typical for social psychology for decades (Sterling, 1959).

The z-scores were submitted to a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). The first part of a z-curve analysis is the z-curve plot (Figure 1).

The vertical red line at z = 1.96 represents the significance criterion with alpha = .05 (two-tailed). The figure shows that most p-values are just significant with z-scores just above 1.96. The distribution of z-scores is abnormal in the sense that random sampling error alone cannot produce the steep drop on the left side of the significance criterion. This provides visual evidence of selection for significance.

The second part of a z-curve analysis is to fit a finite mixture model to the distribution of the significant z-scores (z > 1.96). The model tries to match the distribution as closely as possible. The best fitting curve is shown with the grey/black checkered line. It is notable that the actual data decrease a bit more steeply than the grey curve. This shows a problem for the curve to fit the data even though the curve. This suggests that significance was obtained with massive p-hacking which produces an abundance of just significant results. This is confirmed with a p-curve analysis that shows more p-values between .04 and .05 than p-values between 0 and .01; 24% vs. 19%, respectively (Simonsohn et al., 2014).

The main implication of a left-skewed p-curve is that most significant results do not provide evidence against the null-hypothesis. This is confirmed by the z-curve analysis. A z-curve analysis projects the model based on significant results into the range of non-significant results. This makes it possible to estimate how many tests were conducted to produce the observed significant results (assuming a simple selection model). The results for these data suggest that that the reported significant results are only 5% of all statistical tests, which is what would be expected if only false hypotheses were tested. As a result, the false positive risk is 100%. Z-curve also computes bootstrapped confidence intervals around these estimates. The upper bound for the estimated discovery rate is 12%. Thus, most of the studies had a very low chance of producing a significant result (low power), even if they did not test a false hypothesis (low statistical power). With a low discover rate of 12%, the risk that a significant result is a false positive result is still 39%. This is unacceptably high.

The estimated replication rate of 7% is slightly higher than the estimated discovery rate of 5%. This suggests some heterogeneity across the studies which leads to higher power for studies that produced significant results. However, even 7% replicability is very low. Thus, most studies are expected to produce a non-significant result in a replication attempt.

Based on these results, it would be reasonable to burn everything to the ground and to dismiss the claims made in these 48 articles as empirically unfounded. However, it is also possible to reduce the false positive risk by increasing the significance threshold. With alpha = .01 the FDR is 19%, with alpha = .005 it is 10%, and with alpha = .001 it is 2%. So, to keep the false positive risk below 5%, it is possible to set alpha to .001. This renders most findings non-significant, but 10 findings remain significant.

One finding is evidence that liking of one’s initials has retest reliability. A more interesting finding is that 4 significant (p < .001) results were obtained in the most recent, 2016) article that also included pre-registered studies. This suggests that Dijksterhuis changed research practices in the wake of the replicability crisis. Thus, new articles that have not garnered a lot of citations may be more credible, but the pre-2011 articles lack credible empirical evidence for most of the claims made in these articles.

DISCLAIMER 

It is nearly certain that I made some mistakes in the coding of Ap Dijksterhuis’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and the z-curve code is also openly available.  Thus, this replicability audit is fully transparent and open to revision.

Moreover, the results are broadly consistent with the z-curve results based on automated extraction of test statistics (Schimmack, 2021). Based on automated coding, Dijksterhuis has an EDR of 17, with a rank of 312 out of 357 social psychologists. The reason for the higher EDR is that automated coding does not distinguish focal and non-focal tests and focal tests tend to have lower power and a higher risk of being false positives.

If you found this audit interesting, you might also be interested in other replicability audits (Replicability Audits).

How Prevalent Is Dumb P-Hacking?

Abstract

It has been proposed that psychologists used a number of p-hacking methods to produce false positive results. In this post, I examine the prevalence of two p-hacking methods, namely the use of covariates and peaking until a significant result is obtained. The evidence suggests that these strategies are not very prevalent. One explanation for this is that they are less efficient than other p-hacking methods. File-drawering of small studies and inclusion of multiple dependent measures are more likely to be the main questionable practices that inflate effect sizes and success rates in psychology.

Article

P-hacking refers to statistical practices that inflate effect size estimates and increase the probability of false positive results (Simonsohn et al., 2014). There are a number of questionable practices that can be used to achieve this goal. Three major practices are (a) continuing to add participants until significance is reached, (b) adding covariates, and (c) testing multiple dependent variables.

In a previous blog-post, I pointed out that two of these practices are rather dumb because they require more resources than simply running many studies with small samples and to put non-significant results in the file drawer (Schimmack, 2021). The dumbest p-hacking method is to continue data collection until significance is reached. Even with sampling until N = 200, the majority of studies remain non-significant. The predicted pattern is a continuous decline in the frequencies with increasing sample sizes. The second dumb strategy is to measure additional variables and to use them as covariates. It is smarter to add additional variables as dependent variables.

Simonsohn et al. (2014) suggested that it is possible to detect the use of dumb p-hacking methods by means of p-curve plots. Repeated sampling and the use of covariates produce markedly left-skewed (monotonic decreasing) p-curves. Schimmack (2021) noted that left-skewed p-curves are actually very rare. Figure 1 shows the p-curve for the most cited articles of 71 social psychologists (k = 2,570). The p-curve is clearly right-skewed.

I then examined p-curves for individual social psychologists (k ~ 30). The worst p-curve was flat, but not left-skewed.

The most plausible explanation for this finding is that no social psychologists tested only false hypotheses. As studies with true effect sizes produce right skew, p-hacking cannot be detected by left-skewed p-curves.

I therefore examined the use of dumb p-hacking strategies in other ways. The use of covariates is easy to detect by coding studies whether they used covariates or not. If researchers use multiple covariates the chances that a result becomes significant with a covariate are higher than the chances to get the significant result without a covariate. Thus, we should see more results with covariates and the frequency of studies with covariates provides some information about the prevalence of covariate hacking. I also distinguished between strictly experimental studies and correlational studies because covariates are more likely to be used in correlational studies for valid reasons. Figure 3 shows that the use of covariates in experimental studies is fairly rare (8.6%). If researchers would try only one covariate, this would limit the number of studies that were p-hacked with covariates to 17.2%, but the true frequency is likely to be much lower because p-hacking with a single covariate barely increases the chances of a significant result.

To examine the use of peaking, I first plotted the histogram of sample sizes. I limited sample sizes to studies with N < 200 to make the distribution of small sample sizes more visible.

There is no evidence that researchers start with very small sample sizes (n = 5) and publish as soon as they get significance (simulation by Simonsohn et al., 2014). This would have produced a high frequency of studies with N = 10. The peak around N = 40 suggests that many researchers use n = 20 as a rule of thumb for the allocation of participants to cells in two-group designs. Another bump around N = 80 is explained by the same rule for 2 x 2 designs that are popular among social psychologists. N = 100 seems to be another rule of thumb. Except for these peaks, the distribution does show a decreasing trend suggesting that peaking was used. However, there is also no evidence that researchers simply stop after n = 15 when results are not significant (Simonsohn et al., 2014, simulation).

If the decreasing trend is due to peaking, sample sizes would be uncorrelated with the strength of the evidence. Otherwise, studies with larger samples have less sampling error and stronger evidence against the null-hypothesis. To test this prediction, I regressed p-values transformed into z-scores onto sampling error (1 / sqrt(N). I included the use of covariates and the nature of the study (experimental vs. correlational) as additional predictors.

The strength of evidence increased with decreasing sampling error without, z = 3.73, and with covariates, z = 3.30. These results suggest that many studies tested a true effect because a true effect is necessary to increase the strength of evidence against the null-hypothesis. To conclude, peaking may have been used, but not at excessive levels that would produce many low z-scores with large samples.

The last analysis was used to examine whether social psychologists have used questionable research practices. The difference between p-hacking and questionable research practices is that p-hacking excludes publication bias (not reporting entire studies). The focus on questionable research practices has the advantage that it is no longer necessary to distinguish between selective reporting of analyses or entire studies. Most researchers are likely to use both p-hacking and publication bias and both practices inflate effect sizes and lower replicability. Thus, it is not important to distinguish between p-hacking and publication bias.

The results show clear evidence that social psychologists used questionable research practices to produce an abundance of significant results. Even not counting marginally significant results, the success rate is 89%, but the actual power to produce these significant results is estimated to be just 26%. This shows that a right-skewed does not tell us how much questionable research practices contributed to significant results. A low discovery rate of 26% translates into a maximum false discovery rate of 15%. This would suggest that one reason for the lack of left-skewed p-curves is that p-hacking of true null-hypothesis is fairly rare. A bigger problem is that p-hacking of real effects in small samples produces vastly inflated effect size estimates. However, the 95% confidence interval around this estimate reaches all the way to 47%. Thus, it cannot be ruled out that a substantial number of results was obtained with true null-hypotheses by using p-hacking methods that do not produce a marked left-skew and publication bias.

Smart P-Hackers Have File-Drawers and Are Not Detected by Left-Skewed P-Curves

Abstract

In the early 2010s, two articles suggested that (a) p-hacking is common, (b) false positives are prevalent, and (c) left-skewed p-curves reveal p-hacking to produce false positive results (Simmons et al., 2011; Simonsohn, 2014a). However, empirical application of p-curve have produced few left-skewed p-curves. This raises question about the absence of left-skewed z-curves. One explanation is that some p-hacking strategies do not produce notable left skew and that these strategies may be used more often because they require fewer resources. Another explanation could be that file-drawering is much more common than p-packing. Finally, it could be that most of the time p-hacking is used to inflate true effect sizes rather than to chase false positive results. P-curve plots do not allow researchers to distinguish these alternative hypotheses. Thus, p-curve should be replaced by more powerful tools that detect publication bias or p-hacking and estimate the amount of evidence against the null-hypothesis. Fortunately, there is an app for this (zcurve package).

Introduction

Simonsohn, Nelson, and Simmons (2014) coined the term p-hacking for a set of questionable research practices that increase the chances of obtaining a statistically significant result. In the worst case scenario, p-hacking can produce significant results without a real effect. In this case, the statistically significant result is entirely explained by p-hacking.

Simonsohn et al. (2014) make a clear distinction between p-hacking and publication bias. Publication bias is unlikely to produce a large number of false positive results because it requires 20 attempts to produce a single significant result in either direction or 40 attempts to get a significant result with a predicted direction. In contrast, “p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011)” (p. 535).

There have been surprisingly few investigations of the best way to p-hack studies. Some p-hacking strategies may work in simulation studies that do not impose limits on resources, but they may not be practical in real applications of p-hacking. I postulate that the main goal of p-hacking is to get significant results with minimal resources rather than with a minimum number of studies and that p-hacking is more efficient with a file drawer of studies that are abandoned.

Simmons et al. (2011) and Simonsohn et al. (2014) suggest one especially dumb p-hacking strategy, namely simply collecting more data until a significant result emerges.

“For example, consider a researcher who p-hacks by analyzing data after every five per-condition participants and ceases upon obtaining significance.” (Simonsohn et al., 2014).

This strategy is known to produce more p-values close to .04 than .01.

The main problem with this strategy is that sample sizes can get very large before the significant result emerges. I limited the maximum sample size before a researcher would give up to N = 200. A limit of 20 makes sense because N = 200 would allow a researcher to run 20 studies with the starting sample size of N = 10 to get a significant result. The p-curve plot shows a similar distribution as the simulation in the p-curve article.

The success rate was 25%. This means, 75% of studies with N = 200 produced a non-significant result that had to be put in the file-drawer. Figure 2 shows the distribution of sample sizes for the significant results.

The key finding is that the chances of a significant results drop drastically after the first attempt. The reason is that the most favorable results in the first trial produce a significant result in the first trial. As a result, the non-significant ones are less favorable. It would be better to start a new study because the chances to get a significant result are higher than adding participants after an unsuccessful attempt. In short, just adding participants to get significant is a dumb p-hacking method.

Simonsohn et al. (2014) do not disclose the stopping rule, but they do show that they got only 5.6% significant results compared to the 25% with N = 200. This means they stopped much earlier. Simulation suggest that they stopped when N = 30 (n = 15 per cell) did not produce a significant result (1 million simulations, success rate = 5.547%). The success rates for N = 10, 20, and 30 were 2.5%, 1.8%, and 1.3%, respectively. These probabilities can be compared to a probability of 2.5 for each test with N = 10. It is clear that trying three studies is a more efficient strategy than to add participants until N reaches 30. Moreover, neither strategy avoids producing a file drawer. To avoid a file-drawer, researchers would need to combine several questionable research practices (Simmons et al., 2011).

Simmons et al. (2011) proposed that researchers can add covariates to increase the number of statistical tests and to increase the chances of producing a significant result. Another option is to include several dependent variables. To simplify the simulation, I am assuming that dependent variables and covariates are independent of each other. Sample size has no influence on these results. To make the simulation consistent with typical results in actual studies, I used n = 20 per cell. Adding covariates or additional dependent variables requires the same amount of resources. For example, participants make additional ratings for one more item and this item is either used as a covariate or as a dependent variable. Following Simmons et al. (2011), I first simulated a scenario with 10 covariates.

The p-curve plot is similar to the repeated peaking plot and is called left-skewed. The success rate, however, is disappointing. Only 4.48% of results were statistically significant. This suggests that collecting data to be used as covariates is another dumb p-hacking strategy.

Adding dependent variables is much more efficient. In the simple scenario, with independent DVs, the probability of obtaining a significant result equals 1-(1-.025)^11 = 24.31%. A simulation with 100,000 trials produced a percentage of 24.55%. More important, the p-curve is flat.

Correlation among the dependent variables produces a slight left-skewed distribution, but not as much as the other p-hacking methods. With a population correlation of r = .3, the percentages are 17% for p < .01 and 22% for p between .04 and .05.

These results provide three insights into p-hacking that have been overlooked. First, some p-hacking methods are more effective than others. Second, the amount of left-skewness varies across p-hacking methods. Third, efficient p-hacking produces a fairly large file-drawer of studies with non-significant results because it is inefficient to add participants to data that failed to produce a significant result.

Implications

False P-curve Citations

The p-curve authors made it fairly clear what p-curve does and what it does not do. The main point of a p-curve analysis is to examine whether a set of significant results was obtained at least partially with some true effects. That is, at least in a subset of the studies the null-hypothesis was false. The authors call this evidential value. A right-skewed p-curve suggests that a set of significant results have evidential value. This is the only valid inference that can be drawn from p-curve plots.

“We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole [italics added] explanation of those findings” (p. 535).

The emphasize on selective reporting as the sole explanation is important. A p-curve that shows evidential value can still be biased by p-hacking and publication bias, which can lead to inflated effect size estimates.

To make sure that I interpret the article correctly, I asked one of the authors on twitter and the reply confirmed that p-curve is not a bias test, but strictly a test that some real effects contributed to a right-skewed p-curve. The answer also explains why the p-curve authors did not care about testing for bias. They assume that bias is almost always present; which makes it unnecessary to test for it.

Although the authors stated the purpose of p-curve plots clearly, many meta-analysists have misunderstood the meaning of a p-curve analysis and have drawn false conclusions about right-skewed p-curves. For example, Rivers (2017) writes that a right-skewed p-curve suggests “that the WIT effect is a) likely to exist, and b) unlikely biased by extensive p-hacking.” The first inference is correct. The second one is incorrect because p-curve is not a bias detection method. A right-skewed p-curve could be a mixture of real effects and bias due to selective reporting.

Rivers also makes a misleading claim that a flat p-curve shows the lack of evidential value, whereas “a significantly left-skewed distribution indicates that the effect under consideration may be biased by p-hacking.” These statements are wrong because a flat p-curve can also be produced by p-hacking, especially when a real effect is also present.

Rivers is by no means the only one who misinterpreted p-curve results. Using the 10 most highly cited articles that applied p-curve analysis, we can see the same mistake in several articles. A tutorial for biologists claims “p-curve can, however, be used to identify p-hacking, by only considering significant findings” (Head, 2015, p. 3). Another tutorial for biologists repeats this false interpretation of p-curves. “One proposed method for identifying P-hacking is ‘P-curve’ analysis” (Parker et al., 2016, p. 714). A similar false claim is made by Polanin et al. (2016). “The p-curve is another method that attempts to uncover selective reporting, or “p-hacking,” in primary reports (Simonsohn, Nelson, Leif, & Simmons, 2014)” (p. 211). The authors of a meta-analysis of personality traits claim that they conduct p-curve analyses “to check whether this field suffers from publication bias” (Muris et al., 2017, 186). Another meta-analysis on coping also claims “p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) allows the detection of selective reporting by researchers who “file-drawer” certain parts of their studies to reach statistical significance” (Cheng et al., 2014; p. 1594).

Shariff et al.’s (2016) article on religious priming effects provides a better explanation of p-curve, but their final conclusion is still misleading. “These results suggest that the body of studies reflects a true effect of religious priming, and not an artifact of publication bias and p-hacking.” (p. 38). The first part is correct, but the second part is misleading. The correct claim would be “not solely the result of publication bias and p-hacking”, but it is possible that publication bias and p-hacking inflate effect size estimates in this literature. The skew of p-curves simply does not tell us about this. The same mistake is made by Weingarten et al. (2016). “When we included all studies (published or unpublished) with clear hypotheses for behavioral measures (as outlined in our p-curve disclosure table), we found no evidence of p-hacking (no left-skew), but dual evidence of a right-skew and flatter than 33% power.” (p. 482). While a left-skewed p-curve does reveal p-hacking, the absence of left-skew does not ensure that p-hacking was absent. The same mistake is made by Steffens et al. (2017), who interpret a right-skewed p-curve as evidence “that the set of studies contains evidential value and that there is no evidence of p-hacking or ambitious p-hacking” (p. 303).

Although some articles correctly limit the interpretation of the p-curve to the claim that the data contain evidential value (Combs et al., 2015; Rand, 2016; Siks et al., 2018), the majority of applied p-curve articles falsely assume that p-curve can reveal the presence or absence of p-hacking or publication bias. This is incorrect. A left-skewed p-curve does provide evidence of p-hacking, but the absence of left-skew does not imply that p-hacking is absent.

How prevalent are left-skewed p-curves?

After 2011, psychologists were worried that many published results might be false positive results that were obtained with p-hacking (Simmons et al., 2011). As the combination of p-hacking in the absence of a real effect does produce left-skewed p-curves, one might expect that a large percentage of p-curve analyses revealed left-skewed distributions. However, empirical examples of left-skewed p-curves are extremely rare. Take, power-posing as an example. It is widely assumed these days that original evidence for power-posing was obtained with p-hacking and that the real effect size of power-posing is negligible. Thus, power-posing would be expected to show a left-skewed p-curve.

Simmons and Simonsohn (2017) conducted a p-curve analysis of the power-posing literature. They did not observe a left-skewed p-curve. Instead, the p-curve was flat, which justifies the conclusion that the studies contain no evidential value (i.e., we cannot reject the null-hypothesis that all studies tested a true null-hypothesis). The interpretation of this finding is misleading.

“In this Commentary, we rely on p-curve analysis to answer the following question: Does the literature reviewed by Carney et al. (2015) suggest the existence of an effect once one accounts for selective reporting? We conclude that it does not. The distribution of p values from those 33 studies is indistinguishable from what would be expected if (a) the average effect size were zero and (b) selective reporting (of studies or analyses) were solely responsible for the significant effects that were published”

The interpretation only focus on selective reporting (or testing of independent DVs) as a possible explanation for lack of evidential value. However, usually the authors emphasize p-hacking as the most likely explanation for significant results without evidential value. Ignoring p-hacking is deceptive because a flat p-curve can occur as a combination of p-hacking and real effect, as the authors showed themselves (Simonsohn et al., 2014).

Another problem is that significance testing is also one-sided. A right-skewed p-curve can be used to reject the null-hypotheses that all studies are false positives, but the absence of significant right skew cannot be used to infer the lack of evidential value. Thus, p-curve cannot be used to establish that there is no evidential value in a set of studies.

There are two explanations for the surprising lack of left-skewed p-curves in actual studies. First, p-hacking may be much less prevalent than is commonly assumed and the bigger problem is publication bias which does not produce a left-skewed distribution. Alternatively, false positive results are much rarer than has been assumed in the wake of the replication crisis. The main reason for replication failures could be that published studies report inflated effect sizes and that replication studies with unbiased effect size estimates are underpowered and produce false negative results.

How useful are Right-skewed p-curves?

In theory, left-skew is diagnostic of p-hacking, but in practice left-skew is rarely observed. This leaves right-skew as the only diagnostic information of p-curve plots. Right skew can be used to reject the null-hypothesis that all of the significant results tested a true null-hypothesis. The problem with this information is shared by all significance tests. It does not provide evidence about the effect size. In this case, it does not provide evidence about the percentage of significant results that are true positives (the false positive risk), nor does it quantify the strength of evidence.

This problem has been addressed by other methods that quantify how strong the evidence against the null-hypothesis is. Confusingly, the p-curve authors used the term p-curve for a method that estimates the strength of evidence in terms of the unconditional power of the set of studies (Simonsohn et al., 2014b). The problem with these power estimates is that they are biased when studies are heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Simulation studies show that z-curve is a superior method to quantify the strength of evidence against the null-hypothesis. In addition, z-curve.2.0 provides additional information about the false positive risk; that is the maximum number of significant results that may be false positives.

In conclusion, p-curve plots no longer produce meaningful information. Left-skew can be detected in z-curves plots as well as in p-curve plots and is extremely rare. Right skew is diagnostic of evidential value, but does not quantify the strength of evidence. Finally, p-curve plots are not diagnostic when data contain evidential value and bias due to p-hacking or publication bias.

Hey Social Psychologists: Don’t Mess with Suicide!

Ten years ago, the foundations of psychological science were shaking by the realization that the standard scientific method of psychological science is faulty. Since then it has become apparent that many classic findings are not replicable and many widely used measures are invalid; especially in social psychology (Schimmack, 2020).

However, it is not uncommon to read articles in 2021 that ignore the low credibility of published results. There are too many of these pseudo-scientific articles, but some articles matter more than others; at last to me. I do care about suicide and like many people my age, I know people who have committed suicide. I was therefore concerned when I saw a review article that examines suicide from a dual-process perspective.

Automatic and controlled antecedents of suicidal ideation and action: A dual-process conceptualization of suicidality.

My main concern about this article is that dual-process models in social cognition are based on implicit priming studies with low replicability and implicit measures with low validity (Schimmack, 2021a, 2021b). It is therefore unclear how dual-process models can help us to understand and prevent suicides.

After reading the article, it is clear that the authors make many false statements and present questionable studies that have never been replicated as if they produce a solid body of empirical evidence.

Introduction of the Article

The introduction cites outdated studies that have either not been replicated or produced replication failures.

“Our position is that even these integrative models omit a fundamental and well-established
dynamic of the human mind: that complex human behavior is the result of an interplay between relatively automatic and relatively controlled modes of thought (e.g., Sherman et al., 2014). From basic processes of impression formation (e.g., Fiske et al., 1999) to romantic relationships (e.g., McNulty & Olson, 2015) and intergroup relations (e.g., Devine, 1989), dual-process frameworks that incorporate automatic and controlled cognition have provided a more complete understanding of a broad array of social phenomena.

This is simply not true. For example, there is no evidence that we implicitly love our partners when we consciously hate them or vice versa, and there is no evidence that prejudice occurs outside of awareness.

Automatic cognitions can be characterized as unintentional (i.e., inescapably activated), uncontrollable (i.e., difficult to stop), efficient in operation (i.e., requiring few cognitive resources), and/or unconscious (Bargh, 1994) and are typically captured with implicit measures.

This statement ignores many articles that have criticized the assumption that implicit measures measure implicit constructs. Even the proponent of the most widely used implicit measure have walked back this assumption (Greenwald & Banaji, 2017).

The authors then make the claim that implicit measures of suicide have incremental predictive validity of suicidal behavior.

For example, automatic associations between the self and death predict suicidal ideation and action beyond traditional explicit (i.e., verbal) responses (Glenn et al., 2017).

This claim has been made repeatedly by proponents of implicit measures, so I meta-analyzed the small set of studies that tested this prediction (Schimmack, 2021). Some of these studies produced non-significant results and the literature showed evidence that questionable research practices were used to produce significant results. Overall, the evidence is inconclusive. It is therefore incorrect to point to a single study as if there is clear evidence that implicit measures of suicidality are valid.

Further statements are also based on outdated research and a single reference.

Research on threat has consistently shown that people preferentially process dangers to physical harm by prioritizing attention, response, and recall regarding threats (e.g., Öhman &
Mineka, 2001)
.”

There have been many proposals about stimuli that attract attention, and threatening stimuli are by no means the only attention grabbing stimuli. Sexual stimuli also attract attention and in general arousal rather than valence or threat is a better predictor of attention (Schimmack, 2005).

It is also not clear how threatening stimuli are relevant for suicide which is related to depression rather than anxiety disorders.

The introduction of implicit measures totally disregards the controversy about the validity of implicit measures or the fact that different implicit measures of the same construct show low convergent validity.

Much has been written about implicit measures (for reviews, see De Houwer et al., 2009; Fazio & Olson, 2003; March et al., 2020; Nosek et al., 2011; Olson & Fazio, 2009), but for the present purposes, it is important to note the consensus that implicit measures index the automatic properties of attitudes.

More relevant are claims that implicit measures have been successfully used to understand a variety of clinical topics.

The application of a dual-process framework has consequently improved explanation and prediction in a number of areas involving mental health, including addiction (Wiers & Stacy, 2006), anxiety (Teachman et al., 2012), and sexual assault (Widman & Olson, 2013). Much of this work incorporates advances in implicit measurement in clinical domains (Roefs et al., 2011).

The authors then make the common mistake to conflate self-deception and other-deception. The notion of implicit motives that can influence behavior without awareness implies self-deception. An alternative rational for the use of implicit measures is that they are better measures of consciously accessible thoughts and feelings that individuals are hiding from others. Here we do not need to assume a dual-process model. We simply have to assume that self-report measures are easy to fake, whereas implicit measures can reveal the truth because they are difficult to fake. Thus, even incremental predictive validity does not automatically support a dual-process model of suicide. However, this question is only relevant if implicit measures of suicidality show incremental predictive validity, which has not been demonstrated.

Consistent with the idea that such automatic evaluative associations can predict suicidality later, automatic spouse-negative associations predicted increases in suicidal ideation over time across all three studies, even after accounting for their controlled counterparts (McNulty et al., 2019).

Conclusion Section

In the conclusion section, the authors repeat their false claim that implicit measures of suicidality reflect valid variance in implicit suicidality and that they are superior to explicit measures.

As evidence of their impact on suicidality has accumulated, so has the need for incorporating automatic processes into integrative models that address questions surrounding how and under what circumstances automatic processes impact suicidality, as well as how automatic and controlled processes interact in determining suicide-relevant outcomes.”

Implicit measures are better-suited to assess constructs that are more affective
(Kendrick & Olson, 2012), spontaneous (e.g., Phillips & Olson, 2014), and uncontrollable (e.g., Klauer & Teige-Mocigemba, 2007).

As recent work has shown (e.g., Creemers et al., 2012; Franck, De Raedt, Dereu, et al., 2007; Franklin et al., 2016; Glashouwer et al., 2010; Glenn et al., 2017; Hussey et al., 2016; McNulty et al., 2019; Nock et al., 2010; Tucker,Wingate, et al., 2018), the psychology of suicidality requires formal consideration of automatic processes, their proper measurement, and how they relate
to one another and corresponding controlled processes.

We have articulated a number of hypotheses, several already with empirical support, regarding interactions between automatic and controlled processes in predicting suicidal ideation and lethal acts, as well as their combination into an integrated model.

Then they finally mention the measurement problems of implicit measures.

Research utilizing the model should be mindful of specific challenges. First, although the model answers calls to diversify measurement in suicidality research by incorporating implicit measures, such measures are not without their own problems. Reaction time measures often have problematically low reliabilities, and some include confounds (e.g., Olson et al., 2009). Further, implicit and explicit measures can differ in a number of ways, and structural differences between them can artificially deflate their correspondence (Payne et al., 2008). Researchers should be aware of the strengths and weaknesses of implicit measures.

Evaluation of the Evidence

Here I provide a brief summary of the actual results of studies cited in the review article so that readers can make up their own mind about the relevance and credibility of the evidence.

Creemers, D. H., Scholte, R. H., Engels, R. C., Prinstein, M. J., & Wiers, R. W. (2012). Implicit and explicit self-esteem as concurrent predictors of suicidal ideation, depressive symptoms, and loneliness. Journal of Behavior Therapy and Experimental Psychiatry, 43(1), 638–646

Participants: 95 undergraduate students
Implicit Construct / Measure: Implicit self-esteem / Name Latter Task
Dependent Variables: depression, loneliness, suicidal ideation
Results: No significant direct relationship. Interaction between explicit and implicit self-esteem for suicidal ideation only, b = .28.

Franck, E., De Raedt, R., & De Houwer, J. (2007). Implicit but not explicit self-esteem predicts future depressive symptomatology. Behaviour Research and Therapy, 45(10), 2448–2455.

Participants: 28 clinically depressed patients; 67 not-depressed participants.
Implicit Construct / Measure: Implicit self-esteem / Name Latter Task
Dependent Variable: change in depression controlling for T1
Result: However, after controlling for initial symptoms of depression, implicit, t(48) = 2.21, p = .03, b = .25, but not explicit self-esteem, t(48) = 1.26, p = .22, b = .17, proved to be a significant predictor for depressive symptomatology at 6 months follow-up.

Franck, E., De Raedt, R., Dereu, M., & Van den Abbeele, D. (2007). Implicit and explicit self- esteem in currently depressed individuals with and without suicidal ideation. Journal of Behavior Therapy and Experimental Psychiatry, 38(1), 75–85.

Participants: Depressed patients with suicidal ideation (N = 15), depressed patients without suicidal ideation (N = 14) and controls (N = 15)
Implicit Construct / Measure: Implicit self-esteem / IAT
Dependent variable. Group status
Contrast analysis revealed that the currently depressed individuals with suicidal ideation showed a significantly higher implicit self-esteem as compared to the currently depressed individuals without suicidal ideation, t(43) = 3.0, p < 0.01. Furthermore, the non-depressed controls showed a significantly higher implicit self-esteem as compared to the currently depressed individuals without suicidal ideation, t(43) = 3.7, p < 0.001.
[this finding implies that suicidal depressed patients have HIGHER implicit self-esteem than depressed patients who are not suicidal].

Glashouwer,K.A., de Jong,P. J., Penninx, B.W.,Kerkhof,A. J., vanDyck, R., & Ormel, J. (2010). Do automatic self-associations relate to suicidal ideation? Journal of Psychopathology and Behavioral Assessment, 32(3), 428–437.

Participants: General population (N = 2,837)
Implicit Constructs / Measure: Implicit depression, Implicit Anxiety / IAT
Dependent variable: Suicidal Ideation, Suicide Attempt
Results: simple correlations
Depression IAT – Suicidal Ideation, r = .22
Depression IAT – Suicide Attempt, r = .12
Anxiety IAT – Suicide Ideation, r = .18
Anxiety IAT – Suicide Attempt, r = .11
Controlling for Explicit Measures of Depression / Anxiety
Depression IAT – Suicidal Ideation, b = ..024, p = .179
Depression IAT – Suicide Attempt, b = .037, p = .061
Anxiety IAT – Suicide Ideation, b = .024, p = .178
Anxiety IAT – Suicide Attempt, r = ..039, p = .046

Glenn, J. J., Werntz, A. J., Slama, S. J., Steinman, S. A., Teachman, B. A., &
Nock, M. K. (2017). Suicide and self-injury-related implicit cognition: A
large-scale examination and replication. Journal of Abnormal Psychology,
126(2), 199–211.

Participants: Self-selected online sample with high rates of self-harm (> 50%). Ns = 3,115, 3114
Implicit Constructs / Measure
: Self-Harm, Death, Suicide / IAT
Dependent variables: Group differences (non-suicidal self-injury / control; suicide attempt / control)
Results:
Non-suicidal self-injury versus control
Self-injury IAT, d = .81/.97; Death IAT d = .52/.61, Suicide IAT d = .58/.72
Suicide Attempt versus control
Self-injury IAT, d = ..52/.54; Death IAT d = .37/.32, Suicide IAT d = .54/.67
[these results show that self-ratings and IAT scores reflect a common construct;
they do not show discriminant validity; no evidence that they measure distinct
constructs and they do not show incremental predictive validity]

Hussey, I., Barnes-Holmes, D., & Booth, R. (2016). Individuals with current
suicidal ideation demonstrate implicit “fearlessness of death..” Journal of
Behavior Therapy and Experimental Psychiatry, 51, 1–9.

Participants: 23 patients with suicidal ideation and 25 controls (university students)
Implicit Constructs / Measure: Death attitudes (general / personal) / IRAP
Dependent variable: Group difference
Results: No main effects were found for either group (p = .08). Critically, however, a three-way interaction effect was found between group, IRAP type, and trial-type, F(3, 37) = 3.88, p = .01. Specifically, the suicidal ideation group produced a moderate “my death-not-negative” bias (M = .29, SD = .41), whereas the normative group produced a weak “my death-negative” bias (M = -.12, SD = .38, p < .01). This differential performance was of a very large effect size (Hedges’ g = 1.02).
[This study suggests that evaluations of personal death show stronger relationships than generic death]

McNulty, J. K., Olson, M. A., & Joiner, T. E. (2019). Implicit interpersonal evaluations as a risk factor for suicidality: Automatic spousal attitudes predict changes in the probability of suicidal thoughts. Journal of Personality and Social Psychology, 117(5), 978–997

Participants. Integrative analysis of 399 couples from 3 longitudinal study of marriages.
Implicit Construct / Measure: Partner attitudes / evaluative priming task
Dependent variable: Change in suicidal thoughts (yes/no) over time
Result: (preferred scoring method)
without covariates, b = -.69, se = .27, p = .010.
with covariate, b = -.64, se = .29, p = .027

Nock, M. K., Park, J. M., Finn, C. T., Deliberto, T. L., Dour, H. J., & Banaji, M. R. (2010). Measuring the suicidal mind: Implicit cognition predicts suicidal behavior. Psychological Science, 21(4), 511–517.

Participants. 157 patients with mental health problems
Implicit Construct / Measure: death attitudes / IAT
Dependent variable: Prospective Prediction of Suicide
Result: controlling for prior attempts / no explicit covariates
b = 1.85, SE = 0.94, z = 2.03, p = .042

Tucker, R. P., Wingate, L. R., Burkley, M., & Wells, T. T. (2018). Implicit Association with Suicide as Measured by the Suicide Affect Misattribution Procedure (S-AMP) predicts suicide ideation. Suicide and Life-Threatening Behavior, 48(6), 720–731.

Participants. 138 students oversampled for suicidal ideation
Implicit Construct / Measure: suicide attitudes / AMP
Dependent variable: Suicidal Ideation
Result: simple correlation, r = .24
regression controlling for depression, b = .09, se = .04, p = .028

Taken together the reference show a mix of constructs, measures and outcomes, and p-values cluster just below .05. Not one of these p-values is below .005. Moreover, many studies relied on small convenience samples. The most informative study is the study by Glashouwer et al. that examined incremental predictive validity of a depression IAT in a large, population wide, sample. The result was not significant and the effect size was less than r = .1. Thus, the references do not provide compelling evidence for dual-attitude models of depression.

Conclusion

Social psychology have abused the scientific method for decades. Over the past decade, criticism of their practices has become louder, but many social psychologists ignore this criticism and continue to abuse significance testing and to misrepresent these results as if they provide empirical evidence that can inform understanding of human behavior. This article is just another example of the unwillingness of social psychologists to “clean up their act” (Kahneman, 2012). Readers of this article should be warned that the claims made in this article are not scientific. Fortunately, there is a credible research on depression and suicide outside of social psychology.