Poster presented at the virtual conference of the Association for Personality Research (ARP), July 16, 2021.
For a more detailed critique, see “Why most cross-lagged-panel models are false” (R-Index, August, 22, 2020).
Anusic, I., & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110(5), 766-781.
Campbell, D. T. (1963). From description to experimentation:
Interpreting trends as quasi-experiments. In C. W. Harris (Ed.), Problems in measuring
change. Madison: University of Wisconsin Press.
Hamaker, E. L., Kuiper, R. M., & Grasman, R. P. P. P. (2015). A critique of the cross-lagged panel model. Psychological Methods, 20(1), 102–116. https://doi.org/10.1037/a0038889
Heise, D. R. (1970). Causal inference from panel data. Sociological Methodology,
Kenny, D. A., & Zautra, A. (1995). The trait-state-error model for multiwave data. Journal of Consulting and Clinical Psychology, 63(1), 52–59. https://doi.org/10.1037/0022-006X.63.1.52
Orth, U., Clark, D. A., Donnellan, M. B., & Robins, R. W. (2021). Testing prospective effects in longitudinal research: Comparing seven competing cross-lagged models. Journal of Personality and Social Psychology, 120(4), 1013-1034. http://dx.doi.org/10.1037/pspp0000358
Orth, U., Robins, R. W., & Roberts, B. W. (2008). Low self-esteem prospectively predicts depression in adolescence and young adulthood. Journal of Personality and Social Psychology, 95, 695–708. http://dx.doi.org/10.1037/0022-35184.108.40.2065
Pelz, D. C., & Andrews, F. (1964). Detecting causal priorities in panel study data, American Sociological Review, 29, 836-848.
Background: A previous blog post shared a conversation between Bill von Hippel and Ulrich Schimmack about Bill’s Replicability Index (Part 1). To recapitulate, I had posted statistical replicability estimates for several hundred social psychologists (Personalized P-Values). Bill’s scores suggested that many of his results with p-values just below .05 might not be replicable. Bill was dismayed by his low R-Index but thought that some of his papers with very low values might be more replicable than the R-Index would indicate. He suggested that we put the R-Index results to an empirical test. He chose his paper with the weakest statistical evidence (interaction p = .07) for a replication study. We jointly agreed on the replication design and sample size. In just three weeks the study was approved, conducted, and the results were analyzed. Here we discuss the results.
…. Three Weeks Later
Bill: Thanks to rapid turnaround at our university IRB, the convenience of modern data collection, and the programming skills of Sam Pearson, we have now completed our replication study on Prolific. We posted the study for 2,000 participants, and 2,031 people signed up. For readers who are interested in a deeper dive, the data file is available at https://osf.io/cu68f/ and the pre-registration at https://osf.io/7ejts.
To cut to the chase, this one is a clear win for Uli’s R-Index. We successfully replicated the standard effect documented in the prior literature (see Figure A), but there was not even a hint of our predicted moderation of that effect, which was the key goal of this replication exercise (see Figure B: Interaction F(1,1167)=.97, p=.325, and the nonsignificant mean differences don’t match predictions). Although I would have obviously preferred to replicate our prior work, given that we failed to do so, I’m pleased that there’s no hint of the effect so I don’t continue to think that maybe it’s hiding in there somewhere. For readers who have an interest in the problem itself, let me devote a few paragraphs to what we did and what we found. For those who are not interested in Darwinian Grandparenting, please skip ahead to Uli’s response.
Previous work has established that people tend to feel closest to their mother’s mother, then their mother’s father, then their father’s mother, and last their father’s father. We replicated this finding in our prior paper and replicated it again here as well. The evolutionary idea underlying the effect is that our mother’s mother knows with certainty that she’s related to us, so she puts greater effort into our care than other grandparents (who do not share her certainty), and hence we feel closest to her. Our mother’s father and father’s mother both have one uncertain link (due to the possibility of cuckoldry), and hence put less effort into our care than our mother’s mother, so we feel a little less close to them. Last on the list is our father’s father, who has two uncertain links to us, and hence we feel least close to him.
The puzzle that motivated our previous work lies in the difference between our mother’s father and father’s mother; although both have one uncertain link, most studies show that people feel closer to their mother’s father than their father’s mother. The explanation we had offered for this effect was based on the idea that our father’s mother often has daughters who often have children, providing her with a more certain outlet for her efforts and affections. According to this possibility, we should only feel closer to our mother’s father than our father’s mother when the latter has grandchildren via daughters, and that is what our prior paper had documented (in the form of a marginally significant interaction and predicted simple effects).
Our clear failure to replicate that finding suggests an alternative explanation for the data in Figure A:
- People are closer to their maternal grandparents than their paternal grandparents (possibly for the reasons of genetic certainty outlined above).
- People are closer to their grandmothers than their grandfathers (possibly because women tend to be more nurturant than men and more involved in childcare).
- As a result of these two main effects, people tend to be closer to their mothers’ father than their father’s mother, and this particular difference emerges in the presence or absence of other more certain kin.
Does our failure to replicate mean that the presence or absence of more certain kin has no impact on grandparenting? Clearly not in the manner I expected, but that doesn’t mean it has no effect. Consider the following (purely exploratory, non-preregistered) analyses of these same data: After failing to find the predicted interaction above, I ran a series of regression analyses, in which closeness to maternal and paternal grandparents were the dependent variables and number of cousins via fathers’ and mothers’ brothers and sisters were the predictor variables. The results are the same whether we’re looking at grandmothers or grandfathers, so for the sake of simplicity, I’ve collapsed the data into closeness to paternal grandparents and closeness to maternal grandparents. Here are the regression tables:
We see three very small but significant findings here (all of which require replication before we have any confidence in them). First, people feel closer to their paternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through fathers’ sisters are associated with less closeness to paternal grandparents). Second, people feel closer to their paternal grandparents to the degree that their maternal grandparents have more grandchildren through daughters other than their mother (i.e., more cousins through mothers’ sisters are associated with more closeness to paternal grandparents). Third, people feel closer to their maternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through mothers’ sisters are associated with less closeness to maternal grandparents). Note that none of these effects emerged via cousins through father’s or mother’s brothers. These findings strike me as worthy of follow-up, as they suggest that the presence or absence of equally or more certain kin does indeed have a (very small) influence on grandparents in a manner that evolutionary theory would predict (even if I didn’t predict it myself).
Uli: Wow, I am impressed how quickly research with large samples can be done these days. That is good news for the future of social psychology, at least the studies that are relatively easy to do.
Bill: Agreed! But benefits rarely come without cost and studies on the web are no exception. In this case, the ease of working on the web also distorts our field by pushing us to do the kind of work that is ‘web-able’ (e.g., self-report) or by getting us to wangle the methods to make them work on the web. Be that as it may, this study was a no brainer, as it was my lowest R-Index and pure self-report. Unfortunately, my other papers with really low R-Indices aren’t as easy to go back and retest (although I’m now highly motivated to try).
Uli: Of course, I am happy that R-Index made the correct prediction, but N = 1 is not that informative.
Bill: Consider this N+1, as it adds to your prior record.
Uli: Fortunately, R-Index also does make good, although by no means, perfect predictions in general; https://replicationindex.com/2021/05/16/pmvsrindex/.
Bill: Very interesting.
Uli: Maybe you set yourself up for failure by picking a marginally significant result.
Bill: That was exactly my goal. I still believed in the finding, so it was a great chance to pit your method against my priors. Not much point in starting with one of my results that we both agree is likely to replicate.
Uli: The R-Index analysis implied that we should only trust your results with p < .001.
Bill: That seems overly conservative to me, but of course I’m a biased judge of my own work. Out of curiosity, is that p value better when you analyze all my critical stats rather than just one per experiment? This strikes me as potentially important, because almost none of my papers would have been accepted based on just a single statistic; rather, they typically depend on a pattern of findings (an issue I mentioned briefly in our blog).
Uli: The rankings are based on automatic extraction of test statistics. Selecting focal tests would only lead to an even more conservative alpha criterion. To evaluate the alpha = .001 criterion, it is not fair to use a single p = .07 result. Looking at the original article about grandparent relationships, I see p < .001 for mother’s mother vs. mother’s father relationships. The other contrasts are just significant and do not look credible according to R-Index (predicting failure for same N). However, they are clearly significant in the replication study. So, R-Index made two correct predictions (one failure and one success), and two wrong predictions. Let’s call it a tie. 🙂
Bill: Kind of you, but still a big win for the R-Index. It’s important to keep in mind that many prior papers had found the other contrasts, whereas we were the first to propose and find the specific moderation highlighted in our paper. So a reasonable prior would set the probability much higher to replicate the other effects, even if we accept that many prior findings were produced in an era of looser research standards. And that, in turn, raises the question of whether it’s possible to integrate your R-Index with some sort of Bayesian prior to see if it improves predictive ability.
Your prediction markets v. R-Index blog makes the very good point that simple is better and the R-Index works awfully well without the work involved in human predictions. But when I reflect on how I make such predictions (I happened to be a participant in one of the early prediction market studies and did very well), I’m essentially asking whether the result in question is a major departure from prior findings or an incremental advance that follows from theory. When the former, I say it won’t replicate without very strong statistical evidence. When the latter, I say it will replicate. Would it be possible to capture that sort of Bayesian processing via machine learning and then use it to supplement the R-Index?
Uli: There is an article that tried to do this. Performance was similar to prediction markets. However, I think it is more interesting to examine the actual predictors that may contribute to the prediction of replication outcomes. For example, we know cognitive psychology and within-subject designs are more replicable than social psychology and between-subject designs. I don’t think, however, we will get very far based on single questionable studies. Bias-corrected meat-analysis may be the only way to salvage robust findings from the era of p-hacking.
To broaden the perspective from this single article to your other articles, one problem with the personalized p-values is that they are aggregated across time. This may lead to overly conservative alpha levels (p < .001) for new research that was conducted in accordance with new rules about transparency, while the rules may be too liberal for older studies that were conducted in a time when awareness about the problems of selection for significance was lacking (say before 2013). Inspired by the “loss of confidence project” (Rohrer et al., 2021), I want to give authors the opportunity to exclude articles from their R-Index analysis that they no longer consider credible themselves. To keep track of these loss-of-confidence declaration, I am proposing to use PubPeer (https://pubpeer.com/). Once an author posts a note on PubPeer that declares loss of confidence in the empirical results of an article, the article will be excluded from the R-Index analysis. Thus, authors can improve their standing in the rankings and, more importantly, change the alpha level to a more liberal level (e.g., from .005 to .01) by (a) publicly declaring loss of confidence in a finding and (b) publishing new research with studies that have more power and honestly report non-significant results.
I hope that the incentive to move up in the rankings will increase the low rate of loss of confidence declarations and help us to clean up the published record faster. Declarations could also be partial. For example, for the 2005 article, you could post a note on PubPeer that the ordering of the grandparent relationships was successfully replicated and the results for cousins were not with a link to the data and hopefully eventually a publication. I would then remove this article from the R-Index analysis. What do you think about this idea?
Bill: I think this is a very promising initiative! The problem, as I see it, is that authors are typically the last ones to lose confidence in their own work. When I read through the recent ‘loss of confidence’ reports, I was pretty underwhelmed by the collection. Not that there was anything wrong with the papers in there, but rather that only a few of them surprised me.
Take my own case as an example. I obviously knew it was possible my result wouldn’t replicate, but I was very willing to believe what turned out to be a chance fluctuation in the data because it was consistent with my hypothesis. Because I found that hypothesis-consistent chance fluctuation on my first try, I would never have stated I have low confidence in it if you hadn’t highlighted it as highly improbable. In other words, there’s no chance I’d have put that paper on a ‘loss of confidence’ list without your R-Index telling me it was crap and even then it took a failure to replicate for me to realize you were right.
Thus, I would guess that uptake into the ‘loss of confidence’ list would be low if it emphasizes work that people feel was sloppy in the first place, not because people are liars, but because people are motivated reasoners.
With that said, if the collection also emphasizes work that people have subsequently failed to replicate, and hence have lost confidence in it, I think it would be used much more frequently and could become a really valuable corrective. When I look at the Darwinian Grandparenting paper, I see that it’s been cited over 150 times on google scholar. I don’t know how many of those papers are citing it for the key moderation effect that we now know doesn’t replicate, but I hope that no one else will cite it for that reason after we publish this blog. No one wants other investigators to waste time following up their work once they realize the results aren’t reliable.
Uli: (feeling a bit blue today). I am not very optimistic that authors will take note of replication failures. Most studies are not conducted after a careful review of the existing literature or a meta-analysis that takes publication bias into account. As a result, citations in articles are often picked because they help to support a finding in an article. While p-hacking of data may have decreased over the past decade in some areas, cherry-picking of references is still common and widespread. I am not really sure how we can speed up self-correction of science. My main hope is that meta-analyses are going to improve and take publication bias more seriously. Fortunately, new methods show promising results in debiasing effect sizes estimates (Bartoš, Maier, Wagenmakers, Doucouliagos, & Stanley, 2021). Z-curve is also being used by meta-analysists and we are hopeful that z-curve 2.0 will soon be accepted for publication in Meta-Psychology (Bartos & Schimmack, 2021). Unfortunately, it will take another decade for these methods to become mainstream and meanwhile many resources will be wasted on half-baked ideas that are grounded in a p-hacked literature. I am not optimistic that psychology will become a rigorous science during my lifetime. So, I am trying to make the best of it. Fortunately, I can just do something else when things are too depressing, like sitting in my backyard and watch Germany win at the Euro cup. Life is good, psychological science not so much.
Bill: I don’t blame you for your pessimism, but I completely disagree. You see a science that remains flawed when we ought to know better, but I see a science that has improved dramatically in the 35 years since I began working in this field. Humans are wildly imperfect actors who did not evolve to be dispassionate interpreters of data. We hope that training people to become scientists will debias them – although the data suggest that it doesn’t – and then we double down by incentivizing scientists to publish results that are as exciting as possible as rapidly as possible.
Thankfully bias is the both the problem and the solution, as other scientists are biased in favor of their theories rather than ours, and out of this messy process the truth eventually emerges. The social sciences are a dicier proposition in this regard, as our ideologies intersect with our findings in ways that are less common in the physical and life sciences. But so long as at least some social scientists feel free to go wherever the data lead them, I think our science will continue to self-correct, even if the process often seems painfully slow.
Uli: Your response to my post is a sign that progress is possible, but 1 out of 400 may just be the exception to the rule to never question your own results. Even researchers who know better become promoters of their own theories, especially when they become popular. I think the only way to curb false enthusiasm is to leave the evaluation of theories (review articles, meta-analysis) to independent scientists. The idea that one scientist can develop and evaluate a theory objectively is simply naive. Leaders of a paradigm are like strikers in soccer. They need to have blinders on to risk failure. We need meta-psychologists to distinguish real contributions from false ones. In this way meta-psychologists are like referees. Referees are not glorious heroes, but they are needed for a good soccer game, and they have the power to call of a goal because a player was offside or used their hands. The problem for science is the illusion that scientists can control themselves.
The past decade has revealed many flaws in the way psychologists conduct empirical tests of theories. The key problem is that psychologists lacked an accepted strategy to conclude that a prediction was not supported. This fundamental flaw can be traced back to Fisher’s introduction of significance testing. In Fisher’s the null-hypothesis is typically specified as the absence of an effect in either direction. That is, the effect size is exactly zero. Significance testing examines how much empirical results deviate from this prediction. If the probability of the result or even more extreme deviations is less than 5%, the null-hypothesis is rejected. However, if the p-value is greater than .05, no inferences can be drawn from the finding because there are two explanations for this finding. Either the null-hypothesis is true or it is false and the result is a false negative result. The probability of this false negative results is unspecified in Fisher’s framework. This asymmetrical approach to significance testing continues to dominate psychological science.
Criticism of this one-sided approach to significance testing is nearly as old as nil-hypothesis significance testing itself (Greenwald, 1975; Sterling, 1959). Greenwald’s (1975) article is notable because it provided a careful analysis of the problem and it pointed towards a solution to this problem that is rooted in Neyman-Pearson’s alternative to Fisher’s significance testing. Greenwald (1975) showed how it is possible to “Accept the Null-Hypothesis Gracefully” (p. 16).
“Use a range, rather than a point, null hypothesis. The procedural recommendations to follow are much easier to apply if the researcher has decided, in advance of data collection, just what magnitude of effect on a dependent measure or measure of association is large enough not to be considered trivial. This decision may have to be made somewhat arbitrarily but seems better to be made somewhat arbitrarily before data collection than to be made after examination of the data.” (p. 16).
The reason is simply that it is impossible to provide evidence for the nil-hypothesis that an effect size is exactly zero, just like it is impossible to show than an effect size equals any other precise value (e..g., r = .1). Although Greenwald made this sensible suggestion over 40 years ago, it is nearly impossible to find articles that specify a range of effect sizes a priori (e.g.., we expected the effect size to be in the range between r = .3 and r = .5 or we expected the correlation to be larger than r = .1).
Bad training continues to be a main reason for the lack of progress in psychological science. However, other factors also play a role. First, specifying effect sizes a priori has implications for the specification of sample sizes. A researcher who declares that effect sizes as small as r = .1 are meaningful and expected needs large samples to obtain precise effect size estimates. For example, assuming the population correlation is r = .2 and a researcher wants to show that it is at least r = .1, a one-sided test with alpha = .05 and 95% power (i.e., the probability of a successful outcome) is N = 1,035. As most sample sizes in psychology are below N = 200, most studies simply lack the precision to test hypothesis that predict small effects. A solution to this might be to focus on hypotheses that predict large effect sizes. However, to show that a population correlation of r = .4 is greater than r = .3, still requires N = 833 participants. In fact, most studies in psychology barely have enough power to demonstrate that moderate correlations, r = .3, are greater than zero, N = 138. In short, most studies are too small to provide evidence for the null-hypothesis that effect sizes are small than a minimum effect size. Not surprisingly, psychological theories are rarely abandoned because empirical results seemed to support the null-hypothesis.
However, occasionally studies do have large samples and it would be possible to follow Greenwald’s (1975) recommendation to specify a minimum effect size a priori. For example, Greenwald and colleagues conducted a study with N = 1,411 participants who reported their intentions to vote for Obama or McCain in the 2008 US elections. The main hypothesis was that implicit measures of racial attitudes like the race IAT would add to the prediction because some White Democrats might not vote for a Black Democratic candidate. It would have been possible to specify an minimum effect size based on a meta-analysis that was published in the same year. This meta-analysis of smaller studies suggested that the average race IAT – criterion correlation was r = .236. The explicit – criterion correlation was r = .186, effect, and the explicit-implicit correlation was only r = .117. Given the lower estimates for the explicit measures and the low explicit-implicit correlation, a regression analysis would only slightly reduce the effect size for the incremental predictive validity of the race IAT, b = .225. Thus, it would have been possible to test the hypothesis that the effect size is at least b = .1, which would imply that adding the race IAT as a predictor explains at least 1% additional variance in voting behaviors.
In reality, the statistical analyses were conducted with prejudice against the null-hypothesis. First, Greenwald et al. (2009) noted that “conservatism and symbolic racism
were the two strongest predictors of voting intention (see Table 1)” (p. 247).
A straightforward way to test the hypothesis that the race IAT contributes to the prediction of voting would simply add the standardized race IAT as an additional predictor and use the regression coefficient to test the prediction that implicit bias as measured with the race IAT contributes to voting against Obama. A more stringent test of incremental predictive validity would also include the other explicit prejudice measures because measurement error alone can produce incremental predictive validity for measures of the same construct. However, this is not what the authors did. Instead, they examined whether the four racial attitude measures jointly predicted variance in addition to political orientation. This was the case, with 2% additional explained variance (p < .0010). However, this result does not tell us anything about the unique contribution of the race IAT. The unique contributions of the four measures were not reported. Instead, another regression model tested whether the race IAT and a second implicit measure (the Affective Misattribution Task) explained incremental variance in addition to political orientation. In this model “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05” (p. 247). This model also does not tell us anything about the importance of the race IAT because it was not reported how much of the joint contribution was explained by the race IAT alone. The inclusion of the AMP also makes it impossible to test the statistical significance for the race IAT because most of the prediction may come from the shared variance between the two implicit measures, r = .218. Most important, the model does not test whether the race IAT predicts voting above and beyond explicit measures, including symbolic racism.
Another multiple regression analysis entered symbolic racism and the two implicit measures. In this analysis, the two implicit measures combined explained an additional 0.7% of the variance, but this was not statistically significant, p = .07.
They then fitted the model with all predictor variables. In this model, the four attitude measures explained an additional 1.3% of the variance, p = .01, but no information is provided about the unique contribution of the race IAT or the joint contribution of the two implicit measures. The authors merely comment that “among the four race attitude measures,
the thermometer difference measure was the strongest incremental predictor and was also the only one of the four that was individually statistically significant in their simultaneous entry after both symbolic racism and conservatism (p. 247).
To put it mildly, the presented results carefully avoid reporting the crucial result about the incremental predictive validity of the race IAT after explicit measures of prejudice are entered into the equation. Adding the AMP only creates confusion because the empirical question is how much the race IAT adds to the prediction of voting behavior. Whether this variance is shared with another implicit measure or not is not relevant.
Table 1 can be used to obtain the results that were not reported in the article. A regression analysis shows a standardized effect size estimate of 0.000 with a 95%CI that ranges from -.047 to .046. The upper limit of this confidence interval is below the minimum effect size of .1 that was used to specify a reasonable null-hypothesis. Thus, the only study that had sufficient precision to the incremental predictive validity of the race IAT shows that the IAT does not make a meaningful, notable, practically significant contribution to the prediction of racial bias in voting. In contrast, several self-report measures did show that racial bias influenced voting behavior above and beyond the influence of political orientation.
Greenwald et al.’s (2009) article illustrates Greenwald’s (1975) prejudice against the null-hypotheses. Rather than reporting a straightforward result, they present several analyses that disguise the fact that the race IAT did not predict voting behavior. Based on these questionable analyses, the authors misrepresent the findings. For example, they claim that “both the implicit and explicit (i.e., self-report) race attitude measures successfully predicted voting.” They omit that this statement is only correct when political orientation and symbolic racism are not used as predictors.
They then argue that their results “supplement the substantial existing evidence that race attitude IAT measures predict individual behavior (reviewed by Greenwald et al., 2009)” (p. 248). This statement is false. The meta-analysis suggested that incremental predictive validity of the race IAT is r ~ .2, whereas this study shows an effect size of r ~ 0 when political orientation is taken into account.
The abstract, often the only information that is available or read, further misleads readers. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242). Careful reading of the results section shows that the statement refers to separate analyses in which implicit measures are tested controlling for explicit attitude ratings OR political orientation OR symbolic racism. The new results presented here show that the race IAT does not predict voting controlling for explicit attitudes AND political orientation AND symbolic racism.
The deceptive analysis of these data has led to many citations that the race IAT is an important predictor of actual behavior. For example, in their popular book “Blindspot” Banaji and Greenwald list this study as an example that “the Race IAT predicted racially discriminatory behavior. A continuing stream of additional studies that have been completed since publication of the meta-analysis likewise supports that conclusion. Here are a few examples of race-relevant behaviors that were predicted by automatic White preference in these more recent studies: voting for John McCain rather than Barack Obama in the 2008 U.S. presidential election” (p. 49)
Kurdi and Banaji (2017) use the study to claim that “investigators have used implicit race attitudes to predict widely divergent outcome measures” (p. 282), without noting that even the reported results showed less than 1% incremental predictive validity. A review of prejudice measures features this study as an example of predictive validity (Fiske & North, 2014).
Of course, a single study with a single criterion is insufficient to accept the null-hypothesis that the race IAT lacks incremental predictive validity. A new meta-analysis by Kurdi with Greenwald as co-author provides new evidence about the typical amount of incremental predictive validity of the incremental predictive validity of the race IAT. The only problem is that this information is not provided. I therefore analyzed the open data to get this information. The meta-analytic results suggest an implicit-criterion correlation of r = .100, se = .01, an explicit-criterion correlation of r = .127, se = .02, and an implicit-explicit correlation of of r = .139, se = .022. A regression analysis yields an estimate of the incremental predictive validity for the race IAT of .084, 95%CI = .040 to .121. While this effect size is statistically significant in a test against the nil-hypothesis, it is also statistically different from Greenwald et al.s’ (2009) estimate of b = .225. Moreover, the point estimate is below .1, which could be used to affirm the null-hypothesis, but the confidence interval includes a value of .1. Thus, there is a 20% chance (an 80%CI would not include .1) that the effect size is greater than .1, but it is unlikely(p < .05) that it is greater than .12.
Greenwald and Lai (2020) wrote an Annual Review article about implicit measures. It mentions that estimates of the predictive validity of IATs have decreased from r = .274 (Greenwald et all, 2009) to r = .097 (Kurdi et al., 2019). No mention is made of a range of effect sizes that would support the null-hypothesis that implicit measures do not add to the prediction of prejudice because they do not measure an implicit cause of behavior that is distinct from causes of prejudice that are reflected in self-report measures. Thus, Greenwald fails to follow the advice of his younger self to provide a strong test of a theory by specifying effect sizes that would provide support for the null-hypothesis and against his theory of implicit cognitions.
It is not only ironic to illustrate the prejudice against falsification with Greenwald’s own research. It also shows that the one-sided testing of theories that avoids failures is not only a lack of proper training in statistics or philosophy of science. After all, Greenwald demonstrated that he is well aware of the problems with nil-hypothesis testing. Thus, only motivated biases can explain the one-sided examination of the evidence. Once a researcher has made a name for themselves, they are no longer neutral observers like judges or juries. They are more like prosecutors who will try as hard as possible to get a conviction and ignore evidence that may support a non-guilty verdict. To make matters worse, science does not really have an adversarial system where a defense lawyer stands up for the defendant (i.e., the null-hypothesis) and no evidence can be presented to support the defendant.
Once we realize the power of motivated reasoning, it is clear that we need to separate the work of theory development and theory evaluation. We cannot let researchers who developed a theory conduct meta-analyses and write review articles, just like we cannot ask film directors to write their own movie reviews. We should leave meta-analyses and reviews to a group of theoretical psychologists who do not conduct original research. As grant money for original research is extremely limited and a lot of time and energy is wasted on grant proposals, there is ample capacity for psychologist to become meta-psychologist. Their work also needs to be evaluated differently. The aim of meta-psychology is not to make novel discoveries, but to confirm that claims by original researchers about their discoveries are actually robust, replicable, and credible. Given the well-documented bias in the published literature, a lot of work remains to be done.
After I posted this post, I learned about a published meta-analysis and new studies of incidental anchoring by David Shanks and colleagues that came to the same conclusion (Shanks et al., 2020).
“The most expensive car in the world costs $5 million. How much does a new BMW 530i cost?”
According to anchoring theory, information about the most expensive car can lead to higher estimates for the cost of a BMW. Anchoring effects have been demonstrated in many credible studies since the 1970s (Kahneman & Tversky, 1973).
A more controversial claim is that anchoring effects even occur when the numbers are unrelated to the question and presented incidentally (Criticher & Gilovich, 2008). In one study, participants saw a picture of a football player and were asked to guess how likely it is that the player will sack the football player in the next game. The player’s number on jersey was manipulated to be 54 or 94. The study produced a statistically significant result suggesting that a higher number makes people give higher likelihood judgments. This study started a small literature on incidental anchoring effects. A variation on this them are studies that presented numbers so briefly on a computer screen that most participants did not actually see the numbers. This is called subliminal priming. Allegedly, subliminal priming also produced anchoring effects (Mussweiler & Englich (2005).
Since 2011, many psychologists are skeptical whether statistically significant results in published articles can be trusted. The reason is that researchers only published results that supported their theoretical claims even when the claims were outlandish. For example, significant results also suggested that extraverts can foresee where pornographic images are displayed on a computer screen even before the computer randomly selected the location (Bem, 2011). No psychologist, except Bem, believes these findings. More problematic is that many other findings are equally incredible. A replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2005). So, the question is whether incidental and subliminal anchoring are more like classic anchoring or more like extrasensory perception.
There are two ways to assess the credibility of published results when publication bias is present. One approach is to conduct credible replication studies that are published independent of the outcome of a study. The other approach is to conduct a meta-analysis of the published literature that corrects for publication bias. A recent article used both methods to examine whether incidental anchoring is a credible effect (Kvarven et al., 2020). In this article, the two approaches produced inconsistent results. The replication study produced a non-significant result with a tiny effect size, d = .04 (Klein et al., 2014). However, even with bias-correction, the meta-analysis suggested a significant, small to moderate effect size, d = .40.
The data for the meta-analysis were obtained from an unpublished thesis (Henriksson, 2015). I suspected that the meta-analysis might have coded some studies incorrectly. Therefore, I conducted a new meta-analysis, using the same studies and one new study. The main difference between the two meta-analysis is that I coded studies based on the focal hypothesis test that was used to claim evidence for incidental anchoring. The p-values were then transformed into fisher-z transformed correlations and and sampling error, 1/sqrt(N – 3), based on the sample sizes of the studies.
Whereas the old meta-analysis suggested that there is no publication bias, the new meta-analysis showed a clear relationship between sampling error and effect sizes, b = 1.68, se = .56, z = 2.99, p = .003. Correcting for publication bias produced a non-significant intercept, b = .039, se = .058, z = 0.672, p = .502, suggesting that the real effect size is close to zero.
Figure 1 shows the regression line for this model in blue and the results from the replication study in green. We see that the blue and green lines intersect when sampling error is close to zero. As sampling error increases because sample sizes are smaller, the blue and green line diverge more and more. This shows that effect sizes in small samples are inflated by selection for significance.
However, there is some statistically significant variability in the effect sizes, I2 = 36.60%, p = .035. To further examine this heterogeneity, I conducted a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). A z-curve analysis converts p-values into z-statistics. The histogram of these z-statistics shows publication bias, when z-statistics cluster just above the significance criterion, z = 1.96.
Figure 2 shows a big pile of just significant results. As a result, the z-curve model predicts a large number of non-significant results that are absent. While the published articles have a 73% success rate, the observed discovery rate, the model estimates that the expected discovery rate is only 6%. That is, for every 100 tests of incidental anchoring, only 6 studies are expected to produce a significant result. To put this estimate in context, with alpha = .05, 5 studies are expected to be significant based on chance alone. The 95% confidence interval around this estimate includes 5% and is limited at 26% at the upper end. Thus, researchers who reported significant results did so based on studies with very low power and they needed luck or questionable research practices to get significant results.
A low discovery rate implies a high false positive risk. With an expected discovery rate of 6%, the false discovery risk is 76%. This is unacceptable. To reduce the false discovery risk, it is possible to lower the alpha criterion for significance. In this case, lowering alpha to .005 produces a false discovery risk of 5%. This leaves 5 studies that are significant.
One notable study with strong evidence, z = 3.70, examined anchoring effects for actual car sales. The data came from an actual auction of classic cars. The incidental anchors were the prices of the previous bid for a different vintage car. Based on sales data of 1,477 cars, the authors found a significant effect, b = .15, se = .04 that translates into a standardized effect size of d = .2 (fz = .087). Thus, while this study provides some evidence for incidental anchoring effects in one context, the effect size estimate is also consistent with the broader meta-analysis that effect sizes of incidental anchors are fairly small. Moreover, the incidental anchor in this study is still in the focus of attention and in some way related to the actual bid. Thus, weaker effects can be expected for anchors that are not related to the question at all (a player’s number) or anchors presented outside of awareness.
There is clear evidence that evidence for incidental anchoring cannot be trusted at face value. Consistent with research practices in general, studies on incidental and subliminal anchoring suffer from publication bias that undermines the credibility of the published results. Unbiased replication studies and meta-analysis suggest that incidental anchoring effects are either very small or zero. Thus, there exists currently no empirical support for the notion that irrelevant numeric information can bias numeric judgments. More research on anchoring effects that corrects for publication bias is needed.
Social psychology suffers from a replication crisis because publication bias undermines the evidential value of published significant results. Meta-analysis that do not correct for publication bias are biased and cannot be used to estimate effect sizes. Here I show that a meta-analysis of the ease-of-retrieval effect (Weingarten & Hutchinson, 2018) did not fully correct for publication bias and that 200 significant results for the ease-of-retrieval effect can be fully explained by publication bias. This conclusion is consistent with the results of the only registered replication study of ease of retrieval (Groncki et al., 2021). As a result, there is no empirical support for the ease-of-retrieval effect. Implications for the credibility of social psychology are discussed.
Until 2011, social psychology appeared to have made tremendous progress. Daniel Kahneman (2011) reviewed many of the astonishing findings in his book “Thinking: Fast and Slow.” His book used Schwarz et al.’s (2011) ease-of-retrieval research as an example of rigorous research on social judgments.
The ease-of-retrieval paradigm is simple. Participants are assigned to two groups. In one group, they are asked to recall a small number of examples from memory. The number is chosen to make it easy to do this. In the other conditions, participants are asked to recall a larger number of examples. The number is chosen so that it is hard to come up with the requested number of examples. This task is used to elicit a feeling of ease or difficulty. Hundreds of studies have used this paradigm to study the influence of ease-of-retrieval on a variety of judgments.
In the classic studies that introduced the paradigm, participants were asked to retrieve a few or many examples of assertiveness behaviors before answering a question about their assertiveness. Three studies suggested that participants based their personality judgments on the ease of retrieval.
However, this straightforward finding is not always found. Kahneman points out that participants sometimes do not rely on the ease of retrieval. Paradoxically, they sometimes rely on the number of examples they retrieved even though the number was given by the experimenter. What made ease-of-retrieval a strong theory was that ease of retrieval researchers seemed to be able to predict the conditions that made people use ease as information and the conditions when they would use other information. “The proof that you truly understand a pattern of behavior is that you know how to reverse it” (Kahneman, 2011).
This success story had one problem. It was not true. In 2011, it became apparent that social psychologists used questionable research practices to produce significant results. Thus, rather than making amazing predictions about the outcome of studies, they searched for statistical significance and then claimed that they predicted these effects (John, Loewenstein, & , 2012; Kerr, 1998). Since 2011, it has become clear that only a small percentage of results in social psychology can be replicated without questionable practices (Open Science Collaboration, 2015).
I had my doubts about the ease-of-retrieval literature because I had heard rumors that researchers were unable to replicate these effects, but it was common not to publish these replication failures. My suspicions appeared to be confirmed, when John Krosnick gave a talk about a project that replicated 12 experiments in a large nationally representative sample. All but one experiment was successfully replicate. The exception was the ease-of-retrieval study; a direct replication of Schwarz et al.’s (1991) assertiveness studies. These results were published several years later (Yeager et al., 2019).
I was surprised when Weingarten and Hutchinson (2018) published a detailed and comprehensive meta-analysis of published and unpublished ease-of-retrieval studies and found evidence for a moderate effect size (d ~ .4) even after correcting for publication bias. This conclusion based on many small studies seemed inconsistent with the replication failure in the large national representative sample (Yeager et al., 2019). Moreover, the first pre-registered direct replication of Schwarz et al. (1991) also produced a replication failure (Groncki et al., 2021). One possible explanation for the discrepancy between the meta-analytic results and the replication results could be that the meta-analysis did not fully correct for publication bias. To test this hypothesis, I used the openly shared data to examine the robustness of the effect size estimate. I also conducted a new meta-analysis that included studies published after 2014, using a different coding of studies that codes only one focal hypothesis test per study. The results showed that the effect size estimate in Weingarten and Hutchinson’s (2018) is not robust and depends heavily on outliers. I also find that the coding scheme attenuates the detection of bias which leads to inflated effect size estimates. The new meta-analysis shows an effect size estimate close to zero. It also shows that heterogeneity is fully explained by publication bias.
Reproducing the Original Meta-Analysis
All effect sizes are Fisher-z transformed correlation coefficients. The predictor is the standard error; 1/sqrt(N – 3). Figure 1 reproduces the funnel plot in Weingarten and Hutchinson (2018), with the exception that sampling error is plotted on the x-axis and effect sizes are plotted on the y-axis.
Figure 1 also includes the predictions (regression lines) for three models. The first model is an unweighted average. This model assumes that there is no publication bias. The straight orange line shows that this model assumes an average effect size of z = .23 for all sample sizes. The second model assumes that there is publication bias and that bias increases in a linear fashion with sampling error. The slope of the blue regression line is significance and suggests that publication bias is present. The intercept of this model can be interpreted as the unbiased effect size estimate (Stanley, 2017). The intercept is z = .115 with a 95% confidence interval that ranges from .036 to .193. These results reproduce the results in Weingarten and Hutchinson (2018) closely, but not exactly, r = .104, 95%CI = .034 to .172. Simulation studies suggest that this effect size estimate underestimates the true effect size when the intercept is significantly different from zero (Stanley, 2017). In this case, it is recommended to use the variance (sampling error squared) as a model of publication bias. The red curve shows the predictions of this model. Most important, the intercept is now nearly at the same level as the model without publication bias, z = .221, 95%CI = .174 to .267. Once more, these results closely reproduce the published results, r = .193, 95%CI = .153 to .232.
The problem with unweighted models is that data points from small studies are given equal weights to studies with large samples. In this particular case, small studies are given even more weight than larger studies because small studies with extremely small sample sizes (N < 20) are outliers and outliers are weighted more heavily in regression analysis. Inspection of the scatter plot shows that 7 studies with sample sizes less than 10 (5 per condition) have a strong influence on the regression line. As a result, all three regression lines in Figure 1 overestimate effect sizes for studies with more than 100 participants. Thus, the intercept overestimates the effect sizes for large studies, including Yeager et al.’s (2019) study with N = 1,323 participants. In short, the effect size estimate in the meta-analysis is strongly influenced by 7 data points that represent fewer than 100 participants.
A simple solution to this problem is to weight observations by sample size so that larger samples are given more weight. This is actually the default option for many meta-analysis programs like the metafor package in R (Viechbauer, 2010). Thus, I reran the same analyses with weighting of observations by sample size. Figure 2 shows the results. In Figure 2 the size of observations reflects weights. The most important difference in the results is that the intercept for the model with a linear effect of sampling error is practically zero and not statistically significant, z = .006, 95%CI = -.040 to .052. The confidence interval is small enough to infer that the typical effect size is close enough to zero to accept the null-hypothesis.
Proponents of ease-of-retrieval will, however, not be satisfied with this answer. First, inspection of Figure 2 shows that the intercept is now strongly influenced by a few large samples. Moreover, the model does show heterogeneity in effect sizes, I2 = 33.38%, suggesting that at least some of the significant results were based on real effects.
Coding of Studies
Effect size meta-analysis evolved without serious consideration of publication bias. Although publication bias has been known to be present since meta-analysis was invented (Sterling, 1959), it was often an afterthought rather than part of the meta-analytic model (Rosenthal, 1979). Without having to think about publication bias, it became a common practice to code individual studies without a focus on the critical test that was used to publish a study. This practice obscures the influence of publication bias and may lead to an overestimation of the average effect size. To illustrate this, I am going to focus on the 7 data points in Figure 1 that were coded with sample sizes less than 10.
Six of the observations stem from an unpublished dissertation by Bares (2007) that was supervised by Norbert Schwarz. The dissertation was a study with children. The design had the main manipulation of ease of retrieval (few vs. many) as a between subject factor. Additional factors were gender, age (kindergartners vs. second graders) and 5 content domains (books, shy, friendly, nice, mean). The key dependent variable were frequency estimates. The total sample size was 198, with 98 participants in the easy condition and 100 in the difficult condition. The hypothesis was that ease-of-retrieval would influence judgments independent of gender or content. However, rather than testing the overall main effect across all participants, the dissertation presents analyses separately for different ages and contents. This led to the coding of this study with a reasonable sample size of N = 198 as 20 effects with sample sizes of N = 7 to 9. Only six of these effects were included in the meta-analysis. Thus, the meta-analysis added 6 studies with non-significant results, when there was only one study with non-significant results that was put in the file-drawer. As a result, the meta-analysis does no longer represent the amount of publication bias in the ease-of-retrieval literature. Adding these six effects to the meta-analysis makes the data look less biased and attenuates the regression of effect sizes on sampling error, which in turn leads to a higher intercept. Thus, traditional coding of effect sizes in meta-analyses can lead to inflated effect size estimates even in models that aim to correct for publication bias.
An Updated Meta-Analysis of Ease-of-Retrieval
Building on Weingarten and Hutchinson’s (2018) meta-analysis, I conducted a new meta-analysis that relied on test statistics that were reported to test ease-of-retrieval effects. I only used published articles because the only reason to search for unpublished studies is to correct for publication bias. However, Weingarten and Hutchinson’s meta-analysis showed that publication bias is still present even with a diligent attempt to obtain all data. I extended the time frame of the meta-analysis by searching for new publications since the last year that was included in Weingarten and Hutchinson’s meta-analysis (i.e., 2014). For each study, I looked for the focal hypothesis test of the ease-of-retrieval effect. In some studies, this was a main effect. In other studies, it was a post-hoc test following an interaction effect. The exact p-values were converted into t-values and t-values were converted into fisher-z scores as effect sizes. Sampling error was based on the sample size of the study or the subgroup in which the ease of retrieval effect was predicted. For the sake of comparability, I again show unweighted and weighted results.
The effect size estimate for the random effects model that ignores publication bias is z = .340, 95%CI = .317 to .363. This would be a moderate effect size (d ~ .6). The model also shows a moderate amount of heterogeneity, I2 = 33.48%. Adding sampling error as a predictor dramatically changes the results. The effect size estimate is now practically zero, z = .020. and the 95%CI is small enough to conclude that any effect would be small, 95%CI = -.048 to .088. Moreover, publication bias fully explains heterogeneity, I2 = 0.00%. Based on this finding, it is not recommended to use the variance as a predictor (Stanley, 2017). However, for the sake of comparison, Figure 1 also shows the results for this model. The red curve shows that the model makes similar predictions in the middle, but overestimates effect sizes for large samples and for small samples. Thus, the intercept is not a reasonable estimate of the average effect size, z = .183, 95%CI = .144 to .222. In conclusion, the new coding shows clearer evidence of publication bias and even the unweighted analysis shows no evidence that the average effect size differs from zero.
Figure 4 shows that the weighted models produce very similar results to the unweighted results.
The key finding is that the intercept is not significantly different from zero, z = -.016, 95%CI = -.053 to .022. The upper bound of the 95%CI corresponds to an effect size of r = .022 or d = .04. Thus, the typical ease of retrieval effect is practically zero and there is no evidence of heterogeneity.
Meta-analysis treats individual studies as interchangeable tests of a single hypothesis. This makes sense when all studies are more or less direct replications of the same experiments. However, meta-analysis in psychology often combine studies that vary in important details such as the population (adults vs. children) and the dependent variables (frequency judgments vs. attitudes). Even if a meta-analysis would show a significant average effect size, it remains unclear which particular conditions show the effect and which ones do not. This is typically examined in moderator analyses, but when publication bias is strong and effect sizes are dramatically inflated, moderator analyses have low power to detect signals in the noise.
In Figure 4, real moderators would produce systematic deviations from the blue regression line. As these residuals are small and strongly influenced by sampling error, finding a moderator is like looking for a needle in a heystack. To do so, it is useful to look for individual studies that produced more credible results than the average study. A new tool that can be used for this purpose is z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).
Z-curve does not decompose p-values into separate effect size and sampling error components. Rather it converts p-values into z-scores and models the distribution of z-scores with a finite mixture model. The results provide complementary information about publication bias that does not rely on variation in sample sizes. As correlations between sampling error and effect sizes can be produced by other factors, z-curve provides a more direct test of publication bias.
Z-curve also provides information about the false positive risk in individual studies. If a literature has a low discovery rate (many studies produce non-significant results), the false discovery risk is high (Soric, 1989). Z-curve estimates the size of the file drawer and provides a corrected estimate of the expected discovery rate. To illustrate z-curve, I fitted z-curve to the ease-of-retrieval studies in the new meta-analysis (Figure 5).
Visual inspection shows that most z-statistics are just above the criterion for statistical significance z = 1.96. This corresponds to the finding that most effect sizes are about 2 times the magnitude of sampling error, which produces a just significant result. The z-curve shows a rapid decline of z-statistics as z-values increase. The z-curve model uses the shape of this distribution to estimate the expected discovery rate; that is, the proportion of significant results that are observed if all tests that were conducted were available. The estimate of 8% implies that most ease-of-retrieval tests are extremely underpowered and can only produce significant results with the help of sampling error. Thus, most of the observed effect size estimates in Figure 4 reflect sampling error rather than any population effect sizes.
The expected discovery rate can be compared to the observed discovery rate to assess the amount of publication bias. The observed discovery rate is simply the percentage of significant results for the 128 studies. The observed discovery rate is 83% and would be even higher if marginally significant results, p < .10, z > 1.65, were counted as significant. Thus, the observed discovery rate is 10 times higher than the expected discovery rate. This shows massive publication bias.
The difference between the expected and observed discovery rate is also important for the assessment of the false positive risk. As Soric (1989) showed, the risk of false positives increases as the discovery risk decreases. The observed discovery rate of 83% implies that the false positive risk is very small (1%). Thus, readers of journals are given the illusion that ease-of-retrieval effects are robust and researchers have a very good understanding of the conditions that can produce the effect. Hence, Kahneman’s praise of researchers’ ability to show the effect and to reverse it seemingly at will. The z-curve results show that this is an illusion because researchers only publish results when a study was successful. With an expected discovery rate of 8%, the false discovery risk is 61%. Thus, there is a high chance that studies with large samples will produce effect size estimates close to zero. This is consistent with the effect size estimates close to zero.
One solution to reduce the false-positive risk is to lower the significance criterion (Benjamin et al., 2017). Z-curve can be fitted with different alpha-levels to examine the influence on the false positive risk. By setting alpha to .005, the false positive risk is below 5% (Figure 6).
This leaves 36 studies that may have produced a real effect. A true positive result does not mean that a direct replication study will produce a significant result. To estimate replicability, we can select only the studies with p < .005 (z > 2.8) and fit z-curve to these studies using the standard significance criterion of .05. The false discovery risk inched up a bit, but may be considered acceptable with 8%. However, the expected replication rate with the same sample sizes is only 47%. Thus, replication studies need to increase sample sizes to avoid false negative results.
Five of the studies with strong evidence are by Sanna and colleagues. This is noteworthy because Sanna retracted 8 articles, including an article with ease-of-retrieval effects under suspicion of fraud (Yong, 2012). It is therefore unlikely that these studies provide credible evidence for ease-of-retrieval effects.
An article with three studies reported consistently strong evidence (Ofir et al., 2008). All studies manipulated the ease of recall of products and found that recalling a few low priced items made participants rate a store as less expensive than recalling many low priced items. It seems simple enough to replicate this study to test the hypothesis that ease of retrieval effects influence judgments of stores. Ease of retrieval may have a stronger influence for these judgments because participants may have less chronically accessible and stable information to make these judgments. In contrast, assertiveness judgments may be harder to move because people have highly stable self-concepts that show very little situational variance (Eid & Diener, 2004; Anusic & Schimmack, 2016).
Another article that provided three studies examined willingness to pay for trips to England (Sinha & Naykankuppam, 2013). A major difference to other studies was that this study supplied participants with information about tourist destinations in England and after a delay used recall of this information to manipulate ease of retrieval. Once more, ease-of-retrieval may have had an effect in these studies because participants had little chronically accessible information to make willingness-to-pay judgments.
A third, three study article with strong evidence found that participants rated the quality of their memory for specific events (e.g., New Year’s Eve) worse when they were asked to recall many (vs. few) facts about the event (Echterhoff & Hirst, 2006). These results suggest that ease-of-retrieval is used for judgments about memory, but may not influence other judgments.
The focus on individual studies shows why moderator analyses in effect-size meta-analysis often produce non-significant results. Most of the moderators that can be coded are not relevant, whereas moderators that are relevant can be limited to a single article and are not coded.
The Original Paradigm
It is not clear why Schwarz et al. (1991) decided to manipulate personality ratings of assertiveness. A look into the personality literature suggests that these judgments are often made quickly and with high temporal stability. Thus, they seemed a challenging target to demonstrate the influence of situational effects.
It was also risky to conduct these studies with small sample sizes that require large effect sizes to produce significant results. Nevertheless, the first study with 36 participants produced an encouraging, marginally significant result, F(1,34) = .07. Study 2 followed up on this result with a larger sample to boost power and did produce a real significant result, F(1, 142) = 6.35, p = .01. However, observed power (70%)) was still below the recommended level of 80%. Thus, the logical next step would have been to test the effect again with an even larger sample. However, the authors tested a moderator hypothesis in a smaller sample, which surprisingly produced a significant three-way interaction, F(1, 70) = 9.75, p < .001. Despite this strong interaction, the predicted ease-of-retrieval effects were not statistically significant because sample sizes were very small, assertive: t(18) = 1.55, p = .14, unassertive: t(18) = 1.91, p = .07.
It is unlikely to obtain supportive evidence in three underpowered studies (Schimmack, 2012), suggesting that the reported results were selected from a larger set of tests. This hypothesis can be tested with the Test of Insufficient Variance (TIVA), a bias test for small sets of studies (Renkewitz & Keiner, 2019; Schimmack, 2015). TIVA shows that the variation in p-values is less than expected, but the evidence is not conclusive. Nevertheless, even if the authors were just lucky, future studies are expected to produce non-significant results unless sample sizes are increased considerably. However, most direct replication studies of the original design used equally small sample sizes, but reported successful outcomes.
Yahalom and Schul (2016) reported a successful replication in another small sample (N = 20), with an inflated effect size estimate, t(18) = 2.58, p < .05, d = 1.15. Rather than showing the robustness of the effect, it strengthens the evidence that bias is present, TIVA p = .05. Another study in the same article finds evidence for the effect again, but only when participants are instructed to hear some background noise and not when they are instructed to listen to background noise, t(25) = 2.99, p = .006. The bias test remains significant, TIVA p = .05. Kuehnen did not find the effect, but claimed an interaction with item-order for questions about ease-of-retrieval and assertiveness. A non-significant trend emerged when ease-of-retrieval questions were asked first, which was not reported, t(34) = 1.10, p = .28. The bias test remains significant, TIVA p = .08. More evidence from small samples comes from Caruso (2008). In Study 1a, 30 participants showed an ease-of-retrieval effect, F(1,56) = 6.91, p = .011. The bias test remains significant, TIVA p = .06. Study 1b with more participants (N = 55), the effect was not significant, F(1, 110) = 1.05, p = .310. The bias test remains significant despite the non-significant result, TIVA p = .08. Tomala et al. (2007) added another just significant result with 79 participants, t(72) = 1.97, p = .05. This only strengthens the evidence of bias, TIVA p = .05. Yahalom and Schul (2013) also found a just significant effect with 130 students, t(124) = 2.54, only to strengthen evidence of bias, TIVA p = .04. Study 2 reduced the number of participants to 40, yet reported a significant result, F(1,76) = 8.26, p = .005. Although this p-value nearly reached the .005 level, there is no theoretical explanation why this particular direct replication of the original finding should have produced a real effect. Evidence for bias remains significant, TIVA p = .05. Study 3 reverts back to a marginally significant result that only strengthens evidence of bias, t(114) = 1.92, p = .06, TIVA bias p = .02. Greifeneder and Bless (2007) manipulated cognitive load and found the predicted trend only in the low-load condition, t(76) = 1.36, p = .18. Evidence for bias remained unchanged, TIVA p = .02.
In conclusion, from 1991 to 2016 published studies appeared to replicate the original findings, but this evidence is not credible because there is evidence of publication bias. Not a single one of these studies produced a p-value below .005, which has been suggested as a significance level that keeps the type-I error rate at an acceptable level (Benjamin et al., 2017).
Even meta-analyses of these small studies that correct for bias are inconclusive because sampling error is large and effect size estimates are imprecise. The only way to provide strong and credible evidence is to conduct a transparent and ideally pre-registered replication study with a large sample. One study like this was published by Yeager et al. (2019). With N = 1,325 participants the study failed to show a significant effect, F(1, 1323) = 1.31, p = .25. Groncki et al. (2021) conducted the first pre-registered replication study with N = 659 participants. They also ended up with a non-significant result, F(1, 657) = 1.34, p = .25.
These replication failures are only surprising if the inflated observed discovery rate is used to predict the outcome of future studies. Accordingly, we would have an 80% probability to get significant results and an even higher probability given the larger sample sizes. However, when we take publication bias into account, the expected discovery rate is only 8% and even large sample sizes will not produce significant results if the true effect size is close to zero.
In conclusion, the clear evidence of bias and the replication failures in two large replication studies suggest that the original findings were only obtained with luck or with questionable research practices. However, naive interpretation of these results created a literature with over 200 published studies without a real effect. In this regard, ease of retrieval is akin to the ego-depletion literature that is now widely considered invalid (Inzlicht, Werner, Briskin, & Roberts, 2021).
2011 has been a watershed moment in the history of social psychology. It has split social psychology into two camps. One camp denies that questionable research practices undermine the validity of published results and continue to rely on published studies as credible empirical evidence (Schwarz & Strack, 2016). The other camp assumes that most published results are false positives and trusts only new studies that are published following open science practices with badges for sharing of materials, data, and ideally pre-registration.
Meta-analysis can help to find a middle ground by examining carefully whether published results can be trusted, even if some publication bias is present. To do so, meta-analysis have to take publication bias seriously. Given the widespread use of questionable practices in social psychology, we have to assume that bias is present (Schimmack, 2020). Published meta-analyses that did not properly correct for publication bias can at best provide an upper limit for effect sizes, but they cannot establish that an effect exists or that the effect size has practical significance.
Weingarten and Hutchinson (2018) tried to correct for publication bias by using the PET-PEESE approach (Stanley, 2017). This is currently the best bias-correction method, but it is by no means perfect (Hong & Reed, 2021; Stanley, 2017). Here I demonstrated one pitfall in the use of PET-PEESE. Coding of studies that does not match the bias in the original articles can obscure the amount of bias and lead to inflated effect size estimates, especially if the PET model is incorrectly rejected and the PEESE results are accepted at face value. As a result, the published effect size of r = .2 (d = .4) was dramatically inflated and new results suggest that the effect size is close to zero.
I also showed in a z-curve analysis that the false positive risk for published ease-of-retrieval studies is high because the expected discovery rate is low and the file drawer of unpublished studies is large. To reduce the false positive risk, I recommend to adjust the significance level to alpha = .005, which is also consistent with other calls for more stringent criteria to claim discoveries (Benjamin et al., 2017). Based on this criterion, neither the original studies, nor any direct replication s of the original studies were significant. A few variations of the paradigm may have produced real effects, but pre-registered replication studies are needed to examine this question. For now, ease of retrieval is a theory without credible evidence.
For many social psychologists, these results are shocking and hard to believe. However, the results are by no means unique to the ease-of-retrieval literature. It has been estimated that only 25% to 50% of published results in social psychology can be replicated (Schimmack, 2020). Other large literatures such as implicit priming, ego-depletion, and facial feedback have also been questioned by rigorous meta-analyses and large replication studies.
For methodologists the replication crisis in psychology is not a surprise. They have warned for decades that selection for significance renders significant results insignificant (Sterling, 1961) and that sample sizes are too low (Cohen, 1961). To avoid similar mistakes in the future, researchers should conduct continuous power analyses and bias tests. As demonstrated here for the assertiveness paradigm, bias tests ring the alarm bells from the start and continue to show bias. In the future, we do not need to wait 40 years before we realize that researchers are chasing an elusive phenomenon. Sample sizes need to be increased or research needs to stop. Amassing a literature of 200 studies with a median sample size of N = 53 and 8% power has to be a mistake that should not be repeated.
Social psychologists should be the least surprised that they fooled themselves in believing their results. After all, they have demonstrated with credible studies that confirmation bias has a strong influence on human information processing. They should therefore embrace the need for open science, bias-checks, and replication studies as necessary evils that are necessary to minimize confirmation bias and to make scientific progress.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1–48. https://www.jstatsoft.org/v036/i03.
We have never met in person or interacted professionally before we embarked on this joint project. This blog emerged from a correspondence that was sparked by one of Uli’s posts in the Replicability-Index blog. We have carried the conversational nature of that correspondence over to the blog itself.
Bill: Uli, before we dive into the topic at hand, can you provide a super brief explanation of how the Replicability Index works? Just a few sentences for those who might be new to your blog and don’t understand how one can assess the replicability of the findings from a single paper.
Uli: The replicability of a finding is determined by the true power of a study, and true power depends on the sample size and the population effect size. We have to estimate replicability because the population effect size is unknown, but studies with higher power are more likely to produce smaller p-values. We can convert the p-values into a measure of observed power. For a single statistical test this estimate is extremely noisy, but it is the best guess we can make. So, a result with a p-value of .05 (50% observed power) is less likely to replicate than a p-value of .005 (80% power). A value of 50% may still look good, but observed power is inflated if we condition on significance. To correct for this inflation, the R-Index takes the difference between the success rate and observed power. For a single test, the success rate is 1 (100%) because a significant result was observed. This means that observed power of 50%, produces an R-Index of 50% – (100% – 50%) = 0. In contrast, 80% power, still produces an R-Index of 80% – (100% – 80%) = 60%. The main problem is that observed p-values are very variable. It is therefore better to use several p-values and to compute the R-Index based on average power. For larger sets of studies, a more sophisticated method called z-curve can produce actual estimates of power. It also can be used to estimate the false positive risk. Sorry, if this is not super brief.
Bill: That wasn’t super brief, but it was super informative. It does raise a worry in me, however. Essentially, your formula stipulates that every p value of .05 is inherently unreliable. Do we have any empirical evidence that the true replicability of p = .05 is functionally zero?
Uli: The inference is probabilistic. Sometimes p = .05 will occur with high power (80%). However, empirically we know that p = .05 more often occurs with low power. The Open Science Collaboration project showed that results with p > .01 rarely replicated, whereas results with p < .005 replicated more frequently. Thus, in the absence of other information, it is rational to bet on replication failures when the p-value is just or marginally significant.
Bill: Good to know. I’ll return to this “absence of other information” issue later in our blog, but in the meantime, back to our story…I was familiar with your work prior to this year, as Steve Lindsay kept the editorial staff at Psychological Science up to date with your evaluations of the journal. But your work was made much more personally relevant on January 19, when you wrote a blog on “Personalized p-values for social/personality psychologists.”
Initially, I was curious if I would be on the list, hoping you had taken the time to evaluate my work so that I could see how I was doing. Your list was ordered from the highest replicability to lowest, so when I hadn’t seen my name by the half-way point, my curiosity changed to trepidation. Alas, there I was – sitting very near the bottom of your list with one of the lowest replicability indices of all the social psychologist’s you evaluated.
I was aghast. I had always thought we maintained good data practices in our lab: We set a desired N at the outset and never analyzed the data until we were done (having reached our N or occasionally run out of participants); we followed up any unexpected findings to be sure they replicated before reporting them, etc. But then I thought about the way we used to run the lab before the replication crisis:
- We never reported experiments that failed to support our hypotheses, but rather tossed them in the file drawer and tried again.
- When an effect had a p value between .10 and the .05 cut-off, we tried various covariates/control variables to see if they would push our effect over that magical line. Of course we reported the covariates, but we never reported their ad-hoc nature – we simply noted that we included them.
- We typically ran studies that were underpowered by today’s standards, which meant that the effects we found were bouncy and could easily be false positives.
- When we found an effect on one set of measures and not another, sometimes we didn’t report the measures that didn’t work.
The upshot of these reflections was that I emailed you to get the details on my individual papers to see where the problems were (which led to my next realization; I doubt I would have bothered contacting you if my numbers had been better). So here’s my first question: How many people have contacted you about this particular blog and is there any evidence that people are taking it seriously?
Uli: I have been working on replicability for 10 years now. The general response to my work is to ignore it with the justification that it is not peer-reviewed. I also recall only two requests. One to evaluate a department and one to evaluate an individual. However, the R-Index analysis is easy to do and Mickey Inzlicht published a blog post about his self-analysis. I don’t know how many researchers have evaluated their work in private. It is harder to evaluate how many people take my work seriously. The main indicator of impact is the number of views of my blogs which has increased from 50,000 in 2015 to 250,000 in 2020. The publication of the z-curve package for R also has generated interest among researchers to conduct their own analyses.
Bill: That’s super impressive. I don’t think my entire body of research has been viewed anywhere near 250K times.
OK, once again, back to our story. When you sent me the data file on my papers, initially I was unhappy that you only used a subset of my empirical articles (24 of the 70 or so empirical papers I’ve published) and that your machine coding had introduced a bit of error into the process. But we decided to turn this into a study of sorts, so we focused on those 24 papers and the differences that would emerge as a function of machine vs. hand-coding and as a function of how many stats we pulled out of each experiment (all the focal statistics vs. just one stat for each experiment). Was that process useful for you? If so, what did you learn from it?
Uli: First of all, I was pleased that you engaged with the results. More important, I was also curious how the results would compare to hand-coding. I had done some comparisons for other social psychologists and I had some confidence in the results to post them, but I am aware that the method is not flawless and can produce misleading results in individual cases. I am also aware that my own hand-coding can be biased. So, for you to offer to do your own coding was a fantastic opportunity to examine the validity of my results.
Bill: Great. I’m still a little unsure what I’ve learned from this particular proctology exam, so let’s see what we can figure out here. If you’ll humor me, let’s start with our paper that has the lowest replicability index in your subsample – no matter which we way we calculate it, we find less than a 50% chance that it will replicate. It was published in 2005 in PSPB and took an evolutionary approach on grandparenting. Setting aside the hypothesis, the relevant methods were as follows:
- We recruited all of the participants who were available that year in introductory psychology, so our N was large, but determined by external constraints.
- We replicated prior findings that served as the basis of our proposed effect.
- The test of our new hypothesis yielded a marginally significant interaction (F(1, 412) = 2.85, p < .10). In decomposing the interaction, we found a simple effect where it was predicted (F(1, 276) = 5.92, p < .02) and no simple effect where it wasn’t predicted (F < 1, ns. – *apologies for the imprecise reporting practices).
Given that: 1) we didn’t exclude any data (a poor practice we sometimes engaged in by reporting some measures and not others, but not in this paper), 2) we didn’t include any ad-hoc control variables (a poor practice we sometimes engaged in, but not in this paper), 3) we didn’t run any failed studies that were tossed out (a poor practice we regularly engaged in, but not in this paper), and 4) we reported the a priori test of our hypothesis exactly as planned…what are we to conclude from the low replicability index? Is the only lesson here that marginally significant interactions are highly unlikely to replicate? What advice would you have given me in 2004, if I had shown you these data and said I wanted to write them up?
Uli: There is a lot of confusion about research methods, the need for preregistration, and the proper interpretation of results. First, there is nothing wrong with the way you conducted the study. The problems arise when the results are interpreted as a successful replication of prior studies. Here is why. First, we do not know whether prior studies used questionable research practices and reported inflated effect sizes. Second, the new findings are reported without information about effect sizes. What we really would like to know is the confidence interval around the predicted interaction effect, which would be the difference in the effect sizes between the two conditions. With a p-value greater than .05, we know that the 95%CI includes a value of 0. So, we cannot reject the hypotheses that the two conditions differ at that level of confidence. We can increase uncertainty in the conclusion by using a 90% or 80% confidence interval, but we still would want to know what effect sizes we can reject. It would also be important to specify what effect sizes would be considered too small to warrant a theory that predicts this interaction effect. Finally, the results suggest that the sample size of about 400 participants was still too small to have good power to detect and replicate the effect. A conclusive study would require a larger sample.
Bill: Hmm, very interesting. But let me clarify one thing before we go on. In this study, the replication of prior effects that I mentioned wasn’t the marginal interaction that yielded the really low replicability index. Rather, it was a separate main effect, whereby participants felt closest to their mother’s mother, next to their mother’s father, next to their father’s mother, and last to their father’s father. The pairwise comparisons were as follows: “participants felt closer to mothers’ mothers than mothers’ fathers, F(1,464) = 35.88, p < .001, closer to mothers’ fathers than fathers’ mothers, F(1, 424) = 3.96, p < .05, and closer to fathers’ mothers than fathers’ fathers, F(1, 417) = 4.88, p < .03.”
We were trying to build on that prior effect by explaining the difference in feelings toward father’s mothers and mothers’ fathers, and that’s where we found the marginal interaction (which emerged as a function of a third factor that we had hypothesized would moderate the main effect).
I know it’ll slow things down a bit, but I’m inclined to take your advice and rerun the study with a larger sample, as you’ve got me wondering whether this marginal interaction and simple effect are just random junk or meaningful. We could run the study with one or two thousand people on Prolific pretty cheaply, as it only involves a few questions.
Shall we give it a try before we go on? In the spirit of both of us trying to learn something from this conversation, you could let me know what sample size would satisfy you as giving us adequate power to attempt a replication of the simple effect that I’ve highlighted in red above. I suspect that the sample size required to have adequate power for a replication of the marginal interaction would be too expensive, but I believe a study that is sufficiently powered to detect that simple effect will reveal an interaction of at least the magnitude we found in that paper (as I still believe in the hypothesis we were testing).
If that suits you, I’m happy to post this opener on your blog and then return in a few weeks with the results of the replication effort and the goal of completing our conversation.
Uli: Testing our different intuitions about this particular finding with empirical data is definitely interesting, but I am a bit puzzled about the direction this discussion has taken. It is surely interesting to see whether this particular finding is real and can be replicated. Let’s assume for the moment that it does. This unfortunately, increases the chances that some of the other studies in the z-curve are even less likely to be replicated because there is clear evidence of selection bias and a low probability of replication. Think about it as an urn with 9 red and 1 green marble. Red ones do not replicate and green ones do replicate. After we pick the green marble on the first try, there are only red marbles left.
One of the biggest open questions is what researchers actually did to get too many significant results. We have a few accounts of studies with non-significant results that were dropped and anonymous surveys show that a variety of questionable research practices were used. Even though these practices are different from fraud and may have occurred without intent, researchers have been very reluctant to talk about the mistakes they made in the past. Carney walked away from power posing by admitting to the use of statistical shortcuts. I wonder whether you can tell us a bit more about the practices that led to the low EDR estimate for your focal tests. I know it is a big ask, but I also know that young social psychologists would welcome open disclosure of past practices. As Mickey Inzlicht always tells me “Hate the sin. Love the sinner.” As my own z-curve shows, I also have a mediocre z-curve and I am currently in the process of examining my past articles to see which ones I still believe and which ones I no longer believe.
Bill: Fair question Uli – I’ve made more mistakes than I care to remember! But (at least until your blog post) I’ve comforted myself in the belief that peer-review corrected most of them and that the work I’ve published is pretty solid. So forgive me for banging on for so long, but I have a two-part answer to your question. Part 1 refers back to your statement above, about your work being “in the absence of other information”, and also incorporates your statement above about red and green marbles in an urn. And Part 2 builds on Part 1 by digging through my studies with low Replicability indices and letting you know whether (and if so where) I think they were problematic.
Part 1: “In the absence of other information” is a really important caveat. I understand that it’s the basis of your statistical approach, but of course research isn’t conducted in the absence of other information. In my own case, some of my hypotheses were just hunches about the world, based on observations or possible links between other ideas. I have relatively little faith in these hypotheses and have abandoned them frequently in the face of contrary or inconsistent evidence. But some of my hypotheses are grounded in a substantial literature or prior theorizing that strike me as rock solid. The Darwinian Grandparenting paper is just such an example, and thus it seems like a perfect starting point. The logic is so straightforward and sensible that I’d be very surprised if it’s not true. As a consequence, despite the weak statistical support for it, I’m putting my money on it to replicate (and it’s just self-report, so super easy to conduct a replication online).
And this line of reasoning leads me to dispute your “red and green marbles in the urn” metaphor. Your procedure doesn’t really tell us how many marbles are in the urn of these two colors. Rather, your procedure makes a guess about the contents of the urn, and that guess intentionally ignores all other information. Thus, I’d argue that a successful or failed replication of the grandparenting paper tells us nothing at all about the probability of replicating other papers I’ve published, as I’m bringing additional information to bear on the problem by including the theoretical strength of the claims being made in the paper. In other words, I believe your procedure has grossly underestimated the replicability of this paper by focusing only on the relevant statistics and ignoring the underlying theory. That doesn’t mean your procedure has no value, but it does mean that it’s going to make predictable mistakes.
Part 2: HereI’m going to focus on papers that I first authored, as I don’t think it’s appropriate for me to raise concerns about work that other people led without involving them in this conversation. With that plan in mind, let’s start at the bottom of the replication list you made for me in your collection of 24 papers and work our way up.
- Darwinian Grandparenting – discussed above and currently in motion to replicate (Laham, S. M., Gonsalkorale, K., & von Hippel, W. (2005). Darwinian grandparenting: Preferential investment in more certain kin. Personality and Social Psychology Bulletin, 31, 63-72.)
- The Chicken-Foot paper – I love this paper but would never conduct it that way now. The sample was way too small and the paper only allowed for a single behavioral DV, which was how strongly participants reacted when they were offered a chicken foot to eat. As a consequence, it was very under-powered. Although we ran that study twice, first as a pilot study in an undergraduate class with casual measurement and then in the lab with hidden cameras, and both studies “worked”, the first one was too informal and the second one was too small and would never be published today. Do I believe it would replicate? The effect itself is consistent with so many other findings that I continue to believe in it, but I would never place my money on replicating this particular empirical demonstration without a huge sample to beat down the inevitable noise (which must have worked in our favor the first time).
(von Hippel, W., & Gonsalkorale, K. (2005). “That is bloody revolting!” Inhibitory control of thoughts better left unsaid. Psychological Science, 16, 497-500.)
- Stereotyping Against your Will – this paper was wildly underpowered, but I think it’s low R-index reflects the fact that in our final data set you asked me to choose just a single statistic for each experiment. In this study there were a few key findings with different measures and they all lined up as predicted, which gave me a lot more faith in it. Since its publication 20 years ago, we (and others) have found evidence consistent with it in a variety of different types of studies. I think we’ve failed to find the predicted effect in one or maybe two attempts (which ended up in the circular file, as all my failed studies did prior to the replication crisis), but all other efforts have been successful and are published. When we included all the key statistics from this paper in our Replicability analysis, it has an R-index of .79, which may be a better reflection of the reliability of the results.
Important caveat: With all that said, the original data collection included three or four different measures of stereotyping, only one of which showed the predicted age effect. I never reported the other measures, as the goal of the paper was to see if inhibition would mediate age differences in stereotyping and prejudice. In retrospect that’s clearly problematic, but at the time it seemed perfectly sensible, as I couldn’t mediate an effect that didn’t exist. On the positive side, the experiment included only two measures of prejudice, and both are reported in the paper.
(von Hippel, W., Silver, L. A., & Lynch, M. E. (2000). Stereotyping against your will: The role of inhibitory ability in stereotyping and prejudice among the elderly. Personality and Social Psychology Bulletin, 26, 523-532.)
- Inhibitory effect of schematic processing on perceptual encoding – given my argument above that your R-index makes more sense when we include all the focal stats from each experiment, I’ve now shifted over to the analysis you conducted on all of my papers, including all of the key stats that we pulled out by hand (ignoring only results with control variables, etc.). That analysis yields much stronger R-indices for most of my papers, but there are still quite a few that are problematic. Sadly, this paper is the second from the bottom on my larger list. I say sadly because it’s my dissertation. But…when I reflect back on it, I remember numerous experiments that failed. I probably ran two failed studies for each successful one. At the time, no one was interested in them, and it didn’t occur to me that I was engaging in poor practices when I threw them in the bin. The main conclusion I came to when I finished the project was that I didn’t want to work on it anymore as it seemed like I spent all my time struggling with methodological details trying to get the experiments to work. Maybe each successful study was the one that found just the right methods and materials (as I thought at the time), but in hindsight I suspect not. And clearly the evidentiary value for the effect is functionally zero if we collapse across all the studies I ran. With that said, the key finding followed from prior theory in a pretty straightforward manner and we later found evidence for the proposed mechanism (which we published in a follow-up paper*). I guess I’d conclude from all this that if other people have found the effect since then, I’d believe in it, but I can’t put any stock in my original empirical demonstration.
(von Hippel, W., Jonides, J., Hilton, J. L., & Narayan, S. (1993). Inhibitory effect of schematic processing on perceptual encoding. Journal of Personality and Social Psychology, 64, 921-935.
*von Hippel, W., & Hawkins, C. (1994). Stimulus exposure time and perceptual memory. Perception and Psychophysics, 56, 525-535.)
- The Linguistic Intergroup Bias (LIB) as an Indicator of Prejudice. This is the only other paper on which I was first author that gets an R-index of less than .5 when you include all the focal stats in the analysis. I have no doubt it’s because, like all of my work at the time, it was wildly under-powered and the effects weren’t very strong. Nonetheless, we’ve used the LIB many times since, and although we haven’t found the predicted results every time, I believe it works pretty reliably. Of course, I could easily be wrong here, so I’d be very interested if any readers of this blog have conducted studies using the LIB as an indicator of prejudice, and if so, whether it yielded the predicted results.
(von Hippel, W., Sekaquaptewa, D., & Vargas, P. (1997). The Linguistic Intergroup Bias as an implicit indicator of prejudice. Journal of Experimental Social Psychology, 33, 490-509.)
- All my articles published in the last ten years with a low R–index – Rather than continuing to torture readers with the details of each study, in this last section I’ve gone back and looked at all my papers with an R-index less than .70 based on all the key statistics (not just a single stat for each experiment) that were published in the last 10 years. This exercise yields 9 out of 25 empirical papers with an R-index ranging from .30 to .62 (with five other papers in which an R-index apparently couldn’t be calculated). The evidentiary value of these 9 papers is clearly in doubt, despite the fact that they were published at a time when we should have known better. So what’s going on here? Six of them were conducted on special samples that are expensive to run or incredibly difficult to recruit (e.g., people who have suffered a stroke, people who inject drugs, studies in an fMRI scanner), and as a result they are all underpowered. Perhaps we shouldn’t be doing that work, as we don’t have enough funding in our lab to run the kind of sample sizes necessary to have confidence in the small effects that often emerge. Or perhaps we should publish the papers anyway, and let readers decide if the effects are sufficiently meaningful to be worthy of further investigation. I’d be curious to hear your thoughts on this Uli. Of the remaining three papers, one of them reports all four experiments we ran prior to publication, but since then has proven difficult to replicate and I have my doubts about it (let’s call that a clear predictive win for the R-Index). Another is well powered and largely descriptive without much hypothesis testing and I’m not sure an R-index makes sense for it. And the last one is underpowered (despite being run on undergraduates), so we clearly should have done better.
What do I conclude from this exercise? A consistent theme in our findings that have a low R-index is that they have small sample sizes and report small effects. Some of those probably reflect real findings, but others probably don’t. I suspect the single greatest threat to their validity (beyond the small samples sizes) was the fact that until very recently we never reported experiments that failed. In addition, sometimes we didn’t report measures we had gathered if they didn’t work out as planned and sometimes we added control variables into our equations in an ad-hoc manner. Failed experiments, measures that don’t work, and impactful ad-hoc controls are all common in science and reflect the fact that we learn what we’re doing as we go. But the capacity for other people to evaluate the work and its evidentiary value is heavily constrained when we don’t report those decisions. In retrospect, I deeply regret placing a greater emphasis on telling a clear story than on telling a transparent and complete story.
Has this been a wildly self-serving tour through the bowels of a social psychology lab whose R-index is in the toilet? Research on self-deception suggests you should probably decide for yourself.
Uli: Thank you for your candid response. I think for researchers our age (not sure really how old you are) it will be easier to walk away from some articles published in the anything-goes days of psychological science because we still have time to publish some new and better work. As time goes on, it may become easier for everybody to acknowledge mistakes and become less defensive. I hope that your courage and our collaboration encourage more people to realize that the value of a researcher is not measured in terms of number of publications or citations. Research is like art and not every van Gogh is a masterpiece. We are lucky if we make at least one notable contribution to our science. So, losing a few papers to replication failures is normal. Let’s see what the results of the replication study will show.
Bill: I couldn’t agree with you more! (Except for the ‘courage’ part; I’m trepidatious as hell that my work won’t be taken seriously [or funded] anymore. But so be it.) I’ll be in touch as soon as we get ethics approval and run our replication study…
This blog post was inspired by my experience to receive a rejection of a replication manuscript. We replicated Diener et al.’s (1995) JPSP article on the personality structure of affect. For the most part, it was a successful replication and a generalization to non-student samples. The manuscript was desk rejected because the replication study was not close enough in terms of the items and methods that we used. I was shocked that JPSP would reject replication studies, which made me wonder what the criteria for acceptance are.
In 2015, JPSP started to publish online only replication articles. I examined what these articles have revealed about the replicability of articles published in JPSP before 215. There were only 21 articles published between 2015 and 2020. Only 7 of these articles reported replications of JPSP articles, one included replications of 3 articles. Out of these 9 replications, six were successful and 3 were failures. This finding shows once more that psychologists do everything in their power to appear trustworthy without doing the things that are required to gain or regain trust. While fabulous review articles tout the major reforms that have been made (Nelson et al., 2019), the reality is often much less glamourous. It remains unclear which articles in JPSP can be trusted and selection for significance may undermine the value of self-replications in JPSP.
The past decade has revealed a replication crisis in social psychology. First, Bargh’s famous elderly walking study did not replicate. Then, only 14 out of 55 (25%) significant results could be replicated in an investigation of the replicability of social psychology (Open Science Collaboration, 2015). While some social psychologists tried to dismiss this finding, additional evidence further confirmed that social psychology has a replication crisis (Motyl et al., 2017). A statistical method that corrects for publication bias and other questionable research practices estimates a replication rate of 43% (Schimmack, 2020). This estimate was replicated with a larger dataset of the most cited articles by eminent social psychologists (49%; Schimmack, 2021). However, the statistical estimates assume that it is possible to replicate studies exactly, but most replication studies are often conceptual replications that vary in some attributes. Most often the population between original and replication studies differ. Due to regression to the mean, effect sizes in replication studies are likely to be weaker. Thus, the statistical estimates are likely to overestimate the success rate of actual replication studies (Bartos & Schimmack, 2021). Thus, 49% is an upper limit and we can currently conclude that the actual replication rate is somewhere between 25% and 50%. This is also consistent with analyses of statistical power in social psychology (Cohen, 1961; Sedlmeier & Gigerenzer, 1989).
There are two explanations for the emergence of replication failures in the past decade. One explanation is that social psychologists simply did not realize the importance of replication studies and forgot to replicate their findings. They only learned about the need to replicate findings in 2011 and when they started conducting replication studies, they realized that many of their findings are not replicable. Consistent with this explanation, Nelson, Simmons, and, Simonsohn (2019) report that out of over 1,000 curated replication attempts, 96% have been conducted since 2011. The problem with this explanation is that it is not true. Psychologists have conducted replication studies since the beginning of their science. Since the late 1990, many articles in social psychology reported at least two and sometimes many more conceptual replication studies. Bargh reported two close replications of his elderly priming study in an article with four studies (Bargh et al., 1996).
The real reason for the replication crisis is that social psychologists selected studies for significance (Motyl et al., 2017; Schimmack, 2021; Sterling, 1959; Sterling et al., 1995). As a result, only replication studies with significant results were published. What changed in 2011 is that researchers suddenly were able to circumvent censorship at traditional journals and were able to published replication failures in new journals that were less selective, which in this case, was a good thing (Doyen ,Klein, Pichon, & Cleeremans, 2012; Ritchie, Wiseman, & French, 2012). The problem with this explanation is that it is true, but it makes psychological scientists look bad. Even undergraduate students with little formal training in philosophy of science realize that selective publishing of successful studies is inconsistent with the goal to search for the truth (Ritchie, 2021). However, euphemistic descriptions of the research culture before 2011 avoid mentioning questionable research practices (Weir, 2015) or describe these practices as honest (Nelson et al. 2019). Even suggestions that these practices were at best honest mistakes are often met with hostility (Fiske, 2016). Rather than cleaning up the mess that has been created by selection for significance, social psychologists avoid discussion of their practices to hide replication failures. As a result, not much progress has been made in vetting the credibility of thousands of published articles that provide no empirical support of their claims because most of these results might not replicate.
In short, social psychology suffers from a credibility crisis. The question is what social psychologists can do to restore credibility and to regain trust in their published results. For new studies this can be achieved by avoiding the pitfalls of the past. For example, studies can be pre-registered and journals may accept articles before the results are known. But what should researchers, teachers, students, and the general public do with the thousands of published results?
One solution to this problem is to conduct replication studies of published findings and to publish the results of these studies whether they are positive or negative. In their fantastic (i.e., imaginative or fanciful; remote from reality) review article, Nelson et al. (2019) proclaim that “top journals [are] routinely publishing replication attempts, both failures and successes” (p. 512). That would be wonderful, if it were true, but top journals are considered top journals because they are highly selective in what they are publishing and they have limited journal space. So, every replication study competes with an article that has an intriguing, exiting, and groundbreaking new discovery. Editors would need superhuman strength to resist the temptation to publish the sexy new finding and instead to publish a replication of an article from 1977 or 1995. Surely, there are specialized journals for this laudable effort that makes an important contribution to science, but unfortunately do not meet the high threshold of a top journal that has to maintain its status as a top journal.
The Journal of Personality and Social Psychology found an ingenious solution to this problem. To avoid competition with groundbreaking new research, replication studies can be published in the journal, but only online. Thus, these extra articles do not count towards the limited page numbers that are needed to ensure high profit margins for predatory (i.e., for-profit) publisher. Here, I examined what articles JPSP has published as e-online only publications.
The first e-only replication study was published in 2015. Over the past five years, JPSP has published 21 articles as e-replications (not counting 2021).
In the years from 1965 to 2014, JPSP has published 9,428 articles. Thus, the 21 replication articles provide new, credible evidence for 21/9428 = 0.22% of articles that were published before 2015, when selection bias undermined the credibility of the evidence in these articles. Despite the small sample size, it is interesting to examine the nature and the outcome of the studies reported in these 21 articles.
Eschleman, K. J., Bowling, N. A., & Judge, T. A. (2015). The dispositional basis of attitudes: A replication and extension of Hepler and Albarracín (2013). Journal of Personality and Social Psychology, 108(5), e1–e15. https://doi.org/10.1037/pspp0000017
Hepler, J., & Albarracín, D. (2013). Attitudes without objects: Evidence for a dispositional attitude, its measurement, and its consequences. Journal of Personality and Social Psychology, 104(6), 1060–1076. https://doi.org/10.1037/a0032282
The original authors introduced a new measure called the Dispositional Attitude Measure (DAM). The replication study was designed to examine whether the DAM shows discriminant validity compared to an existing measure, the Neutral Objects Satisfaction Questionnaire (NOSQ). The replication studies replicated the previous findings, but also suggested that DAM and NOSQ are overlapping measures of the same construct. If we focus narrowly on replicability, this replication study is a success.
Van Dessel, P., De Houwer, J., Roets, A., & Gast, A. (2016). Failures to change stimulus evaluations by means of subliminal approach and avoidance training. Journal of Personality and Social Psychology, 110(1), e1–e15. https://doi.org/10.1037/pspa0000039
This article failed to show evidence that subliminal stimuli to change evaluations that was reported by Kawakami et al. in 2007. Thus, this article counts as a failure.
Kawakami, K., Phills, C. E., Steele, J. R., & Dovidio, J. F. (2007). (Close) distance makes the heart grow fonder: Improving implicit racial attitudes and interracial interactions through approach behaviors. Journal of Personality and Social Psychology, 92, 957–971. http://dx.doi.org/10.1037/0022-35220.127.116.117
Citation counts suggest that the replication failure has reduced citations, although 4 articles already cited it in 2021.
Most worrisome, an Annual Review of Psychology chapter (editor Susan Fiske) perpetuates the idea that subliminal stimuli could reduce prejudice. “Interventions seeking to automate more positive responses to outgroup members may train people to have an “approach” response to Black faces (e.g., by pulling a joystick toward themselves when Black faces appear on a screen; see Kawakami et al. 2007)” (Paluck, Porat, Clark, & Green, 2021, p. 543). The chapter does not cite the replication failure.
Rieger, S., Göllner, R., Trautwein, U., & Roberts, B. W. (2016). Low self-esteem prospectively predicts depression in the transition to young adulthood: A replication of Orth, Robins, and Roberts (2008). Journal of Personality and Social Psychology, 110(1), e16–e22. https://doi.org/10.1037/pspp0000037
The original article used a cross-lagged panel model to claim that low self-esteem causes depression (rather than depression causing low self-esteem).
Orth, U., Robins, R. W., & Roberts, B. W. (2008). Low self-esteem prospectively predicts depression in adolescence and young adulthood. Journal of Personality and Social Psychology, 95(3), 695–708. https://doi.org/10.1037/0022-3518.104.22.1685
The replication study showed the same results. In this narrow sense it is a success.
The same year, JPSP also published an “original” article that showed the same results.
Orth, U., Robins, R. W., Meier, L. L., & Conger, R. D. (2016). Refining the vulnerability model of low self-esteem and depression: Disentangling the effects of genuine self-esteem and narcissism. Journal of Personality and Social Psychology, 110(1), 133–149. https://doi.org/10.1037/pspp0000038
Last year, the authors published a meta-analysis of 10 studies that all consistently show the main result.
Orth, U., Clark, D. A., Donnellan, M. B., & Robins, R. W. (2021). Testing prospective effects in longitudinal research: Comparing seven competing cross-lagged models. Journal of Personality and Social Psychology, 120(4), 1013-1034. http://dx.doi.org/10.1037/pspp0000358
The high replicability of the key finding in these articles is not surprising because it is a statistical artifact (Schimmack, 2020). The authors also knew about this because I told them as a reviewer when their first manuscript was under review at JPSP, but neither the authors not the editor seemed to care about it. In short, statistical artifacts are highly replicable.
Davis, D. E., Rice, K., Van Tongeren, D. R., Hook, J. N., DeBlaere, C., Worthington, E. L., Jr., & Choe, E. (2016). The moral foundations hypothesis does not replicate well in Black samples. Journal of Personality and Social Psychology, 110(4), e23–e30. https://doi.org/10.1037/pspp0000056
The main focus of this “replication” article was to test the generalizability of the key finding in Graham, Haidt, and Nosek’s (2009) original article to African Americans. They also examined whether the results replicate in White samples.
Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029–1046. https://doi.org/10.1037/a0015141
Study 1 found weak evidence that the relationship between political conservatism and authority differs across racial groups, beta = .25, beta = .47, chi(1) = 3.92, p = .048. Study 2 replicated this finding, but the p-value was still above .005, beta = .43, beta = .00, chi(1) = 7.04. While stronger evidence for the moderator effect of race is needed, the study counts as a successful replication of the relationship among White or predominantly White samples.
Crawford, J. T., Brandt, M. J., Inbar, Y., & Mallinas, S. R. (2016). Right-wing authoritarianism predicts prejudice equally toward “gay men and lesbians” and “homosexuals”. Journal of Personality and Social Psychology, 111(2), e31–e45. https://doi.org/10.1037/pspp0000070
This article reports replication studies, but the original studies were not published in JPSP. Thus, the results provide no information about the replicability of JPSP.
Rios, K. (2013). Right-wing authoritarianism predicts prejudice against “homosexuals” but not “gay men and lesbians.” Journal of Experimental Social Psychology, 49, 1177–1183. http://dx.doi.org/10.1016/j.jesp.2013.05.013
Panero, M. E., Weisberg, D. S., Black, J., Goldstein, T. R., Barnes, J. L., Brownell, H., & Winner, E. (2016). Does reading a single passage of literary fiction really improve theory of mind? An attempt at replication. Journal of Personality and Social Psychology, 111(5), e46–e54. https://doi.org/10.1037/pspa0000064
I excluded this article because it did not replicate a JPSP article. The original article was published in Science. Thus, the outcome of this replication study tells us nothing about the replicability published in JPSP.
Twenge, J. M., Carter, N. T., & Campbell, W. K. (2017). Age, time period, and birth cohort differences in self-esteem: Reexamining a cohort-sequential longitudinal study. Journal of Personality and Social Psychology, 112(5), e9–e17. https://doi.org/10.1037/pspp0000122
This article challenges the conclusions of the original article and presents new analyses using the same data. Thus, it is not a replication study.
Gebauer, J. E., Sedikides, C., Schönbrodt, F. D., Bleidorn, W., Rentfrow, P. J., Potter, J., & Gosling, S. D. (2017). The religiosity as social value hypothesis: A multi-method replication and extension across 65 countries and three levels of spatial aggregation. Journal of Personality and Social Psychology, 113(3), e18–e39. https://doi.org/10.1037/pspp0000104
This article is a successful self-replication of an article by the first two authors. The original article was published in Psychological Science. Thus, it does not provide evidence about the replicability of JPSP articles.
Gebauer, J. E., Sedikides, C., & Neberich, W. (2012). Religiosity, social self-esteem, and psychological adjustment: On the cross-cultural specificity of the psychological benefits of religiosity. Psychological Science, 23, 158–160. http://dx.doi.org/10.1177/0956797611427045
Siddaway, A. P., Taylor, P. J., & Wood, A. M. (2018). Reconceptualizing Anxiety as a Continuum That Ranges From High Calmness to High Anxiety: The Joint Importance of Reducing Distress and Increasing Well-Being. Journal of Personality and Social Psychology, 114(2), e1–e11. https://doi.org/10.1037/pspp0000128
This article replicates an original study published in Psychological Assessment. Thus, it does not tell us anything about the replicability of research in JPSP.
Vautier, S., & Pohl, S. (2009). Do balanced scales assess bipolar constructs? The case of the STAI scales. Psychological Assessment, 21, 187–193. http://dx.doi.org/10.1037/a0015312
Hounkpatin, H. O., Boyce, C. J., Dunn, G., & Wood, A. M. (2018). Modeling bivariate change in individual differences: Prospective associations between personality and life satisfaction. Journal of Personality and Social Psychology, 115(6), e12-e29. http://dx.doi.org/10.1037/pspp0000161
This article is a method article. The word replication does not appear once in it.
Burns, S. M., Barnes, L. N., McCulloh, I. A., Dagher, M. M., Falk, E. B., Storey, J. D., & Lieberman, M. D. (2019). Making social neuroscience less WEIRD: Using fNIRS to measure neural signatures of persuasive influence in a Middle East participant sample. Journal of Personality and Social Psychology, 116(3), e1–e11. https://doi.org/10.1037/pspa0000144
“In this study, we demonstrate one approach to addressing the imbalance by using portable neuroscience equipment in a study of persuasion conducted in Jordan with an Arabic-speaking sample. Participants were shown persuasive videos on various health and safety topics while their brain activity was measured using functional near infrared spectroscopy (fNIRS). Self-reported persuasiveness ratings for each video were then recorded. Consistent with previous research conducted with American subjects, this work found that activity in the dorsomedial
and ventromedial prefrontal cortex predicted how persuasive participants found the videos and how much they intended to engage in the messages’ endorsed behaviors.”
This article reports a conceptual replication study. It uses a different population (US vs. Jordan) and a different methodology. As a key finding did replicate, it might be considered a successful replication, but a failure could have been attributed to the difference in population and methodology. It is also not clear that a failure would have been reported. The study should have been conducted as a registered report.
Wilmot, M. P., Haslam, N., Tian, J., & Ones, D. S. (2019). Direct and conceptual replications of the taxometric analysis of type a behavior. Journal of Personality and Social Psychology, 116(3), e12–e26. https://doi.org/10.1037/pspp0000195
This article fails to replicate the claim that Type A and Type B are distinct types rather than extremes on a continuum. Thus, this article counts as a failure.
Strube, M. J. (1989). Evidence for the Type in Type A behavior: A taxometric analysis. Journal of Personality and Social Psychology, 56(6), 972–987. https://doi.org/10.1037/0022-3522.214.171.1242
It is difficult to evaluate the impact of this replication failure because the replication study was just published and the original article received hardly any citations in recent years. Overall, it has 68 citations since 89.
Kim, J., Schlegel, R. J., Seto, E., & Hicks, J. A. (2019). Thinking about a new decade in life increases personal self-reflection: A replication and reinterpretation of Alter and Hershfield’s (2014) findings. Journal of Personality and Social Psychology, 117(2), e27–e34. https://doi.org/10.1037/pspp0000199
This article replicated an original articles published in PNAS. It therefore cannot be used to examine the replicability of articles published in JPSP.
Alter, A. L., & Hershfield, H. E. (2014). People search for meaning when they approach a new decade in chronological age. Proceedings of the National Academy of Sciences of the United States of America, 111, 17066–17070. http://dx.doi.org/10.1073/pnas.1415086111
Mõttus, R., Sinick, J., Terracciano, A., Hřebíčková, M., Kandler, C., Ando, J., Mortensen, E. L., Colodro-Conde, L., & Jang, K. L. (2019). Personality characteristics below facets: A replication and meta-analysis of cross-rater agreement, rank-order stability, heritability, and utility of personality nuances. Journal of Personality and Social Psychology, 117(4), e35–e50. https://doi.org/10.1037/pspp0000202
This article replicates a previous study of personality structure using the same items and methods using a different sample. The results are a close replication. Thus, it is a success, but the study is excluded because the original study was published in 2017. Thus, the study does not shed light on the replicability of articles published in JPSP before 2015.
Mõttus, R., Kandler, C., Bleidorn, W., Riemann, R., & McCrae, R. R. (2017). Personality traits below facets: The consensual validity, longitudinal stability, heritability, and utility of personality nuances. Journal of Personality and Social Psychology, 112, 474–490. http://dx.doi.org/10.1037/pspp0000100
van Scheppingen, M. A., Chopik, W. J., Bleidorn, W., & Denissen, J. J. A. (2019). Longitudinal actor, partner, and similarity effects of personality on well-being. Journal of Personality and Social Psychology, 117(4), e51–e70. https://doi.org/10.1037/pspp0000211
This study is a replication and extension of a previous study that examined the influence of personality on well-being in couples. A key finding was that personality similarity explained very little variance in well-being. While evidence for the lack of an effect is important, the replication crisis is about the reporting of too many significant results. A concern could be that the prior article reported a false negative results, but the studies were based on large samples with high power to detect even small effects.
Dyrenforth, P. S., Kashy, D. A., Donnellan, M. B., & Lucas, R. E. (2010). Predicting relationship and life satisfaction from personality in nationally representative samples from three countries: The relative importance of actor, partner, and similarity effects. Journal of Personality and Social Psychology, 99, 690–702. http://dx.doi.org/10.1037/a0020385
Buttrick, N., Choi, H., Wilson, T. D., Oishi, S., Boker, S. M., Gilbert, D. T., Alper, S., Aveyard, M., Cheong, W., Čolić, M. V., Dalgar, I., Doğulu, C., Karabati, S., Kim, E., Knežević, G., Komiya, A., Laclé, C. O., Ambrosio Lage, C., Lazarević, L. B., . . . Wilks, D. C. (2019). Cross-cultural consistency and relativity in the enjoyment of thinking versus doing. Journal of Personality and Social Psychology, 117(5), e71–e83. https://doi.org/10.1037/pspp0000198
This article mainly aims to examine the cross-cultural generality of a previous study by Wilson et al. (2014). Moreover, the study was published in Science. Thus, it does not help to examine the replicability of research published in JPSP before 2015.
Wilson, T. D., Reinhard, D. A., Westgate, E. C., Gilbert, D. T., Ellerbeck, N., Hahn, C., . . . Shaked, A. (2014). Just think: The challenges of the disengaged mind. Science, 345, 75–77. http://dx.doi.org/10.1126/science.1250830
Yeager, D. S., Krosnick, J. A., Visser, P. S., Holbrook, A. L., & Tahk, A. M. (2019). Moderation of classic social psychological effects by demographics in the U.S. adult population: New opportunities for theoretical advancement. Journal of Personality and Social Psychology, 117(6), e84-e99. http://dx.doi.org/10.1037/pspa0000171
This article reports replications of seven original studies. It also examined whether results are moderated by age / student status. Conformity to a simply presented descriptive norm (Asch, 1952; Cialdini, 2003; Sherif, 1936).
[None of these references are from JPSP]
The effect of a content-laden persuasive message on attitudes as moderated by argument quality and need for cognition (e.g., Cacioppo, Petty, & Morris, 1983).
Cacioppo, J. T., Petty, R. E., & Morris, K. J. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805–818. http://dx.doi.org/10.1037/0022-35126.96.36.1995
Base-rate underutilization (using the “lawyer/engineer” problem; Kahneman & Tversky, 1973). [not in JPSP]
The conjunction fallacy (using the “Linda” problem; Tversky & Kahneman, 1983). [not in JPSP]
Underappreciation of the law of large numbers (using the “hospital” problem; Tversky & Kahneman, 1974). [Not in JPSP]
The false consensus effect (e.g., Ross, Greene, & House, 1977). Ross, L., Greene, D., & House, P. (1977).
The “false consensus effect”: An egocentric bias in social perception and attribution processes. Journal of Experimental Social Psychology, 13, 279 –301. http://dx.doi.org/10
The effect of “ease of retrieval” on self-perceptions (e.g., Schwarz et al., 1991).
Schwarz, N., Bless, H., Strack, F., Klumpp, G., Rittenauer-Schatka, H., & Simons, A. (1991). Ease of retrieval as information: Another look at the availability heuristic. Journal of Personality and Social Psychology, 61, 195–202. http://dx.doi.org/10.1037/0022-35188.8.131.52
Because the replication failure was just published, it is not possible to examine whether it had any effect on citations.
In sum, the article successfully replicated 2 JPSP articles and failed to replicate 1.
Van Dessel, P., De Houwer, J., Gast, A., Roets, A., & Smith, C. T. (2020). On the effectiveness of approach-avoidance instructions and training for changing evaluations of social groups. Journal of Personality and Social Psychology, 119(2), e1–e14. https://doi.org/10.1037/pspa0000189
This is another replication of the Kawakami et al. (2007) article, but it focusses on Experiment 1 that did not use subliminal stimuli. This article reports a successful replication in Study 1, t(61) = 1.72, p = .045 (one-tailed), Study 3, t(981) = 2.19, p = .029, t(362) = 2.76, p = .003. Thus, this article counts as a success. It should be noted, however, that these effects disappear in studies with a delay between the training and testing sessions (Lai et al., 2016).
Kawakami, K., Phills, C. E., Steele, J. R., & Dovidio, J. F. (2007). (Close) distance makes the heart grow fonder: Improving implicit racial attitudes and interracial interactions through approach behaviors. Journal of Personality and Social Psychology, 92, 957–971. http://dx.doi.org/10.1037/0022-35184.108.40.2067
Aknin, L. B., Dunn, E. W., Proulx, J., Lok, I., & Norton, M. I. (2020). Does spending money on others promote happiness?: A registered replication report. Journal of Personality and Social Psychology, 119(2), e15–e26. https://doi.org/10.1037/pspa0000191
This article replicated a study that was published in Science. It therefore does not tell us anything about the replicability of articles published in JPSP.
Dunn, E. W., Aknin, L. B., & Norton, M. I. (2008). Spending money on others promotes happiness. Science, 319, 1687–1688. http://dx.doi.org/10.1126/science.1150952
Calderon, S., Mac Giolla, E., Ask, K., & Granhag, P. A. (2020). Subjective likelihood and the construal level of future events: A replication study of Wakslak, Trope, Liberman, and Alony (2006). Journal of Personality and Social Psychology, 119(5), e27–e37. https://doi.org/10.1037/pspa0000214
Although this article reports two replication studies (both failures), the original studies were published in a different journal. Thus, the results do not provide information about the replicability of research published in JPSP.
Wakslak, C. J., Trope, Y., Liberman, N., & Alony, R. (2006). Seeing the forest when entry is unlikely: Probability and the mental representation of events. Journal of Experimental Psychology: General, 135, 641–653. http://dx.doi.org/10.1037/0096-34220.127.116.111
Burnham, B. R. (2020). Are liberals really dirty? Two failures to replicate Helzer and Pizarro’s (2011) study 1, with meta-analysis. Journal of Personality and Social Psychology, 119(6), e38–e42. https://doi.org/10.1037/pspa0000238
Although this article reports two replication studies (both failures), the original studies were published in a different journal. Thus, the results do not provide information about the replicability of research published in JPSP.
Helzer, E. G., & Pizarro, D. A. (2011). Dirty liberals! Reminders of physical cleanliness influence moral and political attitudes. Psychological Science, 22, 517–522. http://dx.doi.org/10.1177/0956797611402514
Out of the 21 articles published under the e-replication format, only 7 articles report replications of studies published in JPSP before 2015. One of these article reports replications of three articles, but two of these articles report replications of different studies in the same article (one failure, one success; Kawakami et al., 2007). Thus, there are a total of 9 replications with 6 success and 3 failures. This is a success rate of 67%, 95%CI = 31% to 98%.
The first observation is that the number of replication studies of studies published in JPSP is abysmally low. It is not clear why this is the case. Either researchers are not interested in conducting replication studies or JPSP is not accepting all submissions of replication studies for publication. Only the editors of JPSP know.
The second Some evidence that JPSP published more successful than failed replications. This is inconsistent with the results of the Open Science Collaboration project and predictions based on statistical analyses of the p-values in JPSP articles (Open Science Collaboration, 2015; Schimmack, 2020, 2021). Although this difference may simply be sampling error because the sample of replication studies in JPSP is so small, it is also possible that this high success rate reflects reflects systematic factors that select for significance.
First, researchers may be less motivated to conduct studies with a low probability of success, especially in research areas that have been tarnished by the replication crisis. Who still wants to do priming studies in 2021? Thus, bad research that was published before 2015 may simply die out. The problem with this slow death model of scientific self-correction is that old studies continue to be cited as evidence. Thus, JPSP should solicit replication studies of prominent articles with high citations even if these replication studies may produce failures.
Second, it is unfortunately possible that editors at JPSP prefer to publish studies that report successful outcomes rather than replication failures. To ensure consumers of JPSP, editors should make it transparent whether replication studies get rejected and why they get rejected. Given the e-only format, it is not clear why any replication studies would be rejected.
Unfortunately, the results of this meta-analysis show once more that psychologists do everything in their power to appear trustworthy without doing the things that are required to gain or regain trust. While fabulous review articles tout the major reforms that have been made (Nelson et al., 2019), the reality is often much less glamourous. Trying to get social psychologists to openly admit that they made (honest) mistakes in the past and to correct themselves is akin to getting Trump to admit that he lost the 2020 election. Most of the energy is wasted on protecting the collective self-esteem of card carrying social psychologists in the face of objective, scientific criticism of their past and current practices. It remains unclear which results in JPSP are replicable and provide solid foundations for a science of human behavior and which results are nothing but figments of social psychologists’ imagination. Thus, I can only warn consumers of social psychological research to be as careful as they would be when they are trying to buy a used car. Often the sales pitch is better than the product (Ritchie, 2020; Singal, 2021).
It has been proposed that psychologists used a number of p-hacking methods to produce false positive results. In this post, I examine the prevalence of two p-hacking methods, namely the use of covariates and peaking until a significant result is obtained. The evidence suggests that these strategies are not very prevalent. One explanation for this is that they are less efficient than other p-hacking methods. File-drawering of small studies and inclusion of multiple dependent measures are more likely to be the main questionable practices that inflate effect sizes and success rates in psychology.
P-hacking refers to statistical practices that inflate effect size estimates and increase the probability of false positive results (Simonsohn et al., 2014). There are a number of questionable practices that can be used to achieve this goal. Three major practices are (a) continuing to add participants until significance is reached, (b) adding covariates, and (c) testing multiple dependent variables.
In a previous blog-post, I pointed out that two of these practices are rather dumb because they require more resources than simply running many studies with small samples and to put non-significant results in the file drawer (Schimmack, 2021). The dumbest p-hacking method is to continue data collection until significance is reached. Even with sampling until N = 200, the majority of studies remain non-significant. The predicted pattern is a continuous decline in the frequencies with increasing sample sizes. The second dumb strategy is to measure additional variables and to use them as covariates. It is smarter to add additional variables as dependent variables.
Simonsohn et al. (2014) suggested that it is possible to detect the use of dumb p-hacking methods by means of p-curve plots. Repeated sampling and the use of covariates produce markedly left-skewed (monotonic decreasing) p-curves. Schimmack (2021) noted that left-skewed p-curves are actually very rare. Figure 1 shows the p-curve for the most cited articles of 71 social psychologists (k = 2,570). The p-curve is clearly right-skewed.
I then examined p-curves for individual social psychologists (k ~ 30). The worst p-curve was flat, but not left-skewed.
The most plausible explanation for this finding is that no social psychologists tested only false hypotheses. As studies with true effect sizes produce right skew, p-hacking cannot be detected by left-skewed p-curves.
I therefore examined the use of dumb p-hacking strategies in other ways. The use of covariates is easy to detect by coding studies whether they used covariates or not. If researchers use multiple covariates the chances that a result becomes significant with a covariate are higher than the chances to get the significant result without a covariate. Thus, we should see more results with covariates and the frequency of studies with covariates provides some information about the prevalence of covariate hacking. I also distinguished between strictly experimental studies and correlational studies because covariates are more likely to be used in correlational studies for valid reasons. Figure 3 shows that the use of covariates in experimental studies is fairly rare (8.6%). If researchers would try only one covariate, this would limit the number of studies that were p-hacked with covariates to 17.2%, but the true frequency is likely to be much lower because p-hacking with a single covariate barely increases the chances of a significant result.
To examine the use of peaking, I first plotted the histogram of sample sizes. I limited sample sizes to studies with N < 200 to make the distribution of small sample sizes more visible.
There is no evidence that researchers start with very small sample sizes (n = 5) and publish as soon as they get significance (simulation by Simonsohn et al., 2014). This would have produced a high frequency of studies with N = 10. The peak around N = 40 suggests that many researchers use n = 20 as a rule of thumb for the allocation of participants to cells in two-group designs. Another bump around N = 80 is explained by the same rule for 2 x 2 designs that are popular among social psychologists. N = 100 seems to be another rule of thumb. Except for these peaks, the distribution does show a decreasing trend suggesting that peaking was used. However, there is also no evidence that researchers simply stop after n = 15 when results are not significant (Simonsohn et al., 2014, simulation).
If the decreasing trend is due to peaking, sample sizes would be uncorrelated with the strength of the evidence. Otherwise, studies with larger samples have less sampling error and stronger evidence against the null-hypothesis. To test this prediction, I regressed p-values transformed into z-scores onto sampling error (1 / sqrt(N). I included the use of covariates and the nature of the study (experimental vs. correlational) as additional predictors.
The strength of evidence increased with decreasing sampling error without, z = 3.73, and with covariates, z = 3.30. These results suggest that many studies tested a true effect because a true effect is necessary to increase the strength of evidence against the null-hypothesis. To conclude, peaking may have been used, but not at excessive levels that would produce many low z-scores with large samples.
The last analysis was used to examine whether social psychologists have used questionable research practices. The difference between p-hacking and questionable research practices is that p-hacking excludes publication bias (not reporting entire studies). The focus on questionable research practices has the advantage that it is no longer necessary to distinguish between selective reporting of analyses or entire studies. Most researchers are likely to use both p-hacking and publication bias and both practices inflate effect sizes and lower replicability. Thus, it is not important to distinguish between p-hacking and publication bias.
The results show clear evidence that social psychologists used questionable research practices to produce an abundance of significant results. Even not counting marginally significant results, the success rate is 89%, but the actual power to produce these significant results is estimated to be just 26%. This shows that a right-skewed does not tell us how much questionable research practices contributed to significant results. A low discovery rate of 26% translates into a maximum false discovery rate of 15%. This would suggest that one reason for the lack of left-skewed p-curves is that p-hacking of true null-hypothesis is fairly rare. A bigger problem is that p-hacking of real effects in small samples produces vastly inflated effect size estimates. However, the 95% confidence interval around this estimate reaches all the way to 47%. Thus, it cannot be ruled out that a substantial number of results was obtained with true null-hypotheses by using p-hacking methods that do not produce a marked left-skew and publication bias.