“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
See Reference List at the end for peer-reviewed publications.
The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.
I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science.
Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020). An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017). The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).
Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021). I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021).
Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021). That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b).
If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey).
Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22 https://doi.org/10.15626/MP.2018.874
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566 http://dx.doi.org/10.1037/a0029487
Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246
Warning. The content may be graphic and is not suitable for all audiences.
Back in July 2021, the authors sent me a draft of the present paper. I am glad that they did so, because it gave us an opportunity to exchange our opinions and interpretations and to try to correct any misunderstanding or misinterpretations. Unfortunately, however, I see that in the present submission many of those misinterpretations (including false and misleading statements) remain. Thus, I am forced to conclude, reluctantly, that we are not dealing with misunderstandings here but with strategic misrepresentations that seem willful. To be honest, this saddens me, because I thought we could make progress through mutual dialogue. But I don’t see how it serves the goals of science to engage in hyperbole and dismissiveness and to misrepresent so egregiously the views of professional colleagues.
For all of these reasons, and those enumerated below, I am afraid that I cannot support publication of this paper in PSPB. John Jost
(1) On p. 3 the authors write: “IAT scores close to zero for African Americans have been interpreted as evidence that “sizable proportions of members of disadvantaged groups – often 40% to 50% or even more exhibit implicit (or indirect) biases against their own group and in favor of more advantaged groups” (Jost, 2019, p. 277). This is not true. We did not “interpret” the mean-level scores in terms of frequency distributions (or vice versa). We looked at both. So these are two separate observations; one observation was not used to explain the other. For African Americans the mean-level scores were close to zero (no preference) and, using a procedure described in the note to Figure 1 for Jost et al. (2004 p. 898), we concluded that 39.3% exhibited a pro-White/anti-Black bias. (The 40-50% figure comes from other intergroup comparisons included in the original article).
It is not important how you did arrive at a precise percentage of unconsciously self-hating African Americans. We used this quote to make clear that you treated the race IAT as a perfectly valid measure of unconscious bias to arrive at the conclusion that a large percentage of African Americans (and a much larger percentage than White Americans) have a preference for the White out-group over the Black in-group. This is the key claim of your article and this is the claim that we challenge. At issue is the validity of the race IAT which is required to make valid claims about African Americans, not the statistical procedure to estimate a percentage.
(2) On p. 4, the authors write: “Jost et al.’s (2004) claims about African Americans follow a long tradition of psychological research on African Americans by mostly White psychologists. Often this research ignores the lived experience of African Americans, which often leads to false claims…” There are two very big problems with this section of the paper, which I have already pointed out to the authors (and they have apparently chosen to ignore them). (a) The first is that this is an ad hominem critique, directed at me because of a personal characteristic, namely my race. For centuries philosophers have rejected this as a fallacious form of reasoning: whether something is true or false has nothing to do with the personal characteristics of the person making this claim. Furthermore, the senior author (Uli Schimmack) is obviously wielding this critique in bad faith; he, too, is White, so if he took his own objection seriously he would refrain from making any claims about the psychology of African Americans, but he obviously has not refrained from doing so in this submission or in other forums.
It is a general observation that White researchers have speculated about African American’s self-esteem and mental states often without consulting African Americans. (see our quote of Adams). And I, Ulrich Schimmack, did collaborate with my African American wife on this paper to avoid this very same mistake.
(b) The second problem with this claim, which I have also already pointed out to the authors, is that the very same hypotheses about internalization of inferiority advanced by Jost et al. (2004) in the article in question were, in fact, made by a number of Black scholars, including W.E.B. DuBois, Frantz Fanon, Steven Biko, and Kenneth and Mamie Clark. These influences are discussed in considerable detail in my 2020 book, A Theory of System Justification.
Kenneth and Mamie Clark are the authors’ of the famous doll studies from the 1940s. Are we supposed to believe that nothing has changed over the past 80 years and that we can just use a study with children in 1940s to make claims about adult African Americans’ attitudes in 2014? What kind of social psychologists would ignore the influence of situations and culture on attitudes?
(3) On the next page the authors write: “Just like White theorists’ claims about self-esteem, Jost et al.’s claims about African Americans’ unconscious are removed from African Americans’ own understanding of their culture and identity and disconnected from other findings that are in conflict with the theory’s predictions. The only empirical support for the theory is the neutral score of African Americans on the race IAT.” Now, this claim is absurd. The book cited above describes hundreds of studies providing empirical support for the theory that have nothing to do with the IAT.
Over the past 10 years, we have seen this gaslighting again and again. When one study is criticized, it is defended by pointing to the vast literature of other studies that also support this claim. There may be other evidence, but it is not clear how this other evidence could reveal something about the unconscious. The whole appeal of the IAT was that it shows something that explicit measures cannot show. In fact, explicit ratings often show a stronger in-group favoritism among African Americans. To dismiss this finding, Jost has to allude to the unconscious which shows the hidden preference of Whites.
(4) They go on: “We are skeptical about the claim that most African-Americans secretly favor the outgroup based on the lived experience of the second author” (p. 5). But this was not our claim. As noted above, we found that 39.3% of African Americans (not “most”) exhibited a pro-White/anti-Black bias on the IAT. But, of course, the theory is about relations among variables, not about the specific percentage of Black people who do X, Y, or Z (which is, of course, affected by historical factors, among many other things).
Back to the game with percentages. We do not care whether you wrote 40% or 50%. We care about the fact that you make claims about African American’s unconscious based on an invalid measure.
(5) On p. 6 the authors write: “the mean score of African Americans on the race IAT may be shifted towards a pro-White bias because negative cultural stereotypes persist in US American culture. The same influence of cultural stereotypes would also enhance the pro-White bias for White Americans. Thus, an alternative explanation for the greater in-group bias for White Americans than for African Americans on the race IAT is that attitudes and cultural stereotypes act together for White Americans, whereas they act in opposite directions for African Americans” (p. 6).
As noted above, in July 2021 I wrote to the authors in an attempt to clarify that, from the perspective of SJT, the effects of “cultural stereotypes” in no way support “an alternative explanation” for out-group favoritism, because stereotypes (since the very first article by Jost & Banaji, 1994) have been considered to be system-justifying devices. Here is what I wrote to them: You describe the influence of “cultural stereotypes” as some kind of an alternative to system justification processes, but they are not. The theory started as a way of understanding the origins and consequences of cultural stereotypes. None of this contradicts SJT at all: “The nature of the task may activate cultural stereotypes that are normally not activated when African Americans interact with each other. As a result, the mean score of African Americans on the race IAT may be shifted towards a pro-White bias because negative cultural stereotypes persist in US American culture. The same influence of cultural stereotypes would also enhance the pro-White bias for White Americans.” Yes, this is perfectly consistent with SJT. In fact, it is part of our point. And the purpose of SJT is not to explain what happens “when African Americans interact with each other,” although it may shed some light on intragroup dynamics. I think of the scene in Spike Lee’s (a Black film director, as you well know) movie, School Daze, when the light-skinned and dark-skinned African Americans are fighting/dancing with each other. There is plenty of system justification going on there, it seems to me. We may (or may not disagree) in our interpretation of the social dynamics in School Daze, but I feel that the authors are now willfully misrepresenting system justification theory on the issue of “cultural stereotypes,” even after I explicitly sought to clarify their misrepresentation months ago: The activation of cultural stereotypes IS part of what we are trying to understand in terms of SJT.
Jost ignores that many other social psychologists have raised concerns about the validity of the race IAT because it may conflate knowledge of negative stereotypes with endorsement of these stereotypes and attitudes (Olson & Fazio, DOI: 10.1037/0022-35184.108.40.2063). For anybody who cares, please ask yourself why Jost does not address the key point of our criticism, namely the use of race IAT scores to make inferences about African Americans’ unconscious without evidence that it can measure conscious or unconscious preferences of African Americans.
(6) It has been a while since I read the Bar-Anan and Nosek (2014) article, but my memory for it is incompatible with the claim that those authors were foolish enough to simply assume that the most valid implicit measures was the one that produced the biggest difference between Whites and Blacks in terms of in-group bias, as the present authors claim (pp. 7-8). As I recall, Bar-Anan and Nosek made a series of serious and comprehensive comparisons between the IAT and other tasks and concluded on the basis of those comparisons, not the one graphed in Figure 1 here, that the validity of the IAT was superior. I feel that, in addition to seriously representing my own work, they are also seriously misrepresenting the work of Bar-Anan and Nosek. Those authors should also have the opportunity to review and/or respond to the present claims being made about the (in) validity of the IAT.
So, the reviewer relies on his foggy memory to question our claim instead of retrieving a pdf file and checking for himself. New York University should be proud of this display of scholarship. I hope Jost made sure to get his Publons credit. Here is the relevant section from Bar-Anan and Nosek (2014 p. 675; https://link.springer.com/article/10.3758/s13428-013-0410-6).
(7) One methodological improvement of this paper over the previous draft that I saw is that this version now includes other implicit measures, including the single category IAT. However, the hypothesis stated on p. 9, allegedly on behalf of SJT, is incorrect: “System justification theory predicts a score close to zero that would reflect an overall neutral attitude and at least 50% of participants who may hold negative views of the in-group.” This is wrong on several counts and indicates a real lack of familiarity with SJT, which predicts that (to varying degrees) people are motivated to hold favorable attitudes toward themselves (ego justification), their in-group (group justification), and toward the overarching social system (system justification). This last motive—in a departure from the first two—implies that, based on the strength of system justification tendencies, advantaged group members’ attitudes toward the ingroup will become more favorable and disadvantaged group members’ attitudes toward the in-group will become less favorable. As noted above, SJT is not about making predictions about absolute scores or frequency counts—these are all subject to historical and many other contextual factors. It would be foolish to predict that African Americans have a neutral (near zero) attitude toward their own group or that 50% have a negative attitude. This is not what the theory says at all. Unless you have separate individual-level estimates of ego, group, and system justification scores, the most one could hypothesize is that on the single category IAT is that African Americans would have a more favorable evaluation of the out-group than European Americans would, and European Americans would have a more favorable evaluation of the in-group than African Americans would. Note that I am writing this before looking at the results.
We are interested in African Americans and White Americans attitudes towards their in-groups and out-groups. If System Justification Theory (SJT) makes no clear predictions about these attitudes, we do not care about SJT. However, we do care about an article that has been cited over 1,000 times that makes the claim that many African Americans have unconscious negative attitudes towards their in-group and the support of this claim by means of computing a percentage of African Americans who scored above zero on a White-Black IAT (i.e., slower responses when African American is paired with good than when African American is paired with bad). We show that the race IAT lacks convergent validity with other implicit measures and that other implicit measures show different results. Thus, Jost has to justify why we should focus on the IAT results and ignore the results from other IAT tasks. So far, he has avoided talking about our actual empirical results.
(8) On pp. 10-11 the authors concede: “The model was developed iteratively using the data. Thus, all results are exploratory and require validation in a separate sample. Due to the small number of Black participants, it was not possible to cross-validate the model with half of the sample. Moreover, tests of group differences have low power and a study with a larger sample of African Americans is needed to test equivalence of parameters… models with low coverage (many missing data) may overestimate model fit. A follow-up study that administers all tasks to all participants should be conducted to provide a stronger test of the model.” These seem like serious limitations that, in the absence of replication with much larger samples, undermine the very strong conclusions the authors wish to draw.
So Jost can make strong claims (40% of African Americans have unconscious negative attitudes towards their group) based on an unvalidated measure, but when we actually show that the measure lacks validity, we need to replicate our findings first? This is not how science works. Rather, Jost needs to explain why other implicit measures, including the single category IAT do not show the same pattern as the race IAT that was used in the 2001 article.
(9) There is a peculiar paragraph on p. 13 in the “Results” sections, even though it goes well beyond the reporting of results: “Most important is the finding that race IAT scores for African Americans were unrelated to the attitudes towards the in-group and out-group factors. Thus, scores on the race IAT do not appear to be valid measures of African Americans’ attitudes. This finding has important implications for Jost et al.’s (2004) reliance on race IAT scores to make inferences about African Americans’ unconscious attitudes towards their in-group. This interpretation assumed that race IAT scores do provide valid information about African American’s attitudes towards the ingroup, but no evidence for this assumption was provided. The present results show 20 years later that this fundamental assumption is wrong. The race-IAT does not provide information about African Americans’ attitudes towards the in-group as reflected in other implicit measures.”
First of all, I don’t know if one can conclude, even in principle, that the race IAT is invalid for African Americans on the basis of a single study carried out with approximately 200 African American participants. There have been dozens, if not more, studies conducted (see Essien et al., 2020, JPSP), so it seems that any attempt to claim invalidity across the board should be based on a far more comprehensive analysis of larger data sets. Second, if I understand the specific methodological claim here it is that African Americans’ race IAT scores are not correlated with whatever the common factor is that is shared by the other implicit attitude measures (AMP, evaluative priming, and SC-IAT) and one explicit attitude measure (feeling thermometer). At most, it seems to me that one could conclude, on the basis of this, that the race IAT is measuring something different than the other things. This is not all that surprising; indeed, the IAT was supposed to measure something different from feeling thermometers. It seems like a stretch to conclude that the IAT is invalid and the other measures are valid simply because they appear to be measuring somewhat or even completely different things.
Third, the hyperbolic and misleading language implies that something about the IAT is a “fundamental assumption” of SJT, but this is false. The IAT was simply considered to be the best implicit measure at that time (20 years ago), so that is what we used. But it is silly to assume that hypotheses, especially “fundamental” ones, should be forever tied to specific operationalizations. Fourth, the attacking, debunking nature of this paragraph—against the IAT as a methodological instrument and against SJT as a theoretical framework—makes it clear that the authors are not really very interested in the dynamics of ingroup and outgroup favoritism among members of advantaged and disadvantaged groups (measured in different ways). It’s as if the real issue doesn’t even come up here.
Finally, we get to the substantive issue. First, let’s get the gaslighting out of the way. There have not been dozens of studies trying to validate the race IAT for African Americans. There have been zero. This is not surprising because there have also been no serious attempts to validate the race IAT for White respondents or IATs in general (Schimmack, 2021; https://journals.sagepub.com/doi/abs/10.1177/1745691619863798). The key problem is that social psychologists are poorly trained in psychometrics (i.e. the science of psychological measurement and construct validation; Schimmack, 2021, https://open.lnu.se/index.php/metapsychology/article/view/1645).
Now on to the substantive issue. We are the first two show that among African Americans, several implicit measures (e.g., evaluative priming, AMP, single category IAT) show some (modest) convergent validity with each other. Not surprisingly, they also show convergent validity with explicit measures because all measures mostly reflect a common attitude (rather than one conscious and one unconscious ones) (Schimmack, 2021; https://journals.sagepub.com/doi/abs/10.1177/1745691619863798). All of these measures show as much (or more) positivity in in-group attitudes for African Africans as for White Americans. This is an interesting findingbecause positive attitudes on explicit measures were dismissed by Jost. But now several implicit measures show the same result. Thus, it is not a simple rating bias.Now the race IAT and its variants are the odd ones with a different pattern. Why? That remains to be examined, but to make claims about African Americans’ attitudes we would need to know the answer to this question. Maybe it is just a method artifact? Just raising this possibility is a noteworthy contribution to science.
(10) Eventually, a few pages later, the authors get around to telling us what they really found with respect to the actual research question: “Also expected was the finding that out-group attitudes of African Americans, d = .42, 95%CI , are more favorable than out-group attitudes of White Americans, d = .20, 95%CI.” So, um, African Americans exhibited more favorable attitudes toward Whites than Whites exhibited toward African Americans. This is precisely what system justification theory would have predicted, as I noted above (before looking at the results). It is, perhaps, an interesting discovery — if it is replicated with larger samples — that out-group attitudes are unrelated to in-group attitudes for both groups and that in-group attitudes were equally positive for both groups. But, with respect to the key question of out-group favoritism, the authors actually obtained support for SJT but refuse to even acknowledge it. Is this really what science is about? On the contrary, they draw this outrageous conclusion: “Thus, support for the system justification theory rests on a measurement artifact.” In point of fact, when the authors return to the comparative ingroup vs. outgroup measure they arrive at a conclusion that is virtually the same as Jost et al. (2004): “White Americans’ scores on the race IAT are systematically biased towards a pro-White score, d = .78, whereas African Americans’ scores are only slightly biased towards a pro-Black score, d = -.19.” Yes, advantaged groups tend to show reasonably strong in-group favoritism, whereas disadvantaged groups tend to show weak in-group favoritism, with substantial proportions showing out-group favoritism. This is precisely what we found 20 years ago. The authors and I already had this exchange back in July, but their paper contains the same misleading statements as before. Here is our exchange:
You write: “Proponents of system justification theory might argue that attitudes towards the in-group have to be evaluated in relative terms. Viewed from this perspective, the results still show relatively more in-group favoritism for White Americans, d = .62 – .20 = .42 than African Americans, d = .54 – .40 = .14. However, out-group attitudes contribute more to this difference, d = .40 = .20 = .20, than in-group differences, d = .62 – .54 = .08. Thus, one reason for the difference in relative preferences is that African Americans attitudes towards Whites are more positive than White Americans’ attitudes towards African Americans.” My response: Yes, this is key. We are talking about the ways in which people respond to relative status, power, and wealth, etc. rankings within a given social system (or society). The fact that “African Americans attitudes towards Whites are more positive than White Americans’ attitudes towards African Americans” is supportive of SJT.
Oh boy, sorry if you had to read all of this. Does it make sense to make a distinction between in-group attitudes and out-group attitudes? I hope we can agree that it does. Would we be surprising if Black girls like White dolls more than White girls like Black dolls? Not really and it doesn’t tell us anything about internalizing stereotypes. The important and classic doll study did not care about the comparison of out-group attitudes. The issue was whether Black children preferred White dolls over Black dolls and Jost et al. (2001) claimed that many African Americans internalized negative stereotypes of their group and positive stereotypes of Whites so that they have a relatively greater preference of White over Black. The problem is that the race IAT confounds in-group and out-group attitudes and that measures that avoid this confound like the single-attribute IAT don’t show the same result.
(11) Another huge problem with this whole research program is that it ignores completely the strongest piece of evidence for SJT in this context, namely that the degree of out-group favoritism among disadvantaged groups is positively associated with support for the status quo, measured in terms of political conservatism and individual difference measures of system-justifying beliefs (e.g., see Ashburn-Nardo et al., 2003; Essien et al., 2020; Jost et al., 2004). If Blacks’ responses on the IAT were random or meaningless, I see no reason why they would be consistently correlated with other measures of system justification. But the voluminous literature shows that they are (Essien et al., 2020). Although I have pointed this out to the authors before, they have simply ignored the issue once again, even though this is a key piece of evidence that supports the SJT interpretation of implicit attitudes about advantaged and disadvantaged groups
Back to gaslighting. Let’s say there are some studies that show this pattern. How does Jost explain the pattern of results in the present study? He doesn’t. That is the point.
(12) All of the above problems are repeated in the General Discussion, so there is no need to address them again point by point. But I will say that other key issues that the authors and I discussed in July are also ignored in the present submission: I wrote: This statement is interesting but far too categorical, in my opinion: “It would be a mistake to interpret this difference in evaluations of the out-group as evidence that African Americans have internalized negative stereotypes about their in-group.” First, it is not an either/or situation, as if people either love their group or hate it. This is not how people are. There are multiple, conflicting motives involving ego, group, and system justification, and ambivalence is part of what interests us as system justification theorists. Second, there is plenty of other evidence suggesting that—again, to some degree—African Americans and other groups “internalize” negative stereotypes. Are you really suggesting that there are NO psychological consequences for African Americans living in a society in which they are systematically devalued? I’m still waiting for an answer to that last question. The purpose of this submission, it seems to me, is not to illuminate anything, really, and indeed very little, if anything, is illuminated. The purpose of the paper, it seems, is to create the appearance of something scandalous and awful and perhaps even racist in the research literature when, in fact, the substantive results obtained here are very similar to what has been found before. And if the authors really want to declare that the race-based IAT is a completely useless measure, they have a lot more work to do than re-analyzing previously published data from one relatively small study.
With the confidence of a peer-reviewer in the role of an expert, Jost feels confident enough to lie when he writes “In fact, the substantive results obtained here are very similar to what has been found before.” Really? Nobody has examined convergent validity of various implicit measures among African Americans before. Bar-Anan and Nosek collected the data, but they didn’t analyze them. Instead, they simply concluded that the race IAT is the best measure because it shows the strongest differences between groups. Here we show that implicit measures that can be scored to distinguish in-group and out-group attitudes do not show that African Americans hold negative views of their in-group. Does it matter? Yes it does. Where do African Americans want to live? Who do they want to marry? Would they want other African Americans as colleagues? The answers to these questions depend on their in-group attitudes. So, if Jost cared about African Americans rather than about his theory that made him famous, he might be a bit more interested in our results. However, Jost just displays the same level of curiosity about disconfirming and distressing evidence as many of his colleagues; that is, none. Instead, he fights like a cornered animal to defend his system of ideas against criticism. You might even call this behavior system justification.
I always wanted to be James Bond, but being 55 now it is clear that I will never get a license to kill or work for a government intelligence agency. However, the world has changed and there are other ways to spy on dirty secrets of evil villains.
I have started to focus on the world of psychological science, which I know fairly well because I was a psychological scientist for many years. During my time as a psychologist, I learned about many of the dirty tricks that psychologists use to publish articles to further their careers without advancing understanding of human behavior, thoughts, and feelings.
However, so far the general public, government agencies, or government funding agencies that hand out taxpayers’ money to psychological scientists have not bothered to monitor the practices of psychological scientists. They still believe that psychological scientists can control themselves (e.g., peer review). As a result, bad practices persist because the incentives favor behaviors that lead to publication of many articles even if these articles make no real contribution to science. I therefore decided to create my own Psychological Intelligence Agency (PIA). Of course, I cannot give myself a license to kill, and I have no legal authority to enforce laws that do not exist. However, I can gather intelligence (information) and share this information with the general public. This is less James Bond and more CIA that also shares some of its intelligence with the public (CIA factbook), or the website Retraction Watch that keeps track of article retractions.
Some of the projects that I have started are:
Replicability Rankings of Psychology Journals Keeping track of the power (expected discovery rate, expected replication rate) and the false discovery risk of test results published in over 100 psychology journals from 2010 to 2020.
Personalized Criteria of Statistical Significance It is problematic to use the standard criterion of significance (alpha = .05) when this criterion leads to few discoveries because researchers test many false hypotheses or test true hypotheses with low power. When discovery rates are low, alpha should be set to a lower value (e.g., .01, .005, .001). Here I used estimates of authors’ discovery rate to recommend an appropriate alpha level to interpret their results.
Quantitative Book Reviews Popular psychology books written by psychological scientists (e.g., Nobel Laureate Daniel Kahneman) reach a wide audience and are assumed to be based on solid scientific evidence. Using statistical examinations of the sources cited in these books, I provide information about the robustness of the scientific evidence to the general public. (see also “Before you know it“)
Citation Watch Science is supposed to be self-correcting. However, psychological scientists often cite outdated references that fit their theory without citing newer evidence that their claims may be false (a practice known as cherry picking citations). Citation watch reveals these bad practice, by linking articles with misleading citations to articles that question the claims supported by cherry picked citations.
Whether all of this intelligence gathering will have a positive effect depends on how many people actually care about the scientific integrity of psychological science and the credibility of empirical claims. Fortunately, some psychologists are willing to learn from past mistakes and are improving their research practices (Bill von Hippel).
I was fortunate enough to read Jacob Cohen’s articles early on in my career to avoid many of the issues that plague psychological science. One of his important lessons was that it is better to test a few (or better one) hypothesis in one large sample (Cohen, 1990) than to conduct many tests in small samples.
The reason is simple. Even if a theory makes a correct prediction, sampling error may produce a non-significant result, especially in small samples where sampling error is large. This type of error is known as type-II error, beta, or a false negative. The probability of obtaining the desired and correct outcome of a significant result, when a hypothesis is true is called power. The problem of testing multiple hypotheses is that the cumulative or total power of finding evidence for all correct hypotheses decreases with the number of tests. Even if a single test has 80% power (i.e., the probability of a significant result for a correct hypothesis is 80 percent), the probability of providing evidence for 10 correct hypotheses is only .8^10 = .11%. The expected value is that 2 of the 10 tests produce a type-II error (Schimmack, 2012).
Cohen (1961) also noted that the average power of statistical tests is well below 80%. For a medium/average effect size, power was around 50%. Now imagine that a researcher tests 10 true hypotheses with 50% power. The expected value is that 5 tests produce a significant result (p < .05) and 5 studies produce a type-II error (p > .05). The interpretation of the article will focus on the significant results, but they were selected basically by a coin flip. The next study will produce a different set of 5 significant studies.
To avoid type-II errors researchers could conduct a priori power analysis to ensure that they have enough power. However, this is rarely done with the explanation that a priori power analysis requires knowledge about the population effect size, which is unknown. However, it is possible to estimate the typical power of studies by keeping track of the percentage of significant results. Because power determines the rate of significant results, the rate of significant results is an estimate of average power. The main problem with this simple method of estimating power is that researchers often do not report all of their results. Especially before the replication crisis became apparent, psychologists tended to publish only significant results. As a result, it is largely unknown how much power actual studies in psychology have and whether power increased since Cohen (1961) estimated power to be around 50%.
Here I illustrate a simple way to estimate actual power of studies with a recent multi-study article that reported a total of 184 significance tests (more were reported in a supplement, but were not coded)! Evidently, Cohen’s important insights remain neglected, especially in journals that pride themselves on rigorous examination of hypotheses (Kardas, Kumar, & Epley, 2021).
Figure 2 shows the first rows of the coding spreadsheet (Spreadsheet).
Each row shows one specific statistical test. The column “HO rejected” reflects how authors interpreted a result. Broadly this decision is based on the p < .05 rule, but sometimes authors are willing to treat values just above .05 as sufficient evidence which is often called marginal significance. The column p < .05 strictly follows the p < .05 rule. The averages in the top row show that there are 77% significant results using authors’ rules and 71% using the p < .05 rule. This shows that 6% of the p-values were interpreted as marginally significant.
All test-values or point estimates with confidence intervals are converted into exact two-sided p-values. The two-sided p-values are then converted into z-scores using the inverse normal formula; z = -qnorm(2). Observed power is then estimated for the standard criterion of significance; alpha = .05, which corresponds to a z-score of 1.96. The formula for observed power is pnorm(z, 1.96). The top row shows that mean observed power is 69%. This is close to the 71% percentage with the strict p < .05 rule, but a bit lower than the 77% when marginally significant results are included. This simple comparison shows that marginally significant results inflate the percentage of significant results.
The inflation column keeps track of the consistency between the outcome of a significance test and the power estimate. When power is practically 1, a significant result is expected and inflation is zero. However, when power is only 60%, there is a 40% chance of a type-II error and authors were lucky if they got a significant result. This can happen in a single test, but not in the long run. Average inflation is a measure of how lucky authors were if they got more significant results than the power of their studies allows. Using the authors 77% success rate and estimated power of 69%, we have an inflation of 8%. This is a small bias, and we already saw that interpretation of marginal results accounts for most of it.
The last column is called the Replication Index (R-Index). It simply subtracts the inflation from the observed power estimate. The reason is that observed power is an inflated estimate of power when there are too many significant results. The R-Index is called an index because the formula is just an approximate correction for selection for significance. Later I show the results with a better method. However, the Index can clearly distinguish between junk science (R-Index below 50) and credible evidence. Based on the present results, the R-Index of 62 shows that the article reported some credible findings. Moreover, the R-Index now underestimates power because the rate of p-values below .05 is consistent with observed power. The inflation is just due to the interpretation of marginal results as significant. In short, the main conclusion from this simple analysis of test statistics in a single article is that the authors conducted studies with an average power of about 70%. This is expected to produce type-II errors, sometimes with p-values close to .05 and sometimes with p-values well above .1. This could mean that nearly a quarter of the published results are type-II errors.
but what about type-I errors?
Cohen was concerned about the problem that many underpowered studies fail to reject true hypotheses. However, the replication crisis shifted the focus from false negative results to false-positive results. An influential article by Simmons et al. (2011) suggested that many if not most published results might be false positive results. The authors also developed a statistical tool that examines whether a set of significant results is entirely based on false positive results called p-curve. The next figure shows the output of the p-curve app for the 130 significant results (only significant results are considered because p-values greater than .05 cannot be false positives).
The graph shows that there a lot more p-values below .01 (78%) than p-values between .04 and .05 (2%). This distribution of p-values is inconsistent with the hypothesis that all significant results are false positives. In addition, the program estimates that the average power of the 130 studies with significant results is 99%! As a result, there can be no false positives that would produce an estimate of 5% power. It is noteworthy that the p-curve analysis did not spot the inflation of significant results by interpreting marginally significant results because these results are omitted from the p-curve analysis. It is rather unlikely that the average power of studies is 99%. In fact, simulation studies have shown that the power estimates of p-curve are often inflated when studies are heterogeneous (Brunner, 2018; Brunner & Schimmack, 2020). The p-curve authors are aware of this bug, but have done nothing to fix it (Datacolada, 2018).
A better statistical method to analyze p-values is z-curve, which relies on the z-scores that were obtained from the p-values in the spreadsheet. However, the z-curve package for R can also read p-values. The next Figure shows a histogram of all 184 (significant and non-significant) values up to a value of 6. Values over 6 are not shown and are all treated as studies with perfect power.
The expected discovery rate corresponds to the power estimate in p-curve. It is notably lower than 99% and the 95%CI excludes a value of 99%. This finding simply shows once again that p-curve estimates are inflated.
The observed discovery rate is simply the same percentage that was computed on the spreadsheet using a strict p < .05 rule. The expected discovery rate is an estimate of the average power for all studies, including non-significant results that is corrected for any potential inflation. It is 62%, which matches the R-Index in the spreadsheet.
The comparison of the observed discovery rate of 71% and the expected discovery rate of 62% suggests that there is some overreporting of significant results. However, the 95%CI around the EDR estimate ranges from 27% to 88%. Thus, sampling error alone may explain this discrepancy.
An EDR of 62% implies that only a small number of significant results can be false positives. The point estimate is just 2%, but the 95%CI allows for up to 14% false positives. Thus, the reported results are unlikely to be false positives, but effect sizes could be inflated because selection for significance with modest power inflates effect size estimates.
There is also notable evidence of heterogeneity. The distribution of z-scores is much flatter than a standard normal distribution that is expected if all studies had the same power. This means that some results might be more credible than others. Therefore I conducted some moderator analyses.
One key hypothesis in the article was that shallow and deep conversations differ in important ways. Several studies tested this by comparing shallow and deep conversations. Fifty-four analyses included a contrast between shallow and deep conversations as a main effect or in an interaction. The expected replication rate is unchanged. The expected discovery rate is a bit higher, but surprisingly, the observed discovery rate is lower. Visual inspection of the z-curve plot shows an unusually high number of marginally significant results. This is further evidence to distrust marginally significant results. However, overall these results suggest that shallow and deep conversations differ.
Several analyses tested mediation, which can require large samples to have adequate power. Not surprisingly, the 39 mediation tests have only a replication rate of 53%. There is also some suggestion of bias, with an observed discovery rate of 51% and an expected discovery rate of only 25%, but the 95%CI around the point estimate is wide and includes 51%. The low expected discovery rate implies that the false discovery risk is 16%, which is unacceptably high.
One solution to the high false discovery risk is to lower the criterion for significance. The next conventional level is alpha = .01. The next figure shows the results for this criterion value (the red solid line has moved to z = 2.58.
Now the observed discovery rate is in line with the expected discovery rate (28% vs. 27%) and the false discovery risk has been lowered to 3%. However, the expected replication rate (for alpha = .01) is only 36%. Thus, follow-up studies need to increase sample sizes to replicate these mediation effects.
A post-hoc power-analysis of this recent article shows that psychologists still have not learned Cohen’s lesson that he shared in 1990 (more than 30 years ago). Conducting many significance tests with modest statistical power produces a confusing pattern of significant and non-significant results that is strongly influenced by sampling error. Rather than reporting results of individual studies, the authors should have reported meta-analytic results for tests of the same hypothesis. However, to end on a positive note, the studies are not p-hacked and the risk of false positives is low. Thus, the results provide some credible findings that can be used to conduct confirmatory tests of the hypothesis that deeper conversations are more awkward, but also more rewarding. I hope these analyses show that a deep dive into the statistical results reported in an article can also be rewarding.
Good science requires not only open and objective reporting of new data; it also requires unbiased review of the literature. However, there are no rules and regulations regarding citations, and many authors cherry-pick citations that are consistent with their claims. Even when studies have failed to replicate, original studies are cited without citing the replication failures. In some cases, authors even cite original articles that have been retracted. Fortunately, it is easy to spot these acts of unscientific behavior. Here I am starting a project to list examples of bad scientific behaviors. Hopefully, more scientists will take the time to hold their colleagues accountable for ethical behavior in citations. They can even do so by posting anonymously on the PubPeer comment site.
Authors: B. Keith Payne1, Jason W. Hannay
Citation: One of the most important contributions from psychological science is the concept of implicit bias. Implicit bias refers to positive or negative mental associations cued spontaneously by social groups. It is measured using cognitive tasks that test how those associations facilitate, interfere with, or otherwise bias task performance [5,6]. Many studies suggest that implicit bias is widespread, even among people who explicitly endorse egalitarian attitudes [7,8].
Others argue that implicit bias is a stable trait- like construct, and that context effects or temporal fluctuations reflect only measurement error [50,51].
Correction: This quote and many other citations in this article fail to mention that the concept of implicit bias is controversial and lacks strong empirical support. There are many critical articles to cite, but my own criticism of the construct validity of implicit measures references most of them (https://doi.org/10.1177/1745691619863798). Another article directly criticizes Payne and is not cited (https://journals.sagepub.com/doi/abs/10.1177/1745691620931492). The authors cite my article , but fails to mention that it also contains evidence to support the claim that implicit racial bias measurs have only modest convergent validity with explicit racism measures and very little discriminant validity.
Authors: Cassandra Baldwin, Katie E. Garrison, Roy F. Baumeister & Brandon J. Schmeichel
Citation: Research has found that the capacity for executive control may work as if it depended on a limited resource. Effortful acts of control consume some of this resource, resulting in a state known as ego depletion (Baumeister et al., 1998; Muraven & Baumeister, 2000).
Correction: does not cite meta-analysis that shows publication bias and no evidence for the effect (https://doi.org/10.3389/fpsyg.2014.00823). Also does not cite two failed replication attempts in major RRR (https://doi.org/10.1177/1745691616652873, https://doi.org/10.1177/0956797621989733)
Authors: Mόnika Gergelyfia, Ernesto J. Sanz-Arigita, Oleg Solopchuk, Laurence Dricot, Benvenuto Jacob, Alexandre Zénon
Citation: Theories of MF can be classified in two major groups that assume either: (a) alterations of motivational processes leading to restrictions on the recruitment of cognitive resources for the task at hand (…) or b) progressive functional alteration of cognitive processes through metabolic mechanisms ( Gailliot and Baumeis- ter, 2007 ; Christie and Schrater, 2015 ; Holroyd, 2015 ; Hopstaken et al., 2015 ; Blain et al., 2016 ; Gergelyfiet al., 2015 ).
Correction: do not cite meta-analysis that shows publication bias and no evidence for glucose effects on willpower (https://doi.org/10.1177/0956797616654911)
Authors: Alexandra Touroutoglou, Joseph Andreano, Bradford C. Dickerson, Lisa Feldman Barrett
Citation: Some accounts hold that effort serves to manage intrinsic costs to finite resources such as metabolic resources (Gailliot and Baumeister, 2007; Gailliot et al., 2007; Holroyd, 2016),
Correction: do not cite meta-analysis that shows publication bias and no evidence for glucose effects on willpower (https://doi.org/10.1177/0956797616654911)
Authors: P. A. Hancock; John D. Lee; John W. Senders
Citation: Misattributions involved in such processes of assessment can, as we have seen, lead to adverse consequences (e.g., Johnson et al., 2019).
DOI: DOI: 10. 1177/ 0018 7208 2110 36323
Correction: Retraction (https://www.pnas.org/content/117/30/18130)
Authors: Desmond Ang
Citation: While empirical evidence of racial bias is mixed (Nix et al. 2017; Fryer 2019; Johnson et al. 2019; Knox, Lowe, and Mummolo 2020; Knox and Mummolo 2020)
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Jordan R. Riddell; John L. Worrall
Citation: Recent years have also seen improvements in benchmarking-related research, that is, in formulating methods to more accurately analyze whether bias (implicit or explicit) or racial disparities exist in both UoF and OIS. Recent examples include Cesario, Johnson, and Terrill (2019), Johnson, Tress, Burkel, Taylor, and Cesario (2019), Shjarback and Nix (2020), and Tregle, Nix, and Alpert (2019).
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Dean Knox, Will Lowe, Jonathan Mummolo
Citation: A related study, Johnson et al. (2019), attempts to estimate racial bias in police shootings. Examining only positive cases in which fatal shootings occurred, they find that the majority of shooting victims are white and conclude from this that no antiminority bias exists
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Chew Wei Ong, Kenichi Ito
Citation: This penalty treatment of error trials has been shown to improve the correlations between the IAT and explicit measures, indicating a greater construct validity of the IAT.
Correction: higher correlations do not imply higher construct validity of IATs as measures of implicit attitudes (https://doi.org/10.1177/1745691619863798)
Authors: Sara Costa, Viviana Langher, Sabine Pirchio
Citation: The most used method to assess implicit attitudes is the “Implicit Association Test” (IAT; Greenwald et al., 1998), which presents a good reliability (Schnabel et al., 2008) and validity (Nosek et al., 2005; Greenwald et al., 2009).
DOI: doi: 10.3389/fpsyg.2021.712356
Correction: does not cite critique of the construct validity of IATs (https://doi.org/10.1177/1745691619863798)
Authors: Yang, Gengfeng, Zhenzhen, Dongjing
Citation: "Studies have found that merely activating the concept of money can increase egocentrism, which can further redirect people's attention toward their inner motivations and needs (Zaleskiewicz et al., 2018) and reduce their sense of connectedness with others (Caruso et al., 2013).
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
Authors: Garriy Shteynberg, Theresa A. Kwon, Seong-Jae Yoo, Heather Smith, Jessica Apostle, Dipal Mistry, Kristin Houser
Citation: Money is often described as profane, vulgar, and filthy (Belk & Wallendorf, 1990), yet incidental exposure to money increases the endorsement of the very social systems that render such money meaningful (Caruso et al., 2013).
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
Author: Arden Rowell
Citation: In particular, some studies show that encouraging people to think about things in terms of money may measurably change people's thoughts, feelings, motivations, and behaviors. See Eugene M. Caruso, Kathleen D. Vohs, Brittani Baxter & Adam Waytz, Exposure to Money Increases Endorsement of Free-Market Systems and Social Inequality, 142 J. EXPERIMENTAL PSYCH. 301, 301-02, 305 (2013) DOI: https://scholarship.law.nd.edu/ndlr/vol96/iss4/9
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
Authors: Anna Jasinenkoa, Fabian Christandl, Timo Meynhardt
Citation: Caruso et al. (2013) find that exposure to money (which is prevalent in most shopping situations) activates personal tendencies to justify the market system. Furthermore, they find that money exposure also activates general system justification; however, he effect was far smaller than for the activation of MSJ.
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
It is well known that many psychology articles report too many significant results because researchers selectively publish results that support their predictions (Francis, 2014; Sterling, 1959; Sterling et al., 1995; Schimmack, 2021). This often leads to replication failures (Open Science Collaboration, 2015).
One way to examine whether a set of studies reported too many significant results is to compare the success rate (i.e., the percentage of significant results) with the mean observed power in studies (Schimmack, 2012). In this video, I illustrate this bias detection method using Vohs et al.’s (2006) Science article “The Psychological Consequences of Money.”
I use this students for training purposes because the article reports 9 studies and a reasonably large number of studies is needed to have good power to detect selection bias. Also, the article is short and the results are straight forward. Thus, students have no problem filling out the coding sheet that is needed to compute observed power (Coding Sheet).
The results show clear evidence of selection bias that undermine the credibility of the reported results (see also TIVA). Although bias tests are available, few researchers use them to protect themselves from junk science and articles like this one continue to be cited at high rates (683 total, 67 in 2019). A simple way to protect yourself from junk science is to adjust the alpha level to .005 because many questionable practices produce p-values that are just below .05. For example, the lowest p-value in these 9 studies was p = .006. Thus, not a single study was statistically significant with alpha = .005.
Last week I posted a video that provided an introduction to the basic concepts of statistics, namely effect sizes and sampling error. A test statistic like a t-value, is simply the ratio of the effect size over sampling error. This ratio is also known as a signal to noise ratio. The bigger the signal (effect size), the more likely it is that we will notice it in our study. Similarly, the less noise we have (sampling error), the easier it is to observe even small signals.
In this video, I use the basic concepts of effect sizes and sampling error to introduce the concept of statistical power. Statistical power is defined as the percentage of studies that produce a statistically significant result. When alpha is set to .05, it is the expected percentage of p-values with values below .05.
Statistical power is important to avoid type-II errors; that is, there is a meaningful effect, but the study fails to provide evidence for it. While researchers cannot control the magnitude of effects, they can increase power by lowering sampling error. Thus, researchers should carefully think about the magnitude of the expected effect to plan how large their sample has to be to have a good chance to obtain a significant result. Cohen proposed that a study should have at least 80% power. The planning of sample sizes using power calculation is known as a priori power analysis.
The problem with a priori power analysis is that researchers may fool themselves about effect sizes and conduct studies with insufficient sample sizes. In this case, power will be less than 80%. It is therefore useful to estimate the actual power of studies that are being published. In this video, I show that actual power could be estimated by simply computing the percentage of significant results. However, in reality this approach would be misleading because psychology journals discriminant against non-significant results. This is known as publication bias. Empirical studies show that the percentage of significant results for theoretically important tests is over 90% (Sterling, 1959). This does not mean that mean power of psychological studies is over 90%. It merely suggests that publication bias is present. In a follow up video, I will show how it is possible to estimate power when publication bias is present. This video is important to understand what statistical power.
Each year, I am working with undergraduate students on the coding of research articles to examine the replicability and credibility of psychological science (ROP2020). Before students code test-statistics from t-tests or F-tests in results sections, I provide a crash course on inferential statistics (null-hypothesis significance testing). Although some students have taken a basic stats course, the courses often fail to teach a conceptual understanding of statistics and distract students with complex formulas that are treated like a black box that converts data into p-values (or worse starts that reflect whether p < .05*, p < .01**, or p < .001***).
In this one-hour lecture, I introduce the basic principles of null-hypothesis significance testing using the example of the t-test for independent samples.
I explain that a t-value is conceptual made up of three components, namely the effect size (D = x1 – x2), a measure of the natural variation of the dependent variable (the standard deviation (s), and a measure of the amount of sampling error (simplified se = 2/sqrt (n1 + n2)).
Moreover, dividing the effect size D by the standard deviation provides the familiar standardized effect size, Cohen’s d = D/s. This means that a t-value corresponds to the ratio of the standardized effect size (d) over the amount of sampling error (se), t = d/se
It follows that a t-value is influenced by two quantities. T-values increase as the standardized (unit-free) effect sizes increase and as the sampling error decreases. The two quantities are sometimes called signal (effect size) and noise (sampling error). Accordingly, the t-value is the signal to noise ratio. I compare the signal and noise to an experiment where somebody is throwing rocks into a lake and somebody has to tell whether a rock was thrown based on the observation of a splash. A study with a small effect and a lot of noise is like trying to detect the splash of a small pebble on a very windy, stormy day where waves are creating a lot of splashes that make it hard to see the small splash made by a pebble. However, if you throw a big rock into the lack, you can see the big splash from the rock even when the wind creates a lot of splashing. If you want to see the splash of a pebble, you need to wait for a calm day without wind. These conditions correspond to a study with a large sample and very little sampling error.
Have a listen and let me know how I am doing. Feel free to ask questions that help me to understand how I can make the introduction to statistics even easier. Too many statistics books and lecturers intimidate students with complex formulas and Greek symbols that make statistics look hard, but in reality it is very simple. Data always have two components. The signal you are looking for and noise that makes it hard to see the signal. The bigger the signal to noise ratio is, the more likely it is that you saw a true signal. Of course, it can be hard to quantify signals and noise and statisticians work hard in getting good estimates of noise, but that does not have to concern users of statistics. As users of statistics we just trust statisticians that they have good (the best) estimates to see how good our data are.
Articles published in peer-reviewed journals are only a tip of the scientific iceberg. Professional organizations want you to believe that these published articles are carefully selected to be the most important and scientifically credible articles. In reality, peer-review is unreliable, invalid, and editorial decisions are based on personal preferences. For this reason, the censoring mechanism is often hidden. Part of the movement towards open science is to make the censoring process transparent.
I therefore post the decision letter and the reviews from JEP:General. I sent my ms “z-curve: an even better p-curve” to this journal because it published two articles on the p-curve method that are highly cited. The key point of my ms. is to point out that the p-curve app produces a “power” estimate of 97% for hand-coded articles by Leif Nelson, while z-curve produces an estimate of 52%. If you are a quantitative scientist, you will agree that this is a non-trivial difference and you are right to ask which of these estimates is more credible. The answer is provided by simulation studies that compare p-curve and z-curve and show that p-curve can dramatically overestimate “power” when the data are heterogeneous (Brunner & Schimmack, 2020). In short, the p-curve app sucks. Let the record show that JEP-General is happy to get more citations for a flawed method. The reason might be that z-curve is able to show publication bias in the original articles published in JEP-General (Replicability Rankings). Maybe Timothy J. Pleskac is afraid that somebody looks at his z-curve, which shows a few too many p-values that are just significant (ODR = 73% vs. EDR = 45%).
Unfortunately for psychologists, statistics is an objective science that can be evaluated using either mathematical proofs (Brunner & Schimmack, 2020) and simulation studies (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). It is just hard for psychologists to follow the science, if the science doesn’t agree with their positive illusions and inflated egos.
XGE-2021-3638 Z-curve 2.0: An Even Better P-Curve Journal of Experimental Psychology: General
Dear Dr. Schimmack,
I have received reviews of the manuscript entitled Z-curve 2.0: An Even Better P-Curve (XGE-2021-3638) that you recently submitted to Journal of Experimental Psychology: General. Upon receiving the paper I read the paper. I agree that Simonsohn, Nelson, & Simmons’ (2014) P-Curve paper has been quite impactful. As I read over the manuscript you submitted, I saw there was some potential issues raised that might help help advance our understanding of how to evaluate scientific work. Thus, I asked two experts to read and comment on the paper. The experts are very knowledgeable and highly respected experts in the topical area you are investigating.
Before reading their reviews, I reread the manuscript, and then again with the reviews in hand. In the end, both reviewers expressed some concerns that prevented them from recommending publication in Journal of Experimental Psychology: General. Unfortunately, I share many of these concerns. Perhaps the largest issue is that both reviewers identified a number formal issues that need more development before claims can be made about the z-curve such as the normality assumptions in the paper. I agree with Reviewer 2 that more thought and work is needed here to establish the validity of these assumptions and where and how these assumptions break down. I also agree with Reviewer 1 that more care is needed when defining and working with the idea of unconditional power. It would help to have the code, but that wouldn’t be sufficient as one should be able to read the description of the concept in the paper and be able to implement it computationally. I haven’t been able to do this. Finally, I also agree with Reviewer 1 that any use of the p-curve should have a p-curve disclosure table. I would also suggest ways to be more constructive in this critique. In many places, the writing and approach comes across as attacking people. That may not be the intention. But, that is how it reads.
Given these concerns, I regret to report that that I am declining this paper for publication in Journal of Experimental Psychology: General. As you probably know, we can accept only small fraction of the papers that are submitted each year. Accordingly, we must make decisions based not only on the scientific merit of the work but also with an eye to the potential level of impact for the findings for our broad and diverse readership. If you decide to pursue publication in another journal at some point (which I hope you will consider), I hope that the suggestions and comments offered in these reviews will be helpful.
Thank you for submitting your work to the Journal. I wish you the best in your continued research, and please try us again in the future if you think you have a manuscript that is a good fit for Journal of Experimental Psychology: General.
Timothy J. Pleskac, Ph.D. Associate Editor Journal of Experimental Psychology: General
Reviewer #1: 1. This commentary submitted to JEPG begins presenting a p-curve analysis of early work by Leif Nelson. Because it does not provide a p-curve disclosure table, this part of the paper cannot be evaluated. The first p-curve paper (Simonsohn et al, 2014) reads: “P-curve disclosure table makes p-curvers accountable for decisions involved in creating a reported p-curve and facilitates discussion of such decisions. We strongly urge journals publishing p-curve analyses to require the inclusion of a p-curve disclosure table.” (p.540). As a reviewer I am aligning with these recommendation and am *requiring* a p-curve disclosure table, as in, I will not evaluate that portion of the paper, and moreover I will recommend the paper be rejected unless that analysis is removed, or a p-curve disclosure table is included, and is then evaluated as correctly conducted by the review team in an subsequent round of evaluation. The p-curve disclosure table for the Russ et al p-curve, even if not originally conducted by these authors, should be included as well, with a statement that the authors of this paper have examined the earlier p-curve disclosure table and deemed it correct. If an error exists in the literature we have to fix it, not duplicate it (I don’t know if there is an error, my point is, neither do the authors who are using it as evidence).
2. The commentary then makes arguments about estimating conditional vs unconditional power. While not exactly defined in the article, the authors come pretty close to defining conditional power, I think they mean by it the average power conditional on being included in p-curve (ironically, if I am wrong about the definition, the point is reinforced). I am less sure about what they mean by unconditional power. I think they mean that they include in the population parameter of interest not only the power of the studies included in p-curve, but also the power of studies excluded from it, so ALL studies. OK, this is an old argument, dating back to at least 2015, it is not new to this commentary, so I have a lot to say about it.
First, when described abstractly, there is some undeniable ‘system 1’ appeal to the notion of unconditional power. Why should we restrict our estimation to the studies we see? Isn’t the whole point to correct for publication bias and thus make inferences about ALL studies, whether we see them or not? That’s compelling. At least in the abstract. It’s only when one continues thinking about it that it becomes less appealing. More concretely, what does this set include exactly? Does ‘unconditional power’ include all studies ever attempted by the researcher, does it include those that could have been run but for practical purposes weren’t? does it include studies run on projects that were never published, does it include studies run, found to be significant, but eventually dropped because they were flawed? Does it include studies for which only pilots were run but not with the intention of conducting confirmatory analysis? Does it include studies which were dropped because the authors lost interest in the hypothesis? Does it include studies that were run but not published because upon seeing the results the authors came up with a modification of the research question for which the previous study was no longer relevant? Etc etc). The unconditional set of studies is not a defined set, without a definition of the population of studies we cannot define a population parameter for it, and we can hardly estimate a non-existing parameter. Now. I don’t want to trivialize this point. This issue of the population parameter we are estimating is an interesting issue, and reasonable people can disagree with the arguments I have outlined above (many have), but it is important to present the disagreement in a way that readers understand what it actually entails. An argument about changing the population parameter we estimate with p-curve is not about a “better p-curve”, it is about a non-p-curve. A non-p-curve which is better for the subset of people who are interested in the unconditional power, but a WORSE p-curve for those who want the conditional power (for example, it is worse for the goals of the original p-curve paper). For example, the first paper using p-curve for power estimation reads “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve”. So a tool which does not estimate that value, but a different value, it is not better, it is different. The standard deviation is neither better nor worse than the mean. They are different. It would be silly to say “Standard Deviation, a better Mean (because it captures dispersion and the mean does not)”. The standard deviation is better for someone interested in dispersion, and the standard deviation is worse for someone interested in the central tendency. Exactly the same holds for conditional vs unconditional power. (well, the same if z-curve indeed estimated unconditional power, i don’t know if that is true or not. Am skeptical but open minded).
Second, as mentioned above, this distinction of estimating the parameter of the subset of studies included in p-curve vs the parameter of “all studies” is old. I think that argument is seen as the core contribution of this commentary, and that contribution is not close to novel. As the quote above shows, it is a distinction made already in the original p-curve paper for estimating power. And, it is also not new to see it as a shortcoming of p-curve analysis. Multiple papers by Van Assen and colleagues, and by McShane and colleagues, have made this argument. They have all critiqued p-curve on those same grounds.
I therefore think this discussion should improve in the following ways: (i) give credit, and give voice, to earlier discussions of this issue (how is the argument put forward here different from the argument put forward in about a handful of previous papers making it, some already 5 years ago), (ii) properly define the universe of studies one is attempting to estimate power for (i.e., what counts in the set of unconditional power), and (iii) convey more transparently that this is a debate about what is the research question of interest, not of which tool provides the better answer to the same question. Deciding whether one wants to estimate the average power of one or another set of studies is completely fair game of an issue to discuss, and if indeed most readers don’t think they care about conditional power, and those readers use p-curve not realizing that’s what they are estimating, it is valuable to disabuse them of their confusion. But it is not accurate, and therefore productive, to describe this as a statistical discussion, it is a conceptual discussion.
3. In various places the paper reports results from calculations, but the authors have not shared neither the code nor data for those calculations, so these results cannot be adequately evaluated in peer-review, and that is the very purpose of peer-review. This shortcoming is particularly salient when the paper relies so heavily on code and data shared in earlier published work.
Finally, it should be clearer what is new in this paper. What is said here that is not said in the already published z-curve paper and p-curve critique papers?
Reviewer #2: The paper reports a comparison between p-curve and z-curve procedures proposed in the literature. I found the paper to be unsatisfactory, and therefore cannot recommend publication in JEP:G. It reads more like a cropped section from the author’s recent piece in meta-psychology than a standalone piece that elaborates on the different procedures in detail. Because a lot is completely left out, it is very difficult to evaluate the results. For example, let us consider a couple of issues (this is not an exhaustive list):
– The z-curve procedure assumes that z-transformed p-values under the null hypothesis follow a standard Normal distribution. This follows from the general idea that the distribution of p-values under the null-hypothesis is uniform. However, this general idea is not necessarily true when p-values are computed for discrete distributions and/or composite hypotheses are involved. This seems like a point worth thinking about more carefully, when proposing a procedure that is intended to be applied to indiscriminate bodies of p-values. But nothing is said about this, which strikes me as odd. Perhaps I am missing something here.
– The z-curve procedure also assumes that the distribution of z-transformed p-values follows a Normal distribution or a mixture of homoskedastic Normals (distributions that can be truncated depending on the data being considered/omitted). But how reasonable is this parametric assumption? In their recently published paper, the authors state that this is as **a fact**, but provide no formal proof or reference to one. Perhaps I am missing something here. If anything, a quick look at classic papers on the matter, such as Hung et al. (1997, Biometrics), show that the cumulative distributions of p-values under different alternatives cross-over, which speaks against the equal-variance assumption. I don’t think that these questions about parametric assumptions are of secondary importance, given that they will play a major in the parameter estimates obtained with the mixture model.
Also, when comparing the different procedures, it is unclear whether the reported disagreements are mostly due to pedestrian technical choices when setting up an “app” rather than irreconcilable theoretical commitments. For example, there is nothing stopping one from conducting a p-curve analysis on a more fine-grained scale. The same can be said about engaging in mixture modeling. Who is/are the culprit/s here?
Finally, I found that the writing and overall tone could be much improved.
The hallmark of a science is progress. To demonstrate that psychology is a science therefore requires evidence that current evidence, research methods, and theories are better than those in the past. Historic reviews are also needed because it is impossible to make progress without looking back once in a while.
Research on the stability or consistency of personality has a long history that started with the first empirical investigations in the 1930s, but a historic review of this literature is lacking. Few young psychologists interested in personality development may be familiar with Kelly, his work, or his American Psychologist article on “Consistency of the Adult Personality” (Kelly, 1955). Kelly starts his article with some personal observations about stability and change in traits that he observed in colleagues over the years.
Today, we call traits that are neither physical characteristics, nor cognitive abilities, personality traits that are represented in the Big Five model. What have we learned about the stability of personality traits in adulthood from nearly a century of research?
Kelly (1955) reported some preliminary results from his own longitudinal study of personality that he started in the 1930s with engaged couples. Twenty years-later, they completed follow-up questionnaires. Figure 6 reported the results for the Allport-Vernon value scales. I focus on these results because they make it possible to compare the retest-correlations to retest-correlations over a one-year period.
Figure 6 shows that personality, or at least values, are not perfectly stable. This is easily seen by a comparison of the one-year retest correlations with the 20-year retest correlations. The 20-year retest correlations are always lower than the one-year retest correlations. Individual differences in values change over time. Some individuals become more religious and others become less religious, for example. The important question is how much individuals change over time. To quantify change and stability it is important to specify a time interval because change implies lower retest correlations over longer retest intervals. Although the interval is arbitrary, a period of 1-year or 10-year can be used to quantify and compare stability and change of different personality traits. To do so, we need a model of change over time. A simple model is Heise’s (1969) autoregressive model that assumes a constant rate of change.
Take religious values as an example. Here we have two observed retest correlations, r(y1) = .60, and r(y20) = .75. Both correlations are attenuated by random measurement error. To correct for unreliability, we need to solve two equations with two unknowns, the rate of change and reliability. .75 = rate^1 * rel .60 = rate^20 * rel With some rusty high-school math, I was able to solve this equation for rate rate = (.60/.75)^(1/(20-1) = .988 The implied 10-year stability is .988^10 = .886. The estimated reliability is .75 / .988 = .759.
Table 1 shows the results for all six values.
Table 1 Stability and Change of Allport-Vernon Values
The results show that the 1-year retest correlations are very similar to the reliability estimates of the value measure. After correcting for unreliability the 1-year stability is extremely high with stability estimates ranging from .96 for social values to .99 for religious values. The small differences in 1-year stabilities become only notable over longer time periods. The estimated 10-year stability estimates range from .68 for social values to .90 for religious values.
Kelly reported results for two personality constructs that were measured with the Bernreuter personality questionnaire, namely self-confidence and sociability.
The implied stability of these personality traits is similar to the stability of values.
Kelly’s results published in 1955 are based on a selective sample during a specific period of time that included the second world war. It is therefore possible that studies with other populations during other time periods produce different results. However, the results are more consistent than different across different studies.
The first article with retest correlations for different time intervals of reasonable length was published in 1941 by Mason N. Crook. The longest retest interval was 6-years and six months. Figure 1a in the article plotted the retest correlations as a function of the retest interval.
Table 2 shows the retest correlations and reveals that some of them are based on extremely small sample sizes. The 5-month retest is based on only 30 participants whereas the 8 months retest is based on 200 participants. Using this estimate for the short-term stability, it is possible to estimate the 1-year rate and 10-year rates using the formula given above.
The 1-year stability estimates are all above .9, except for the retest correlation that is based on only N = 18 participants. Given the small sample sizes, variability in estimates is mostly random noise. I computed a weighted average that takes both sample size and retest interval into account because longer time-intervals provide better information about the actual rate of change. The estimated 1-year stability is r = .96, which implies a 10-year stability of .65. This is a bit lower than Kelley’s estimates, but this might just be sampling error. It is also possible that Crook’s results underestimate long-term stability because the model assumes a constant rate of change. It is possible that this assumption is false, as we will see later.
Crook also provided a meta-analysis that included other studies and suggested a hierarchy of consistency.
Accordingly, personality traits like neuroticism are less stable than cognitive abilities, but more stable than attitudes. As the Figure shows, empirical support for this hierarchy was limited, especially for estimates of the stability of attitudes.
Several decades later, Conley (1984) reexamined this hierarchy of consistency with more data. He was also the first, to provide quantitative stability estimates that correct for unreliability. The meta-analysis included more studies and, more importantly, studies with long retest intervals. The longest retest interval was 45 years (Conley, 1983). After correcting for unreliability, the one-year stability was estimated to be r = .98, which implies a stability of r = .81 over a period of 10-years and r = .36 over 50 years.
Using the published retest correlations for with sample sizes greater than 100, I obtained a one-year stability estimate of r = .969 for neuroticism and r = .986 for extraversion. These differences may reflect differences in stability or could just be sampling error. The average reproduces Conley’s (1984) estimate of r = .98 (r = .978).
To summarize, decades of research had produced largely consistent findings that the short-term (1-year) stability of personality traits is well above r = .9 and that it takes long time-periods to observe substantial changes in personality.
The next milestone in the history of research on personality stability and change was Roberts and DelVeccio’s (2000) influential meta-analysis that is featured in many textbooks and review articles (e.g., Caspi, Roberts, & Shiner, 2005; MacAdams & Olson, 2010).
Roberts and DelVeccio’s literature review mentions Conley’s (1984) key findings. “When dissattenuated, measures of extraversion were quite consistent, averaging .98 over a 1-year period, approximately .70 over a 10-year period, and approximately .50 over a 40-year period” (p. 7).
The key finding of Roberts and DelVeccio’s meta-analysis was that age moderates stability of personality. As shown in Figure 1, stability increases with age. The main limitation of Figure 1 is that the figure shows average retest correlations without a specific time interval that are not corrected for measurement error. Thus, the finding that retest correlations in early and middle adulthood (22-49) average around .6 provides no information about the stability of personality in this age group.
Most readers of Roberts and DelVeccio (2000) fail to notice a short section that examines the influence of time interval on retest correlations.
“On the basis of the present data, the average trait consistency over a 1-year period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25” (Roberts & DelVeccio, 2000, p. 16).
Using the aforementioned formula to correct for measurement error shows that Roberts and DelVeccio’s meta-analysis replicates Conley’s results, 1-year r = .983.
Unfortunately, review articles often mistake these observed retest correlations as estimates of stability. For example, Adams and Olson write “Roberts & DelVecchio (2000) determined that stability coefficients for dispositional traits were lowest in studies of children (averaging 0.41), rose to higher levels among young adults (around 0.55), and then reached a plateau for adults between the ages of 50 and 70 (averaging 0.70)” (p. 521) and fail to mention that these stability coefficients are not corrected for measurement error, which is a common mistake (Schmidt, 1996).
Roberts and DelVeccio’s (2000) article has shaped contemporary views that personality is much more malleable than the data suggest. A twitter poll showed that only 11% of respondents guessed the right answer that the one-year stability is above .9, whereas 43% assumed the upper limit is r = .7. With r = 7 over a 1-year period, the stability over 10-years would only be r = .03 over a 10-year period. Thus, these respondents essentially assumed that personality has no stability over a 10-year period. More likely, respondents simply failed to take into account how high short-term stability has to be to allow for moderately high long-term stability.
The misinformation about personality stability is likely due to vague, verbal statements and the use of effect sizes that ignore the length of the retest interval. For example, Atherton, Grijalva, Roberts, and Robins (2021) published an article with a retest interval of 18-years. The abstract describes the results as “moderately-to-high stability over a 20-year period” (p. 841). Table 1 reports the observed correlations that control for random measurement error using a latent variable model with item-parcels as indicators.
The next table shows the results for the 4-year retest interval in adolescence and the 20-year retest interval in adulthood along with the implied 1-year rates. Consistent with Roberts and DelVeccio’s meta-analysis, the 1-year stability in adolescence is lower, r = .908, than in adulthood, r = .976.
However, even in adolescence the 1-year stability is high. Most important, the 1-year rate for adults is consistent with estimates in Conley’s (1984) meta-analysis and the first study in 1941 by Crook, and even Roberts and DelVeccio’s meta-analysis when measurement error is taken into account. However, Atherton et al. (2021) fail to cite historic articles and fail to mention that their results replicate nearly a century of research on personality stability in adulthood.
Stable Variance in Personality
So far, I have used a model that assumes a fixed rate of change. The model also assumes that there are no stable influences on personality. That is, all causes of variation in personality can change and given enough time will change. This model implies that retest correlations eventually approach zero. The only reason why this may not happen is that human lives are too short to observe retest correlations of zero. For example, with r = .98 over a 1-year period, the 100-year retest correlation is still r = .13, but the 200-year retest correlation is r = .02.
With more than two retest intervals, it is possible to see that this model may not fit the data. If there is no measurement error, the correlation from t1 to t3 should equal the product of the two lags from t1 to t2 and from t2 to t3. If the t1-t3 correlation is larger than this model predicts, the data suggest the presence of some stable causes that do not change over time (Anusic & Schimmack, 2016; Kenny & Zautra, 1995).
Take the data from Atherton et al. (2021) as an example. The average retest correlation from t1 (beginning of college) to t3 (age 40) was r = .55. The correlation from beginning to end of college was r = .68, and the correlation from end of college to age 40 was r = .62. We see that .55 > .68 * .62 = .42.
Anusic and Schimmack (2016) estimated the amount of stable variance in personality traits to be over 50%. This estimate may be revised in the future when better data become available. However, models with and without stable causes differ mainly in predictions over long-time intervals where few data are currently available. The modeling has little influence on estimates of stability over time periods of less than 10-years.
This historic review of research on personality change and stability demonstrated that nearly a century of research has produced consistent findings. Unfortunately, many textbooks misrepresent this literature and cite evidence that does not correct for measurement error.
In their misleading, but influential meta-analysis, Roberts and DelVeccio concluded that “the average trait consistency over a 1-year period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25” (p. 16).
The correct (ed for measurement error) estimates are much higher. The present results suggest consistency over a 1-year would be .98, at 5 years it would be .90, at 10-years it would be .82, at 20-years it would be .67, and at 40 years it would be .45. Long-term stability might even be higher if stable causes contribute substantially to variance in personality (Anusic & Schimmack, 2016).
The evidence of high stability in personality (yes, I think r = .8 over 10-years warrants the label high) has important practical and theoretical implications. First of all, stability of personality in adulthood is one of the few facts that students at the beginning of adulthood may find surprising. It may stimulate self-discovery and taking personality into account in major life decisions. Stability of personality also means that personality psychologists need to focus on the factors that cause stability in personality, but psychologists have traditionally focused on change because statistical tools are designed to focus on differences and deviations rather than invariances. However, just because the Earth is round or the speed of light is constant, natural sciences do not ignore these fixtures of life. It is time for personality psychologists to do the same. The results also have a (sobering) message for researchers interested in personality change. Real change takes time. Even a decade is a relatively short period to observe notable changes which is needed to find predictors of change. This may explain why there are currently no replicable findings of predictors of personality change.
So, what is the stability of personality over a one-year period in adulthood after taking measurement error into account. The correct answer is that it is greater than .9. You probably didn’t know this before reading this blog post. This does of course not mean that we are still the same person after one year or 10 years. However, the broader dispositions that are measured with the Big Five are unlikely to change in the near future for you, your spouse, or co-workers. Whether this is good or bad news depends on you.
Many models of science postulate a feedback loop between theories and data. Theories stimulate research that tests theoretical models. When the data contradict the theory and nobody can find flaws with the data, theories are revised to accommodate the new evidence. In reality, many sciences do not follow this idealistic model. Instead of testing theories, researchers try to accumulate evidence that supports their theories. In addition, evidence that contradicts the theory is ignored. As a result, theories never develop. These degenerative theories have been called paradigms. Psychology is filled with paradigms. One paradigm is the personality development paradigm. Accordingly, personality changes throughout adulthood towards the personality of a mature adult (emotionally stable, agreeable, and conscientious; Caspi, Roberts, & Shiner, 2005).
Many findings contradict this paradigm, but these findings are often ignored by personality development researchers. For example, a recent article on personality development (Zimmermann et al., 2021) claims that there is broad evidence for substantial rank-order and mean-level changes citing outdated references from 2000 (Roberts & DelVeccio, 2000) and 2006 (Roberts et al., 2006). It is not difficult to find more recent studies that challenge these claims based on newer evidence and better statistical analyses (Anusic & Schimmack, 2016; Costa et al., 2019). It is symptomatic of a paradigm that these findings that do not fit the personality development paradigm are ignored.
Another symptom of paradigmatic research is that interpretations of research findings do not fit the data. Zimmermann et al. (2021) conducted an impressive study of N = 3,070 students’ personality over the course of a semester. Some of these students stayed at their university and others went abroad. The focus of the article was to examine the potential influence of spending time abroad on personality. The findings are summarized in Table 1.
The key prediction of the personality development paradigm is that neuroticism decreases with age and that agreeableness and conscientiousness increase with age. This trend might be accelerated by spending time abroad, but it is also predicted for students who stay at their university (Robins et al., 2001).
The data do not support this prediction. In the two control groups, neither conscientiousness (d = -.11, d = -.02) nor agreeableness increased (d = -.02, .00) and neuroticism increased (d = .08, .02). The group of students who were waiting to go abroad, but also stayed during the study period also showed no increase in conscientiousness (d = -.22, -.02) or agreeableness (d = -.16, .00), but showed a small decrease in neuroticism (d = -.08, -.01). The group that went abroad showed small increases in conscientiousness (d = .03, .09) and agreeableness (d = .14, .00), and a small decrease in neuroticism (d = -.14, d = .00). All of these effect sizes are very small, which may be due to the short time period. A semester is simply too short to see notable changes in personality.
These results are then interpreted as being fully consistent with the personality development paradigm.
A more accurate interpretation of these findings is that the effects of spending a semester abroad on personality are very small (d ~ .1) and that a semester is too short to discover changes in personality traits. The small effect sizes in this study are not surprising given the finding that even changes over a decade are no larger than d = .1 (Graham et al., 2020; also not cited by Zimmermann et al., 2021) .
In short, the personality development paradigm is based on the assumption that personality changes substantially. However, empirical studies of stability show much stronger evidence of stability, but this evidence is often not cited by prisoners of the personality development paradigm. It is therefore necessary to fact check articles on personality development because the abstracts and discussion section often do not match the data.