All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

What would Cohen say to 184 Significance Tests in 1 Article

I was fortunate enough to read Jacob Cohen’s articles early on in my career to avoid many of the issues that plague psychological science. One of his important lessons was that it is better to test a few (or better one) hypothesis in one large sample (Cohen, 1990) than to conduct many tests in small samples.

The reason is simple. Even if a theory makes a correct prediction, sampling error may produce a non-significant result, especially in small samples where sampling error is large. This type of error is known as type-II error, beta, or a false negative. The probability of obtaining the desired and correct outcome of a significant result, when a hypothesis is true is called power. The problem of testing multiple hypotheses is that the cumulative or total power of finding evidence for all correct hypotheses decreases with the number of tests. Even if a single test has 80% power (i.e., the probability of a significant result for a correct hypothesis is 80 percent), the probability of providing evidence for 10 correct hypotheses is only .8^10 = .11%. The expected value is that 2 of the 10 tests produce a type-II error (Schimmack, 2012).

Cohen (1961) also noted that the average power of statistical tests is well below 80%. For a medium/average effect size, power was around 50%. Now imagine that a researcher tests 10 true hypotheses with 50% power. The expected value is that 5 tests produce a significant result (p < .05) and 5 studies produce a type-II error (p > .05). The interpretation of the article will focus on the significant results, but they were selected basically by a coin flip. The next study will produce a different set of 5 significant studies.

To avoid type-II errors researchers could conduct a priori power analysis to ensure that they have enough power. However, this is rarely done with the explanation that a priori power analysis requires knowledge about the population effect size, which is unknown. However, it is possible to estimate the typical power of studies by keeping track of the percentage of significant results. Because power determines the rate of significant results, the rate of significant results is an estimate of average power. The main problem with this simple method of estimating power is that researchers often do not report all of their results. Especially before the replication crisis became apparent, psychologists tended to publish only significant results. As a result, it is largely unknown how much power actual studies in psychology have and whether power increased since Cohen (1961) estimated power to be around 50%.

Here I illustrate a simple way to estimate actual power of studies with a recent multi-study article that reported a total of 184 significance tests (more were reported in a supplement, but were not coded)! Evidently, Cohen’s important insights remain neglected, especially in journals that pride themselves on rigorous examination of hypotheses (Kardas, Kumar, & Epley, 2021).

Figure 2 shows the first rows of the coding spreadsheet (Spreadsheet).

Each row shows one specific statistical test. The column “HO rejected” reflects how authors interpreted a result. Broadly this decision is based on the p < .05 rule, but sometimes authors are willing to treat values just above .05 as sufficient evidence which is often called marginal significance. The column p < .05 strictly follows the p < .05 rule. The averages in the top row show that there are 77% significant results using authors’ rules and 71% using the p < .05 rule. This shows that 6% of the p-values were interpreted as marginally significant.

All test-values or point estimates with confidence intervals are converted into exact two-sided p-values. The two-sided p-values are then converted into z-scores using the inverse normal formula; z = -qnorm(2). Observed power is then estimated for the standard criterion of significance; alpha = .05, which corresponds to a z-score of 1.96. The formula for observed power is pnorm(z, 1.96). The top row shows that mean observed power is 69%. This is close to the 71% percentage with the strict p < .05 rule, but a bit lower than the 77% when marginally significant results are included. This simple comparison shows that marginally significant results inflate the percentage of significant results.

The inflation column keeps track of the consistency between the outcome of a significance test and the power estimate. When power is practically 1, a significant result is expected and inflation is zero. However, when power is only 60%, there is a 40% chance of a type-II error and authors were lucky if they got a significant result. This can happen in a single test, but not in the long run. Average inflation is a measure of how lucky authors were if they got more significant results than the power of their studies allows. Using the authors 77% success rate and estimated power of 69%, we have an inflation of 8%. This is a small bias, and we already saw that interpretation of marginal results accounts for most of it.

The last column is called the Replication Index (R-Index). It simply subtracts the inflation from the observed power estimate. The reason is that observed power is an inflated estimate of power when there are too many significant results. The R-Index is called an index because the formula is just an approximate correction for selection for significance. Later I show the results with a better method. However, the Index can clearly distinguish between junk science (R-Index below 50) and credible evidence. Based on the present results, the R-Index of 62 shows that the article reported some credible findings. Moreover, the R-Index now underestimates power because the rate of p-values below .05 is consistent with observed power. The inflation is just due to the interpretation of marginal results as significant. In short, the main conclusion from this simple analysis of test statistics in a single article is that the authors conducted studies with an average power of about 70%. This is expected to produce type-II errors, sometimes with p-values close to .05 and sometimes with p-values well above .1. This could mean that nearly a quarter of the published results are type-II errors.

but what about type-I errors?

Cohen was concerned about the problem that many underpowered studies fail to reject true hypotheses. However, the replication crisis shifted the focus from false negative results to false-positive results. An influential article by Simmons et al. (2011) suggested that many if not most published results might be false positive results. The authors also developed a statistical tool that examines whether a set of significant results is entirely based on false positive results called p-curve. The next figure shows the output of the p-curve app for the 130 significant results (only significant results are considered because p-values greater than .05 cannot be false positives).

The graph shows that there a lot more p-values below .01 (78%) than p-values between .04 and .05 (2%). This distribution of p-values is inconsistent with the hypothesis that all significant results are false positives. In addition, the program estimates that the average power of the 130 studies with significant results is 99%! As a result, there can be no false positives that would produce an estimate of 5% power. It is noteworthy that the p-curve analysis did not spot the inflation of significant results by interpreting marginally significant results because these results are omitted from the p-curve analysis. It is rather unlikely that the average power of studies is 99%. In fact, simulation studies have shown that the power estimates of p-curve are often inflated when studies are heterogeneous (Brunner, 2018; Brunner & Schimmack, 2020). The p-curve authors are aware of this bug, but have done nothing to fix it (Datacolada, 2018).

A better statistical method to analyze p-values is z-curve, which relies on the z-scores that were obtained from the p-values in the spreadsheet. However, the z-curve package for R can also read p-values. The next Figure shows a histogram of all 184 (significant and non-significant) values up to a value of 6. Values over 6 are not shown and are all treated as studies with perfect power.

The expected discovery rate corresponds to the power estimate in p-curve. It is notably lower than 99% and the 95%CI excludes a value of 99%. This finding simply shows once again that p-curve estimates are inflated.

The observed discovery rate is simply the same percentage that was computed on the spreadsheet using a strict p < .05 rule. The expected discovery rate is an estimate of the average power for all studies, including non-significant results that is corrected for any potential inflation. It is 62%, which matches the R-Index in the spreadsheet.

The comparison of the observed discovery rate of 71% and the expected discovery rate of 62% suggests that there is some overreporting of significant results. However, the 95%CI around the EDR estimate ranges from 27% to 88%. Thus, sampling error alone may explain this discrepancy.

An EDR of 62% implies that only a small number of significant results can be false positives. The point estimate is just 2%, but the 95%CI allows for up to 14% false positives. Thus, the reported results are unlikely to be false positives, but effect sizes could be inflated because selection for significance with modest power inflates effect size estimates.

There is also notable evidence of heterogeneity. The distribution of z-scores is much flatter than a standard normal distribution that is expected if all studies had the same power. This means that some results might be more credible than others. Therefore I conducted some moderator analyses.

One key hypothesis in the article was that shallow and deep conversations differ in important ways. Several studies tested this by comparing shallow and deep conversations. Fifty-four analyses included a contrast between shallow and deep conversations as a main effect or in an interaction. The expected replication rate is unchanged. The expected discovery rate is a bit higher, but surprisingly, the observed discovery rate is lower. Visual inspection of the z-curve plot shows an unusually high number of marginally significant results. This is further evidence to distrust marginally significant results. However, overall these results suggest that shallow and deep conversations differ.

Several analyses tested mediation, which can require large samples to have adequate power. Not surprisingly, the 39 mediation tests have only a replication rate of 53%. There is also some suggestion of bias, with an observed discovery rate of 51% and an expected discovery rate of only 25%, but the 95%CI around the point estimate is wide and includes 51%. The low expected discovery rate implies that the false discovery risk is 16%, which is unacceptably high.

One solution to the high false discovery risk is to lower the criterion for significance. The next conventional level is alpha = .01. The next figure shows the results for this criterion value (the red solid line has moved to z = 2.58.

Now the observed discovery rate is in line with the expected discovery rate (28% vs. 27%) and the false discovery risk has been lowered to 3%. However, the expected replication rate (for alpha = .01) is only 36%. Thus, follow-up studies need to increase sample sizes to replicate these mediation effects.


A post-hoc power-analysis of this recent article shows that psychologists still have not learned Cohen’s lesson that he shared in 1990 (more than 30 years ago). Conducting many significance tests with modest statistical power produces a confusing pattern of significant and non-significant results that is strongly influenced by sampling error. Rather than reporting results of individual studies, the authors should have reported meta-analytic results for tests of the same hypothesis. However, to end on a positive note, the studies are not p-hacked and the risk of false positives is low. Thus, the results provide some credible findings that can be used to conduct confirmatory tests of the hypothesis that deeper conversations are more awkward, but also more rewarding. I hope these analyses show that a deep dive into the statistical results reported in an article can also be rewarding.

Citation Watch

Good science requires not only open and objective reporting of new data; it also requires unbiased review of the literature. However, there are no rules and regulations regarding citations, and many authors cherry-pick citations that are consistent with their claims. Even when studies have failed to replicate, original studies are cited without citing the replication failures. In some cases, authors even cite original articles that have been retracted. Fortunately, it is easy to spot these acts of unscientific behavior. Here I am starting a project to list examples of bad scientific behaviors. Hopefully, more scientists will take the time to hold their colleagues accountable for ethical behavior in citations. They can even do so by posting anonymously on the PubPeer comment site.

Entry DateTable of Incorrect Citations
21/10/27Authors: Jürgen Kornmeier ,Kriti Bhatia,Ellen Joos
Year: 2021
Citation: In the present paradigm, it is of course not difficult to comprehend the direct influence from past percepts of a disambiguated lattice figure on the perception of a highly similar but ambiguous lattice variant. In other precognition paradigms, such as some of those used in the experiments of the seminal Bem paper [85], the potential role of the perceptual history is not as directly comprehensible as in the present study– which does not necessarily rule it out.
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/26Authors: B. Keith Payne1, Jason W. Hannay
Year: 2021
Citation: One of the most important contributions from psychological science is the concept of implicit bias. Implicit bias refers to positive or negative mental associations cued spontaneously by social groups. It is measured using cognitive tasks that test how those associations facilitate, interfere with, or otherwise bias task performance [5,6]. Many studies suggest that implicit bias is widespread, even among people who explicitly endorse egalitarian attitudes [7,8].
Others argue that implicit bias is a stable trait- like construct, and that context effects or temporal fluctuations reflect only measurement error [50,51].
Correction: This quote and many other citations in this article fail to mention that the concept of implicit bias is controversial and lacks strong empirical support. There are many critical articles to cite, but my own criticism of the construct validity of implicit measures references most of them ( Another article directly criticizes Payne and is not cited ( The authors cite my article [51], but fails to mention that it also contains evidence to support the claim that implicit racial bias measurs have only modest convergent validity with explicit racism measures and very little discriminant validity.
21/10/25Authors: Cassandra Baldwin, Katie E. Garrison, Roy F. Baumeister & Brandon J. Schmeichel
Year: 2021
Citation: Research has found that the capacity for executive control may work as if it depended on a limited resource. Effortful acts of control consume some of this resource, resulting in a state known as ego depletion (Baumeister et al., 1998; Muraven & Baumeister, 2000).
DOI: 10.1080/15298868.2021.1888787
Correction: does not cite meta-analysis that shows publication bias and no evidence for the effect ( Also does not cite two failed replication attempts in major RRR (,
21/10/25Authors: Liad Uziel, Roy F. Baumeister, and Jessica L. Alquist
Year: 2021
Citation: Furthermore, temporary reduction in selfcontrol (following laboratory manipulations or activities such as alcohol consumption) often causes an increase in careless and impulsive acts (Baumeister et al., 2007; Hagger et al., 2010).
Correction: cite an outdated meta-analysis that did not control for publication bias and fail to cite an updated meta-analysis that shows clear evidence of publication bias and no evidence for an effect (, see also
21/10/20Authors: Nicole C. Nelson, Julie Chung, Kelsey Ichikawa, and Momin M. Malik
Year: 2021
Citation: The second event Earp and Tramifow point to is the publication of psychologist Daryl Bem’s (2011) paper "Feeling the future,” which presented evidence suggesting hat people could anticipate evocative stimuli before they actually happened (such as the ppearance of an erotic image)."
DOI: 10.1177/10892680211046508
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Authors: Gregory D. Webster, Val Wongsomboon, Elizabeth A. Mahar
Year: 2021
Citation: To be sure, quantity need not reflect quality in published articles (e.g., see Bem’s [2011] nine-study article purporting experimental evidence of precognition).
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Authors: Jason Chin, Justin T. Pickett, Simine Vazire, Alex O. Holcombe
Year: 2021
Citation: The threat posed by QRPs has been discussed most extensively in the field of psychology, arguably the eye of the storm of the “replication crisis.” In the wake of the “False Positive Psychology” paper (Simmons et al. 2011), Daryl Bem’s paper claiming to find evidence of Extra Sensory Perception (ESP; Bem 2011), and several cases of fraud, the field of psychology entered a period of intense self-examination.
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Citation: Daryl Bem became notorious for publication of two articles in high-quality journals claiming the existence of ESP (Bem, 2011; Bem&Honorton, 1994). The experimental design and the statistical power looked persuasive enough to lead the editors and reviewers to a decision to publish despite the lack of a theory to explain the results.
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Authors: T. D. Stanley, Hristos Doucouliagos, John P. A. Ioannidis, Evan C. Carter
Year: 2021
Citation: Bem conducted some dozen(s) of experiments that asked students to “feel the future” by responding in the present to random future stimulus that was unknown to both subjects and experimenters at the time.46–49 Even though Bem seemed to employ state-of-the-art methods, his findings that students could “feel the future” were implausible to most psychologists.
DOI: 10.1002/jrsm.1512
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Author: Guido W. Imbens
Year: 2021
Citation: In the Journal of Personality and Social Psychology, Bem (2011) studies whether precognition exists: that is, whether future events retroactively affect people’s responses. Reviewing nine experiments, he finds (from the abstract): “The mean effect size (d) in psi performance across all nine experiments was 0.22, and all but one of the experiments
yielded statistically significant results.” This finding sparked considerable controversy, some of it methodological.
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Authors: Mariella Paul, Gisela H. Govaart, Antonio Schettino
Year: 2001
Citation: "Over the last decade, findings from a number of research disciplines have been under careful scrutiny. Prominent examples of research supporting incredible conclusions (Bem, 2011),
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Authors: Andrew T. Little, Thomas B. Pepinsky
Year: 2021
Citation: A prominent example here is Bem (2011) on extrasensory perception, which played a central role in uncovering the problems of p-hacking in psychology.
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Authors: Bruno Verschuere, Franziska M. Yasrebi-de Kom, Iza van Zelm, MSc, Scott O. Lilienfeld
Year: 2021
Citation: and the spurious “discovery” of
precognition (Bem, 2011)
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (

21/10/20Authors: Lincoln J. Colling, Dénes Szűcs
Year: 2021
Citation: A series of events in the early 2010s, including the publication of Bem’s (2011) infamous study on extrasensory perception (or PSI), and data fabrication by Diederik Stapel and others (Stroebe et al. 2012), led some prominent researchers to claim that psychological science was suffering a crisis of confidence (Pashler and
Wagenmakers 2012).
Correction: Does not cite evidence that Bem used questionable research practices to produce these results (,, and that the results failed to replicate (
21/10/20Authors: Mόnika Gergelyfia, Ernesto J. Sanz-Arigita, Oleg Solopchuk, Laurence Dricot, Benvenuto Jacob, Alexandre Zénon
Year: 2021
Citation: Theories of MF can be classified in two major groups that assume either: (a) alterations of motivational processes leading to restrictions on the recruitment of cognitive resources for the task at hand (…) or b) progressive functional alteration of cognitive processes through metabolic mechanisms ( Gailliot and Baumeis- ter, 2007 ; Christie and Schrater, 2015 ; Holroyd, 2015 ; Hopstaken et al., 2015 ; Blain et al., 2016 ; Gergelyfiet al., 2015 ).
Correction: do not cite meta-analysis that shows publication bias and no evidence for glucose effects on willpower (
21/10/20Authors: Alexandra Touroutoglou, Joseph Andreano, Bradford C. Dickerson, Lisa Feldman Barrett
Year: 2020
Citation: Some accounts hold that effort serves to manage intrinsic costs to finite resources such as metabolic resources (Gailliot and Baumeister, 2007; Gailliot et al., 2007; Holroyd, 2016),
Correction: do not cite meta-analysis that shows publication bias and no evidence for glucose effects on willpower (

21/10/10Authors: Scott W. Phillips; Dae-Young Kim
Year: 2021
Citation: Johnson et al. (2019) found no evidence for disparity in the shooting deaths of Black or Hispanic people. Rather, their data indicated an anti-White disparity in OIS deaths.
Correction: Retraction (
21/10/10Authors: Richard Stansfield, Ethan Aaronson, Adam Okulicz-Kozaryn
Year: 2021
Citation: While recent studies increasingly control for officer and incident characteristics (e.g., Fridell & Lim, 2016; Johnson et al., 2019; Ridgeway et al., 2020)
Correction: Retraction of Johnson et al. (
21/10/10Authors: P. A. Hancock; John D. Lee; John W. Senders
Citation: Misattributions involved in such processes of assessment can, as we have seen, lead to adverse consequences (e.g., Johnson et al., 2019).
DOI: DOI: 10. 1177/ 0018 7208 2110 36323
Correction: Retraction (
21/10/10Authors: Desmond Ang
Citation: While empirical evidence of racial bias is mixed (Nix et al. 2017; Fryer 2019; Johnson et al. 2019; Knox, Lowe, and Mummolo 2020; Knox and Mummolo 2020)
DOI: doi:10.1093/qje/qjaa027
Correction: Retraction of Johnson et al. (
21/10/10Authors: Lara Vomfell; Neil Stewart
Year: 2021
Citation: Some studies have argued that the general population in an area is not the appropriate comparison: instead one should compare rates of use of force to how often Black and White people come into contact with police [59–61]
Correction: [60] Johnson et al. Retracted (
21/10/10Authors: Jordan R. Riddell; John L. Worrall
Year: 2021
Citation: Recent years have also seen improvements in benchmarking-related research, that is, in formulating methods to more accurately analyze whether bias (implicit or explicit) or racial disparities exist in both UoF and OIS. Recent examples include Cesario, Johnson, and Terrill (2019), Johnson, Tress, Burkel, Taylor, and Cesario (2019), Shjarback and Nix (2020), and Tregle, Nix, and Alpert (2019).
Correction: Retraction of Johnson et al. (
21/10/10Authors: Dean Knox, Will Lowe, Jonathan Mummolo
Year: 2021
Citation: A related study, Johnson et al. (2019), attempts to estimate racial bias in police shootings. Examining only positive cases in which fatal shootings occurred, they find that the majority of shooting victims are white and conclude from this that no antiminority bias exists
Correction: Retraction of Johnson et al. (
21/10/10Authors: Ming-Hui Li, Pei-Wei Li ,Li-Lin Rao
Year: 2021
Citation: The IAT has been utilized in diverse areas and has proven to have good construct validity and reliability (Gawronski et al., 2020).
Correction: does not cite critique of the construct validity of IATs (

21/10/10Authors: Chew Wei Ong, Kenichi Ito
Year: 2021
Citation: This penalty treatment of error trials has been shown to improve the correlations between the IAT and explicit measures, indicating a greater construct validity of the IAT.
DOI: 10.1111/bjso.12503
Correction: higher correlations do not imply higher construct validity of IATs as measures of implicit attitudes (
21/10/10Authors: Sara Costa, Viviana Langher, Sabine Pirchio
Year: 2021
Citation: The most used method to assess implicit attitudes is the “Implicit Association Test” (IAT; Greenwald et al., 1998), which presents a good reliability (Schnabel et al., 2008) and validity (Nosek et al., 2005; Greenwald et al., 2009).
DOI: doi: 10.3389/fpsyg.2021.712356
Correction: does not cite critique of the construct validity of IATs (
21/10/10Authors: Christoph Bühren, Julija Michailova
Year: 2021
Citation: not available, behind paywall
DOI: DOI: 10.4018/IJABE.2021100105
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (
21/10/10Authors: Yang, Gengfeng, Zhenzhen, Dongjing
Year: 2021
Citation: "Studies have found that merely activating the concept of money can increase egocentrism, which can further redirect people's attention toward their inner motivations and needs (Zaleskiewicz et al., 2018) and reduce their sense of connectedness with others (Caruso et al., 2013).
DOI: 10.1002/cb.1973
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (
21/10/10Authors: Garriy Shteynberg, Theresa A. Kwon, Seong-Jae Yoo, Heather Smith, Jessica Apostle, Dipal Mistry, Kristin Houser
Year: 2021
Citation: Money is often described as profane, vulgar, and filthy (Belk & Wallendorf, 1990), yet incidental exposure to money increases the endorsement of the very social systems that render such money meaningful (Caruso et al., 2013).
DOI: 10.1002/jts5.95
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (
21/10/10Author: Arden Rowell
Year: 2021
Citation: In particular, some studies show that encouraging people to think about things in terms of money may measurably change people's thoughts, feelings, motivations, and behaviors. See Eugene M. Caruso, Kathleen D. Vohs, Brittani Baxter & Adam Waytz, Exposure to Money Increases Endorsement of Free-Market Systems and Social Inequality, 142 J. EXPERIMENTAL PSYCH. 301, 301-02, 305 (2013) DOI:
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (
21/10/10Authors: Anna Jasinenkoa, Fabian Christandl, Timo Meynhardt
Year: 2020
Citation: Caruso et al. (2013) find that exposure to money (which is prevalent in most shopping situations) activates personal tendencies to justify the market system. Furthermore, they find that money exposure also activates general system justification; however, he effect was far smaller than for the activation of MSJ.
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (

A one hour introduction to bias detection

This introduction to bias detection builds on introductions to test-statistics and statistical power (Into to Statistics, Intro to Power).

It is well known that many psychology articles report too many significant results because researchers selectively publish results that support their predictions (Francis, 2014; Sterling, 1959; Sterling et al., 1995; Schimmack, 2021). This often leads to replication failures (Open Science Collaboration, 2015).

One way to examine whether a set of studies reported too many significant results is to compare the success rate (i.e., the percentage of significant results) with the mean observed power in studies (Schimmack, 2012). In this video, I illustrate this bias detection method using Vohs et al.’s (2006) Science article “The Psychological Consequences of Money.

I use this students for training purposes because the article reports 9 studies and a reasonably large number of studies is needed to have good power to detect selection bias. Also, the article is short and the results are straight forward. Thus, students have no problem filling out the coding sheet that is needed to compute observed power (Coding Sheet).

The results show clear evidence of selection bias that undermine the credibility of the reported results (see also TIVA). Although bias tests are available, few researchers use them to protect themselves from junk science and articles like this one continue to be cited at high rates (683 total, 67 in 2019). A simple way to protect yourself from junk science is to adjust the alpha level to .005 because many questionable practices produce p-values that are just below .05. For example, the lowest p-value in these 9 studies was p = .006. Thus, not a single study was statistically significant with alpha = .005.

Intro to Statistical Power in One Hour

Last week I posted a video that provided an introduction to the basic concepts of statistics, namely effect sizes and sampling error. A test statistic like a t-value, is simply the ratio of the effect size over sampling error. This ratio is also known as a signal to noise ratio. The bigger the signal (effect size), the more likely it is that we will notice it in our study. Similarly, the less noise we have (sampling error), the easier it is to observe even small signals.

In this video, I use the basic concepts of effect sizes and sampling error to introduce the concept of statistical power. Statistical power is defined as the percentage of studies that produce a statistically significant result. When alpha is set to .05, it is the expected percentage of p-values with values below .05.

Statistical power is important to avoid type-II errors; that is, there is a meaningful effect, but the study fails to provide evidence for it. While researchers cannot control the magnitude of effects, they can increase power by lowering sampling error. Thus, researchers should carefully think about the magnitude of the expected effect to plan how large their sample has to be to have a good chance to obtain a significant result. Cohen proposed that a study should have at least 80% power. The planning of sample sizes using power calculation is known as a priori power analysis.

The problem with a priori power analysis is that researchers may fool themselves about effect sizes and conduct studies with insufficient sample sizes. In this case, power will be less than 80%. It is therefore useful to estimate the actual power of studies that are being published. In this video, I show that actual power could be estimated by simply computing the percentage of significant results. However, in reality this approach would be misleading because psychology journals discriminant against non-significant results. This is known as publication bias. Empirical studies show that the percentage of significant results for theoretically important tests is over 90% (Sterling, 1959). This does not mean that mean power of psychological studies is over 90%. It merely suggests that publication bias is present. In a follow up video, I will show how it is possible to estimate power when publication bias is present. This video is important to understand what statistical power.

Intro to Statistics In One Hour

Each year, I am working with undergraduate students on the coding of research articles to examine the replicability and credibility of psychological science (ROP2020). Before students code test-statistics from t-tests or F-tests in results sections, I provide a crash course on inferential statistics (null-hypothesis significance testing). Although some students have taken a basic stats course, the courses often fail to teach a conceptual understanding of statistics and distract students with complex formulas that are treated like a black box that converts data into p-values (or worse starts that reflect whether p < .05*, p < .01**, or p < .001***).

In this one-hour lecture, I introduce the basic principles of null-hypothesis significance testing using the example of the t-test for independent samples.

An Introduction to T-Tests | Definitions, Formula and Examples

I explain that a t-value is conceptual made up of three components, namely the effect size (D = x1 – x2), a measure of the natural variation of the dependent variable (the standard deviation (s), and a measure of the amount of sampling error (simplified se = 2/sqrt (n1 + n2)).

Moreover, dividing the effect size D by the standard deviation provides the familiar standardized effect size, Cohen’s d = D/s. This means that a t-value corresponds to the ratio of the standardized effect size (d) over the amount of sampling error (se), t = d/se

It follows that a t-value is influenced by two quantities. T-values increase as the standardized (unit-free) effect sizes increase and as the sampling error decreases. The two quantities are sometimes called signal (effect size) and noise (sampling error). Accordingly, the t-value is the signal to noise ratio. I compare the signal and noise to an experiment where somebody is throwing rocks into a lake and somebody has to tell whether a rock was thrown based on the observation of a splash. A study with a small effect and a lot of noise is like trying to detect the splash of a small pebble on a very windy, stormy day where waves are creating a lot of splashes that make it hard to see the small splash made by a pebble. However, if you throw a big rock into the lack, you can see the big splash from the rock even when the wind creates a lot of splashing. If you want to see the splash of a pebble, you need to wait for a calm day without wind. These conditions correspond to a study with a large sample and very little sampling error.

Have a listen and let me know how I am doing. Feel free to ask questions that help me to understand how I can make the introduction to statistics even easier. Too many statistics books and lecturers intimidate students with complex formulas and Greek symbols that make statistics look hard, but in reality it is very simple. Data always have two components. The signal you are looking for and noise that makes it hard to see the signal. The bigger the signal to noise ratio is, the more likely it is that you saw a true signal. Of course, it can be hard to quantify signals and noise and statisticians work hard in getting good estimates of noise, but that does not have to concern users of statistics. As users of statistics we just trust statisticians that they have good (the best) estimates to see how good our data are.

Rejection Watch: Censorship at JEP-General

Articles published in peer-reviewed journals are only a tip of the scientific iceberg. Professional organizations want you to believe that these published articles are carefully selected to be the most important and scientifically credible articles. In reality, peer-review is unreliable, invalid, and editorial decisions are based on personal preferences. For this reason, the censoring mechanism is often hidden. Part of the movement towards open science is to make the censoring process transparent.

I therefore post the decision letter and the reviews from JEP:General. I sent my ms “z-curve: an even better p-curve” to this journal because it published two articles on the p-curve method that are highly cited. The key point of my ms. is to point out that the p-curve app produces a “power” estimate of 97% for hand-coded articles by Leif Nelson, while z-curve produces an estimate of 52%. If you are a quantitative scientist, you will agree that this is a non-trivial difference and you are right to ask which of these estimates is more credible. The answer is provided by simulation studies that compare p-curve and z-curve and show that p-curve can dramatically overestimate “power” when the data are heterogeneous (Brunner & Schimmack, 2020). In short, the p-curve app sucks. Let the record show that JEP-General is happy to get more citations for a flawed method. The reason might be that z-curve is able to show publication bias in the original articles published in JEP-General (Replicability Rankings). Maybe Timothy J. Pleskac is afraid that somebody looks at his z-curve, which shows a few too many p-values that are just significant (ODR = 73% vs. EDR = 45%).

Unfortunately for psychologists, statistics is an objective science that can be evaluated using either mathematical proofs (Brunner & Schimmack, 2020) and simulation studies (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). It is just hard for psychologists to follow the science, if the science doesn’t agree with their positive illusions and inflated egos.


Z-curve 2.0: An Even Better P-Curve
Journal of Experimental Psychology: General

Dear Dr. Schimmack,

I have received reviews of the manuscript entitled Z-curve 2.0: An Even Better P-Curve (XGE-2021-3638) that you recently submitted to Journal of Experimental Psychology: General. Upon receiving the paper I read the paper. I agree that Simonsohn, Nelson, & Simmons’ (2014) P-Curve paper has been quite impactful. As I read over the manuscript you submitted, I saw there was some potential issues raised that might help help advance our understanding of how to evaluate scientific work. Thus, I asked two experts to read and comment on the paper. The experts are very knowledgeable and highly respected experts in the topical area you are investigating.

Before reading their reviews, I reread the manuscript, and then again with the reviews in hand. In the end, both reviewers expressed some concerns that prevented them from recommending publication in Journal of Experimental Psychology: General. Unfortunately, I share many of these concerns. Perhaps the largest issue is that both reviewers identified a number formal issues that need more development before claims can be made about the z-curve such as the normality assumptions in the paper. I agree with Reviewer 2 that more thought and work is needed here to establish the validity of these assumptions and where and how these assumptions break down. I also agree with Reviewer 1 that more care is needed when defining and working with the idea of unconditional power. It would help to have the code, but that wouldn’t be sufficient as one should be able to read the description of the concept in the paper and be able to implement it computationally. I haven’t been able to do this. Finally, I also agree with Reviewer 1 that any use of the p-curve should have a p-curve disclosure table. I would also suggest ways to be more constructive in this critique. In many places, the writing and approach comes across as attacking people. That may not be the intention. But, that is how it reads.

Given these concerns, I regret to report that that I am declining this paper for publication in Journal of Experimental Psychology: General. As you probably know, we can accept only small fraction of the papers that are submitted each year. Accordingly, we must make decisions based not only on the scientific merit of the work but also with an eye to the potential level of impact for the findings for our broad and diverse readership. If you decide to pursue publication in another journal at some point (which I hope you will consider), I hope that the suggestions and comments offered in these reviews will be helpful.

Thank you for submitting your work to the Journal. I wish you the best in your continued research, and please try us again in the future if you think you have a manuscript that is a good fit for Journal of Experimental Psychology: General.


Timothy J. Pleskac, Ph.D.
Associate Editor
Journal of Experimental Psychology: General

Reviewers’ comments:

Reviewer #1: 1. This commentary submitted to JEPG begins presenting a p-curve analysis of early work by Leif Nelson.
Because it does not provide a p-curve disclosure table, this part of the paper cannot be evaluated.
The first p-curve paper (Simonsohn et al, 2014) reads: “P-curve disclosure table makes p-curvers accountable for decisions involved in creating a reported p-curve and facilitates discussion of such decisions. We strongly urge journals publishing p-curve analyses to require the inclusion of a p-curve disclosure table.” (p.540). As a reviewer I am aligning with these recommendation and am *requiring* a p-curve disclosure table, as in, I will not evaluate that portion of the paper, and moreover I will recommend the paper be rejected unless that analysis is removed, or a p-curve disclosure table is included, and is then evaluated as correctly conducted by the review team in an subsequent round of evaluation. The p-curve disclosure table for the Russ et al p-curve, even if not originally conducted by these authors, should be included as well, with a statement that the authors of this paper have examined the earlier p-curve disclosure table and deemed it correct. If an error exists in the literature we have to fix it, not duplicate it (I don’t know if there is an error, my point is, neither do the authors who are using it as evidence).

2. The commentary then makes arguments about estimating conditional vs unconditional power. While not exactly defined in the article, the authors come pretty close to defining conditional power, I think they mean by it the average power conditional on being included in p-curve (ironically, if I am wrong about the definition, the point is reinforced). I am less sure about what they mean by unconditional power. I think they mean that they include in the population parameter of interest not only the power of the studies included in p-curve, but also the power of studies excluded from it, so ALL studies. OK, this is an old argument, dating back to at least 2015, it is not new to this commentary, so I have a lot to say about it.

First, when described abstractly, there is some undeniable ‘system 1’ appeal to the notion of unconditional power. Why should we restrict our estimation to the studies we see? Isn’t the whole point to correct for publication bias and thus make inferences about ALL studies, whether we see them or not? That’s compelling. At least in the abstract. It’s only when one continues thinking about it that it becomes less appealing. More concretely, what does this set include exactly? Does ‘unconditional power’ include all studies ever attempted by the researcher, does it include those that could have been run but for practical purposes weren’t? does it include studies run on projects that were never published, does it include studies run, found to be significant, but eventually dropped because they were flawed? Does it include studies for which only pilots were run but not with the intention of conducting confirmatory analysis? Does it include studies which were dropped because the authors lost interest in the hypothesis? Does it include studies that were run but not published because upon seeing the results the authors came up with a modification of the research question for which the previous study was no longer relevant? Etc etc). The unconditional set of studies is not a defined set, without a definition of the population of studies we cannot define a population parameter for it, and we can hardly estimate a non-existing parameter. Now. I don’t want to trivialize this point. This issue of the population parameter we are estimating is an interesting issue, and reasonable people can disagree with the arguments I have outlined above (many have), but it is important to present the disagreement in a way that readers understand what it actually entails. An argument about changing the population parameter we estimate with p-curve is not about a “better p-curve”, it is about a non-p-curve. A non-p-curve which is better for the subset of people who are interested in the unconditional power, but a WORSE p-curve for those who want the conditional power (for example, it is worse for the goals of the original p-curve paper). For example, the first paper using p-curve for power estimation reads “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve”. So a tool which does not estimate that value, but a different value, it is not better, it is different. The standard deviation is neither better nor worse than the mean. They are different. It would be silly to say “Standard Deviation, a better Mean (because it captures dispersion and the mean does not)”. The standard deviation is better for someone interested in dispersion, and the standard deviation is worse for someone interested in the central tendency. Exactly the same holds for conditional vs unconditional power. (well, the same if z-curve indeed estimated unconditional power, i don’t know if that is true or not. Am skeptical but open minded).

Second, as mentioned above, this distinction of estimating the parameter of the subset of studies included in p-curve vs the parameter of “all studies” is old. I think that argument is seen as the core contribution of this commentary, and that contribution is not close to novel. As the quote above shows, it is a distinction made already in the original p-curve paper for estimating power. And, it is also not new to see it as a shortcoming of p-curve analysis. Multiple papers by Van Assen and colleagues, and by McShane and colleagues, have made this argument. They have all critiqued p-curve on those same grounds.

I therefore think this discussion should improve in the following ways: (i) give credit, and give voice, to earlier discussions of this issue (how is the argument put forward here different from the argument put forward in about a handful of previous papers making it, some already 5 years ago), (ii) properly define the universe of studies one is attempting to estimate power for (i.e., what counts in the set of unconditional power), and (iii) convey more transparently that this is a debate about what is the research question of interest, not of which tool provides the better answer to the same question. Deciding whether one wants to estimate the average power of one or another set of studies is completely fair game of an issue to discuss, and if indeed most readers don’t think they care about conditional power, and those readers use p-curve not realizing that’s what they are estimating, it is valuable to disabuse them of their confusion. But it is not accurate, and therefore productive, to describe this as a statistical discussion, it is a conceptual discussion.

3. In various places the paper reports results from calculations, but the authors have not shared neither the code nor data for those calculations, so these results cannot be adequately evaluated in peer-review, and that is the very purpose of peer-review. This shortcoming is particularly salient when the paper relies so heavily on code and data shared in earlier published work.

Finally, it should be clearer what is new in this paper. What is said here that is not said in the already published z-curve paper and p-curve critique papers?

Reviewer #2:
The paper reports a comparison between p-curve and z-curve procedures proposed in the literature. I found the paper to be unsatisfactory, and therefore cannot recommend publication in JEP:G. It reads more like a cropped section from the author’s recent piece in meta-psychology than a standalone piece that elaborates on the different procedures in detail. Because a lot is completely left out, it is very difficult to evaluate the results. For example, let us consider a couple of issues (this is not an exhaustive list):

– The z-curve procedure assumes that z-transformed p-values under the null hypothesis follow a standard Normal distribution. This follows from the general idea that the distribution of p-values under the null-hypothesis is uniform. However, this general idea is not necessarily true when p-values are computed for discrete distributions and/or composite hypotheses are involved. This seems like a point worth thinking about more carefully, when proposing a procedure that is intended to be applied to indiscriminate bodies of p-values. But nothing is said about this, which strikes me as odd. Perhaps I am missing something here.

– The z-curve procedure also assumes that the distribution of z-transformed p-values follows a Normal distribution or a mixture of homoskedastic Normals (distributions that can be truncated depending on the data being considered/omitted). But how reasonable is this parametric assumption? In their recently published paper, the authors state that this is as **a fact**, but provide no formal proof or reference to one. Perhaps I am missing something here. If anything, a quick look at classic papers on the matter, such as Hung et al. (1997, Biometrics), show that the cumulative distributions of p-values under different alternatives cross-over, which speaks against the equal-variance assumption. I don’t think that these questions about parametric assumptions are of secondary importance, given that they will play a major in the parameter estimates obtained with the mixture model.

Also, when comparing the different procedures, it is unclear whether the reported disagreements are mostly due to pedestrian technical choices when setting up an “app” rather than irreconcilable theoretical commitments. For example, there is nothing stopping one from conducting a p-curve analysis on a more fine-grained scale. The same can be said about engaging in mixture modeling. Who is/are the culprit/s here?

Finally, I found that the writing and overall tone could be much improved.

Personality Over Time: A Historic Review

The hallmark of a science is progress. To demonstrate that psychology is a science therefore requires evidence that current evidence, research methods, and theories are better than those in the past. Historic reviews are also needed because it is impossible to make progress without looking back once in a while.

Research on the stability or consistency of personality has a long history that started with the first empirical investigations in the 1930s, but a historic review of this literature is lacking. Few young psychologists interested in personality development may be familiar with Kelly, his work, or his American Psychologist article on “Consistency of the Adult Personality” (Kelly, 1955). Kelly starts his article with some personal observations about stability and change in traits that he observed in colleagues over the years.

Today, we call traits that are neither physical characteristics, nor cognitive abilities, personality traits that are represented in the Big Five model. What have we learned about the stability of personality traits in adulthood from nearly a century of research?

Kelly (1955) reported some preliminary results from his own longitudinal study of personality that he started in the 1930s with engaged couples. Twenty years-later, they completed follow-up questionnaires. Figure 6 reported the results for the Allport-Vernon value scales. I focus on these results because they make it possible to compare the retest-correlations to retest-correlations over a one-year period.

Figure 6 shows that personality, or at least values, are not perfectly stable. This is easily seen by a comparison of the one-year retest correlations with the 20-year retest correlations. The 20-year retest correlations are always lower than the one-year retest correlations. Individual differences in values change over time. Some individuals become more religious and others become less religious, for example. The important question is how much individuals change over time. To quantify change and stability it is important to specify a time interval because change implies lower retest correlations over longer retest intervals. Although the interval is arbitrary, a period of 1-year or 10-year can be used to quantify and compare stability and change of different personality traits. To do so, we need a model of change over time. A simple model is Heise’s (1969) autoregressive model that assumes a constant rate of change.

Take religious values as an example. Here we have two observed retest correlations, r(y1) = .60, and r(y20) = .75. Both correlations are attenuated by random measurement error. To correct for unreliability, we need to solve two equations with two unknowns, the rate of change and reliability.
.75 = rate^1 * rel
.60 = rate^20 * rel
With some rusty high-school math, I was able to solve this equation for rate
rate = (.60/.75)^(1/(20-1) = .988
The implied 10-year stability is .988^10 = .886.
The estimated reliability is .75 / .988 = .759.

Table 1 shows the results for all six values.

Value1-year20-yearReliability1-Year Rate10-Year Rate
Table 1
Stability and Change of Allport-Vernon Values

The results show that the 1-year retest correlations are very similar to the reliability estimates of the value measure. After correcting for unreliability the 1-year stability is extremely high with stability estimates ranging from .96 for social values to .99 for religious values. The small differences in 1-year stabilities become only notable over longer time periods. The estimated 10-year stability estimates range from .68 for social values to .90 for religious values.

Kelly reported results for two personality constructs that were measured with the Bernreuter personality questionnaire, namely self-confidence and sociability.

The implied stability of these personality traits is similar to the stability of values.

Personality1-year20-yearReliability1-Year Rate10-Year Rate

Kelly’s results published in 1955 are based on a selective sample during a specific period of time that included the second world war. It is therefore possible that studies with other populations during other time periods produce different results. However, the results are more consistent than different across different studies.

The first article with retest correlations for different time intervals of reasonable length was published in 1941 by Mason N. Crook. The longest retest interval was 6-years and six months. Figure 1a in the article plotted the retest correlations as a function of the retest interval.

Table 2 shows the retest correlations and reveals that some of them are based on extremely small sample sizes. The 5-month retest is based on only 30 participants whereas the 8 months retest is based on 200 participants. Using this estimate for the short-term stability, it is possible to estimate the 1-year rate and 10-year rates using the formula given above.

Sample SizeMonthsretestReliability1-Year Rate10-Year Rate
Weighted Average0.750.9580.651

The 1-year stability estimates are all above .9, except for the retest correlation that is based on only N = 18 participants. Given the small sample sizes, variability in estimates is mostly random noise. I computed a weighted average that takes both sample size and retest interval into account because longer time-intervals provide better information about the actual rate of change. The estimated 1-year stability is r = .96, which implies a 10-year stability of .65. This is a bit lower than Kelley’s estimates, but this might just be sampling error. It is also possible that Crook’s results underestimate long-term stability because the model assumes a constant rate of change. It is possible that this assumption is false, as we will see later.

Crook also provided a meta-analysis that included other studies and suggested a hierarchy of consistency.

Accordingly, personality traits like neuroticism are less stable than cognitive abilities, but more stable than attitudes. As the Figure shows, empirical support for this hierarchy was limited, especially for estimates of the stability of attitudes.

Several decades later, Conley (1984) reexamined this hierarchy of consistency with more data. He was also the first, to provide quantitative stability estimates that correct for unreliability. The meta-analysis included more studies and, more importantly, studies with long retest intervals. The longest retest interval was 45 years (Conley, 1983). After correcting for unreliability, the one-year stability was estimated to be r = .98, which implies a stability of r = .81 over a period of 10-years and r = .36 over 50 years.

Using the published retest correlations for with sample sizes greater than 100, I obtained a one-year stability estimate of r = .969 for neuroticism and r = .986 for extraversion. These differences may reflect differences in stability or could just be sampling error. The average reproduces Conley’s (1984) estimate of r = .98 (r = .978).

Sample SizeYearsretestReliability1-Year Rate10-Year Rate
Weighted Average0.740.9690.730
Sample SizeYearsretestReliability1-Year Rate10-Year Rate
Weighted Average0.730.9860.868

To summarize, decades of research had produced largely consistent findings that the short-term (1-year) stability of personality traits is well above r = .9 and that it takes long time-periods to observe substantial changes in personality.

The next milestone in the history of research on personality stability and change was Roberts and DelVeccio’s (2000) influential meta-analysis that is featured in many textbooks and review articles (e.g., Caspi, Roberts, & Shiner, 2005; MacAdams & Olson, 2010).

Roberts and DelVeccio’s literature review mentions Conley’s (1984) key findings. “When dissattenuated, measures of extraversion were quite consistent, averaging .98 over a 1-year period, approximately .70 over a 10-year period, and approximately .50 over a 40-year period” (p. 7).

The key finding of Roberts and DelVeccio’s meta-analysis was that age moderates stability of personality. As shown in Figure 1, stability increases with age. The main limitation of Figure 1 is that the figure shows average retest correlations without a specific time interval that are not corrected for measurement error. Thus, the finding that retest correlations in early and middle adulthood (22-49) average around .6 provides no information about the stability of personality in this age group.

Most readers of Roberts and DelVeccio (2000) fail to notice a short section that examines the influence of time interval on retest correlations.

On the basis of the present data, the average trait consistency over a 1-year
period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25
(Roberts & DelVeccio, 2000, p. 16).

Using the aforementioned formula to correct for measurement error shows that Roberts and DelVeccio’s meta-analysis replicates Conley’s results, 1-year r = .983.

YearsretestReliability1-Year Rate10-Year Rate
Weighted Average0.730.9830.842

Unfortunately, review articles often mistake these observed retest correlations as estimates of stability. For example, Adams and Olson write “Roberts & DelVecchio (2000) determined that stability coefficients for dispositional traits were lowest in studies of children (averaging 0.41), rose to higher levels among young adults (around 0.55), and then reached a plateau for adults between the ages of 50 and 70 (averaging 0.70)” (p. 521) and fail to mention that these stability coefficients are not corrected for measurement error, which is a common mistake (Schmidt, 1996).

Roberts and DelVeccio’s (2000) article has shaped contemporary views that personality is much more malleable than the data suggest. A twitter poll showed that only 11% of respondents guessed the right answer that the one-year stability is above .9, whereas 43% assumed the upper limit is r = .7. With r = 7 over a 1-year period, the stability over 10-years would only be r = .03 over a 10-year period. Thus, these respondents essentially assumed that personality has no stability over a 10-year period. More likely, respondents simply failed to take into account how high short-term stability has to be to allow for moderately high long-term stability.

The misinformation about personality stability is likely due to vague, verbal statements and the use of effect sizes that ignore the length of the retest interval. For example, Atherton, Grijalva, Roberts, and Robins (2021) published an article with a retest interval of 18-years. The abstract describes the results as “moderately-to-high stability over a 20-year period” (p. 841). Table 1 reports the observed correlations that control for random measurement error using a latent variable model with item-parcels as indicators.

The next table shows the results for the 4-year retest interval in adolescence and the 20-year retest interval in adulthood along with the implied 1-year rates. Consistent with Roberts and DelVeccio’s meta-analysis, the 1-year stability in adolescence is lower, r = .908, than in adulthood, r = .976.

TraitYearsRetest1-Year RateRetestRetest1-Year Rate

However, even in adolescence the 1-year stability is high. Most important, the 1-year rate for adults is consistent with estimates in Conley’s (1984) meta-analysis and the first study in 1941 by Crook, and even Roberts and DelVeccio’s meta-analysis when measurement error is taken into account. However, Atherton et al. (2021) fail to cite historic articles and fail to mention that their results replicate nearly a century of research on personality stability in adulthood.

Stable Variance in Personality

So far, I have used a model that assumes a fixed rate of change. The model also assumes that there are no stable influences on personality. That is, all causes of variation in personality can change and given enough time will change. This model implies that retest correlations eventually approach zero. The only reason why this may not happen is that human lives are too short to observe retest correlations of zero. For example, with r = .98 over a 1-year period, the 100-year retest correlation is still r = .13, but the 200-year retest correlation is r = .02.

With more than two retest intervals, it is possible to see that this model may not fit the data. If there is no measurement error, the correlation from t1 to t3 should equal the product of the two lags from t1 to t2 and from t2 to t3. If the t1-t3 correlation is larger than this model predicts, the data suggest the presence of some stable causes that do not change over time (Anusic & Schimmack, 2016; Kenny & Zautra, 1995).

Take the data from Atherton et al. (2021) as an example. The average retest correlation from t1 (beginning of college) to t3 (age 40) was r = .55. The correlation from beginning to end of college was r = .68, and the correlation from end of college to age 40 was r = .62. We see that .55 > .68 * .62 = .42.

Anusic and Schimmack (2016)

Anusic and Schimmack (2016) estimated the amount of stable variance in personality traits to be over 50%. This estimate may be revised in the future when better data become available. However, models with and without stable causes differ mainly in predictions over long-time intervals where few data are currently available. The modeling has little influence on estimates of stability over time periods of less than 10-years.


This historic review of research on personality change and stability demonstrated that nearly a century of research has produced consistent findings. Unfortunately, many textbooks misrepresent this literature and cite evidence that does not correct for measurement error.

In their misleading, but influential meta-analysis, Roberts and DelVeccio concluded that “the average trait consistency over a 1-year period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25” (p. 16).

The correct (ed for measurement error) estimates are much higher. The present results suggest consistency over a 1-year would be .98, at 5 years it would be .90, at 10-years it would be .82, at 20-years it would be .67, and at 40 years it would be .45. Long-term stability might even be higher if stable causes contribute substantially to variance in personality (Anusic & Schimmack, 2016).

The evidence of high stability in personality (yes, I think r = .8 over 10-years warrants the label high) has important practical and theoretical implications. First of all, stability of personality in adulthood is one of the few facts that students at the beginning of adulthood may find surprising. It may stimulate self-discovery and taking personality into account in major life decisions. Stability of personality also means that personality psychologists need to focus on the factors that cause stability in personality, but psychologists have traditionally focused on change because statistical tools are designed to focus on differences and deviations rather than invariances. However, just because the Earth is round or the speed of light is constant, natural sciences do not ignore these fixtures of life. It is time for personality psychologists to do the same. The results also have a (sobering) message for researchers interested in personality change. Real change takes time. Even a decade is a relatively short period to observe notable changes which is needed to find predictors of change. This may explain why there are currently no replicable findings of predictors of personality change.

So, what is the stability of personality over a one-year period in adulthood after taking measurement error into account. The correct answer is that it is greater than .9. You probably didn’t know this before reading this blog post. This does of course not mean that we are still the same person after one year or 10 years. However, the broader dispositions that are measured with the Big Five are unlikely to change in the near future for you, your spouse, or co-workers. Whether this is good or bad news depends on you.

Fact Checking Personality Development Research

Many models of science postulate a feedback loop between theories and data. Theories stimulate research that tests theoretical models. When the data contradict the theory and nobody can find flaws with the data, theories are revised to accommodate the new evidence. In reality, many sciences do not follow this idealistic model. Instead of testing theories, researchers try to accumulate evidence that supports their theories. In addition, evidence that contradicts the theory is ignored. As a result, theories never develop. These degenerative theories have been called paradigms. Psychology is filled with paradigms. One paradigm is the personality development paradigm. Accordingly, personality changes throughout adulthood towards the personality of a mature adult (emotionally stable, agreeable, and conscientious; Caspi, Roberts, & Shiner, 2005).

Many findings contradict this paradigm, but these findings are often ignored by personality development researchers. For example, a recent article on personality development (Zimmermann et al., 2021) claims that there is broad evidence for substantial rank-order and mean-level changes citing outdated references from 2000 (Roberts & DelVeccio, 2000) and 2006 (Roberts et al., 2006). It is not difficult to find more recent studies that challenge these claims based on newer evidence and better statistical analyses (Anusic & Schimmack, 2016; Costa et al., 2019). It is symptomatic of a paradigm that these findings that do not fit the personality development paradigm are ignored.

Another symptom of paradigmatic research is that interpretations of research findings do not fit the data. Zimmermann et al. (2021) conducted an impressive study of N = 3,070 students’ personality over the course of a semester. Some of these students stayed at their university and others went abroad. The focus of the article was to examine the potential influence of spending time abroad on personality. The findings are summarized in Table 1.

The key prediction of the personality development paradigm is that neuroticism decreases with age and that agreeableness and conscientiousness increase with age. This trend might be accelerated by spending time abroad, but it is also predicted for students who stay at their university (Robins et al., 2001).

The data do not support this prediction. In the two control groups, neither conscientiousness (d = -.11, d = -.02) nor agreeableness increased (d = -.02, .00) and neuroticism increased (d = .08, .02). The group of students who were waiting to go abroad, but also stayed during the study period also showed no increase in conscientiousness (d = -.22, -.02) or agreeableness (d = -.16, .00), but showed a small decrease in neuroticism (d = -.08, -.01). The group that went abroad showed small increases in conscientiousness (d = .03, .09) and agreeableness (d = .14, .00), and a small decrease in neuroticism (d = -.14, d = .00). All of these effect sizes are very small, which may be due to the short time period. A semester is simply too short to see notable changes in personality.

These results are then interpreted as being fully consistent with the personality development paradigm.

A more accurate interpretation of these findings is that the effects of spending a semester abroad on personality are very small (d ~ .1) and that a semester is too short to discover changes in personality traits. The small effect sizes in this study are not surprising given the finding that even changes over a decade are no larger than d = .1 (Graham et al., 2020; also not cited by Zimmermann et al., 2021) .

In short, the personality development paradigm is based on the assumption that personality changes substantially. However, empirical studies of stability show much stronger evidence of stability, but this evidence is often not cited by prisoners of the personality development paradigm. It is therefore necessary to fact check articles on personality development because the abstracts and discussion section often do not match the data.

Dan Ariely and the Credibility of (Social) Psychological Science

It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty ( Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).

The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.

Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.

Questionable Research Practices

The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).

This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.

Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).

The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.


To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.

For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.

Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.

Z-Curve Results

The key results of the z-curve analysis are captured in Figure 1.

Figure 1

Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.

A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.

Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).

The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.

Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).

With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.

The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.

Figur e3

In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.


The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.

The Myth of Lifelong Personality Development

The German term for development is Entwicklung and evokes the image of a blossom slowly unwrapping its petals. This process has a start and a finish. At some point the blossom is fully open. Similarly, human development has a clear start with conception and usually an end when an individual becomes an adult. Not surprisingly, developmental psychology initially focused on the first two decades of a human life.

At some point, developmental psychologists also started to examine the influence of age at the end of life. Here, the focus was on successful aging in the face of biological decline. The idea of development at the beginning of life and decline at the end of life is consistent with the circle of life that is observed in nature.

In contrast to the circular conception of life, some developmental psychologists propose that that some psychological processes continue to develop throughout adulthood. The idea of life-long development or growth makes the most sense for psychological processes that depend on learning. Over the life course, individuals acquire knowledge and skills. Although practice or the lack thereof may influence performance, individuals with a lot of experience are able to build on their past experiences.

Personality psychologists have divergent views about the development of personality. Some assume that personality is like many other biological traits. They develop during childhood when the brain establishes connections. However, when this process is completed, personality remains fairly stable. Moreover, new experiences may still change neural patterns and personality, but these changes will be idiosyncratic and differ from person to person. These theories do not predict a uniform increase in some personality traits during adulthood.

An alternative view is that we can distinguish between immature and mature personalities and that personality changes towards a goal of the completely mature personality, akin to the completely unfolded blossom. Moreover, this process of personality development or maturation does not end at the end of childhood. Rather, it is a lifelong process that continuous over the adult life-span. Accordingly, personality becomes more mature as individuals are getting older.

What is a Mature Personality?

The notion of personality development during adulthood implies that some personality traits are more mature than others. After all, developmental processes have an end goal and the end goal is the mature state of being.

However, it is difficult to combine the concepts of personality and development because personality implies variation across individuals, just like there is variation across different types of flowers in terms of the number, shape, and color of petals. Should we say that a blossom with more petals is a better blossom? Which shape or color would reflect a better blossom? The answer is that there is no optimal blossom. All blossoms are mature when they are completely unfolded, but this mature state can look every different for different flowers.

Some personality psychologists have not really solved this problem, but rather used the notion of personality development as a label for any personality changes irrespective of direction. “The term ‘personality development’, as used in this paper, is mute with regard to direction
of change. This means that personality development is not necessarily positive change due to functional adjustment, growth or maturation” (Specht et al., 2014, p. 217). While it is annoying that researchers may falsely use the term development when they mean change, it does absolve the researchers from specifying a developmental theory of personality development.

However, others take the notion of a mature personality more seriously (e.g., Hogan & Roberts, 2004, see also Specht et al., 2014). Accordingly, “a mature person from the observer’s viewpoint would be agreeable (supportive and warm), emotionally stable (consistent and positive), and conscientious (honoring commitments and playing by the rules)” (Hogan & Roberts, 2008, p. 9). According to this conception of a mature personality, the goal of personality development is to achieve a low level of neuroticism and high levels of agreeableness and conscientiousness.

Another problem for personality development theories is the existence of variation in mature traits in adulthood. If agreeableness, conscientiousness, and emotional stability are so useful in adult life, it is not clear why some individuals are biologically disposed to have low levels of these traits. The main explanation for variability in traits is that there are trade-offs and that neither extreme is optimal. For example, too much conscientiousness may lead to over-regulated behaviors that are not adaptive when life changes and being too agreeable makes individuals vulnerable to exploitation. In contrast, developmental theories imply that individuals with high levels of neuroticism and low levels of agreeableness or conscientiousness are not fully developed and would have to explain why some individuals do to achieve maturity.

Developmental processes also tend to have a specified time for the process to be completed. For example, flowers blossom at a specified time of year that is optimal for pollination. In humans, sexual development is completed by the end of adolescence to enable reproduction. So, it is reasonable to ask why development of personality should not also have a normal time of completion. If maturity is required to take on the tasks of an adult, including having children and taking care of them, the process should be completed during early adulthood, so that these trait are fully developed when they are needed. It would therefore make sense to assume that most of the development is completed by age 20 or at least age 30, as proposed by Costa and McCrae (cf. Specht et al., 2014). It is not clear why maturation would still occur in middle age or old age.

One possible explanation for late development could be that some individuals have a delayed or “arrested” development. Maybe some environmental factors impede the normal process of development, but the causal forces persist and can still produce the normative change later in adulthood. Another possibility is that personality development is triggered by environmental events. Maybe having children or getting married are life events that trigger personality development in the same way men’s testosterone levels appear to decrease when they enter long-term relationships and have children.

In short, a theory of lifelong development faces some theoretical challenges and alternative predictions about personality in adulthood are possible.

Empirical Claims

Wrzus and Roberts (2017) claim that agreeableness, conscientiousness, and emotional stability increase from young to middle adulthood citing Roberts et al. (2006), Roberts & Mroczek (2008), and Lucas and Donnellan (2011). They also propose that these changes co-occur with life transitions citing Bleidorn (2012, 2015), Le Donnellan, & Conger (2014), Lodi Smith & Roberts (2012), Specht, Egloff, and Schmukle (2011) and Zimmermann and Neyer (2013). A causal role of life events is implied by the claim that mean levels of the traits decrease in old age (Berg & Johansson, 2014; Kandler, Kornadt, Hagemeyer, & Neyer, 2015; Lucas & Donnellan, 2011; Mottus, Johnson, Starr, & Neyer, 2012). Focusing on work experiences, Asselmann and Specht (2020) propose that conscientiousness increases when people enter the workforce and decreases again at the time of retirement.

A recent review article by Costa, McCrae, and Lockenhoff (2019) also suggests that neuroticism decreases and agreeableness and conscientiousness increase over the adult life-span. However, they also point out that these age-trends are “modest.” They suggest that traits change by about one T-score per decade, which is a standardized mean difference of less than .2 standard deviations per decade. However, this effect size implies that changes may be as large as 1 standard deviation from age 20 to age 70.

More recently, Graham et al. (2020) summarized the literature with the claim that “during the emerging adult and midlife years, agreeableness, conscientiousness, openness, and extraversion tend to increase and neuroticism tends to decrease” (p. 303). However, when they conducted an integrated analysis of 16 longitudinal studies, the results were rather different. Most importantly, agreeableness did not increase. The combined effect was b = .02, with a 95%CI that included zero, b = -.02 to .07. Despite the lack of evidence that agreeableness increases with age during adulthood, the authors “tentatively suggest that agreeableness may increase over time” (p. 312).

The results for conscientiousness are even more damaging for the maturation theory. Here most datasets show a decrease in conscientiousness and the average effect size is statistically significant, b = -.05, 95%CI = -.09 to -.02. However, the effect size is small, suggesting that there is no notable age trend in conscientiousness.

The only trait that showed the predicted age-trend was neuroticism, but the effect size was again small and the upper bound of the 95%CI was close to zero, b = -.05, 95%CI = -.09 to -.01.

In sum, recent evidence from several longitudinal studies challenges the claim that personality develops during adulthood. However, longitudinal studies are often limited by rather short time-intervals of a few years up to one decade. If effect sizes over one decade are small, they can be easily masked by method artifacts (Costa et al., 2019). Although cross-sectional studies have their own problem, they have the advantage that it is much easier to cover the full age-range of adulthood. The key problem in cross-sectional studies is that age-effects can be confounded with cohort effects. However, when multiple cross-sectional studies from different survey years are available, it is possible to separate cohort effects and age-effects. (Fosse & Winship, 2019).

Model Predictions

The maturity model also makes some predictions about age-trends for other constructs. One prediction is that well-being should increase as personality becomes more mature because numerous meta-analyses suggest that emotional stability, agreeableness, and conscientiousness predict higher well-being (Anglim et al., 2020). That being said, falsification of this prediction does not invalidate the maturity model. It is possible that other factors lower well-being in middle age or that higher maturity does not cause higher well-being. However, if the maturity model correctly predicts age effects on well-being, it would strengthen the model. I therefore tested age-effects on well-being and examined whether they are explained by personality development.

Statistical Analysis

Fosse and Winship (2019) noted that “despite the existence of hundreds, if not thousands, of articles and dozens of books, there is little agreement on how to adequately analyze age, period, and cohort data” (p. 468). This is also true for studies of personality development. Many of these studies fail to take cohort effects into account or ignore inconsistencies between cross-sectional and longitudinal results.

Fosse and Winship point out that that there is an identification problem when cohort, period, and age effects are linear, but not if the trends have different distributions. For example, if age effects are non-linear, it is possible to distinguish between linear cohort effects, linear period effects, and non-linear age effects. As maturation is expected to produce stronger effects during early adulthood than in middle and may actually show a decline in older age, it is plausible to expect a non-linear age effect. Thus, I examined age-effects in the German Socio-Economic Panel using a statistical model that examines non-linear age effects, while controlling for linear cohort and linear period effects.

Moreover, I included measures of marital status and work status to examine whether age effects are at least partially explained by these life experiences. The inclusion of these measures can also help with model identification (Fosse & Winship, 2019). For example, work and marriage have well-known age-effects. Thus, any age-effects on personality that are mediated by age are easily distinguished from cohort or period effects.

Measurement of Personality

Another limitation of many previous studies is the use of sum scores as measures of personality traits. It is well-known that these sum scores are biased by response styles (Anusic et al., 2009). Moreover, sum scores are influenced by the specific items that were selected to measure the Big Five traits and specific items can have their own age effects (Costa et al., 2019; Terracciano, McCrae, Brant, & Costa, 2005). Using a latent variable approach, it is possible to correct for random and systematic measurement errors and age effects on individual items. I therefore used a measurement model of personality that corrects for acquiescence and halo biases (Anusic et al., 2009). The specification of the model and detailed results can be found on OSF (

A model that assumed only age effects did not fit the data as well as a model that also allowed for cohort and period effects, chi2(df = 211) = 6651, CFI = .974, RMSEA = .021 vs. chi2(df = 201) = 5866, CFI = .977, RMSEA = .020, respectively. This finding shows that age-effects are confounded with other effects in models that do not specify cohort or period effects.

Figure 1 shows the age effects for the Big Five traits.

The results do not support the maturation model. The most inconsistent finding is a strong negative effect of age on agreeableness. However, other traits also did not show a continuous trend throughout adulthood. Conscientiousness increased from age 17 to 35, but remained unchanged afterwards, whereas Openness decreased slightly until age 30 and then increased continuously.

To examine the robustness of these results, I conducted sensitivity analyses with varying controls. The results for agreeableness are shown in Figure 2.

All models show a decreasing trend, but the effect sizes vary. No controls, controlling for either cohort effects or time effects produces a decreasing age trend, but the effect size is small as most scores deviate less than .2 standard deviations from the mean (i.e., zero). However, controlling for time and cohort effects results in the strong decrease observed in Figure 1. Controlling for halo bias makes only a small difference. It is possible that the model that corrects for cohort and time effects overcorrects because it is difficult to distinguish age and time effects. However, none of these results are consistent with the predictions of the maturation model that agreeableness increases throughout adulthood.

Figure 3 takes a closer look at Neuroticism. Inconsistent with the maturation model, most models show a weak increase in neuroticism. The only model that shows a weak decrease controls for cohort effects only. One possible explanation for this finding is that it is difficult to distinguish between non-linear and linear age effects and that the negative time effect is actually an age effect. Even if this were true, the effect size of age is small.

The results for conscientiousness are most consistent with the maturation hypothesis. All models show a big increase from age 17 to age 20, and still a substantial increase from age 20 to age 35. At this point, conscientiousness levels remain fairly stable or decrease in the model that controls only for cohort effects. Although these results are most consistent with the maturation model, they do not support the prediction of a continuous process throughout adulthood. The increase is limited to early adulthood and is stronger at the beginning of adulthood, which is consistent with biological models of development (Costa et al., 2019).

Although not central to the maturation model, I also examined the influence of controls on age-effects for Extraversion and Openness.

Extraversion shows a very small increase over time in the model without controls and the model that controls only for period (time) effects. However, this trend turns negative in models that control for cohort effects. However, all effect sizes are small.

Openness shows different results for models that control for cohort effects or not. Without taking cohort effects into account, openness appears to decrease. However, after taking cohort effects into account, openness stays relatively unchanged until age 30 and then increases gradually. These results suggest that previous cross-section studies may have falsely interpreted cohort effects as age-effects and that openness does not decrease with age.

Work and Marriage as Mediators

Personality psychologists have focussed on two theories to explain increases in conscientiousness during early adulthood. Some personality psychologists assume that it reflects the end stage of a biological process that increases self-regulation throughout childhood and adolescence (Costa & McCrae, 2006; Costa et al., 2019). The process is assumed to be complete by age 30. The present results suggest that it may be a bit later at age 35. The alternative theory is the social roles influence personality (Roberts, Wood, & Smith, 2005). A key prediction of the social investment theory is that personality development occurs when adults take on important social roles such as working full time, entering long-term romantic relationships (marriage), or parenting.

The SOEP makes it possible to test the social investment theory because it included questions about work and marital status. Most young adults start working full-time during their 20s, suggesting that work experiences may produce the increase in conscientiousness during this period. In Germany, marriage occurs later when individuals are in their 30s. Therefore marriage provides a particularly interesting test of the social investment theory because marriage occurs when biological maturation is mostly complete.

Figure 7 shows the age effect for work status. The age effect is clearly visible for all models and only slightly influenced by controlling for cohort or time effects.

Figure 8 shows the figure for marital status with cohabitating participants counted as married. The figure confirms that most Germans enter long-term relationships in their 30s.

To examine the contribution of work and marriage to the development of conscientiousness, I included marriage and work as predictors of conscientiousness. In this model the age-effects on conscientiousness can be decomposed into (a) an effect mediated by work (age -> work -> C), (b) an effect mediated by marriage (age -> married -> C), and an effect of age that is mediated by unmeasured variables (e.g., biological processes). Results are similar for the various models and I present the results for the model that controls for cohort and time effects.

The results show no effect of marriage; that is the effect size for the indirect effect is close to zero, but both work and unmeasured mediators contribute to the total age effect. The unmeasured mediators produce a step increase in the early 20s. This finding is consistent with a biological maturation hypothesis. Moreover, the unmeasured mediators produce a gradual decline over the life span with a surprising uptick at the end. This trajectory may be a sign of cognitive decline. The work effect increases much more gradually and is consistent with the social-role theory. Accordingly, the decrease in conscientiousness after age 55 is related to retirement. The negative effect of retirement on conscientiousness raises some interesting theoretical questions about the definition of personality. Does retirement really alter personality or does it merely alter situational factors that influence conscientious behaviors? To separate these hypotheses, it would be important to examine behaviors outside of work, but the trait measure that was used in this study does not provide information about the consistency of behaviors across different situations.

The key finding is that the data are consistent with two theories that are often treated as mutually exclusive and competing hypotheses. The present results suggest that biological processes and social roles contribute to the development of conscientiousness during early adulthood. However, there is no evidence that this process continuous in middle or late adulthood and role effects tend to disappear as soon as individuals are retiring.

Personality Development and Well-Being

One view of personality assumes that variation is personality is normal and that no personality trait is better than another. In contrast, the maturation model implies that some traits are more desirable, if only because they are instrumental to fulfill roles of adult life like working or maintaining relationships (McCrea & Costa, 1991). Accordingly, more mature individuals should have higher well-being. While meta-analyses suggest that this is the case, they often do not control for rating biases. When rating biases are taken into account, the positive effects of agreeableness and conscientiousness are not always found and are small (Schimmack, Schupp, & Wagner, 2008; Schimmack & Kim, 2020).

Another problem for the maturation theory is that well-being tends to decrease from early to middle adulthood when maturation should produce benefits. However, it is possible that other factors explain this decrease in well-being and maturation buffers these negative effects. To test this hypothesis, I added life-satisfaction to the model and examined mediators of age-effects on life-satisfaction.

An inspection of the direct relationships of personality traits and life-satisfaction confirmed that life-satisfaction ratings are most strongly influenced by neuroticism, b = -.37, se = .01. Response styles also had notable effects; halo b = .15, se = .01, acquiescence, b = .19, se = .01. The effects of the remaining Big Five traits were weak: E b = .078, se = .01, A = .07, se = .01, C = .02, se = .005, O = .07, se = .01. The weak effect of conscientiousness makes it unlikely that age-effects on conscientiousness contribute to age-effects on life-satisfaction.

The next figure shows the age-effect for life-satisfaction. The total effect is rather flat and shows only an increase in the 60s.

The mostly stable level of life-satisfaction masks two opposing trends. As individuals enter the workforce and get married, life-satisfaction actually increases. The positive trajectory for work reverses when individuals retire, while the positive effect of marriage remains. However, the positive effects of work and marriage are undone by unexplained factors that decrease well-being until age 50, when a rebound is observed. Neuroticism is not a substantial mediator because there are no notable age-effects on neuroticism. Conscientiousness is not a notable mediator because it does not predict life-satisfaction.

The main insight from these findings is that achieving major milestones of adult life is associated with increased well-being, but that these positive effects are not explained by personality development.


Narrative reviews claim that personality develops steadily through adulthood. For example, in a just published review of the literature Roberts and Yoon claim that “agreeableness, conscientiousness, and emotional stability show increases steadily through midlife” (p. 10). Roberts and Yoon also claim that “forming serious partnerships is associated with decreases in neuroticism and increases in conscientiousness” (p. 11). The problem with these broad and vague statements is that they ignores inconsistencies across cross-sectional and longitudinal analyses (Lucas & Donnellan, 2011), inconsistencies across populations (Graham et al., 2020), and effect sizes (Costa et al., 2019).

The present results challenge this simplistic story of personality development. First, only conscientiousness shows a notable increase from late adolescence to middle age and most of the change occurs during early adulthood before the age of 35. Second, formation of long-term relationships had no effect on neuroticism or conscientiousness. Participation in the labor force did increase conscientiousness, but these gains were lost when older individuals retired. If conscientiousness were a sign of maturity, it is not clear why it would decrease after it was acquired. In short, the story of life-long development is not based on scientific facts.

The notion of personality development is also problematic from a theoretical perspective. It implies that some personality traits are better, more mature, than others. This has led to calls for interventions to help people to become more mature (Bleidorn et al., 2019). However, this proposal imposes values and implicitly devalues individuals with the wrong traits. An alternative view treats personality as variation without value judgment. Accordingly, it may be justified to help individuals to change their personality if they want to change their personality, just like gender changes are now considered a personal choice without imposing gender norms on individuals. However, it would be wrong to subject individuals to programs that aim to change their personality, just like it is now considered wrong to subject individuals to interventions that target their sexual orientation. Even if individuals want to change, it is not clear how much personality can be changed. Thus, another goal should be to help individuals with different personality traits to feel good about themselves and to live fulfilling lives that allow them to express their authentic personality. The rather weak relationships between many personality traits and well-being suggests that it is possible to have high well-being with a variety of personalities. The main exception is neuroticism, which has a strong negative effect on well-being. However, the question here is how much of this relationship is driven by mood disorders rather than normal variation in personality. The effect may also be moderated by social factors that create stress and anxiety.

In conclusion, the notion of personality development lacks clear theoretical foundations and empirical support. While there are some relatively small mean level changes in personality over the life span, they are relatively trivial compared to the large stable variance in personality traits across individuals. Rather than considering this variation as arrested forms of development, it should be celebrate as diversity that enriches everybody’s life.

Conflict of Interest: My views may be biased by my (immature) personality (high N, low A, low C).

P.S. I asked Brent W. Roberts for comments, but he declined the opportunity. Please share your comments in the comment section.