Dr. Ulrich Schimmack Blogs about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 


Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 

Rejection Watch: Censorship at JEP-General

Articles published in peer-reviewed journals are only a tip of the scientific iceberg. Professional organizations want you to believe that these published articles are carefully selected to be the most important and scientifically credible articles. In reality, peer-review is unreliable, invalid, and editorial decisions are based on personal preferences. For this reason, the censoring mechanism is often hidden. Part of the movement towards open science is to make the censoring process transparent.

I therefore post the decision letter and the reviews from JEP:General. I sent my ms “z-curve: an even better p-curve” to this journal because it published two articles on the p-curve method that are highly cited. The key point of my ms. is to point out that the p-curve app produces a “power” estimate of 97% for hand-coded articles by Leif Nelson, while z-curve produces an estimate of 52%. If you are a quantitative scientist, you will agree that this is a non-trivial difference and you are right to ask which of these estimates is more credible. The answer is provided by simulation studies that compare p-curve and z-curve and show that p-curve can dramatically overestimate “power” when the data are heterogeneous (Brunner & Schimmack, 2020). In short, the p-curve app sucks. Let the record show that JEP-General is happy to get more citations for a flawed method. The reason might be that z-curve is able to show publication bias in the original articles published in JEP-General (Replicability Rankings). Maybe Timothy J. Pleskac is afraid that somebody looks at his z-curve, which shows a few too many p-values that are just significant (ODR = 73% vs. EDR = 45%).

Unfortunately for psychologists, statistics is an objective science that can be evaluated using either mathematical proofs (Brunner & Schimmack, 2020) and simulation studies (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). It is just hard for psychologists to follow the science, if the science doesn’t agree with their positive illusions and inflated egos.


Z-curve 2.0: An Even Better P-Curve
Journal of Experimental Psychology: General

Dear Dr. Schimmack,

I have received reviews of the manuscript entitled Z-curve 2.0: An Even Better P-Curve (XGE-2021-3638) that you recently submitted to Journal of Experimental Psychology: General. Upon receiving the paper I read the paper. I agree that Simonsohn, Nelson, & Simmons’ (2014) P-Curve paper has been quite impactful. As I read over the manuscript you submitted, I saw there was some potential issues raised that might help help advance our understanding of how to evaluate scientific work. Thus, I asked two experts to read and comment on the paper. The experts are very knowledgeable and highly respected experts in the topical area you are investigating.

Before reading their reviews, I reread the manuscript, and then again with the reviews in hand. In the end, both reviewers expressed some concerns that prevented them from recommending publication in Journal of Experimental Psychology: General. Unfortunately, I share many of these concerns. Perhaps the largest issue is that both reviewers identified a number formal issues that need more development before claims can be made about the z-curve such as the normality assumptions in the paper. I agree with Reviewer 2 that more thought and work is needed here to establish the validity of these assumptions and where and how these assumptions break down. I also agree with Reviewer 1 that more care is needed when defining and working with the idea of unconditional power. It would help to have the code, but that wouldn’t be sufficient as one should be able to read the description of the concept in the paper and be able to implement it computationally. I haven’t been able to do this. Finally, I also agree with Reviewer 1 that any use of the p-curve should have a p-curve disclosure table. I would also suggest ways to be more constructive in this critique. In many places, the writing and approach comes across as attacking people. That may not be the intention. But, that is how it reads.

Given these concerns, I regret to report that that I am declining this paper for publication in Journal of Experimental Psychology: General. As you probably know, we can accept only small fraction of the papers that are submitted each year. Accordingly, we must make decisions based not only on the scientific merit of the work but also with an eye to the potential level of impact for the findings for our broad and diverse readership. If you decide to pursue publication in another journal at some point (which I hope you will consider), I hope that the suggestions and comments offered in these reviews will be helpful.

Thank you for submitting your work to the Journal. I wish you the best in your continued research, and please try us again in the future if you think you have a manuscript that is a good fit for Journal of Experimental Psychology: General.


Timothy J. Pleskac, Ph.D.
Associate Editor
Journal of Experimental Psychology: General

Reviewers’ comments:

Reviewer #1: 1. This commentary submitted to JEPG begins presenting a p-curve analysis of early work by Leif Nelson.
Because it does not provide a p-curve disclosure table, this part of the paper cannot be evaluated.
The first p-curve paper (Simonsohn et al, 2014) reads: “P-curve disclosure table makes p-curvers accountable for decisions involved in creating a reported p-curve and facilitates discussion of such decisions. We strongly urge journals publishing p-curve analyses to require the inclusion of a p-curve disclosure table.” (p.540). As a reviewer I am aligning with these recommendation and am *requiring* a p-curve disclosure table, as in, I will not evaluate that portion of the paper, and moreover I will recommend the paper be rejected unless that analysis is removed, or a p-curve disclosure table is included, and is then evaluated as correctly conducted by the review team in an subsequent round of evaluation. The p-curve disclosure table for the Russ et al p-curve, even if not originally conducted by these authors, should be included as well, with a statement that the authors of this paper have examined the earlier p-curve disclosure table and deemed it correct. If an error exists in the literature we have to fix it, not duplicate it (I don’t know if there is an error, my point is, neither do the authors who are using it as evidence).

2. The commentary then makes arguments about estimating conditional vs unconditional power. While not exactly defined in the article, the authors come pretty close to defining conditional power, I think they mean by it the average power conditional on being included in p-curve (ironically, if I am wrong about the definition, the point is reinforced). I am less sure about what they mean by unconditional power. I think they mean that they include in the population parameter of interest not only the power of the studies included in p-curve, but also the power of studies excluded from it, so ALL studies. OK, this is an old argument, dating back to at least 2015, it is not new to this commentary, so I have a lot to say about it.

First, when described abstractly, there is some undeniable ‘system 1’ appeal to the notion of unconditional power. Why should we restrict our estimation to the studies we see? Isn’t the whole point to correct for publication bias and thus make inferences about ALL studies, whether we see them or not? That’s compelling. At least in the abstract. It’s only when one continues thinking about it that it becomes less appealing. More concretely, what does this set include exactly? Does ‘unconditional power’ include all studies ever attempted by the researcher, does it include those that could have been run but for practical purposes weren’t? does it include studies run on projects that were never published, does it include studies run, found to be significant, but eventually dropped because they were flawed? Does it include studies for which only pilots were run but not with the intention of conducting confirmatory analysis? Does it include studies which were dropped because the authors lost interest in the hypothesis? Does it include studies that were run but not published because upon seeing the results the authors came up with a modification of the research question for which the previous study was no longer relevant? Etc etc). The unconditional set of studies is not a defined set, without a definition of the population of studies we cannot define a population parameter for it, and we can hardly estimate a non-existing parameter. Now. I don’t want to trivialize this point. This issue of the population parameter we are estimating is an interesting issue, and reasonable people can disagree with the arguments I have outlined above (many have), but it is important to present the disagreement in a way that readers understand what it actually entails. An argument about changing the population parameter we estimate with p-curve is not about a “better p-curve”, it is about a non-p-curve. A non-p-curve which is better for the subset of people who are interested in the unconditional power, but a WORSE p-curve for those who want the conditional power (for example, it is worse for the goals of the original p-curve paper). For example, the first paper using p-curve for power estimation reads “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve”. So a tool which does not estimate that value, but a different value, it is not better, it is different. The standard deviation is neither better nor worse than the mean. They are different. It would be silly to say “Standard Deviation, a better Mean (because it captures dispersion and the mean does not)”. The standard deviation is better for someone interested in dispersion, and the standard deviation is worse for someone interested in the central tendency. Exactly the same holds for conditional vs unconditional power. (well, the same if z-curve indeed estimated unconditional power, i don’t know if that is true or not. Am skeptical but open minded).

Second, as mentioned above, this distinction of estimating the parameter of the subset of studies included in p-curve vs the parameter of “all studies” is old. I think that argument is seen as the core contribution of this commentary, and that contribution is not close to novel. As the quote above shows, it is a distinction made already in the original p-curve paper for estimating power. And, it is also not new to see it as a shortcoming of p-curve analysis. Multiple papers by Van Assen and colleagues, and by McShane and colleagues, have made this argument. They have all critiqued p-curve on those same grounds.

I therefore think this discussion should improve in the following ways: (i) give credit, and give voice, to earlier discussions of this issue (how is the argument put forward here different from the argument put forward in about a handful of previous papers making it, some already 5 years ago), (ii) properly define the universe of studies one is attempting to estimate power for (i.e., what counts in the set of unconditional power), and (iii) convey more transparently that this is a debate about what is the research question of interest, not of which tool provides the better answer to the same question. Deciding whether one wants to estimate the average power of one or another set of studies is completely fair game of an issue to discuss, and if indeed most readers don’t think they care about conditional power, and those readers use p-curve not realizing that’s what they are estimating, it is valuable to disabuse them of their confusion. But it is not accurate, and therefore productive, to describe this as a statistical discussion, it is a conceptual discussion.

3. In various places the paper reports results from calculations, but the authors have not shared neither the code nor data for those calculations, so these results cannot be adequately evaluated in peer-review, and that is the very purpose of peer-review. This shortcoming is particularly salient when the paper relies so heavily on code and data shared in earlier published work.

Finally, it should be clearer what is new in this paper. What is said here that is not said in the already published z-curve paper and p-curve critique papers?

Reviewer #2:
The paper reports a comparison between p-curve and z-curve procedures proposed in the literature. I found the paper to be unsatisfactory, and therefore cannot recommend publication in JEP:G. It reads more like a cropped section from the author’s recent piece in meta-psychology than a standalone piece that elaborates on the different procedures in detail. Because a lot is completely left out, it is very difficult to evaluate the results. For example, let us consider a couple of issues (this is not an exhaustive list):

– The z-curve procedure assumes that z-transformed p-values under the null hypothesis follow a standard Normal distribution. This follows from the general idea that the distribution of p-values under the null-hypothesis is uniform. However, this general idea is not necessarily true when p-values are computed for discrete distributions and/or composite hypotheses are involved. This seems like a point worth thinking about more carefully, when proposing a procedure that is intended to be applied to indiscriminate bodies of p-values. But nothing is said about this, which strikes me as odd. Perhaps I am missing something here.

– The z-curve procedure also assumes that the distribution of z-transformed p-values follows a Normal distribution or a mixture of homoskedastic Normals (distributions that can be truncated depending on the data being considered/omitted). But how reasonable is this parametric assumption? In their recently published paper, the authors state that this is as **a fact**, but provide no formal proof or reference to one. Perhaps I am missing something here. If anything, a quick look at classic papers on the matter, such as Hung et al. (1997, Biometrics), show that the cumulative distributions of p-values under different alternatives cross-over, which speaks against the equal-variance assumption. I don’t think that these questions about parametric assumptions are of secondary importance, given that they will play a major in the parameter estimates obtained with the mixture model.

Also, when comparing the different procedures, it is unclear whether the reported disagreements are mostly due to pedestrian technical choices when setting up an “app” rather than irreconcilable theoretical commitments. For example, there is nothing stopping one from conducting a p-curve analysis on a more fine-grained scale. The same can be said about engaging in mixture modeling. Who is/are the culprit/s here?

Finally, I found that the writing and overall tone could be much improved.

Personality Over Time: A Historic Review

The hallmark of a science is progress. To demonstrate that psychology is a science therefore requires evidence that current evidence, research methods, and theories are better than those in the past. Historic reviews are also needed because it is impossible to make progress without looking back once in a while.

Research on the stability or consistency of personality has a long history that started with the first empirical investigations in the 1930s, but a historic review of this literature is lacking. Few young psychologists interested in personality development may be familiar with Kelly, his work, or his American Psychologist article on “Consistency of the Adult Personality” (Kelly, 1955). Kelly starts his article with some personal observations about stability and change in traits that he observed in colleagues over the years.

Today, we call traits that are neither physical characteristics, nor cognitive abilities, personality traits that are represented in the Big Five model. What have we learned about the stability of personality traits in adulthood from nearly a century of research?

Kelly (1955) reported some preliminary results from his own longitudinal study of personality that he started in the 1930s with engaged couples. Twenty years-later, they completed follow-up questionnaires. Figure 6 reported the results for the Allport-Vernon value scales. I focus on these results because they make it possible to compare the retest-correlations to retest-correlations over a one-year period.

Figure 6 shows that personality, or at least values, are not perfectly stable. This is easily seen by a comparison of the one-year retest correlations with the 20-year retest correlations. The 20-year retest correlations are always lower than the one-year retest correlations. Individual differences in values change over time. Some individuals become more religious and others become less religious, for example. The important question is how much individuals change over time. To quantify change and stability it is important to specify a time interval because change implies lower retest correlations over longer retest intervals. Although the interval is arbitrary, a period of 1-year or 10-year can be used to quantify and compare stability and change of different personality traits. To do so, we need a model of change over time. A simple model is Heise’s (1969) autoregressive model that assumes a constant rate of change.

Take religious values as an example. Here we have two observed retest correlations, r(y1) = .60, and r(y20) = .75. Both correlations are attenuated by random measurement error. To correct for unreliability, we need to solve two equations with two unknowns, the rate of change and reliability.
.75 = rate^1 * rel
.60 = rate^20 * rel
With some rusty high-school math, I was able to solve this equation for rate
rate = (.60/.75)^(1/(20-1) = .988
The implied 10-year stability is .988^10 = .886.
The estimated reliability is .75 / .988 = .759.

Table 1 shows the results for all six values.

Value1-year20-yearReliability1-Year Rate10-Year Rate
Table 1
Stability and Change of Allport-Vernon Values

The results show that the 1-year retest correlations are very similar to the reliability estimates of the value measure. After correcting for unreliability the 1-year stability is extremely high with stability estimates ranging from .96 for social values to .99 for religious values. The small differences in 1-year stabilities become only notable over longer time periods. The estimated 10-year stability estimates range from .68 for social values to .90 for religious values.

Kelly reported results for two personality constructs that were measured with the Bernreuter personality questionnaire, namely self-confidence and sociability.

The implied stability of these personality traits is similar to the stability of values.

Personality1-year20-yearReliability1-Year Rate10-Year Rate

Kelly’s results published in 1955 are based on a selective sample during a specific period of time that included the second world war. It is therefore possible that studies with other populations during other time periods produce different results. However, the results are more consistent than different across different studies.

The first article with retest correlations for different time intervals of reasonable length was published in 1941 by Mason N. Crook. The longest retest interval was 6-years and six months. Figure 1a in the article plotted the retest correlations as a function of the retest interval.

Table 2 shows the retest correlations and reveals that some of them are based on extremely small sample sizes. The 5-month retest is based on only 30 participants whereas the 8 months retest is based on 200 participants. Using this estimate for the short-term stability, it is possible to estimate the 1-year rate and 10-year rates using the formula given above.

Sample SizeMonthsretestReliability1-Year Rate10-Year Rate
Weighted Average0.750.9580.651

The 1-year stability estimates are all above .9, except for the retest correlation that is based on only N = 18 participants. Given the small sample sizes, variability in estimates is mostly random noise. I computed a weighted average that takes both sample size and retest interval into account because longer time-intervals provide better information about the actual rate of change. The estimated 1-year stability is r = .96, which implies a 10-year stability of .65. This is a bit lower than Kelley’s estimates, but this might just be sampling error. It is also possible that Crook’s results underestimate long-term stability because the model assumes a constant rate of change. It is possible that this assumption is false, as we will see later.

Crook also provided a meta-analysis that included other studies and suggested a hierarchy of consistency.

Accordingly, personality traits like neuroticism are less stable than cognitive abilities, but more stable than attitudes. As the Figure shows, empirical support for this hierarchy was limited, especially for estimates of the stability of attitudes.

Several decades later, Conley (1984) reexamined this hierarchy of consistency with more data. He was also the first, to provide quantitative stability estimates that correct for unreliability. The meta-analysis included more studies and, more importantly, studies with long retest intervals. The longest retest interval was 45 years (Conley, 1983). After correcting for unreliability, the one-year stability was estimated to be r = .98, which implies a stability of r = .81 over a period of 10-years and r = .36 over 50 years.

Using the published retest correlations for with sample sizes greater than 100, I obtained a one-year stability estimate of r = .969 for neuroticism and r = .986 for extraversion. These differences may reflect differences in stability or could just be sampling error. The average reproduces Conley’s (1984) estimate of r = .98 (r = .978).

Sample SizeYearsretestReliability1-Year Rate10-Year Rate
Weighted Average0.740.9690.730
Sample SizeYearsretestReliability1-Year Rate10-Year Rate
Weighted Average0.730.9860.868

To summarize, decades of research had produced largely consistent findings that the short-term (1-year) stability of personality traits is well above r = .9 and that it takes long time-periods to observe substantial changes in personality.

The next milestone in the history of research on personality stability and change was Roberts and DelVeccio’s (2000) influential meta-analysis that is featured in many textbooks and review articles (e.g., Caspi, Roberts, & Shiner, 2005; MacAdams & Olson, 2010).

Roberts and DelVeccio’s literature review mentions Conley’s (1984) key findings. “When dissattenuated, measures of extraversion were quite consistent, averaging .98 over a 1-year period, approximately .70 over a 10-year period, and approximately .50 over a 40-year period” (p. 7).

The key finding of Roberts and DelVeccio’s meta-analysis was that age moderates stability of personality. As shown in Figure 1, stability increases with age. The main limitation of Figure 1 is that the figure shows average retest correlations without a specific time interval that are not corrected for measurement error. Thus, the finding that retest correlations in early and middle adulthood (22-49) average around .6 provides no information about the stability of personality in this age group.

Most readers of Roberts and DelVeccio (2000) fail to notice a short section that examines the influence of time interval on retest correlations.

On the basis of the present data, the average trait consistency over a 1-year
period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25
(Roberts & DelVeccio, 2000, p. 16).

Using the aforementioned formula to correct for measurement error shows that Roberts and DelVeccio’s meta-analysis replicates Conley’s results, 1-year r = .983.

YearsretestReliability1-Year Rate10-Year Rate
Weighted Average0.730.9830.842

Unfortunately, review articles often mistake these observed retest correlations as estimates of stability. For example, Adams and Olson write “Roberts & DelVecchio (2000) determined that stability coefficients for dispositional traits were lowest in studies of children (averaging 0.41), rose to higher levels among young adults (around 0.55), and then reached a plateau for adults between the ages of 50 and 70 (averaging 0.70)” (p. 521) and fail to mention that these stability coefficients are not corrected for measurement error, which is a common mistake (Schmidt, 1996).

Roberts and DelVeccio’s (2000) article has shaped contemporary views that personality is much more malleable than the data suggest. A twitter poll showed that only 11% of respondents guessed the right answer that the one-year stability is above .9, whereas 43% assumed the upper limit is r = .7. With r = 7 over a 1-year period, the stability over 10-years would only be r = .03 over a 10-year period. Thus, these respondents essentially assumed that personality has no stability over a 10-year period. More likely, respondents simply failed to take into account how high short-term stability has to be to allow for moderately high long-term stability.

The misinformation about personality stability is likely due to vague, verbal statements and the use of effect sizes that ignore the length of the retest interval. For example, Atherton, Grijalva, Roberts, and Robins (2021) published an article with a retest interval of 18-years. The abstract describes the results as “moderately-to-high stability over a 20-year period” (p. 841). Table 1 reports the observed correlations that control for random measurement error using a latent variable model with item-parcels as indicators.

The next table shows the results for the 4-year retest interval in adolescence and the 20-year retest interval in adulthood along with the implied 1-year rates. Consistent with Roberts and DelVeccio’s meta-analysis, the 1-year stability in adolescence is lower, r = .908, than in adulthood, r = .976.

TraitYearsRetest1-Year RateRetestRetest1-Year Rate

However, even in adolescence the 1-year stability is high. Most important, the 1-year rate for adults is consistent with estimates in Conley’s (1984) meta-analysis and the first study in 1941 by Crook, and even Roberts and DelVeccio’s meta-analysis when measurement error is taken into account. However, Atherton et al. (2021) fail to cite historic articles and fail to mention that their results replicate nearly a century of research on personality stability in adulthood.

Stable Variance in Personality

So far, I have used a model that assumes a fixed rate of change. The model also assumes that there are no stable influences on personality. That is, all causes of variation in personality can change and given enough time will change. This model implies that retest correlations eventually approach zero. The only reason why this may not happen is that human lives are too short to observe retest correlations of zero. For example, with r = .98 over a 1-year period, the 100-year retest correlation is still r = .13, but the 200-year retest correlation is r = .02.

With more than two retest intervals, it is possible to see that this model may not fit the data. If there is no measurement error, the correlation from t1 to t3 should equal the product of the two lags from t1 to t2 and from t2 to t3. If the t1-t3 correlation is larger than this model predicts, the data suggest the presence of some stable causes that do not change over time (Anusic & Schimmack, 2016; Kenny & Zautra, 1995).

Take the data from Atherton et al. (2021) as an example. The average retest correlation from t1 (beginning of college) to t3 (age 40) was r = .55. The correlation from beginning to end of college was r = .68, and the correlation from end of college to age 40 was r = .62. We see that .55 > .68 * .62 = .42.

Anusic and Schimmack (2016)

Anusic and Schimmack (2016) estimated the amount of stable variance in personality traits to be over 50%. This estimate may be revised in the future when better data become available. However, models with and without stable causes differ mainly in predictions over long-time intervals where few data are currently available. The modeling has little influence on estimates of stability over time periods of less than 10-years.


This historic review of research on personality change and stability demonstrated that nearly a century of research has produced consistent findings. Unfortunately, many textbooks misrepresent this literature and cite evidence that does not correct for measurement error.

In their misleading, but influential meta-analysis, Roberts and DelVeccio concluded that “the average trait consistency over a 1-year period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25” (p. 16).

The correct (ed for measurement error) estimates are much higher. The present results suggest consistency over a 1-year would be .98, at 5 years it would be .90, at 10-years it would be .82, at 20-years it would be .67, and at 40 years it would be .45. Long-term stability might even be higher if stable causes contribute substantially to variance in personality (Anusic & Schimmack, 2016).

The evidence of high stability in personality (yes, I think r = .8 over 10-years warrants the label high) has important practical and theoretical implications. First of all, stability of personality in adulthood is one of the few facts that students at the beginning of adulthood may find surprising. It may stimulate self-discovery and taking personality into account in major life decisions. Stability of personality also means that personality psychologists need to focus on the factors that cause stability in personality, but psychologists have traditionally focused on change because statistical tools are designed to focus on differences and deviations rather than invariances. However, just because the Earth is round or the speed of light is constant, natural sciences do not ignore these fixtures of life. It is time for personality psychologists to do the same. The results also have a (sobering) message for researchers interested in personality change. Real change takes time. Even a decade is a relatively short period to observe notable changes which is needed to find predictors of change. This may explain why there are currently no replicable findings of predictors of personality change.

So, what is the stability of personality over a one-year period in adulthood after taking measurement error into account. The correct answer is that it is greater than .9. You probably didn’t know this before reading this blog post. This does of course not mean that we are still the same person after one year or 10 years. However, the broader dispositions that are measured with the Big Five are unlikely to change in the near future for you, your spouse, or co-workers. Whether this is good or bad news depends on you.

Fact Checking Personality Development Research

Many models of science postulate a feedback loop between theories and data. Theories stimulate research that tests theoretical models. When the data contradict the theory and nobody can find flaws with the data, theories are revised to accommodate the new evidence. In reality, many sciences do not follow this idealistic model. Instead of testing theories, researchers try to accumulate evidence that supports their theories. In addition, evidence that contradicts the theory is ignored. As a result, theories never develop. These degenerative theories have been called paradigms. Psychology is filled with paradigms. One paradigm is the personality development paradigm. Accordingly, personality changes throughout adulthood towards the personality of a mature adult (emotionally stable, agreeable, and conscientious; Caspi, Roberts, & Shiner, 2005).

Many findings contradict this paradigm, but these findings are often ignored by personality development researchers. For example, a recent article on personality development (Zimmermann et al., 2021) claims that there is broad evidence for substantial rank-order and mean-level changes citing outdated references from 2000 (Roberts & DelVeccio, 2000) and 2006 (Roberts et al., 2006). It is not difficult to find more recent studies that challenge these claims based on newer evidence and better statistical analyses (Anusic & Schimmack, 2016; Costa et al., 2019). It is symptomatic of a paradigm that these findings that do not fit the personality development paradigm are ignored.

Another symptom of paradigmatic research is that interpretations of research findings do not fit the data. Zimmermann et al. (2021) conducted an impressive study of N = 3,070 students’ personality over the course of a semester. Some of these students stayed at their university and others went abroad. The focus of the article was to examine the potential influence of spending time abroad on personality. The findings are summarized in Table 1.

The key prediction of the personality development paradigm is that neuroticism decreases with age and that agreeableness and conscientiousness increase with age. This trend might be accelerated by spending time abroad, but it is also predicted for students who stay at their university (Robins et al., 2001).

The data do not support this prediction. In the two control groups, neither conscientiousness (d = -.11, d = -.02) nor agreeableness increased (d = -.02, .00) and neuroticism increased (d = .08, .02). The group of students who were waiting to go abroad, but also stayed during the study period also showed no increase in conscientiousness (d = -.22, -.02) or agreeableness (d = -.16, .00), but showed a small decrease in neuroticism (d = -.08, -.01). The group that went abroad showed small increases in conscientiousness (d = .03, .09) and agreeableness (d = .14, .00), and a small decrease in neuroticism (d = -.14, d = .00). All of these effect sizes are very small, which may be due to the short time period. A semester is simply too short to see notable changes in personality.

These results are then interpreted as being fully consistent with the personality development paradigm.

A more accurate interpretation of these findings is that the effects of spending a semester abroad on personality are very small (d ~ .1) and that a semester is too short to discover changes in personality traits. The small effect sizes in this study are not surprising given the finding that even changes over a decade are no larger than d = .1 (Graham et al., 2020; also not cited by Zimmermann et al., 2021) .

In short, the personality development paradigm is based on the assumption that personality changes substantially. However, empirical studies of stability show much stronger evidence of stability, but this evidence is often not cited by prisoners of the personality development paradigm. It is therefore necessary to fact check articles on personality development because the abstracts and discussion section often do not match the data.

Dan Ariely and the Credibility of (Social) Psychological Science

It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).

The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.

Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.

Questionable Research Practices

The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).

This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.

Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).

The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.


To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.

For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.

Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.

Z-Curve Results

The key results of the z-curve analysis are captured in Figure 1.

Figure 1

Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.

A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.

Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).

The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.

Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).

With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.

The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.

Figur e3

In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.


The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.

The Myth of Lifelong Personality Development

The German term for development is Entwicklung and evokes the image of a blossom slowly unwrapping its petals. This process has a start and a finish. At some point the blossom is fully open. Similarly, human development has a clear start with conception and usually an end when an individual becomes an adult. Not surprisingly, developmental psychology initially focused on the first two decades of a human life.

At some point, developmental psychologists also started to examine the influence of age at the end of life. Here, the focus was on successful aging in the face of biological decline. The idea of development at the beginning of life and decline at the end of life is consistent with the circle of life that is observed in nature.

In contrast to the circular conception of life, some developmental psychologists propose that that some psychological processes continue to develop throughout adulthood. The idea of life-long development or growth makes the most sense for psychological processes that depend on learning. Over the life course, individuals acquire knowledge and skills. Although practice or the lack thereof may influence performance, individuals with a lot of experience are able to build on their past experiences.

Personality psychologists have divergent views about the development of personality. Some assume that personality is like many other biological traits. They develop during childhood when the brain establishes connections. However, when this process is completed, personality remains fairly stable. Moreover, new experiences may still change neural patterns and personality, but these changes will be idiosyncratic and differ from person to person. These theories do not predict a uniform increase in some personality traits during adulthood.

An alternative view is that we can distinguish between immature and mature personalities and that personality changes towards a goal of the completely mature personality, akin to the completely unfolded blossom. Moreover, this process of personality development or maturation does not end at the end of childhood. Rather, it is a lifelong process that continuous over the adult life-span. Accordingly, personality becomes more mature as individuals are getting older.

What is a Mature Personality?

The notion of personality development during adulthood implies that some personality traits are more mature than others. After all, developmental processes have an end goal and the end goal is the mature state of being.

However, it is difficult to combine the concepts of personality and development because personality implies variation across individuals, just like there is variation across different types of flowers in terms of the number, shape, and color of petals. Should we say that a blossom with more petals is a better blossom? Which shape or color would reflect a better blossom? The answer is that there is no optimal blossom. All blossoms are mature when they are completely unfolded, but this mature state can look every different for different flowers.

Some personality psychologists have not really solved this problem, but rather used the notion of personality development as a label for any personality changes irrespective of direction. “The term ‘personality development’, as used in this paper, is mute with regard to direction
of change. This means that personality development is not necessarily positive change due to functional adjustment, growth or maturation” (Specht et al., 2014, p. 217). While it is annoying that researchers may falsely use the term development when they mean change, it does absolve the researchers from specifying a developmental theory of personality development.

However, others take the notion of a mature personality more seriously (e.g., Hogan & Roberts, 2004, see also Specht et al., 2014). Accordingly, “a mature person from the observer’s viewpoint would be agreeable (supportive and warm), emotionally stable (consistent and positive), and conscientious (honoring commitments and playing by the rules)” (Hogan & Roberts, 2008, p. 9). According to this conception of a mature personality, the goal of personality development is to achieve a low level of neuroticism and high levels of agreeableness and conscientiousness.

Another problem for personality development theories is the existence of variation in mature traits in adulthood. If agreeableness, conscientiousness, and emotional stability are so useful in adult life, it is not clear why some individuals are biologically disposed to have low levels of these traits. The main explanation for variability in traits is that there are trade-offs and that neither extreme is optimal. For example, too much conscientiousness may lead to over-regulated behaviors that are not adaptive when life changes and being too agreeable makes individuals vulnerable to exploitation. In contrast, developmental theories imply that individuals with high levels of neuroticism and low levels of agreeableness or conscientiousness are not fully developed and would have to explain why some individuals do to achieve maturity.

Developmental processes also tend to have a specified time for the process to be completed. For example, flowers blossom at a specified time of year that is optimal for pollination. In humans, sexual development is completed by the end of adolescence to enable reproduction. So, it is reasonable to ask why development of personality should not also have a normal time of completion. If maturity is required to take on the tasks of an adult, including having children and taking care of them, the process should be completed during early adulthood, so that these trait are fully developed when they are needed. It would therefore make sense to assume that most of the development is completed by age 20 or at least age 30, as proposed by Costa and McCrae (cf. Specht et al., 2014). It is not clear why maturation would still occur in middle age or old age.

One possible explanation for late development could be that some individuals have a delayed or “arrested” development. Maybe some environmental factors impede the normal process of development, but the causal forces persist and can still produce the normative change later in adulthood. Another possibility is that personality development is triggered by environmental events. Maybe having children or getting married are life events that trigger personality development in the same way men’s testosterone levels appear to decrease when they enter long-term relationships and have children.

In short, a theory of lifelong development faces some theoretical challenges and alternative predictions about personality in adulthood are possible.

Empirical Claims

Wrzus and Roberts (2017) claim that agreeableness, conscientiousness, and emotional stability increase from young to middle adulthood citing Roberts et al. (2006), Roberts & Mroczek (2008), and Lucas and Donnellan (2011). They also propose that these changes co-occur with life transitions citing Bleidorn (2012, 2015), Le Donnellan, & Conger (2014), Lodi Smith & Roberts (2012), Specht, Egloff, and Schmukle (2011) and Zimmermann and Neyer (2013). A causal role of life events is implied by the claim that mean levels of the traits decrease in old age (Berg & Johansson, 2014; Kandler, Kornadt, Hagemeyer, & Neyer, 2015; Lucas & Donnellan, 2011; Mottus, Johnson, Starr, & Neyer, 2012). Focusing on work experiences, Asselmann and Specht (2020) propose that conscientiousness increases when people enter the workforce and decreases again at the time of retirement.

A recent review article by Costa, McCrae, and Lockenhoff (2019) also suggests that neuroticism decreases and agreeableness and conscientiousness increase over the adult life-span. However, they also point out that these age-trends are “modest.” They suggest that traits change by about one T-score per decade, which is a standardized mean difference of less than .2 standard deviations per decade. However, this effect size implies that changes may be as large as 1 standard deviation from age 20 to age 70.

More recently, Graham et al. (2020) summarized the literature with the claim that “during the emerging adult and midlife years, agreeableness, conscientiousness, openness, and extraversion tend to increase and neuroticism tends to decrease” (p. 303). However, when they conducted an integrated analysis of 16 longitudinal studies, the results were rather different. Most importantly, agreeableness did not increase. The combined effect was b = .02, with a 95%CI that included zero, b = -.02 to .07. Despite the lack of evidence that agreeableness increases with age during adulthood, the authors “tentatively suggest that agreeableness may increase over time” (p. 312).

The results for conscientiousness are even more damaging for the maturation theory. Here most datasets show a decrease in conscientiousness and the average effect size is statistically significant, b = -.05, 95%CI = -.09 to -.02. However, the effect size is small, suggesting that there is no notable age trend in conscientiousness.

The only trait that showed the predicted age-trend was neuroticism, but the effect size was again small and the upper bound of the 95%CI was close to zero, b = -.05, 95%CI = -.09 to -.01.

In sum, recent evidence from several longitudinal studies challenges the claim that personality develops during adulthood. However, longitudinal studies are often limited by rather short time-intervals of a few years up to one decade. If effect sizes over one decade are small, they can be easily masked by method artifacts (Costa et al., 2019). Although cross-sectional studies have their own problem, they have the advantage that it is much easier to cover the full age-range of adulthood. The key problem in cross-sectional studies is that age-effects can be confounded with cohort effects. However, when multiple cross-sectional studies from different survey years are available, it is possible to separate cohort effects and age-effects. (Fosse & Winship, 2019).

Model Predictions

The maturity model also makes some predictions about age-trends for other constructs. One prediction is that well-being should increase as personality becomes more mature because numerous meta-analyses suggest that emotional stability, agreeableness, and conscientiousness predict higher well-being (Anglim et al., 2020). That being said, falsification of this prediction does not invalidate the maturity model. It is possible that other factors lower well-being in middle age or that higher maturity does not cause higher well-being. However, if the maturity model correctly predicts age effects on well-being, it would strengthen the model. I therefore tested age-effects on well-being and examined whether they are explained by personality development.

Statistical Analysis

Fosse and Winship (2019) noted that “despite the existence of hundreds, if not thousands, of articles and dozens of books, there is little agreement on how to adequately analyze age, period, and cohort data” (p. 468). This is also true for studies of personality development. Many of these studies fail to take cohort effects into account or ignore inconsistencies between cross-sectional and longitudinal results.

Fosse and Winship point out that that there is an identification problem when cohort, period, and age effects are linear, but not if the trends have different distributions. For example, if age effects are non-linear, it is possible to distinguish between linear cohort effects, linear period effects, and non-linear age effects. As maturation is expected to produce stronger effects during early adulthood than in middle and may actually show a decline in older age, it is plausible to expect a non-linear age effect. Thus, I examined age-effects in the German Socio-Economic Panel using a statistical model that examines non-linear age effects, while controlling for linear cohort and linear period effects.

Moreover, I included measures of marital status and work status to examine whether age effects are at least partially explained by these life experiences. The inclusion of these measures can also help with model identification (Fosse & Winship, 2019). For example, work and marriage have well-known age-effects. Thus, any age-effects on personality that are mediated by age are easily distinguished from cohort or period effects.

Measurement of Personality

Another limitation of many previous studies is the use of sum scores as measures of personality traits. It is well-known that these sum scores are biased by response styles (Anusic et al., 2009). Moreover, sum scores are influenced by the specific items that were selected to measure the Big Five traits and specific items can have their own age effects (Costa et al., 2019; Terracciano, McCrae, Brant, & Costa, 2005). Using a latent variable approach, it is possible to correct for random and systematic measurement errors and age effects on individual items. I therefore used a measurement model of personality that corrects for acquiescence and halo biases (Anusic et al., 2009). The specification of the model and detailed results can be found on OSF (https://osf.io/vpcfd/).

A model that assumed only age effects did not fit the data as well as a model that also allowed for cohort and period effects, chi2(df = 211) = 6651, CFI = .974, RMSEA = .021 vs. chi2(df = 201) = 5866, CFI = .977, RMSEA = .020, respectively. This finding shows that age-effects are confounded with other effects in models that do not specify cohort or period effects.

Figure 1 shows the age effects for the Big Five traits.

The results do not support the maturation model. The most inconsistent finding is a strong negative effect of age on agreeableness. However, other traits also did not show a continuous trend throughout adulthood. Conscientiousness increased from age 17 to 35, but remained unchanged afterwards, whereas Openness decreased slightly until age 30 and then increased continuously.

To examine the robustness of these results, I conducted sensitivity analyses with varying controls. The results for agreeableness are shown in Figure 2.

All models show a decreasing trend, but the effect sizes vary. No controls, controlling for either cohort effects or time effects produces a decreasing age trend, but the effect size is small as most scores deviate less than .2 standard deviations from the mean (i.e., zero). However, controlling for time and cohort effects results in the strong decrease observed in Figure 1. Controlling for halo bias makes only a small difference. It is possible that the model that corrects for cohort and time effects overcorrects because it is difficult to distinguish age and time effects. However, none of these results are consistent with the predictions of the maturation model that agreeableness increases throughout adulthood.

Figure 3 takes a closer look at Neuroticism. Inconsistent with the maturation model, most models show a weak increase in neuroticism. The only model that shows a weak decrease controls for cohort effects only. One possible explanation for this finding is that it is difficult to distinguish between non-linear and linear age effects and that the negative time effect is actually an age effect. Even if this were true, the effect size of age is small.

The results for conscientiousness are most consistent with the maturation hypothesis. All models show a big increase from age 17 to age 20, and still a substantial increase from age 20 to age 35. At this point, conscientiousness levels remain fairly stable or decrease in the model that controls only for cohort effects. Although these results are most consistent with the maturation model, they do not support the prediction of a continuous process throughout adulthood. The increase is limited to early adulthood and is stronger at the beginning of adulthood, which is consistent with biological models of development (Costa et al., 2019).

Although not central to the maturation model, I also examined the influence of controls on age-effects for Extraversion and Openness.

Extraversion shows a very small increase over time in the model without controls and the model that controls only for period (time) effects. However, this trend turns negative in models that control for cohort effects. However, all effect sizes are small.

Openness shows different results for models that control for cohort effects or not. Without taking cohort effects into account, openness appears to decrease. However, after taking cohort effects into account, openness stays relatively unchanged until age 30 and then increases gradually. These results suggest that previous cross-section studies may have falsely interpreted cohort effects as age-effects and that openness does not decrease with age.

Work and Marriage as Mediators

Personality psychologists have focussed on two theories to explain increases in conscientiousness during early adulthood. Some personality psychologists assume that it reflects the end stage of a biological process that increases self-regulation throughout childhood and adolescence (Costa & McCrae, 2006; Costa et al., 2019). The process is assumed to be complete by age 30. The present results suggest that it may be a bit later at age 35. The alternative theory is the social roles influence personality (Roberts, Wood, & Smith, 2005). A key prediction of the social investment theory is that personality development occurs when adults take on important social roles such as working full time, entering long-term romantic relationships (marriage), or parenting.

The SOEP makes it possible to test the social investment theory because it included questions about work and marital status. Most young adults start working full-time during their 20s, suggesting that work experiences may produce the increase in conscientiousness during this period. In Germany, marriage occurs later when individuals are in their 30s. Therefore marriage provides a particularly interesting test of the social investment theory because marriage occurs when biological maturation is mostly complete.

Figure 7 shows the age effect for work status. The age effect is clearly visible for all models and only slightly influenced by controlling for cohort or time effects.

Figure 8 shows the figure for marital status with cohabitating participants counted as married. The figure confirms that most Germans enter long-term relationships in their 30s.

To examine the contribution of work and marriage to the development of conscientiousness, I included marriage and work as predictors of conscientiousness. In this model the age-effects on conscientiousness can be decomposed into (a) an effect mediated by work (age -> work -> C), (b) an effect mediated by marriage (age -> married -> C), and an effect of age that is mediated by unmeasured variables (e.g., biological processes). Results are similar for the various models and I present the results for the model that controls for cohort and time effects.

The results show no effect of marriage; that is the effect size for the indirect effect is close to zero, but both work and unmeasured mediators contribute to the total age effect. The unmeasured mediators produce a step increase in the early 20s. This finding is consistent with a biological maturation hypothesis. Moreover, the unmeasured mediators produce a gradual decline over the life span with a surprising uptick at the end. This trajectory may be a sign of cognitive decline. The work effect increases much more gradually and is consistent with the social-role theory. Accordingly, the decrease in conscientiousness after age 55 is related to retirement. The negative effect of retirement on conscientiousness raises some interesting theoretical questions about the definition of personality. Does retirement really alter personality or does it merely alter situational factors that influence conscientious behaviors? To separate these hypotheses, it would be important to examine behaviors outside of work, but the trait measure that was used in this study does not provide information about the consistency of behaviors across different situations.

The key finding is that the data are consistent with two theories that are often treated as mutually exclusive and competing hypotheses. The present results suggest that biological processes and social roles contribute to the development of conscientiousness during early adulthood. However, there is no evidence that this process continuous in middle or late adulthood and role effects tend to disappear as soon as individuals are retiring.

Personality Development and Well-Being

One view of personality assumes that variation is personality is normal and that no personality trait is better than another. In contrast, the maturation model implies that some traits are more desirable, if only because they are instrumental to fulfill roles of adult life like working or maintaining relationships (McCrea & Costa, 1991). Accordingly, more mature individuals should have higher well-being. While meta-analyses suggest that this is the case, they often do not control for rating biases. When rating biases are taken into account, the positive effects of agreeableness and conscientiousness are not always found and are small (Schimmack, Schupp, & Wagner, 2008; Schimmack & Kim, 2020).

Another problem for the maturation theory is that well-being tends to decrease from early to middle adulthood when maturation should produce benefits. However, it is possible that other factors explain this decrease in well-being and maturation buffers these negative effects. To test this hypothesis, I added life-satisfaction to the model and examined mediators of age-effects on life-satisfaction.

An inspection of the direct relationships of personality traits and life-satisfaction confirmed that life-satisfaction ratings are most strongly influenced by neuroticism, b = -.37, se = .01. Response styles also had notable effects; halo b = .15, se = .01, acquiescence, b = .19, se = .01. The effects of the remaining Big Five traits were weak: E b = .078, se = .01, A = .07, se = .01, C = .02, se = .005, O = .07, se = .01. The weak effect of conscientiousness makes it unlikely that age-effects on conscientiousness contribute to age-effects on life-satisfaction.

The next figure shows the age-effect for life-satisfaction. The total effect is rather flat and shows only an increase in the 60s.

The mostly stable level of life-satisfaction masks two opposing trends. As individuals enter the workforce and get married, life-satisfaction actually increases. The positive trajectory for work reverses when individuals retire, while the positive effect of marriage remains. However, the positive effects of work and marriage are undone by unexplained factors that decrease well-being until age 50, when a rebound is observed. Neuroticism is not a substantial mediator because there are no notable age-effects on neuroticism. Conscientiousness is not a notable mediator because it does not predict life-satisfaction.

The main insight from these findings is that achieving major milestones of adult life is associated with increased well-being, but that these positive effects are not explained by personality development.


Narrative reviews claim that personality develops steadily through adulthood. For example, in a just published review of the literature Roberts and Yoon claim that “agreeableness, conscientiousness, and emotional stability show increases steadily through midlife” (p. 10). Roberts and Yoon also claim that “forming serious partnerships is associated with decreases in neuroticism and increases in conscientiousness” (p. 11). The problem with these broad and vague statements is that they ignores inconsistencies across cross-sectional and longitudinal analyses (Lucas & Donnellan, 2011), inconsistencies across populations (Graham et al., 2020), and effect sizes (Costa et al., 2019).

The present results challenge this simplistic story of personality development. First, only conscientiousness shows a notable increase from late adolescence to middle age and most of the change occurs during early adulthood before the age of 35. Second, formation of long-term relationships had no effect on neuroticism or conscientiousness. Participation in the labor force did increase conscientiousness, but these gains were lost when older individuals retired. If conscientiousness were a sign of maturity, it is not clear why it would decrease after it was acquired. In short, the story of life-long development is not based on scientific facts.

The notion of personality development is also problematic from a theoretical perspective. It implies that some personality traits are better, more mature, than others. This has led to calls for interventions to help people to become more mature (Bleidorn et al., 2019). However, this proposal imposes values and implicitly devalues individuals with the wrong traits. An alternative view treats personality as variation without value judgment. Accordingly, it may be justified to help individuals to change their personality if they want to change their personality, just like gender changes are now considered a personal choice without imposing gender norms on individuals. However, it would be wrong to subject individuals to programs that aim to change their personality, just like it is now considered wrong to subject individuals to interventions that target their sexual orientation. Even if individuals want to change, it is not clear how much personality can be changed. Thus, another goal should be to help individuals with different personality traits to feel good about themselves and to live fulfilling lives that allow them to express their authentic personality. The rather weak relationships between many personality traits and well-being suggests that it is possible to have high well-being with a variety of personalities. The main exception is neuroticism, which has a strong negative effect on well-being. However, the question here is how much of this relationship is driven by mood disorders rather than normal variation in personality. The effect may also be moderated by social factors that create stress and anxiety.

In conclusion, the notion of personality development lacks clear theoretical foundations and empirical support. While there are some relatively small mean level changes in personality over the life span, they are relatively trivial compared to the large stable variance in personality traits across individuals. Rather than considering this variation as arrested forms of development, it should be celebrate as diversity that enriches everybody’s life.

Conflict of Interest: My views may be biased by my (immature) personality (high N, low A, low C).

P.S. I asked Brent W. Roberts for comments, but he declined the opportunity. Please share your comments in the comment section.

Most published results in medical journals are not false

Peer Reviewed by Editors of Biostatistics “You have produced a nicely written paper that seems to be mathematically correct and I enjoyed reading” (Professor Dimitris Rizopoulos & Professor Sherri Rose)

Estimating the false discovery risk in medical journals

Ulrich Schimmack
Department of Psychology, University of Toronto Mississauga 3359 Mississauga Road N. Mississauga, Ontario Canada ulrich.schimmack@utoronto.ca

Frantisek Bartos
Department of Psychology, University of Amsterdam;
Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic


Jager and Leek (2014) proposed an empirical method to estimate the false discovery rate in top medical journals and found a false discovery rate of 14%. Their work received several critical com- ments and has had relatively little impact on meta-scientific discussions about medical research. We build on Jager and Leek’s work and present a new way  to estimate the false discovery risk.  Our results closely reproduce their original finding with a false discovery rate of 13%. In addition, our method shows clear evidence of selection bias in medical journals, but the expected discovery rate is 30%, much higher than we would expect if most published results were false. Our results provide further evidence that meta-science needs to be built on solid empirical foundations.

Keywords: False discovery rate; Meta-analysis; Science-wise false discovery rate; Significance testing; Statistical powe


The successful development of vaccines against Covid-19 provides a vivid example of a scientific success story. At the same time, many sciences are facing a crisis of confidence in published results. The influential article “Why most published research findings are false” suggested that many published significant results are false discoveries (Ioannidis, 2005). One limitation of Ioannidis’s article was the reliance on a variety of unproven assumptions. For  example, Ioannidis assumed  that only 1 out of 11 exploratory epidemiological studies tests a true hypothesis. To address this limitation, Jager and Leek (2014) developed a statistical model to estimate the percentage of false- positive results in a set of significant p-values. They applied their model to 5,322 p-values from medical journals and found that only 14% of the significant results may be false-positives. This     is a sizeable percentage, but it is much lower than the false-positive rates predicted by Ioannidis. Although Jager and Leek’s article was based on actual data, the article had a relatively week impact on discussions about false-positive risks. So far, the article has received only 73 citations   in WebOfScience. In comparison, Ioannidis’s purely theoretical article has been cited 518 times in 2020 alone. We believe that Jager and Leek’s article deserves a second look and that discussions about the credibility of published results benefit from empirical investigations

Estimating the False Discovery Risk

To  estimate the false discovery rate, Jager and Leek developed a model with two populations   of studies. One population includes studies in which the null-hypothesis is true (H0). The other population includes studies in which the null-hypothesis is false; that is, the alternative hypothesis is true (H1). The model assumes that the observed distribution of significant p-values is a mixture of these two populations.

One problem for this model is that it can be difficult to distinguish between studies in which H0 is true and studies in which H1 is true, but it was tested with low statistical power. Furthermore,  the distinction between the point-zero null-hypothesis, the nil-hypothesis (Cohen, 1994), and alternative hypotheses with very small effect sizes is rather arbitrary. Many effect sizes may not   be exactly zero but too small to have practical significance. This makes it difficult to distinguish clearly between the two populations of studies and estimates based on models that assume distinct populations may be unreliable.

To avoid the distinction between two  populations of  p-values, we  distinguish between the  false discovery rate and the false discovery risk. The false discovery risk does not aim to estimate the actual rate of H0  among significant p-values. Rather, it provides an estimate of the worst-case scenario with the highest possible amount of false-positive results. To estimate the false discovery risk, we take advantage of Soric’s (1989) insight that the maximum false discovery rate is limited by statistical power to detect true effects. When power is 100%, all non-significant results are produced by testing false hypotheses (H0). As this scenario maximizes the number of non-significant H0, it also maximizes the number of significant H0 tests and the false discovery rate.  Soric  showed  that  the  maximum  false  discovery  rate  is  a  direct  function  of  the  discovery rate. For example, if 100 studies produce 30 significant results, the discovery rate is 30%. And when the discovery rate is 30%, the maximum false discovery risk with α = 5% is 0.12. In general, the false discovery risk is a simple transformation of the discovery rate, such as

false discovery risk = (1/discovery rate 1) × (α/(1 − α)).

Our suggestion to estimate the false discovery risk rather than the actual false discovery rate addresses concerns about Jager and Leek’s two-population model that were raised in several commentaries (Gelman and O’Rourke, 2014; Benjamini and Hechtlinger, 2014; Ioannidis, 2005; Goodman, 2014).

If all conducted hypothesis tests were reported, the false discovery risk could be determined simply by computing the percentage of significant results. However, it is well-known that journals are more likely to publish significant results than non-significant results. This selection bias renders  the  observed  discovery  rate  in  journals  uninformative  (Bartoˇs  and  Schimmack,  2021; Brunner and Schimmack, 2020). Thus, a major challenge for any empirical estimates of the false discovery risk is to take selection bias into account.

Biostatistics published several commentaries to Jager and Leek’s article. A commentary by Ioannidis (2014) may have contributed to the low impact of Jager and Leek’s article. Ioannidis claims that Jager and Leek’s  results  can  be  ignored  because  they  used  automatic  extraction of p-values, a wrong method, and unreliable data. We address these concerns by  means of a new extraction method, a new estimation method, and new simulation studies that evaluate the performance of Jager and Leek’s original method and a new method. To  foreshadow the main results, we  find that Jager and Leek’s method can sometimes produce biased estimates of the   false discovery risk. However, our improved method produces even lower estimates of the false discovery risk. When we applied this method to p-values from medical journals, we obtained an estimate of 13% that closely matches Jager and Leek’s original results. Thus, although Ioannidis (2014) raised some valid objections, our results provide further evidence that false discovery rates in medical research are much lower than Ioannidis (2005) predicted.


Jager and Leek (2014) proposed a selection model that could be fitted to the observed distribution of significant p-values. This model assumed a flat distribution for p-values from the H0 population and a beta distribution for p-values from the H1 population. A single beta distribution can only approximate the actual distribution of p-values; a better solution is to use a mixture of several beta-distributions or, alternatively, convert the p-values into z-scores and model the z-scores with several truncated normal distributions (similar to the suggestion by Cox, 2014). Since reported p-values often come from two-sided tests, the resulting z-scores need to be converted into absolute z-scores that can be modeled as a mixture of truncated folded normal distributions (Bartoˇs and Schimmack, 2021). The weights of the mixture components can then be used to compute the average power of studies that produced a significant result. As this estimate is limited to the set of studies that were significant, we refer to it as the average power after selection for statistical significance. As power determines the outcomes of replication studies, the average power after selection for statistical significance is an estimate of the expected replication rate.

Although an estimate of the expected replication rate is valuable in its own right, it does not provide an estimate of the false discovery risk because it is  based  on  the  population of studies after selection for statistical significance. To estimate the expected discovery rate, z-curve models the selection process operating on the significance level and assumes that studies produce  a statistically significant result proportionally to their power. For example, studies  with  50% power produce one non-significant result for every significant result and studies with 20% power produce four statistically non-significant results for every significant result. It is therefore possible to estimate the average power before selection of statistical significance based on the weights of the mixture components that are obtained by  fitting the model to only significant results. As  power determines the percentage of significant results, we refer to average power before selection for statistical significance as the expected discovery rate.Extensive simulation studies have demonstrated that z-curve produces good large-sample estimates  of  the  expected  discovery  rate  with  exact  p-values  (Bartoˇs  and  Schimmack,  2021). Moreover, these simulation studies showed that z-curve produces  robust  confidence  intervals with good coverage. As the false discovery risk is a simple transformation of the EDR, these confidence intervals also provide confidence intervals for estimates of the false discovery risk. To use z-curve for p-values from medical abstracts, we extended z-curve’s expectation-maximization (EM) algorithm (Dempster and others, 1977) to incorporate rounding and censoring similarly to Jager and Leek’s model. To demonstrate that z-curve can obtain valid estimates of the false discovery risk for medical journals, we conducted a simulation study that compared Jager and Leek’s method with z-curve.

Simulation Study

We extended the simulation performed by Jager and Leek in several ways.  Instead of simulating  H1 p-values directly from a beta distribution, we used power estimates from individual studies based on meta-analyses (Lamberink and others, 2018) and simulated p-values of two-sided z-tests with corresponding power (excluding all power estimates based meta-analyses with non-significant results). This allows us to assess the performance of the methods under heterogeneity of power to detect H1 corresponding to the actual literature. To simulate H0 p-values, we used a uniform distribution.

We manipulated the true false discovery rate from 0 to 1 with a step size of 0.01 and simulated 10,000 observed significant p-values. Similarly to Jager and Leek, we performed four simulation scenarios with an increasing percentage of imprecisely reported p-values. Scenario A used exact p-values, scenario B rounded p-values to three decimal places (with p-values lower than 0.001 censored at 0.001), scenario C rounds 20% p-values to two decimal places (with p-values rounded to 0 censored at 0.01), and scenario D first rounds 20% p-values to two decimal places and further censors 20% p-values at on of the closest ceilings (0.05, 0.01, or 0.001).

Figure 1 displays the true (x-axis) vs. estimated (y-axis) false discovery rate (FDR) for Jager and Leek’s method and the false discovery risk for z-curve across the different scenarios (panels). We see that when precise p-values are reported (panel A in the upper left corner), z-curve can handle the heterogeneity in power very well across the whole range of false discovery rates and produces accurate estimates of false discovery risks. Higher estimate than the actual false discovery rates are expected because the false discovery risk is an estimate of the maximum false discovery rate. Discrepancies are especially expected when power of true hypothesis tests is low. For the simulated scenarios, the discrepancies are less than 20 percentage points and decrease as  the true false discovery rate increases. Even though Jager and Leek’s method aims to estimate the true false discovery rates, it produces higher estimates than z-curve. This is problematic because the method produces inflated estimates of the true false discovery rate. Even if the estimates were interpreted as maximum estimates, the method is less sensitive to the actual variation in the false discovery rate than the z-curve method.

Panel B shows that the z-curve method produces similar results when p-values are rounded to three decimals. The Jager and Leek’s method however experiences estimation issues, especially in the lower spectrum of the true false discovery rate since the current implementation only allows to deal with rounding to two decimal places (we also tried specifying the p-values as a rounded input; however, the optimizing routine failed with several errors).

Panel C shows a surprisingly similar performance of the two methods when 20% of p-values are rounded to two decimals, except for very high levels of true false discovery rates, where Jager and Leek’s method starts to underestimate the false discovery rate. Despite the similar performance, the results have to be interpreted as estimates of the false discovery risk (maximum false discovery rate) because both methods overestimate the true false discovery rate for low false discovery rates.

Panel D shows that both methods have problems when 20% of p-values are at the closest ceiling of .05, .01, or .001 without providing clear information about the exact p-value. Z-curve does a little bit better than Jager and Leek’s method. Underestimation of true false discovery rates over 40% is not a serious problem because any actual false discovery rate over 40% is unacceptably high. One solution to the underestimation problem is to exclude p-values that are reported in this way from analyses.

Root mean square error and bias of the false discovery rate estimates for each scenario summa- rized in Table 1 show that z-curve produces estimates with considerably lower root mean square error. The results for bias show that both methods tend to produce higher estimates than the true false discovery rate. For z-curve this is expected because it aims to estimate the maximum false discovery rate. It would only be a problem if estimates of the false discovery risk were lower than the actual false discovery rate. This is only the case in Scenario D, but as shown previously, underestimation only occurs when the true false discovery rate is high.

To summarize, our simulation confirms that Jager and Leek’s method provides meaningful estimates of the false discovery risk and that the method is likely to overestimate the true false discovery rate. Thus, it is likely that the reported estimate of 14% for top medical journals overestimates the actual false discovery rate. Our results also show that z-curve improves over the original method and that the modifications can handle rounding and imprecise reporting when the false discovery rates are below 40%.

Application to Medical Journals

Commentators raised more concerns about Jager and Leek’s mining of p-values than about their estimation method. To address these concerns, we extended Jager and Leek’s data mining approach in the following ways; (1) we extracted p-values only from abstracts labeled as “randomized controlled trial” or “clinical trial” as suggested by  Goodman (2014); Ioannidis (2014); Gelman  and O’Rourke (2014), (2) we improved the regex script for extracting p-values to cover more possible notations as suggested by Ioannidis (2014), (3) we extracted confidence intervals from abstracts not reporting p-values as suggested by Ioannidis (2014); Benjamini and Hechtlinger (2014). We further scraped p-values from abstracts in “PLoS Medicine” to compare the false discovery rate estimates to a less-selective journal as suggested by Goodman (2014). Finally, we randomly subset the scraped p-values to include only a single p-value per abstract in all analyses, thus breaking the correlation between the estimates as suggested by Goodman (2014). Although there are additional limitations inherent to the chosen approach, these improvements, along with our improved estimation method, make it possible to test the prediction by several commentators that the false discovery rate is well above 14%.

We executed the scraping protocol on July 2021 and scraped abstracts published since 2000 (see Table 2 for a summary of the scraped data). Interactive visualization of the individual abstracts and scraped values can be accessed at https://tinyurl.com/zcurve-FDR.

Figure 2 visualizes the estimated false discovery rates based on z-curve and Jager and Leek’s method based on scraped abstracts from clinical trials and randomized controlled  trials  and  further divided by  journal and whether the article was published before (and including) 2010   (left) or after 2010 (right). We see that, in line with the simulation results, Jager and Leek’s  method produces slightly higher false discovery rate estimates. Furthermore, z-curve produced considerably wider bootstrapped confidence intervals, suggesting that the confidence interval reported by Jager and Leek (± 1 percentage point) was too narrow.

A comparison of the false discovery estimates based on data before (and including) 2010 and after 2010 shows that confidence intervals overlap, suggesting that false discovery rates have not changed. Separate analyses based on clinical trials and randomized controlled trials also showed no significant differences (see Figure 3). Therefore, to reduce the uncertainty about the false discovery rate, we estimate the false discovery rate for each journal irrespective of publication year. The resulting false discovery rate estimates based z-curve and Jager and Leek’s method are summarized in Table  3. We  find that all false discovery rate estimates fall within a .05     to .30 interval. Finally, further aggregating data across the journals provides a false discovery rate estimate of 0.13, 95% [0.08, 0.21] based on z-curve and 0.19, 95% [0.17, 0.20] based on Jager and Leek’s method. This finding suggests that Jager and Leek’s extraction method slightly underestimate the false discovery rate, whereas their model overestimated the false discovery rate.

Additional Z-Curve Results

So far, we used the expected discovery rate only to estimate the false discovery risk, but the expected discovery rate provides valuable information in itself. Ioannidis’s predictions of the false discovery rate were based on scenarios that assumed that less than 10% of all hypothesis are true hypothesis. The same assumption was made to recommend lowering α from .05 to .005 (Benjamin and others, 2018). If all true hypotheses were tested with 100% power, the discovery rate would match the percentage of true hypotheses plus the false-positive results; 10% + 90% × .05 =  14.5%. Because the actual power is less than 100%, the discovery rate would be even less, but    the estimated expected discovery rate for top medical journals is 30% with a confidence interval ranging from 20% to 41%. Thus, our results suggest that previous speculations about discovery rates were overly pessimistic.

The expected discovery rate also provides valuable information about the extent of selection bias in medical journals. While the expected discovery rate is only 30%, the observed discovery rate (i.e., the percentage of significant results in abstracts) is more than double (69.7%). This discrepancy is visible in Figure 4. The histogram of observed non-significant z-scores does not match the predicted distribution (blue curve). This evidence of selection bias implies that reported effect sizes are inflated by selection bias. Thus, follow-up studies need to adjust effect sizes when planning the sample sizes via power analyses.

Z-curve also provides information about the replicability of significant results in medical abstracts. The expected replication rate is 65% with a confidence interval  ranging from 61% to    69%. This result suggests that sample sizes should be increased to meet the recommended level    of 80% power. Furthermore, this estimate may be overly optimistic because comparisons of actual replication rates and z-curve predictions show lower success rates for actual replication studies (Bartoˇs and Schimmack, 2021). One reason could be that exact replication studies are impossible and changes in population will result in lower power due to selection bias and regression to the mean. In the worst case, the actual replication rate might be as low as the expected discovery rate. Thus, our results predict that the success rate of actual replication studies in medicine will be somewhere between 30% and 65%.

Finally, z-curve can be used to adjust the significance level α retrospectively to maintain a false discovery risk of less than 5% Goodman. To do so, it is only necessary to compute the expected discovery rate for different levels of α. With α = .01, the expected discovery rate decreases to 20% and the false discovery risk decreases to 4%. Adjusting α to the recommended level of .005 reduced the expected discovery rate to 17% and the false discovery risk to 2%. Based on these results, it is possible to use α = .01 as a criterion to reject the null-hypothesis while maintaining a false positive risk of 5%.


Like many other human activities, science relies on trust. Over the past decade, it has become clear that some aspects of modern science undermine trust. The biggest problem remains the prioritization of new discoveries that meet the traditional threshold of statistical significance. The selection for significance has many undesirable consequences. Although medicine has responded to this problem by demanding preregistration of clinical trials, our results suggest that selection for significance remains a pervasive problem in medical research. As a result, the observed discovery rate and reported effect sizes provide misleading information about the robustness of published results. To maintain trust in medical research, it is important to take selection bias into account. Concerns about the replicability of published results have led to the emergence of meta- science as an active field of research over the past decade. Unlike meta-physics, meta-science is an empirical enterprise that uses data to investigate science. Data can range from survey studies of research practices to actual replication studies. Jager and Leek made a valuable contribution to meta-science by developing a method to estimate the false discovery rate based on published p-values using a statistical model that takes selection bias into account. Their work stimulated discussion, but their key finding that false discovery rates in medicine are not at an alarmingly high rate was ignored. We followed up on Jager and Leek’s seminal contribution with a different estimation model and an improved extraction method to harvest results from medical abstracts. Despite these methodological improvements, our results firmly replicated Jager and Leek’s key finding that false discovery rates in top medical journals are between 10% and 20%.

We also extended the meta-scientific investigation of medical research in several ways. First,  we demonstrated that the false discovery risk can be reduced to less than 5% by lowering the criterion for statistical significance to .01. This recommendation is similar to other proposals     to lower α to .005, but our proposal is based on empirical data. Moreover, the α level can be modified for different fields of studies or it can be changed in the future in response to changes  in research practices. Thus, rather than recommending one fixed α, we recommend to justify α (Lakens and others, 2018). Fields with low discovery rates should use a lower α than fields with high discovery rates to maintain a false discovery risk below 5%.

We also demonstrated that medical journals have substantial selection bias. Whereas the percentage of significant results in abstracts is over 60%, the expected discovery rate is only 30%. This test for selection bias is important because it would be unnecessary to use selection models if selection bias were negligible. Evidence of substantial selection bias may also help to change publication practices in order to reduce selection bias. For example, journals could be evaluated on the basis of the amount of selection bias just like they are being evaluated in terms of impact factors.

Finally, we provided evidence that the average power of studies with significant results is 65%. As power increases for studies with lower p-values, this estimate implies that power for studies  that are significant at p < .01 to produce a p-value below .05 in a replication study would be  even higher. Based on these findings, we  would predict that at least 50% of results that achieved   p < .01 can be successfully replicated. This is comparable to cognitive psychology, where 50% of significant results at p < .05 could be successfully replicated (Open Science Collaboration, 2015).

Limitations and Future Directions

Even though we were able to address several of the criticisms of Jager and Leek’s seminal article, we were unable to address all of them. The question is whether the remaining concerns are sufficient to invalidate our results. We think this is rather unlikely because our results are in line with findings in other fields. The main remaining concern is that mining p-values and confidence intervals from abstracts creates a biased sample of results. The only way to address this concern is to read the actual articles and to pick the focal hypothesis test for the z-curve analysis. Unfortunately, nobody seems to have taken on this daunting task for medical journals. However, social psychologists have hand-coded a large, representative sample of test-statistics (Motyl and others, 2017). The coding used the actual test statistics rather than p-values. Thus, exact p-values were computed and no rounding or truncation problems are present in these data. A z-curve analysis of these data estimated an expected discovery rate of 19%, 95% CI = 6% to 36% (Schimmack, 2020). Given the low replication rate of social psychology, it is not surprising that the expected discovery rate is lower than for medical studies (Open Science Collaboration, 2015). However, even a low expected discovery rate of 19% limits the false discovery risk at 22%, which is not much higher than the false discovery risk in medicine and does not justify the claim that most published results are false. To provide more conclusive evidence for medicine, we strongly encourage hand-coding of medical journals and high powered replication studies. Based on the present results, we predict false positive rates well below 50%.


The zcurve R package is available from https://github.com/FBartos/zcurve/tree/censored.

Supplementary Material

Supplementary Materials including data and R scripts for reproducing the simulations, data scraping, and analyses are available from https://osf.io/y3gae/.


Conflict of Interest: None declared.


Bartos,  Frantisek  and  Schimmack,  Ulrich.  (2021).   Z-curve  2.0:  Estimating  replication rates and discovery rates. Meta-Psychology.

Benjamin, Daniel J, Berger, James O, Johannesson, Magnus, Nosek, Brian A, Wagenmakers,  E-J,  Berk,  Richard,  Bollen,  Kenneth  A,  Brembs,  Bjorn,  Brown, Lawrence, Camerer, Colin and others. (2018). Redefine statistical significance. Nature Human Behaviour 2(1), 6–10.

Benjamini, Yoav and Hechtlinger, Yotam. (2014). Discussion: An estimate of the science- wise false discovery rate and applications to top medical journals by jager and leek. Biostatistics 15(1), 13–16.

Brunner, Jerry and Schimmack, Ulrich. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology 4.

Cohen, Jacob. (1994). The earth is round (p ¡.05). American Psychologist 49(12), 997.

Cox, David R. (2014). Discussion: Comment on a paper by jager and leek. Biostatistics 15(1), 16–18.

Dempster, Arthur P, Laird, Nan M and Rubin, Donald B. (1977). Maximum likelihood  from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22.

Gelman, Andrew and O’Rourke, Keith. (2014). Difficulties in making inferences about scientific truth from distributions of published p-values. Biostatistics 15(1), 18–23.

Goodman, Steven N. (2014). Discussion: An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15(1), 13–16.

Ioannidis, John PA. (2005). Why most  published  research  findings  are  false.  PLoS medicine 2(8), e124.

Ioannidis, John PA.  (2014).  Discussion: Why “an estimate of the science-wise false discovery  rate and application to the top medical literature” is false. Biostatistics 15(1), 28–36.

Jager, Leah R  and  Leek,  Jeffrey T. (2014).  An estimate of the science-wise false discovery  rate and application to the top medical literature. Biostatistics 15(1), 1–12.

Lakens, Daniel, Adolfi, Federico G, Albers, Casper J, Anvari, Farid,  Apps, Matthew AJ, Argamon, Shlomo E, Baguley, Thom, Becker, Raymond B, Benning, Stephen D, Bradford, Daniel E and others. (2018). Justify your alpha. Nature Human Behaviour 2(3), 168–171.

Lamberink, Herm J, Otte, Willem M, Sinke, Michel RT, Lakens, Daniel, Glasziou, Paul P, Tijdink, Joeri K and Vinkers, Christiaan H. (2018). Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. Journal of Clinical Epidemiology 102, 123–128.

Motyl, Matt, Demos, Alexander P, Carsel, Timothy S, Hanson, Brittany E, Melton, Zachary J, Mueller, Allison B, Prims, JP, Sun, Jiaqing, Washburn, An- thony N, Wong, Kendal M and others. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology 113(1), 34.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science 349(6251).

Schimmack, Ulrich.  (2020).  A  meta-psychological  perspective  on  the  decade  of  replication failures in social psychology. Canadian Psychology/Psychologie canadienne.

Soric,  Branko.  (1989).    Statistical  “discoveries”  and  effect-size  estimation.   Journal  of  the American Statistical Association 84(406), 608–610.

How to Build a Monster Model of Well-Being: Part 8

So far, I have built a model that relates the Big Five personality traits to well-being. In this model well-being is defined as the weighted average of satisfaction with life domains, positive affect (happy) and negative affect (sad). I showed that most of the personality effects of the Big Five were mediated by the cheerfulness facet of extraversion and the depressiveness facet of neuroticism. I then showed that that there were no gender differences in well-being because women score higher on depressiveness and cheerfulness. Finally, I showed that middle aged parents of students have lower well-being than students and that these age effects were mediated by lower cheerfulness and lower satisfaction with several life domains. The only exception was romantic satisfaction that was higher among parents than among students
(Part 1, Part 2, Part 3 , Part 4. Part 5, Part 6, Part7). Part 8 examines the relationship between positive illusions and well-being.

Positive Illusions and Well-Being

Philosophers have debated whether positive illusions should be allowed to contribute to individuals’ well-being (Sumner, 1996). Some philosophers demand true happiness, where illusions can produce experiences of happiness, but these experiences do not count towards an individual’s well-being. Other theories, most prominently hedonism, have no problem with illusory happiness. Ideally, we would just live a perfect simulated live (think The Matrix) and not care one bit about the fact that our experiences are not real. A third version allows for illusions to contribute to our well-being, if we choose a sweet lie over a bitter truth.

Psychologists tried to settle these questions empirically. An influential article by Taylor and Brown (1988) declared that positive illusions are good for us (our well-being and mental health) and that realistic perceptions of our lives may be maladaptive and may cause depression. In a world of Covid-19, massive forest fires and flooding, this view rings true. However, positive illusions may also have negative effects that can undermine short-lived benefits of positive illusions.

Diener et al. (1999) list a few studies that seemed to support the view that individuals with positive illusions have higher levels of well-being. However, a key problem in research on positive illusions and well-being is that positive illusions and well-being are often measured with self-ratings. It is therefore unclear whether a positive correlation between these two measures reveals a substantial relationship or simply shared method variance. Relatively few studies have tackled this problem and the results are inconsistent (Dufner et al., 2019). Studies that use informant ratings of well-being are particularly rare and suggest that any effect of positive illusions is at best small (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2020).

The monster-model uses the Mississauga Family Study data that were used by Schimmack and Kim (2020). Thus, no effect of positive illusions on well-being is expected. However, the present model examines a new hypothesis that was not investigated by Schimmack and Kim (2020) because the model focussed on the Big Five and did not include facet measures of cheerfulness and depressiveness. The present study examined whether positive illusions in perceptions of the self are related to cheerfulness and depressiveness. To test this hypothesis, the positive illusion factor of self-ratings was related to cheerfulness, depressiveness as well as to experiences of positive affect, negative affect, and life-satisfaction.

The model is illustrated in Figure 1.

Figure 1 is a bit messy and it may be helpful to read previous posts for the basic model that connects factors (in black) with each other (Black lines). Each factor is based on four indicators (self-ratings, informant ratings by students, informant ratings by mothers, and informant ratings by fathers). Figure 2 shows only the self-ratings as orange boxes marked with sr next to each factor. it is assumed that all of these self-ratings share method variance due to a general evaluative bias factor. This factor is represented as the bigger orange box marked as SR in capital letters. It is assumed that all self-ratings load on this factor (orange arrows). Furthermore, the model assumes a positive effect of cheerfulness on evaluative biases (green arrow) and that positive experiences (happy) are influenced by evaluative biases (another green arrow). Depressiveness is expected to be a negative predictor of the evaluative bias factor (red arrow) and evaluative biases are assumed to have a negative effect on sadness (also a red arrow).

Fitting this model to the data reduced model fit, chi2( 1591) = 2655, CFI = .960, RMSEA = .022. The reason is that the general evaluative factor did not explain all of the residual correlations among self-ratings. To improve model fit, additional correlated residuals were allowed. For example, residual variance in self-ratings of recreation satisfaction and friendship satisfaction were correlated. These residual correlations were freed to maintain good fit. The fit of the final model was close to the fit to a model that allowed all correlated residual to be correlated, chi2(1542) = 2082, CFI = .980, RMSEA = .016.

The first important finding was that all self-ratings showed a significant loading (p < .001) on the evaluative bias factor in the predicted direction. The lowest loading was observed for extraversion, b = .18, se = .04, Z = 4.8. The highest loading was observed for self-ratings of positive affect, b = .62, se = .04, Z = 15.1. The loading for self-ratings of life-satisfaction was b = .51, se = .04, Z = 13.7. These results confirm that evaluative biases make a substantial contribution to self-ratings of well-being.

Reproducing Schimmack and Kim’s results, evaluative biases did not predict life-satisfaction (i.e., the shared variance by self-ratings and informant ratings), b = .03, se = .04, Z = 0.7. Evaluative biases also predicted neither positive affect (happy), b = .02, se = .04, Z = 0.4, nor negative affect (sadness), b = -.04, se = .05, Z = 0.8.

The new findings were that cheerfulness was not a significant predictor of evaluative biases, b = .09, se = .06, Z = 1.5, and that depressiveness was a positive rather than negative predictor of evaluative biases, b = .13, se = .06, Z = 2.7. Thus, there is no evidence that individuals with a depressive personality have a negative bias about their personalities or lives. The positive relationship might be a statistical fluke or it might show some deliberate rating bias to overcorrect for negative biases.

As hinted at in Part 7, the evaluative bias factor was significantly correlated with age, b = .37, se = .05, Z = 7.5. At least in this study, parents provided more favorable ratings of themselves than students. Whether this finding shows a general age trend remains to be examined. However, the finding casts a shadow on studies that rely on self-ratings to study personality development. Maybe some of the positive trends such as increased agreeableness or decreased neuroticism are inflated by these biases. It is therefore important to study personality development with measurement models that control for evaluative biases in personality ratings.


The present results challenge the widely held believe that positive illusions are beneficial for well-being and that the absence of positive illusions is associated with depression. At the same time, the present study did replicate previous findings that measures of positive illusions are correlated with self-ratings of well-being. In my opinion, this finding merely reveals that a positive rating bias also influences self-ratings of well-being. Future research needs to ensure that method bias does not produce spurious correlations between measures of positive illusions and measures of well-being. It is sad but true that thirty years of research have been wasted on studies that did not control for method variance even though method variance in personality ratings has been demonstrated over 100 years ago (Thorndike, 1920) and is one of the most robust and well-replicated findings in personality research (Campbell & Fiske, 1959).

How to Build a Monster Model of Well-Being: Part 7

The first five parts built a model that related personality traits with well-being. Part six added sex (male/female) to the model. It may not come as a surprise that part 7 adds age to the model because sex and age are two commonly measured demographic variables.

Age and Wellbeing

Diener et al.’s (1999) review article pointed out that early views of old age as a period of poor health and misery was not supported by empirical studies. Since then, some studies with national representative samples have found a U-shaped relationship between age and well-being. Accordingly, well-being decreases from young adulthood to middle age and then increases again into old age before well-being declines at the end of life. Thus, there is some evidence for a mid-life crisis (Blanchflower, 2021).

The present dataset cannot examine this U-shaped pattern because data are based on students and their parents, but the U-shaped pattern would predict that students have higher well-being than their middle-aged parents.

McAdams, Lucas, and Donnellan (2012) found that the relationship between age and life-satisfaction was explained by effects of age on life-domains. According to their findings in a British sample, health satisfaction decreased with age, but housing satisfaction increased with age. The average trend across domains mirrored the pattern for life-satisfaction judgments.

Based on these findings, I expected that age was a negative predictor of life-satisfaction and that this negative relationship is mediated by domain satisfaction. To test this prediction I added age as a predictor variable. As for sex, age is an exogeneous variable because age can influence personality and well-being, but personality cannot influence (biological) age. Although age was added as a predictor for all factors in the model, overall model fit decreased, chi2(1478) = 2198, CFI = .973, RMSEA = .019. This can happen when a new variable is also related to the unique variances of indicators. Inspection of the modification indices showed some additional relationships with self-ratings that suggested older respondents have a positive bias in their self-ratings. To allow for this possibility, I allowed all self-ratings to be influenced by age. This modification substantially increased model fit, chi2(1462) = 1970, CFI = .981, RMSEA = .016. I will further examine this positivity bias in the next model. Here I focus on the findings for age and well-being.

As expected, age was a negative predictor of life-satisfaction, b = -.21, se = .04, Z = 5.5. This effect was fully mediated. The direct effect of age on life-satisfaction was close to zero and not significant, b = -.01, se = .04, Z = 0.34. Age also had no direct effect on positive affect (happy), b = .00, se = .00, Z = 0.44, and only a small effect on negative affect (sadness), b = -.03, se = .01, Z = 2.5. Yet, the sign of this relationship shows lower levels of sadness in middle age, which does not explain the lower level of life-satisfaction. In contrast, age was a negative predictor of average domain satisfaction (DSX) and the effect size was close to the effect size for life-satisfaction, b = -.20, se = .05, Z = 4.1. This results replicates McAdams et al.’s (2012) finding that domain satisfaction mediates the effect of age on life-satisfaction.

However, the monster model shows that domain satisfaction is influenced by personality traits. Thus, it is possible that some of the age effects on domain satisfaction are not only influenced by objective domain aspects, but also by top-down effects of personality traits. To examine this, I traced the indirect effects of age on average domain satisfaction.

Age was a notable negative predictor of cheerfulness, b = -.29, se = .04, Z = 7.5. This effect was partially mediated by extraversion, b = -.07, se = 02, Z = 3.5 and agreeableness, b = -.08, se = .02, Z = 4.5, while some of the effect was direct, b = -.14, se = .03, Z = 4.4. There was no statistically significant effect of age on depressiveness, b = .07, se = 04, Z = 1.9.

Age also had direct relationships with some life domains. Age was a positive predictor of romantic satisfaction, b = .36, se = .04, Z = 8.2. Another strong relationship emerged for health satisfaction, b = -.36, se = .04, Z = 8.4. Another negative relationship was observed for work, b = -.26, se = .04, Z = 6.4, reflecting the difference between studying and working. Age was also a negative predictor of housing satisfaction, b = -.10, se = .04, Z = 2.8, recreation satisfaction, b = -.15, se = .05, Z = 3.4, financial satisfaction, b = -.10, se = .05, Z = 2.1, and friendship satisfaction, b = -.09, se = .04, Z = 2.1. In short, age was a negative predictor of satisfaction with al life domains even after controlling for the effects of age on cheerfulness.

The only positive effect of age was an increase in conscientiousness, b = .15, se = .04, Z = 3.7, which is consistent with the personality literature (Roberts, Walton, & Viechtbauer, 2006). However, the indirect positive effect on life-satisfaction is small, b = .04

In conclusion, the present results replicate that well-being decreases from young adulthood to middle age. The effect is mainly explained by a decrease in cheerfulness and decreasing satisfaction with a broad range of life domains. The only exception was a positive effect on romantic satisfaction. These results have to be interpreted in the context of the specific sample. Younger participants were students. It is possible that young adults who already join the workforce have lower well-being than students. The higher romantic satisfaction for parents may also be due to the recruitment of parents who remained married with children. Singles and divorced middle-aged individuals show lower life-satisfaction. The fact that age effects were fully mediated shows that studies of age and well-being can benefit from the inclusion of personality measures and the measurement of domain satisfaction (McAdams et al., 2012).

How to Build a Monster Model of Well-Being: Part 6

The first five parts of this series built a model that related the Big Five personality traits as well as the depressiveness facet of neuroticism and the cheerfulness facet of extraversion to well-being. In this model, well-being is conceptualized as a weighted average of satisfaction with life domains and experiences of happiness and sadness (Part 5).

Part 6 adds sex/gender to the model. Although gender is a complex construct, most individuals identify as either male or female. As sex is frequently assessed as a demographic characteristic, the simple correlations of sex with personality and well-being are fairly well known and were reviewed by Diener et al. (1999).

A somewhat surprising finding is that life-satisfaction judgments show hardly any sex differences. Diener et al. (1999) point out that this finding seems to be inconsistent with findings that women report higher levels of neuroticism (neuroticism is a technical term for a disposition to experience more negative affects and does not imply a mental illness), negative affect, and depression. Accordingly, gender could have a negative effect on well-being that is mediated by neuroticism and depressiveness. To explain the lack of a sex difference in well-being, Diener et al. proposed that women also experience more positive emotions. Another possible mediator is agreeableness. Women consistently score higher in agreeableness and agreeableness is a positive predictor of well-being. Part 5 showed that most of the positive effect of agreeableness was mediated by cheerfulness. Thus, agreeableness may partially explain higher levels of cheerfulness for women. To my knowledge, these mediation hypotheses have never been formally tested in a causal model.

Adding sex to the monster model is relatively straightforward because sex is an exogeneous variable. That is causal paths can originate from sex, but no causal path can be pointed at sex. After all, we know that sex is determined by the genetic lottery at the moment of conception. It is therefore possible to add sex as a cause to all factors in the model. Despite adding all causal pathways, model fit decreased a bit, chi2(1432) = 2068, CFI = .976, RMSEA = .018. The main reason for reduced fit would be that sex predicts some of the unique variances in individual indicators. Inspection of modification indices showed that sex was related to higher student ratings of neuroticism and lower ratings of neuroticism by mothers’ as informants. While freeing these parameters improved model fit, the effect on sex differences in neuroticism were opposite. Assuming (!) that mothers’ underestimate neuroticism, increased sex differences in neuroticism from d = .69, se = .07 to d = .81, se = .07. Assuming that students’ overestimate neuroticism resulted in a smaller sex difference of d = .54, se = .08. Thus, the results suggest that sex differences in neuroticism are moderate to large (d = .5 to .8), but there is uncertainty due to some rating biases in ratings of neuroticism. A model that allowed for both biases had even better fit and produced the compromise effect size estimate of d = .67, se = .08. Overall fit was now only slightly lower than for the model without sex, chi2(1430) = 2024, CFI = .978, RMSEA = .017. Figure 2 shows the theoretically significant direct effects of sex with effect sizes in units of standard deviations (Cohen’s d).

The model not only replicated sex differences in neuroticism. It also replicated sex differences in agreeableness, although the effect size was small, d = .29, se = .08, Z = 3.7. Not expected was the finding that women also scored higher in extraversion, d = .38, se = .07, Z = 5.6, and conscientiousness, d = .36, se = .07, Z = 5.0. The only life domain with a notable sex difference was romantic relationships, d = -.41, se = .08, Z = 5.4. The only other statistically significant difference was found for recreation, d = -.19, se = .08, Z = 2.4. Thus, life domains do not contribute substantially to sex differences in well-being. Even the sex difference for romantic satisfaction is not consistently found in studies of marital satisfaction.

The model indirect results replicated the finding that there are no notable sex differences in life-satisfaction, total effect d = -.07, se = .06, Z = 1.1. Thus, tracing the paths from sex to life-satisfaction provides valuable insights into the paradox that women tend to have higher levels of neuroticism, but not lower life-satisfaction.

Consistent with prior studies, women had higher levels of depressiveness and the effect size was small, d = .24, se = .08, Z = 3.0. The direct effect was not significant, d = .06, se = .08, Z = 0.8. The only positive effect was mediated by neuroticism, d = .42, se = .06, Z = 7.4. Other indirect effects reduced the effect of sex on depressiveness. Namely, women’s higher conscientiousness (in this sample) reduced depressiveness, d = -.14, as did women’s higher agreeableness, d = -.06, se = .02, Z = 2.7, and women’s higher extraversion, d = -.04, se = .02, Z = 2.4. These results show the problem of focusing on neuroticism as a predictor of well-being. While neuroticism shows a moderate to strong sex difference, it is not a strong predictor of well-being. In contrast, depressiveness is a stronger predictor of well-being, but has a relatively small sex difference. This small sex difference partially explains why women can have higher levels of neuroticism without lower levels of well-being. Men and women are nearly equally disposed to suffer from depression. Consistent with this finding, men are actually more likely to commit suicide than women.

Consistent with Diener et al.’s (1999) hypothesis, cheerfulness also showed a positive relationship with sex. The total effect size was larger than for depressiveness, d = .50, se = .07, Z = 7.2. The total effect was partially explained by a direct effect of sex on cheerfulness, d = .20, se = .06, Z = 3.6. Indirect effects were mediated by extraversion, d = .27, se = .05, Z = 5.8, agreeableness d = .11, se = .03, Z = 3.6, and conscientiousness, d = .05, se = .02, Z = 3.2. However, neuroticism reduced the effect size by d = -.12, se = .03, Z = 4.4.

The effects of gender on depressiveness and cheerfulness produced corresponding differences in experiences of NA (sadness) and PA (happiness), without additional direct effects of gender on the sadness or happiness factors. The effect on happiness was a bit stronger, d = .35, se = .08, Z = 4.6 than the effect on sadness, d = .28, se = .07, Z = 4.1.


In conclusion, the results provide empirical support for Diener et al.’s hypothesis that sex differences in well-being are small because women have higher levels of positive affect and negative affect. The relatively large difference in neuroticism is also deceptive because neuroticism is not a direct predictor of well-being and gender differences in depressiveness are weaker than gender differences in neuroticism or anxiety. In the present sample, women also benefited from higher levels of agreeableness and conscientiousness that are linked to higher cheerfulness and lower depressiveness.

The present study also addresses concerns that self-report biases may distort gender differences in measures of affect and well-being (Diener et al., 1999). In the present study, well-being of mothers and fathers was not just measured by their self-reports, but also by students’ reports of their parents’ well-being. I have also asked students in my well-being course whether their mother or father has higher life-satisfaction. The answers show pretty much a 50:50 split. Thus, at least subjective well-being does not appear to differ substantially between men and women. This blog post showed a theoretical model that explains why men and women have similar levels of well-being.

Continue here to Part 7.

How to Build a Monster Model of Well-Being: Part 5

This is Part 5 of the blog series on the monster model of well-being. The first parts developed a model of well-being that related life-satisfaction judgments to affect and domain satisfaction. I then added the Big Five personality traits to the model (Part 4). The model confirmed/replicated the key finding that neuroticism has the strongest relationship with life-satisfaction, b ~ .3. It also showed notable relationships with extraversion, agreeableness, and conscientiousness. The relationship with openness was practically zero. The key novel contribution of the monster model is to trace the effects of the Big Five personality traits on well-being. The results showed that neuroticism, extraversion, and agreeableness had broad effects on various life domains (top-down effects) that mediated the effect on global life-satisfaction (bottom-up effect). In contrast, conscientiousness was only instrumental for a few life domains.

The main goal of Part 5 is to examine the influence of personality traits at the level of personality facets. Various models of personality assume a hierarchy of traits. While there is considerable disagreement about the number of levels and the number of traits on each level, most models share a basic level of traits that correspond to traits in the everyday language (talkative, helpful, reliable, creative) and a higher-order level that represents covariations among basic traits. In the Five factor model, the Big Five traits are five independent higher-order traits. Costa and McCrae’s influential model of the Big Five recognizes six basic-level traits called facets for each of the Big Five traits. Relatively few studies have conducted a comprehensive examination of personality and well-being at the facet level (Schimmack, Oishi, Furr, & Funder, 2004). A key finding was that the depressiveness facet of neuroticism was the only facet with unique variance in the prediction of life-satisfaction. Similarly, the cheerfulness facet of extraversion was the only extraversion facet that predicted unique variance in life-satisfaction. Thus, the Mississauga family study included measures of these two facets in addition to the Big Five items.

In Part 5, I add these two facets to the monster model of well-being. Consistent with Big Five theory, I allowed for causal effects of Extraversion on Cheerfulness and from Neuroticism to Depressiveness. Strict hierarchical models would assume that each facet is related to only one broad factor. However, in reality basic-level traits can be related to multiple higher-order factors, but not much attention has been paid to secondary loadings of the depressiveness and cheerfulness facets on the other Big Five factors. In one study that controlled for evaluative bias, I found that depressiveness had a negative loading on conscientiousness (Schimmack, 2019). This relationship was confirmed in this dataset. However, additional relations improved model fit. Namely, cheerfulness was related to lower neuroticism and higher agreeableness and depressiveness was related to lower extraversion and agreeableness. Some of these relations were weak and might be spurious due to the use of short three-item scales to measure the Big Five.

The monster model combines two previous mediation models that link the Big Five personality traits to well-being. Schimmack, Diener, and Oishi (2002) proposed that affective experiences mediate the effects of extraversion and neuroticism. Schimmack, Oishi, Furr, and Funder (2004) suggested that the Depressiveness and Cheerfulness facets mediate the effects of Extraversion and Neuroticism. The monster model proposes that extraversion’s effect is mediated by trait cheerfulness which influences positive experiences, whereas neuroticism’s effect is mediated by trait depressiveness which in turn influences experiences of sadness.

When this model was fitted to the data, depressiveness and cheerfulness fully mediated the effect of extraversion and neuroticism. However, extraversion became a negative predictor of well-being. While it is possible that the unique aspects of extraversion that are not shared with cheerfulness have a negative effect on well-being, there is little evidence for such a negative relationship in the literature. Another possible explanation for this finding is that cheerfulness and positive affect (happy) share some method variance that inflates the correlation between these two factors. As a result, the indirect effect of extraversion is overestimated. When this shared method variance is fixed to zero and extraversion is allowed to have a direct effect, SEM will use the free parameter to compensate for the overestimation of the indirect path. The ability to model shared method variance is one of the advantages of SEM over mediation tests that rely on manifest variables and assume perfect measurement of constructs. Figure 1 shows the correlation between measures of trait PA (cheerfulness) and experienced PA (happy) as a curved arrow. A similar shared method effect was allowed for depressiveness and experienced sadness (sad), although it turned out be not significant.

Exploratory analysis showed that cheerfulness and depressiveness did not fully mediate all effects on well-being. Extraversion, agreeableness, and conscientiousness had additional direct relationships on some life-domains that contribute to well-being. The final model remained good overall fit and modification indices did not show notable additional relationships for the added constructs, chi2(1387) = 1914, CFI = .980, RMSEA = .017.

The standardized model indirect effects were used to quantify the effect of the facets on well-being and to quantify indirect and direct effects of the Big Five on well-being. The total effect of Depressiveness was b = -.47, Z = 8.8. About one-third of this effect was directly mediated by sadness, b = -.19. Follow-up research needs to examine how much of this relationship might be explained by risk factors for mood disorders as compared to normal levels of depressive moods. Valuable new insights can emerge from integrating the extensive literature on depression and life-satisfaction. The remaining effects were mediated by top-down effects of depressiveness on domain satisfactions (Payne & Schimmack, 2020). The present results show that it is important to control for these top-down effects in studies that examine the bottom-up effects of life domains on life-satisfaction.

The total effect of cheerfulness was as large as the effect of depressiveness, b = .44, Z = 6.6. Contrary to depressiveness, the indirect effect through happiness was weak, b = .02, Z = 0.6 because happy did not make a significant unique contribution to life-satisfaction. Thus, all of the effects were mediated by domain satisfaction.

In sum, the results for depressiveness and cheerfulness are consistent with integrated bottom-up-top-down models that postulate top-down effects of affective dispositions on domain satisfaction and bottom-up effects from domain satisfaction to life-satisfaction. The results are only partially consistent with models that assume affective experiences mediate the effect (Schimmack, Diener, & Oishi, 2002).

The effect of neuroticism on well-being, b = -.36, Z = 10.7, was fully mediated by depressiveness, b = -.28 and cheerfulness, b = -.08. Causality is implied by the assumption that neuroticism is a common cause of specific dispositions for anger, anxiety, depressiveness and other negative affects that is made in hierarchical models of personality traits. If this assumption were false, neuroticism would only be a correlate of well-being and it would be even more critical to focus on depressiveness as the more important personality trait related to well-being. Thus, future research on personality and well-being needs to pay more attention to the depressiveness facet of neuroticism. Too many short neuroticism measures focus exclusively or predominantly on anxiety.

Following Costa and McCrae (1980), extraversion has often been considered a second important personality trait that influences well-being. However, quantitatively the effect of extraversion on well-being is relatively small, especially in studies that control for shared method variance. The effect size for this sample was b = .12, a statistically small effect, and a much smaller effect than for its cheerfulness facets. The weak effect was a combination of a moderate positive effect mediated by cheerfulness, b = .32, and a negative effect that was mediated by direct effects of extraversion on domain satisfactions, b = -.23. These results show how important it is to examine the relationship between extraversion and well-being at the facet level. Whereas cheerfulness explains why extraversion has positive effects on well-being, the relationship of other facets with well-being require further investigation. The present results make it clear that a simple reason for positive relationships between extraversion and well-being is the cheerfulness facet. The finding that individuals with a cheerful disposition evaluate their lives more positively may not be surprising or may even appear to be trivial, but it would be a mistake to omit cheerfulness from a causal theory of well-being. Future research needs to uncover the determinants of individual differences in cheerfulness.

Agreeableness had a moderate effect on well-being, b = .21, Z = 5.8. Importantly, the positive effect of agreeableness was fully mediated by cheerfulness, b = .17 and depressiveness, b = .09, with a small negative direct effect on domain satisfactions, b = -.05, which was due to lower work satisfaction for individuals high in agreeableness. These results replicate Schimmack et al.’s (2004) findings that agreeableness was not a predictor of life-satisfaction, when cheerfulness and depressiveness were added to the model. This finding has important implications for theories of well-being that see a relationship between morality, empathy, and prosociality and well-being. The present results do not support this interpretation of the relationship between agreeableness and well-being. The results also show the importance of taking second order relationships more seriously. Hierarchical models consider agreeableness to be unrelated to cheerfulness and depressiveness, but simple hierarchical models do not fit actual data. Finally, it is important to examine the causal relationship between agreeableness and affective facets. It is possible that cheerfulness influences agreeableness rather than agreeableness influencing cheerfulness. In this case, agreeableness would be a predictor but not a cause of higher well-being. However, it is also possible that an agreeable disposition contributes to a cheerful disposition because agreeableness people may be more easily satisfied with reality. In any case, future studies of agreeableness and related traits and well-being need to take potential relationships with cheerfulness and depressiveness into account.

Conscientiousness also has a moderate effect on well-being, b = .19, Z = 5.9. A large portion of this effect is mediated by the Depressiveness facet of Neuroticism, b = .15. Although a potential link between Conscientiousness and Depressiveness is often omitted from hierarchical models of personality, neuropsychological research is consistent with the idea that conscientiousness may help to regulate negative affective experiences. Thus, this relationship deserves more attention in future research. If causality were reversed, conscientiousness would have only a trivial causal effect on well-being.

In short, adding cheerfulness and depressiveness facets to the model provided several new insights. First of all, the results replicated prior findings that these two facets are strong predictors of well-being. Second, the results showed that Big Five predictors are only weak unique predictors of well-being when their relationship with Cheerfulness and Depressiveness is taken into account. Omitting these important predictors from theories of well-being is a major problem of studies that focus on personality traits at the Big Five level. It also makes theoretical sense that cheerfulness and depressiveness are related to well-being. These traits influence the emotional evaluation of people’s lives. Thus, even when objective life circumstances are the same, a cheerful individual is likely to look at the bright side and see the their lives with rose colored glasses. In contrast, depression is likely to color live evaluations negatively. Longitudinal studies confirm that depressive symptoms, positive affect, and negative affect are influenced by stable traits (Anusic & Schimmack, 2016; Desai et al., 2012). Furthermore, twin studies show that shared genes contribute to the correlation between life-satisfaction judgments and depressive symptoms (Nes et al., 2013). Future research needs to examine the biopsychosocial factors that cause stable variation in dispositional cheerfulness and depressiveness that contribute to individual differences in well-being.

Continue here to Part 6.