“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
See Reference List at the end for peer-reviewed publications.
The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.
I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science.
Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020). An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017). The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).
Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021). I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021).
Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021). That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b).
If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey).
Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22 https://doi.org/10.15626/MP.2018.874
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566 http://dx.doi.org/10.1037/a0029487
Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246
Good science requires not only open and objective reporting of new data; it also requires unbiased review of the literature. However, there are no rules and regulations regarding citations, and many authors cherry-pick citations that are consistent with their claims. Even when studies have failed to replicate, original studies are cited without citing the replication failures. In some cases, authors even cite original articles that have been retracted. Fortunately, it is easy to spot these acts of unscientific behavior. Here I am starting a project to list examples of bad scientific behaviors. Hopefully, more scientists will take the time to hold their colleagues accountable for ethical behavior in citations. They can even do so by posting anonymously on the PubPeer comment site.
Authors: P. A. Hancock; John D. Lee; John W. Senders
Citation: Misattributions involved in such processes of assessment can, as we have seen, lead to adverse consequences (e.g., Johnson et al., 2019).
DOI: DOI: 10. 1177/ 0018 7208 2110 36323
Correction: Retraction (https://www.pnas.org/content/117/30/18130)
Authors: Desmond Ang
Citation: While empirical evidence of racial bias is mixed (Nix et al. 2017; Fryer 2019; Johnson et al. 2019; Knox, Lowe, and Mummolo 2020; Knox and Mummolo 2020)
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Jordan R. Riddell; John L. Worrall
Citation: Recent years have also seen improvements in benchmarking-related research, that is, in formulating methods to more accurately analyze whether bias (implicit or explicit) or racial disparities exist in both UoF and OIS. Recent examples include Cesario, Johnson, and Terrill (2019), Johnson, Tress, Burkel, Taylor, and Cesario (2019), Shjarback and Nix (2020), and Tregle, Nix, and Alpert (2019).
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Dean Knox, Will Lowe, Jonathan Mummolo
Citation: A related study, Johnson et al. (2019), attempts to estimate racial bias in police shootings. Examining only positive cases in which fatal shootings occurred, they find that the majority of shooting victims are white and conclude from this that no antiminority bias exists
Correction: Retraction of Johnson et al. (https://www.pnas.org/content/117/30/18130)
Authors: Chew Wei Ong, Kenichi Ito
Citation: This penalty treatment of error trials has been shown to improve the correlations between the IAT and explicit measures, indicating a greater construct validity of the IAT.
Correction: higher correlations do not imply higher construct validity of IATs as measures of implicit attitudes (https://doi.org/10.1177/1745691619863798)
Authors: Sara Costa, Viviana Langher, Sabine Pirchio
Citation: The most used method to assess implicit attitudes is the “Implicit Association Test” (IAT; Greenwald et al., 1998), which presents a good reliability (Schnabel et al., 2008) and validity (Nosek et al., 2005; Greenwald et al., 2009).
DOI: doi: 10.3389/fpsyg.2021.712356
Correction: does not cite critique of the construct validity of IATs (https://doi.org/10.1177/1745691619863798)
Authors: Yang, Gengfeng, Zhenzhen, Dongjing
Citation: "Studies have found that merely activating the concept of money can increase egocentrism, which can further redirect people's attention toward their inner motivations and needs (Zaleskiewicz et al., 2018) and reduce their sense of connectedness with others (Caruso et al., 2013).
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
Authors: Garriy Shteynberg, Theresa A. Kwon, Seong-Jae Yoo, Heather Smith, Jessica Apostle, Dipal Mistry, Kristin Houser
Citation: Money is often described as profane, vulgar, and filthy (Belk & Wallendorf, 1990), yet incidental exposure to money increases the endorsement of the very social systems that render such money meaningful (Caruso et al., 2013).
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
Author: Arden Rowell
Citation: In particular, some studies show that encouraging people to think about things in terms of money may measurably change people's thoughts, feelings, motivations, and behaviors. See Eugene M. Caruso, Kathleen D. Vohs, Brittani Baxter & Adam Waytz, Exposure to Money Increases Endorsement of Free-Market Systems and Social Inequality, 142 J. EXPERIMENTAL PSYCH. 301, 301-02, 305 (2013) DOI: https://scholarship.law.nd.edu/ndlr/vol96/iss4/9
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
Authors: Anna Jasinenkoa, Fabian Christandl, Timo Meynhardt
Citation: Caruso et al. (2013) find that exposure to money (which is prevalent in most shopping situations) activates personal tendencies to justify the market system. Furthermore, they find that money exposure also activates general system justification; however, he effect was far smaller than for the activation of MSJ.
Correction: cite Caruso et al. (2013), but do not cite replication failure by Caruso et al. (2017) (https://journals.sagepub.com/doi/abs/10.1177/0956797617706161)
It is well known that many psychology articles report too many significant results because researchers selectively publish results that support their predictions (Francis, 2014; Sterling, 1959; Sterling et al., 1995; Schimmack, 2021). This often leads to replication failures (Open Science Collaboration, 2015).
One way to examine whether a set of studies reported too many significant results is to compare the success rate (i.e., the percentage of significant results) with the mean observed power in studies (Schimmack, 2012). In this video, I illustrate this bias detection method using Vohs et al.’s (2006) Science article “The Psychological Consequences of Money.”
I use this students for training purposes because the article reports 9 studies and a reasonably large number of studies is needed to have good power to detect selection bias. Also, the article is short and the results are straight forward. Thus, students have no problem filling out the coding sheet that is needed to compute observed power (Coding Sheet).
The results show clear evidence of selection bias that undermine the credibility of the reported results (see also TIVA). Although bias tests are available, few researchers use them to protect themselves from junk science and articles like this one continue to be cited at high rates (683 total, 67 in 2019). A simple way to protect yourself from junk science is to adjust the alpha level to .005 because many questionable practices produce p-values that are just below .05. For example, the lowest p-value in these 9 studies was p = .006. Thus, not a single study was statistically significant with alpha = .005.
Last week I posted a video that provided an introduction to the basic concepts of statistics, namely effect sizes and sampling error. A test statistic like a t-value, is simply the ratio of the effect size over sampling error. This ratio is also known as a signal to noise ratio. The bigger the signal (effect size), the more likely it is that we will notice it in our study. Similarly, the less noise we have (sampling error), the easier it is to observe even small signals.
In this video, I use the basic concepts of effect sizes and sampling error to introduce the concept of statistical power. Statistical power is defined as the percentage of studies that produce a statistically significant result. When alpha is set to .05, it is the expected percentage of p-values with values below .05.
Statistical power is important to avoid type-II errors; that is, there is a meaningful effect, but the study fails to provide evidence for it. While researchers cannot control the magnitude of effects, they can increase power by lowering sampling error. Thus, researchers should carefully think about the magnitude of the expected effect to plan how large their sample has to be to have a good chance to obtain a significant result. Cohen proposed that a study should have at least 80% power. The planning of sample sizes using power calculation is known as a priori power analysis.
The problem with a priori power analysis is that researchers may fool themselves about effect sizes and conduct studies with insufficient sample sizes. In this case, power will be less than 80%. It is therefore useful to estimate the actual power of studies that are being published. In this video, I show that actual power could be estimated by simply computing the percentage of significant results. However, in reality this approach would be misleading because psychology journals discriminant against non-significant results. This is known as publication bias. Empirical studies show that the percentage of significant results for theoretically important tests is over 90% (Sterling, 1959). This does not mean that mean power of psychological studies is over 90%. It merely suggests that publication bias is present. In a follow up video, I will show how it is possible to estimate power when publication bias is present. This video is important to understand what statistical power.
Each year, I am working with undergraduate students on the coding of research articles to examine the replicability and credibility of psychological science (ROP2020). Before students code test-statistics from t-tests or F-tests in results sections, I provide a crash course on inferential statistics (null-hypothesis significance testing). Although some students have taken a basic stats course, the courses often fail to teach a conceptual understanding of statistics and distract students with complex formulas that are treated like a black box that converts data into p-values (or worse starts that reflect whether p < .05*, p < .01**, or p < .001***).
In this one-hour lecture, I introduce the basic principles of null-hypothesis significance testing using the example of the t-test for independent samples.
I explain that a t-value is conceptual made up of three components, namely the effect size (D = x1 – x2), a measure of the natural variation of the dependent variable (the standard deviation (s), and a measure of the amount of sampling error (simplified se = 2/sqrt (n1 + n2)).
Moreover, dividing the effect size D by the standard deviation provides the familiar standardized effect size, Cohen’s d = D/s. This means that a t-value corresponds to the ratio of the standardized effect size (d) over the amount of sampling error (se), t = d/se
It follows that a t-value is influenced by two quantities. T-values increase as the standardized (unit-free) effect sizes increase and as the sampling error decreases. The two quantities are sometimes called signal (effect size) and noise (sampling error). Accordingly, the t-value is the signal to noise ratio. I compare the signal and noise to an experiment where somebody is throwing rocks into a lake and somebody has to tell whether a rock was thrown based on the observation of a splash. A study with a small effect and a lot of noise is like trying to detect the splash of a small pebble on a very windy, stormy day where waves are creating a lot of splashes that make it hard to see the small splash made by a pebble. However, if you throw a big rock into the lack, you can see the big splash from the rock even when the wind creates a lot of splashing. If you want to see the splash of a pebble, you need to wait for a calm day without wind. These conditions correspond to a study with a large sample and very little sampling error.
Have a listen and let me know how I am doing. Feel free to ask questions that help me to understand how I can make the introduction to statistics even easier. Too many statistics books and lecturers intimidate students with complex formulas and Greek symbols that make statistics look hard, but in reality it is very simple. Data always have two components. The signal you are looking for and noise that makes it hard to see the signal. The bigger the signal to noise ratio is, the more likely it is that you saw a true signal. Of course, it can be hard to quantify signals and noise and statisticians work hard in getting good estimates of noise, but that does not have to concern users of statistics. As users of statistics we just trust statisticians that they have good (the best) estimates to see how good our data are.
Articles published in peer-reviewed journals are only a tip of the scientific iceberg. Professional organizations want you to believe that these published articles are carefully selected to be the most important and scientifically credible articles. In reality, peer-review is unreliable, invalid, and editorial decisions are based on personal preferences. For this reason, the censoring mechanism is often hidden. Part of the movement towards open science is to make the censoring process transparent.
I therefore post the decision letter and the reviews from JEP:General. I sent my ms “z-curve: an even better p-curve” to this journal because it published two articles on the p-curve method that are highly cited. The key point of my ms. is to point out that the p-curve app produces a “power” estimate of 97% for hand-coded articles by Leif Nelson, while z-curve produces an estimate of 52%. If you are a quantitative scientist, you will agree that this is a non-trivial difference and you are right to ask which of these estimates is more credible. The answer is provided by simulation studies that compare p-curve and z-curve and show that p-curve can dramatically overestimate “power” when the data are heterogeneous (Brunner & Schimmack, 2020). In short, the p-curve app sucks. Let the record show that JEP-General is happy to get more citations for a flawed method. The reason might be that z-curve is able to show publication bias in the original articles published in JEP-General (Replicability Rankings). Maybe Timothy J. Pleskac is afraid that somebody looks at his z-curve, which shows a few too many p-values that are just significant (ODR = 73% vs. EDR = 45%).
Unfortunately for psychologists, statistics is an objective science that can be evaluated using either mathematical proofs (Brunner & Schimmack, 2020) and simulation studies (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). It is just hard for psychologists to follow the science, if the science doesn’t agree with their positive illusions and inflated egos.
XGE-2021-3638 Z-curve 2.0: An Even Better P-Curve Journal of Experimental Psychology: General
Dear Dr. Schimmack,
I have received reviews of the manuscript entitled Z-curve 2.0: An Even Better P-Curve (XGE-2021-3638) that you recently submitted to Journal of Experimental Psychology: General. Upon receiving the paper I read the paper. I agree that Simonsohn, Nelson, & Simmons’ (2014) P-Curve paper has been quite impactful. As I read over the manuscript you submitted, I saw there was some potential issues raised that might help help advance our understanding of how to evaluate scientific work. Thus, I asked two experts to read and comment on the paper. The experts are very knowledgeable and highly respected experts in the topical area you are investigating.
Before reading their reviews, I reread the manuscript, and then again with the reviews in hand. In the end, both reviewers expressed some concerns that prevented them from recommending publication in Journal of Experimental Psychology: General. Unfortunately, I share many of these concerns. Perhaps the largest issue is that both reviewers identified a number formal issues that need more development before claims can be made about the z-curve such as the normality assumptions in the paper. I agree with Reviewer 2 that more thought and work is needed here to establish the validity of these assumptions and where and how these assumptions break down. I also agree with Reviewer 1 that more care is needed when defining and working with the idea of unconditional power. It would help to have the code, but that wouldn’t be sufficient as one should be able to read the description of the concept in the paper and be able to implement it computationally. I haven’t been able to do this. Finally, I also agree with Reviewer 1 that any use of the p-curve should have a p-curve disclosure table. I would also suggest ways to be more constructive in this critique. In many places, the writing and approach comes across as attacking people. That may not be the intention. But, that is how it reads.
Given these concerns, I regret to report that that I am declining this paper for publication in Journal of Experimental Psychology: General. As you probably know, we can accept only small fraction of the papers that are submitted each year. Accordingly, we must make decisions based not only on the scientific merit of the work but also with an eye to the potential level of impact for the findings for our broad and diverse readership. If you decide to pursue publication in another journal at some point (which I hope you will consider), I hope that the suggestions and comments offered in these reviews will be helpful.
Thank you for submitting your work to the Journal. I wish you the best in your continued research, and please try us again in the future if you think you have a manuscript that is a good fit for Journal of Experimental Psychology: General.
Timothy J. Pleskac, Ph.D. Associate Editor Journal of Experimental Psychology: General
Reviewer #1: 1. This commentary submitted to JEPG begins presenting a p-curve analysis of early work by Leif Nelson. Because it does not provide a p-curve disclosure table, this part of the paper cannot be evaluated. The first p-curve paper (Simonsohn et al, 2014) reads: “P-curve disclosure table makes p-curvers accountable for decisions involved in creating a reported p-curve and facilitates discussion of such decisions. We strongly urge journals publishing p-curve analyses to require the inclusion of a p-curve disclosure table.” (p.540). As a reviewer I am aligning with these recommendation and am *requiring* a p-curve disclosure table, as in, I will not evaluate that portion of the paper, and moreover I will recommend the paper be rejected unless that analysis is removed, or a p-curve disclosure table is included, and is then evaluated as correctly conducted by the review team in an subsequent round of evaluation. The p-curve disclosure table for the Russ et al p-curve, even if not originally conducted by these authors, should be included as well, with a statement that the authors of this paper have examined the earlier p-curve disclosure table and deemed it correct. If an error exists in the literature we have to fix it, not duplicate it (I don’t know if there is an error, my point is, neither do the authors who are using it as evidence).
2. The commentary then makes arguments about estimating conditional vs unconditional power. While not exactly defined in the article, the authors come pretty close to defining conditional power, I think they mean by it the average power conditional on being included in p-curve (ironically, if I am wrong about the definition, the point is reinforced). I am less sure about what they mean by unconditional power. I think they mean that they include in the population parameter of interest not only the power of the studies included in p-curve, but also the power of studies excluded from it, so ALL studies. OK, this is an old argument, dating back to at least 2015, it is not new to this commentary, so I have a lot to say about it.
First, when described abstractly, there is some undeniable ‘system 1’ appeal to the notion of unconditional power. Why should we restrict our estimation to the studies we see? Isn’t the whole point to correct for publication bias and thus make inferences about ALL studies, whether we see them or not? That’s compelling. At least in the abstract. It’s only when one continues thinking about it that it becomes less appealing. More concretely, what does this set include exactly? Does ‘unconditional power’ include all studies ever attempted by the researcher, does it include those that could have been run but for practical purposes weren’t? does it include studies run on projects that were never published, does it include studies run, found to be significant, but eventually dropped because they were flawed? Does it include studies for which only pilots were run but not with the intention of conducting confirmatory analysis? Does it include studies which were dropped because the authors lost interest in the hypothesis? Does it include studies that were run but not published because upon seeing the results the authors came up with a modification of the research question for which the previous study was no longer relevant? Etc etc). The unconditional set of studies is not a defined set, without a definition of the population of studies we cannot define a population parameter for it, and we can hardly estimate a non-existing parameter. Now. I don’t want to trivialize this point. This issue of the population parameter we are estimating is an interesting issue, and reasonable people can disagree with the arguments I have outlined above (many have), but it is important to present the disagreement in a way that readers understand what it actually entails. An argument about changing the population parameter we estimate with p-curve is not about a “better p-curve”, it is about a non-p-curve. A non-p-curve which is better for the subset of people who are interested in the unconditional power, but a WORSE p-curve for those who want the conditional power (for example, it is worse for the goals of the original p-curve paper). For example, the first paper using p-curve for power estimation reads “Here is an intuitive way to think of p-curve’s estimate: It is the average effect size one expects to get if one were to rerun all studies included in p-curve”. So a tool which does not estimate that value, but a different value, it is not better, it is different. The standard deviation is neither better nor worse than the mean. They are different. It would be silly to say “Standard Deviation, a better Mean (because it captures dispersion and the mean does not)”. The standard deviation is better for someone interested in dispersion, and the standard deviation is worse for someone interested in the central tendency. Exactly the same holds for conditional vs unconditional power. (well, the same if z-curve indeed estimated unconditional power, i don’t know if that is true or not. Am skeptical but open minded).
Second, as mentioned above, this distinction of estimating the parameter of the subset of studies included in p-curve vs the parameter of “all studies” is old. I think that argument is seen as the core contribution of this commentary, and that contribution is not close to novel. As the quote above shows, it is a distinction made already in the original p-curve paper for estimating power. And, it is also not new to see it as a shortcoming of p-curve analysis. Multiple papers by Van Assen and colleagues, and by McShane and colleagues, have made this argument. They have all critiqued p-curve on those same grounds.
I therefore think this discussion should improve in the following ways: (i) give credit, and give voice, to earlier discussions of this issue (how is the argument put forward here different from the argument put forward in about a handful of previous papers making it, some already 5 years ago), (ii) properly define the universe of studies one is attempting to estimate power for (i.e., what counts in the set of unconditional power), and (iii) convey more transparently that this is a debate about what is the research question of interest, not of which tool provides the better answer to the same question. Deciding whether one wants to estimate the average power of one or another set of studies is completely fair game of an issue to discuss, and if indeed most readers don’t think they care about conditional power, and those readers use p-curve not realizing that’s what they are estimating, it is valuable to disabuse them of their confusion. But it is not accurate, and therefore productive, to describe this as a statistical discussion, it is a conceptual discussion.
3. In various places the paper reports results from calculations, but the authors have not shared neither the code nor data for those calculations, so these results cannot be adequately evaluated in peer-review, and that is the very purpose of peer-review. This shortcoming is particularly salient when the paper relies so heavily on code and data shared in earlier published work.
Finally, it should be clearer what is new in this paper. What is said here that is not said in the already published z-curve paper and p-curve critique papers?
Reviewer #2: The paper reports a comparison between p-curve and z-curve procedures proposed in the literature. I found the paper to be unsatisfactory, and therefore cannot recommend publication in JEP:G. It reads more like a cropped section from the author’s recent piece in meta-psychology than a standalone piece that elaborates on the different procedures in detail. Because a lot is completely left out, it is very difficult to evaluate the results. For example, let us consider a couple of issues (this is not an exhaustive list):
– The z-curve procedure assumes that z-transformed p-values under the null hypothesis follow a standard Normal distribution. This follows from the general idea that the distribution of p-values under the null-hypothesis is uniform. However, this general idea is not necessarily true when p-values are computed for discrete distributions and/or composite hypotheses are involved. This seems like a point worth thinking about more carefully, when proposing a procedure that is intended to be applied to indiscriminate bodies of p-values. But nothing is said about this, which strikes me as odd. Perhaps I am missing something here.
– The z-curve procedure also assumes that the distribution of z-transformed p-values follows a Normal distribution or a mixture of homoskedastic Normals (distributions that can be truncated depending on the data being considered/omitted). But how reasonable is this parametric assumption? In their recently published paper, the authors state that this is as **a fact**, but provide no formal proof or reference to one. Perhaps I am missing something here. If anything, a quick look at classic papers on the matter, such as Hung et al. (1997, Biometrics), show that the cumulative distributions of p-values under different alternatives cross-over, which speaks against the equal-variance assumption. I don’t think that these questions about parametric assumptions are of secondary importance, given that they will play a major in the parameter estimates obtained with the mixture model.
Also, when comparing the different procedures, it is unclear whether the reported disagreements are mostly due to pedestrian technical choices when setting up an “app” rather than irreconcilable theoretical commitments. For example, there is nothing stopping one from conducting a p-curve analysis on a more fine-grained scale. The same can be said about engaging in mixture modeling. Who is/are the culprit/s here?
Finally, I found that the writing and overall tone could be much improved.
The hallmark of a science is progress. To demonstrate that psychology is a science therefore requires evidence that current evidence, research methods, and theories are better than those in the past. Historic reviews are also needed because it is impossible to make progress without looking back once in a while.
Research on the stability or consistency of personality has a long history that started with the first empirical investigations in the 1930s, but a historic review of this literature is lacking. Few young psychologists interested in personality development may be familiar with Kelly, his work, or his American Psychologist article on “Consistency of the Adult Personality” (Kelly, 1955). Kelly starts his article with some personal observations about stability and change in traits that he observed in colleagues over the years.
Today, we call traits that are neither physical characteristics, nor cognitive abilities, personality traits that are represented in the Big Five model. What have we learned about the stability of personality traits in adulthood from nearly a century of research?
Kelly (1955) reported some preliminary results from his own longitudinal study of personality that he started in the 1930s with engaged couples. Twenty years-later, they completed follow-up questionnaires. Figure 6 reported the results for the Allport-Vernon value scales. I focus on these results because they make it possible to compare the retest-correlations to retest-correlations over a one-year period.
Figure 6 shows that personality, or at least values, are not perfectly stable. This is easily seen by a comparison of the one-year retest correlations with the 20-year retest correlations. The 20-year retest correlations are always lower than the one-year retest correlations. Individual differences in values change over time. Some individuals become more religious and others become less religious, for example. The important question is how much individuals change over time. To quantify change and stability it is important to specify a time interval because change implies lower retest correlations over longer retest intervals. Although the interval is arbitrary, a period of 1-year or 10-year can be used to quantify and compare stability and change of different personality traits. To do so, we need a model of change over time. A simple model is Heise’s (1969) autoregressive model that assumes a constant rate of change.
Take religious values as an example. Here we have two observed retest correlations, r(y1) = .60, and r(y20) = .75. Both correlations are attenuated by random measurement error. To correct for unreliability, we need to solve two equations with two unknowns, the rate of change and reliability. .75 = rate^1 * rel .60 = rate^20 * rel With some rusty high-school math, I was able to solve this equation for rate rate = (.60/.75)^(1/(20-1) = .988 The implied 10-year stability is .988^10 = .886. The estimated reliability is .75 / .988 = .759.
Table 1 shows the results for all six values.
Table 1 Stability and Change of Allport-Vernon Values
The results show that the 1-year retest correlations are very similar to the reliability estimates of the value measure. After correcting for unreliability the 1-year stability is extremely high with stability estimates ranging from .96 for social values to .99 for religious values. The small differences in 1-year stabilities become only notable over longer time periods. The estimated 10-year stability estimates range from .68 for social values to .90 for religious values.
Kelly reported results for two personality constructs that were measured with the Bernreuter personality questionnaire, namely self-confidence and sociability.
The implied stability of these personality traits is similar to the stability of values.
Kelly’s results published in 1955 are based on a selective sample during a specific period of time that included the second world war. It is therefore possible that studies with other populations during other time periods produce different results. However, the results are more consistent than different across different studies.
The first article with retest correlations for different time intervals of reasonable length was published in 1941 by Mason N. Crook. The longest retest interval was 6-years and six months. Figure 1a in the article plotted the retest correlations as a function of the retest interval.
Table 2 shows the retest correlations and reveals that some of them are based on extremely small sample sizes. The 5-month retest is based on only 30 participants whereas the 8 months retest is based on 200 participants. Using this estimate for the short-term stability, it is possible to estimate the 1-year rate and 10-year rates using the formula given above.
The 1-year stability estimates are all above .9, except for the retest correlation that is based on only N = 18 participants. Given the small sample sizes, variability in estimates is mostly random noise. I computed a weighted average that takes both sample size and retest interval into account because longer time-intervals provide better information about the actual rate of change. The estimated 1-year stability is r = .96, which implies a 10-year stability of .65. This is a bit lower than Kelley’s estimates, but this might just be sampling error. It is also possible that Crook’s results underestimate long-term stability because the model assumes a constant rate of change. It is possible that this assumption is false, as we will see later.
Crook also provided a meta-analysis that included other studies and suggested a hierarchy of consistency.
Accordingly, personality traits like neuroticism are less stable than cognitive abilities, but more stable than attitudes. As the Figure shows, empirical support for this hierarchy was limited, especially for estimates of the stability of attitudes.
Several decades later, Conley (1984) reexamined this hierarchy of consistency with more data. He was also the first, to provide quantitative stability estimates that correct for unreliability. The meta-analysis included more studies and, more importantly, studies with long retest intervals. The longest retest interval was 45 years (Conley, 1983). After correcting for unreliability, the one-year stability was estimated to be r = .98, which implies a stability of r = .81 over a period of 10-years and r = .36 over 50 years.
Using the published retest correlations for with sample sizes greater than 100, I obtained a one-year stability estimate of r = .969 for neuroticism and r = .986 for extraversion. These differences may reflect differences in stability or could just be sampling error. The average reproduces Conley’s (1984) estimate of r = .98 (r = .978).
To summarize, decades of research had produced largely consistent findings that the short-term (1-year) stability of personality traits is well above r = .9 and that it takes long time-periods to observe substantial changes in personality.
The next milestone in the history of research on personality stability and change was Roberts and DelVeccio’s (2000) influential meta-analysis that is featured in many textbooks and review articles (e.g., Caspi, Roberts, & Shiner, 2005; MacAdams & Olson, 2010).
Roberts and DelVeccio’s literature review mentions Conley’s (1984) key findings. “When dissattenuated, measures of extraversion were quite consistent, averaging .98 over a 1-year period, approximately .70 over a 10-year period, and approximately .50 over a 40-year period” (p. 7).
The key finding of Roberts and DelVeccio’s meta-analysis was that age moderates stability of personality. As shown in Figure 1, stability increases with age. The main limitation of Figure 1 is that the figure shows average retest correlations without a specific time interval that are not corrected for measurement error. Thus, the finding that retest correlations in early and middle adulthood (22-49) average around .6 provides no information about the stability of personality in this age group.
Most readers of Roberts and DelVeccio (2000) fail to notice a short section that examines the influence of time interval on retest correlations.
“On the basis of the present data, the average trait consistency over a 1-year period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25” (Roberts & DelVeccio, 2000, p. 16).
Using the aforementioned formula to correct for measurement error shows that Roberts and DelVeccio’s meta-analysis replicates Conley’s results, 1-year r = .983.
Unfortunately, review articles often mistake these observed retest correlations as estimates of stability. For example, Adams and Olson write “Roberts & DelVecchio (2000) determined that stability coefficients for dispositional traits were lowest in studies of children (averaging 0.41), rose to higher levels among young adults (around 0.55), and then reached a plateau for adults between the ages of 50 and 70 (averaging 0.70)” (p. 521) and fail to mention that these stability coefficients are not corrected for measurement error, which is a common mistake (Schmidt, 1996).
Roberts and DelVeccio’s (2000) article has shaped contemporary views that personality is much more malleable than the data suggest. A twitter poll showed that only 11% of respondents guessed the right answer that the one-year stability is above .9, whereas 43% assumed the upper limit is r = .7. With r = 7 over a 1-year period, the stability over 10-years would only be r = .03 over a 10-year period. Thus, these respondents essentially assumed that personality has no stability over a 10-year period. More likely, respondents simply failed to take into account how high short-term stability has to be to allow for moderately high long-term stability.
The misinformation about personality stability is likely due to vague, verbal statements and the use of effect sizes that ignore the length of the retest interval. For example, Atherton, Grijalva, Roberts, and Robins (2021) published an article with a retest interval of 18-years. The abstract describes the results as “moderately-to-high stability over a 20-year period” (p. 841). Table 1 reports the observed correlations that control for random measurement error using a latent variable model with item-parcels as indicators.
The next table shows the results for the 4-year retest interval in adolescence and the 20-year retest interval in adulthood along with the implied 1-year rates. Consistent with Roberts and DelVeccio’s meta-analysis, the 1-year stability in adolescence is lower, r = .908, than in adulthood, r = .976.
However, even in adolescence the 1-year stability is high. Most important, the 1-year rate for adults is consistent with estimates in Conley’s (1984) meta-analysis and the first study in 1941 by Crook, and even Roberts and DelVeccio’s meta-analysis when measurement error is taken into account. However, Atherton et al. (2021) fail to cite historic articles and fail to mention that their results replicate nearly a century of research on personality stability in adulthood.
Stable Variance in Personality
So far, I have used a model that assumes a fixed rate of change. The model also assumes that there are no stable influences on personality. That is, all causes of variation in personality can change and given enough time will change. This model implies that retest correlations eventually approach zero. The only reason why this may not happen is that human lives are too short to observe retest correlations of zero. For example, with r = .98 over a 1-year period, the 100-year retest correlation is still r = .13, but the 200-year retest correlation is r = .02.
With more than two retest intervals, it is possible to see that this model may not fit the data. If there is no measurement error, the correlation from t1 to t3 should equal the product of the two lags from t1 to t2 and from t2 to t3. If the t1-t3 correlation is larger than this model predicts, the data suggest the presence of some stable causes that do not change over time (Anusic & Schimmack, 2016; Kenny & Zautra, 1995).
Take the data from Atherton et al. (2021) as an example. The average retest correlation from t1 (beginning of college) to t3 (age 40) was r = .55. The correlation from beginning to end of college was r = .68, and the correlation from end of college to age 40 was r = .62. We see that .55 > .68 * .62 = .42.
Anusic and Schimmack (2016) estimated the amount of stable variance in personality traits to be over 50%. This estimate may be revised in the future when better data become available. However, models with and without stable causes differ mainly in predictions over long-time intervals where few data are currently available. The modeling has little influence on estimates of stability over time periods of less than 10-years.
This historic review of research on personality change and stability demonstrated that nearly a century of research has produced consistent findings. Unfortunately, many textbooks misrepresent this literature and cite evidence that does not correct for measurement error.
In their misleading, but influential meta-analysis, Roberts and DelVeccio concluded that “the average trait consistency over a 1-year period would be .55; at 5 years, it would be .52; at 10 years, it would be .49; at 20 years, it would be .41; and at 40 years, it would be .25” (p. 16).
The correct (ed for measurement error) estimates are much higher. The present results suggest consistency over a 1-year would be .98, at 5 years it would be .90, at 10-years it would be .82, at 20-years it would be .67, and at 40 years it would be .45. Long-term stability might even be higher if stable causes contribute substantially to variance in personality (Anusic & Schimmack, 2016).
The evidence of high stability in personality (yes, I think r = .8 over 10-years warrants the label high) has important practical and theoretical implications. First of all, stability of personality in adulthood is one of the few facts that students at the beginning of adulthood may find surprising. It may stimulate self-discovery and taking personality into account in major life decisions. Stability of personality also means that personality psychologists need to focus on the factors that cause stability in personality, but psychologists have traditionally focused on change because statistical tools are designed to focus on differences and deviations rather than invariances. However, just because the Earth is round or the speed of light is constant, natural sciences do not ignore these fixtures of life. It is time for personality psychologists to do the same. The results also have a (sobering) message for researchers interested in personality change. Real change takes time. Even a decade is a relatively short period to observe notable changes which is needed to find predictors of change. This may explain why there are currently no replicable findings of predictors of personality change.
So, what is the stability of personality over a one-year period in adulthood after taking measurement error into account. The correct answer is that it is greater than .9. You probably didn’t know this before reading this blog post. This does of course not mean that we are still the same person after one year or 10 years. However, the broader dispositions that are measured with the Big Five are unlikely to change in the near future for you, your spouse, or co-workers. Whether this is good or bad news depends on you.
Many models of science postulate a feedback loop between theories and data. Theories stimulate research that tests theoretical models. When the data contradict the theory and nobody can find flaws with the data, theories are revised to accommodate the new evidence. In reality, many sciences do not follow this idealistic model. Instead of testing theories, researchers try to accumulate evidence that supports their theories. In addition, evidence that contradicts the theory is ignored. As a result, theories never develop. These degenerative theories have been called paradigms. Psychology is filled with paradigms. One paradigm is the personality development paradigm. Accordingly, personality changes throughout adulthood towards the personality of a mature adult (emotionally stable, agreeable, and conscientious; Caspi, Roberts, & Shiner, 2005).
Many findings contradict this paradigm, but these findings are often ignored by personality development researchers. For example, a recent article on personality development (Zimmermann et al., 2021) claims that there is broad evidence for substantial rank-order and mean-level changes citing outdated references from 2000 (Roberts & DelVeccio, 2000) and 2006 (Roberts et al., 2006). It is not difficult to find more recent studies that challenge these claims based on newer evidence and better statistical analyses (Anusic & Schimmack, 2016; Costa et al., 2019). It is symptomatic of a paradigm that these findings that do not fit the personality development paradigm are ignored.
Another symptom of paradigmatic research is that interpretations of research findings do not fit the data. Zimmermann et al. (2021) conducted an impressive study of N = 3,070 students’ personality over the course of a semester. Some of these students stayed at their university and others went abroad. The focus of the article was to examine the potential influence of spending time abroad on personality. The findings are summarized in Table 1.
The key prediction of the personality development paradigm is that neuroticism decreases with age and that agreeableness and conscientiousness increase with age. This trend might be accelerated by spending time abroad, but it is also predicted for students who stay at their university (Robins et al., 2001).
The data do not support this prediction. In the two control groups, neither conscientiousness (d = -.11, d = -.02) nor agreeableness increased (d = -.02, .00) and neuroticism increased (d = .08, .02). The group of students who were waiting to go abroad, but also stayed during the study period also showed no increase in conscientiousness (d = -.22, -.02) or agreeableness (d = -.16, .00), but showed a small decrease in neuroticism (d = -.08, -.01). The group that went abroad showed small increases in conscientiousness (d = .03, .09) and agreeableness (d = .14, .00), and a small decrease in neuroticism (d = -.14, d = .00). All of these effect sizes are very small, which may be due to the short time period. A semester is simply too short to see notable changes in personality.
These results are then interpreted as being fully consistent with the personality development paradigm.
A more accurate interpretation of these findings is that the effects of spending a semester abroad on personality are very small (d ~ .1) and that a semester is too short to discover changes in personality traits. The small effect sizes in this study are not surprising given the finding that even changes over a decade are no larger than d = .1 (Graham et al., 2020; also not cited by Zimmermann et al., 2021) .
In short, the personality development paradigm is based on the assumption that personality changes substantially. However, empirical studies of stability show much stronger evidence of stability, but this evidence is often not cited by prisoners of the personality development paradigm. It is therefore necessary to fact check articles on personality development because the abstracts and discussion section often do not match the data.
It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).
The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.
Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.
Questionable Research Practices
The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).
This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.
Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).
The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.
To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.
For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.
Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.
The key results of the z-curve analysis are captured in Figure 1.
Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.
A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.
Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).
The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.
Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).
With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.
The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.
In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.
The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.
The German term for development is Entwicklung and evokes the image of a blossom slowly unwrapping its petals. This process has a start and a finish. At some point the blossom is fully open. Similarly, human development has a clear start with conception and usually an end when an individual becomes an adult. Not surprisingly, developmental psychology initially focused on the first two decades of a human life.
At some point, developmental psychologists also started to examine the influence of age at the end of life. Here, the focus was on successful aging in the face of biological decline. The idea of development at the beginning of life and decline at the end of life is consistent with the circle of life that is observed in nature.
In contrast to the circular conception of life, some developmental psychologists propose that that some psychological processes continue to develop throughout adulthood. The idea of life-long development or growth makes the most sense for psychological processes that depend on learning. Over the life course, individuals acquire knowledge and skills. Although practice or the lack thereof may influence performance, individuals with a lot of experience are able to build on their past experiences.
Personality psychologists have divergent views about the development of personality. Some assume that personality is like many other biological traits. They develop during childhood when the brain establishes connections. However, when this process is completed, personality remains fairly stable. Moreover, new experiences may still change neural patterns and personality, but these changes will be idiosyncratic and differ from person to person. These theories do not predict a uniform increase in some personality traits during adulthood.
An alternative view is that we can distinguish between immature and mature personalities and that personality changes towards a goal of the completely mature personality, akin to the completely unfolded blossom. Moreover, this process of personality development or maturation does not end at the end of childhood. Rather, it is a lifelong process that continuous over the adult life-span. Accordingly, personality becomes more mature as individuals are getting older.
What is a Mature Personality?
The notion of personality development during adulthood implies that some personality traits are more mature than others. After all, developmental processes have an end goal and the end goal is the mature state of being.
However, it is difficult to combine the concepts of personality and development because personality implies variation across individuals, just like there is variation across different types of flowers in terms of the number, shape, and color of petals. Should we say that a blossom with more petals is a better blossom? Which shape or color would reflect a better blossom? The answer is that there is no optimal blossom. All blossoms are mature when they are completely unfolded, but this mature state can look every different for different flowers.
Some personality psychologists have not really solved this problem, but rather used the notion of personality development as a label for any personality changes irrespective of direction. “The term ‘personality development’, as used in this paper, is mute with regard to direction of change. This means that personality development is not necessarily positive change due to functional adjustment, growth or maturation” (Specht et al., 2014, p. 217). While it is annoying that researchers may falsely use the term development when they mean change, it does absolve the researchers from specifying a developmental theory of personality development.
However, others take the notion of a mature personality more seriously (e.g., Hogan & Roberts, 2004, see also Specht et al., 2014). Accordingly, “a mature person from the observer’s viewpoint would be agreeable (supportive and warm), emotionally stable (consistent and positive), and conscientious (honoring commitments and playing by the rules)” (Hogan & Roberts, 2008, p. 9). According to this conception of a mature personality, the goal of personality development is to achieve a low level of neuroticism and high levels of agreeableness and conscientiousness.
Another problem for personality development theories is the existence of variation in mature traits in adulthood. If agreeableness, conscientiousness, and emotional stability are so useful in adult life, it is not clear why some individuals are biologically disposed to have low levels of these traits. The main explanation for variability in traits is that there are trade-offs and that neither extreme is optimal. For example, too much conscientiousness may lead to over-regulated behaviors that are not adaptive when life changes and being too agreeable makes individuals vulnerable to exploitation. In contrast, developmental theories imply that individuals with high levels of neuroticism and low levels of agreeableness or conscientiousness are not fully developed and would have to explain why some individuals do to achieve maturity.
Developmental processes also tend to have a specified time for the process to be completed. For example, flowers blossom at a specified time of year that is optimal for pollination. In humans, sexual development is completed by the end of adolescence to enable reproduction. So, it is reasonable to ask why development of personality should not also have a normal time of completion. If maturity is required to take on the tasks of an adult, including having children and taking care of them, the process should be completed during early adulthood, so that these trait are fully developed when they are needed. It would therefore make sense to assume that most of the development is completed by age 20 or at least age 30, as proposed by Costa and McCrae (cf. Specht et al., 2014). It is not clear why maturation would still occur in middle age or old age.
One possible explanation for late development could be that some individuals have a delayed or “arrested” development. Maybe some environmental factors impede the normal process of development, but the causal forces persist and can still produce the normative change later in adulthood. Another possibility is that personality development is triggered by environmental events. Maybe having children or getting married are life events that trigger personality development in the same way men’s testosterone levels appear to decrease when they enter long-term relationships and have children.
In short, a theory of lifelong development faces some theoretical challenges and alternative predictions about personality in adulthood are possible.
Wrzus and Roberts (2017) claim that agreeableness, conscientiousness, and emotional stability increase from young to middle adulthood citing Roberts et al. (2006), Roberts & Mroczek (2008), and Lucas and Donnellan (2011). They also propose that these changes co-occur with life transitions citing Bleidorn (2012, 2015), Le Donnellan, & Conger (2014), Lodi Smith & Roberts (2012), Specht, Egloff, and Schmukle (2011) and Zimmermann and Neyer (2013). A causal role of life events is implied by the claim that mean levels of the traits decrease in old age (Berg & Johansson, 2014; Kandler, Kornadt, Hagemeyer, & Neyer, 2015; Lucas & Donnellan, 2011; Mottus, Johnson, Starr, & Neyer, 2012). Focusing on work experiences, Asselmann and Specht (2020) propose that conscientiousness increases when people enter the workforce and decreases again at the time of retirement.
A recent review article by Costa, McCrae, and Lockenhoff (2019) also suggests that neuroticism decreases and agreeableness and conscientiousness increase over the adult life-span. However, they also point out that these age-trends are “modest.” They suggest that traits change by about one T-score per decade, which is a standardized mean difference of less than .2 standard deviations per decade. However, this effect size implies that changes may be as large as 1 standard deviation from age 20 to age 70.
More recently, Graham et al. (2020) summarized the literature with the claim that “during the emerging adult and midlife years, agreeableness, conscientiousness, openness, and extraversion tend to increase and neuroticism tends to decrease” (p. 303). However, when they conducted an integrated analysis of 16 longitudinal studies, the results were rather different. Most importantly, agreeableness did not increase. The combined effect was b = .02, with a 95%CI that included zero, b = -.02 to .07. Despite the lack of evidence that agreeableness increases with age during adulthood, the authors “tentatively suggest that agreeableness may increase over time” (p. 312).
The results for conscientiousness are even more damaging for the maturation theory. Here most datasets show a decrease in conscientiousness and the average effect size is statistically significant, b = -.05, 95%CI = -.09 to -.02. However, the effect size is small, suggesting that there is no notable age trend in conscientiousness.
The only trait that showed the predicted age-trend was neuroticism, but the effect size was again small and the upper bound of the 95%CI was close to zero, b = -.05, 95%CI = -.09 to -.01.
In sum, recent evidence from several longitudinal studies challenges the claim that personality develops during adulthood. However, longitudinal studies are often limited by rather short time-intervals of a few years up to one decade. If effect sizes over one decade are small, they can be easily masked by method artifacts (Costa et al., 2019). Although cross-sectional studies have their own problem, they have the advantage that it is much easier to cover the full age-range of adulthood. The key problem in cross-sectional studies is that age-effects can be confounded with cohort effects. However, when multiple cross-sectional studies from different survey years are available, it is possible to separate cohort effects and age-effects. (Fosse & Winship, 2019).
The maturity model also makes some predictions about age-trends for other constructs. One prediction is that well-being should increase as personality becomes more mature because numerous meta-analyses suggest that emotional stability, agreeableness, and conscientiousness predict higher well-being (Anglim et al., 2020). That being said, falsification of this prediction does not invalidate the maturity model. It is possible that other factors lower well-being in middle age or that higher maturity does not cause higher well-being. However, if the maturity model correctly predicts age effects on well-being, it would strengthen the model. I therefore tested age-effects on well-being and examined whether they are explained by personality development.
Fosse and Winship (2019) noted that “despite the existence of hundreds, if not thousands, of articles and dozens of books, there is little agreement on how to adequately analyze age, period, and cohort data” (p. 468). This is also true for studies of personality development. Many of these studies fail to take cohort effects into account or ignore inconsistencies between cross-sectional and longitudinal results.
Fosse and Winship point out that that there is an identification problem when cohort, period, and age effects are linear, but not if the trends have different distributions. For example, if age effects are non-linear, it is possible to distinguish between linear cohort effects, linear period effects, and non-linear age effects. As maturation is expected to produce stronger effects during early adulthood than in middle and may actually show a decline in older age, it is plausible to expect a non-linear age effect. Thus, I examined age-effects in the German Socio-Economic Panel using a statistical model that examines non-linear age effects, while controlling for linear cohort and linear period effects.
Moreover, I included measures of marital status and work status to examine whether age effects are at least partially explained by these life experiences. The inclusion of these measures can also help with model identification (Fosse & Winship, 2019). For example, work and marriage have well-known age-effects. Thus, any age-effects on personality that are mediated by age are easily distinguished from cohort or period effects.
Measurement of Personality
Another limitation of many previous studies is the use of sum scores as measures of personality traits. It is well-known that these sum scores are biased by response styles (Anusic et al., 2009). Moreover, sum scores are influenced by the specific items that were selected to measure the Big Five traits and specific items can have their own age effects (Costa et al., 2019; Terracciano, McCrae, Brant, & Costa, 2005). Using a latent variable approach, it is possible to correct for random and systematic measurement errors and age effects on individual items. I therefore used a measurement model of personality that corrects for acquiescence and halo biases (Anusic et al., 2009). The specification of the model and detailed results can be found on OSF (https://osf.io/vpcfd/).
A model that assumed only age effects did not fit the data as well as a model that also allowed for cohort and period effects, chi2(df = 211) = 6651, CFI = .974, RMSEA = .021 vs. chi2(df = 201) = 5866, CFI = .977, RMSEA = .020, respectively. This finding shows that age-effects are confounded with other effects in models that do not specify cohort or period effects.
Figure 1 shows the age effects for the Big Five traits.
The results do not support the maturation model. The most inconsistent finding is a strong negative effect of age on agreeableness. However, other traits also did not show a continuous trend throughout adulthood. Conscientiousness increased from age 17 to 35, but remained unchanged afterwards, whereas Openness decreased slightly until age 30 and then increased continuously.
To examine the robustness of these results, I conducted sensitivity analyses with varying controls. The results for agreeableness are shown in Figure 2.
All models show a decreasing trend, but the effect sizes vary. No controls, controlling for either cohort effects or time effects produces a decreasing age trend, but the effect size is small as most scores deviate less than .2 standard deviations from the mean (i.e., zero). However, controlling for time and cohort effects results in the strong decrease observed in Figure 1. Controlling for halo bias makes only a small difference. It is possible that the model that corrects for cohort and time effects overcorrects because it is difficult to distinguish age and time effects. However, none of these results are consistent with the predictions of the maturation model that agreeableness increases throughout adulthood.
Figure 3 takes a closer look at Neuroticism. Inconsistent with the maturation model, most models show a weak increase in neuroticism. The only model that shows a weak decrease controls for cohort effects only. One possible explanation for this finding is that it is difficult to distinguish between non-linear and linear age effects and that the negative time effect is actually an age effect. Even if this were true, the effect size of age is small.
The results for conscientiousness are most consistent with the maturation hypothesis. All models show a big increase from age 17 to age 20, and still a substantial increase from age 20 to age 35. At this point, conscientiousness levels remain fairly stable or decrease in the model that controls only for cohort effects. Although these results are most consistent with the maturation model, they do not support the prediction of a continuous process throughout adulthood. The increase is limited to early adulthood and is stronger at the beginning of adulthood, which is consistent with biological models of development (Costa et al., 2019).
Although not central to the maturation model, I also examined the influence of controls on age-effects for Extraversion and Openness.
Extraversion shows a very small increase over time in the model without controls and the model that controls only for period (time) effects. However, this trend turns negative in models that control for cohort effects. However, all effect sizes are small.
Openness shows different results for models that control for cohort effects or not. Without taking cohort effects into account, openness appears to decrease. However, after taking cohort effects into account, openness stays relatively unchanged until age 30 and then increases gradually. These results suggest that previous cross-section studies may have falsely interpreted cohort effects as age-effects and that openness does not decrease with age.
Work and Marriage as Mediators
Personality psychologists have focussed on two theories to explain increases in conscientiousness during early adulthood. Some personality psychologists assume that it reflects the end stage of a biological process that increases self-regulation throughout childhood and adolescence (Costa & McCrae, 2006; Costa et al., 2019). The process is assumed to be complete by age 30. The present results suggest that it may be a bit later at age 35. The alternative theory is the social roles influence personality (Roberts, Wood, & Smith, 2005). A key prediction of the social investment theory is that personality development occurs when adults take on important social roles such as working full time, entering long-term romantic relationships (marriage), or parenting.
The SOEP makes it possible to test the social investment theory because it included questions about work and marital status. Most young adults start working full-time during their 20s, suggesting that work experiences may produce the increase in conscientiousness during this period. In Germany, marriage occurs later when individuals are in their 30s. Therefore marriage provides a particularly interesting test of the social investment theory because marriage occurs when biological maturation is mostly complete.
Figure 7 shows the age effect for work status. The age effect is clearly visible for all models and only slightly influenced by controlling for cohort or time effects.
Figure 8 shows the figure for marital status with cohabitating participants counted as married. The figure confirms that most Germans enter long-term relationships in their 30s.
To examine the contribution of work and marriage to the development of conscientiousness, I included marriage and work as predictors of conscientiousness. In this model the age-effects on conscientiousness can be decomposed into (a) an effect mediated by work (age -> work -> C), (b) an effect mediated by marriage (age -> married -> C), and an effect of age that is mediated by unmeasured variables (e.g., biological processes). Results are similar for the various models and I present the results for the model that controls for cohort and time effects.
The results show no effect of marriage; that is the effect size for the indirect effect is close to zero, but both work and unmeasured mediators contribute to the total age effect. The unmeasured mediators produce a step increase in the early 20s. This finding is consistent with a biological maturation hypothesis. Moreover, the unmeasured mediators produce a gradual decline over the life span with a surprising uptick at the end. This trajectory may be a sign of cognitive decline. The work effect increases much more gradually and is consistent with the social-role theory. Accordingly, the decrease in conscientiousness after age 55 is related to retirement. The negative effect of retirement on conscientiousness raises some interesting theoretical questions about the definition of personality. Does retirement really alter personality or does it merely alter situational factors that influence conscientious behaviors? To separate these hypotheses, it would be important to examine behaviors outside of work, but the trait measure that was used in this study does not provide information about the consistency of behaviors across different situations.
The key finding is that the data are consistent with two theories that are often treated as mutually exclusive and competing hypotheses. The present results suggest that biological processes and social roles contribute to the development of conscientiousness during early adulthood. However, there is no evidence that this process continuous in middle or late adulthood and role effects tend to disappear as soon as individuals are retiring.
Personality Development and Well-Being
One view of personality assumes that variation is personality is normal and that no personality trait is better than another. In contrast, the maturation model implies that some traits are more desirable, if only because they are instrumental to fulfill roles of adult life like working or maintaining relationships (McCrea & Costa, 1991). Accordingly, more mature individuals should have higher well-being. While meta-analyses suggest that this is the case, they often do not control for rating biases. When rating biases are taken into account, the positive effects of agreeableness and conscientiousness are not always found and are small (Schimmack, Schupp, & Wagner, 2008; Schimmack & Kim, 2020).
Another problem for the maturation theory is that well-being tends to decrease from early to middle adulthood when maturation should produce benefits. However, it is possible that other factors explain this decrease in well-being and maturation buffers these negative effects. To test this hypothesis, I added life-satisfaction to the model and examined mediators of age-effects on life-satisfaction.
An inspection of the direct relationships of personality traits and life-satisfaction confirmed that life-satisfaction ratings are most strongly influenced by neuroticism, b = -.37, se = .01. Response styles also had notable effects; halo b = .15, se = .01, acquiescence, b = .19, se = .01. The effects of the remaining Big Five traits were weak: E b = .078, se = .01, A = .07, se = .01, C = .02, se = .005, O = .07, se = .01. The weak effect of conscientiousness makes it unlikely that age-effects on conscientiousness contribute to age-effects on life-satisfaction.
The next figure shows the age-effect for life-satisfaction. The total effect is rather flat and shows only an increase in the 60s.
The mostly stable level of life-satisfaction masks two opposing trends. As individuals enter the workforce and get married, life-satisfaction actually increases. The positive trajectory for work reverses when individuals retire, while the positive effect of marriage remains. However, the positive effects of work and marriage are undone by unexplained factors that decrease well-being until age 50, when a rebound is observed. Neuroticism is not a substantial mediator because there are no notable age-effects on neuroticism. Conscientiousness is not a notable mediator because it does not predict life-satisfaction.
The main insight from these findings is that achieving major milestones of adult life is associated with increased well-being, but that these positive effects are not explained by personality development.
Narrative reviews claim that personality develops steadily through adulthood. For example, in a just published review of the literature Roberts and Yoon claim that “agreeableness, conscientiousness, and emotional stability show increases steadily through midlife” (p. 10). Roberts and Yoon also claim that “forming serious partnerships is associated with decreases in neuroticism and increases in conscientiousness” (p. 11). The problem with these broad and vague statements is that they ignores inconsistencies across cross-sectional and longitudinal analyses (Lucas & Donnellan, 2011), inconsistencies across populations (Graham et al., 2020), and effect sizes (Costa et al., 2019).
The present results challenge this simplistic story of personality development. First, only conscientiousness shows a notable increase from late adolescence to middle age and most of the change occurs during early adulthood before the age of 35. Second, formation of long-term relationships had no effect on neuroticism or conscientiousness. Participation in the labor force did increase conscientiousness, but these gains were lost when older individuals retired. If conscientiousness were a sign of maturity, it is not clear why it would decrease after it was acquired. In short, the story of life-long development is not based on scientific facts.
The notion of personality development is also problematic from a theoretical perspective. It implies that some personality traits are better, more mature, than others. This has led to calls for interventions to help people to become more mature (Bleidorn et al., 2019). However, this proposal imposes values and implicitly devalues individuals with the wrong traits. An alternative view treats personality as variation without value judgment. Accordingly, it may be justified to help individuals to change their personality if they want to change their personality, just like gender changes are now considered a personal choice without imposing gender norms on individuals. However, it would be wrong to subject individuals to programs that aim to change their personality, just like it is now considered wrong to subject individuals to interventions that target their sexual orientation. Even if individuals want to change, it is not clear how much personality can be changed. Thus, another goal should be to help individuals with different personality traits to feel good about themselves and to live fulfilling lives that allow them to express their authentic personality. The rather weak relationships between many personality traits and well-being suggests that it is possible to have high well-being with a variety of personalities. The main exception is neuroticism, which has a strong negative effect on well-being. However, the question here is how much of this relationship is driven by mood disorders rather than normal variation in personality. The effect may also be moderated by social factors that create stress and anxiety.
In conclusion, the notion of personality development lacks clear theoretical foundations and empirical support. While there are some relatively small mean level changes in personality over the life span, they are relatively trivial compared to the large stable variance in personality traits across individuals. Rather than considering this variation as arrested forms of development, it should be celebrate as diversity that enriches everybody’s life.
Conflict of Interest: My views may be biased by my (immature) personality (high N, low A, low C).
P.S. I asked Brent W. Roberts for comments, but he declined the opportunity. Please share your comments in the comment section.
Peer Reviewed by Editors of Biostatistics “You have produced a nicely written paper that seems to be mathematically correct and I enjoyed reading” (Professor Dimitris Rizopoulos & Professor Sherri Rose)
Estimating the false discovery risk in medical journals
Ulrich Schimmack∗ Department of Psychology, University of Toronto Mississauga 3359 Mississauga Road N. Mississauga, Ontario Canada firstname.lastname@example.org
Frantisek Bartos Department of Psychology, University of Amsterdam; Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic
Jager and Leek (2014) proposed an empirical method to estimate the false discovery rate in top medical journals and found a false discovery rate of 14%. Their work received several critical com- ments and has had relatively little impact on meta-scientific discussions about medical research. We build on Jager and Leek’s work and present a new way to estimate the false discovery risk. Our results closely reproduce their original finding with a false discovery rate of 13%. In addition, our method shows clear evidence of selection bias in medical journals, but the expected discovery rate is 30%, much higher than we would expect if most published results were false. Our results provide further evidence that meta-science needs to be built on solid empirical foundations.
The successful development of vaccines against Covid-19 provides a vivid example of a scientific success story. At the same time, many sciences are facing a crisis of confidence in published results. The influential article “Why most published research findings are false” suggested that many published significant results are false discoveries (Ioannidis, 2005). One limitation of Ioannidis’s article was the reliance on a variety of unproven assumptions. For example, Ioannidis assumed that only 1 out of 11 exploratory epidemiological studies tests a true hypothesis. To address this limitation, Jager and Leek (2014) developed a statistical model to estimate the percentage of false- positive results in a set of significant p-values. They applied their model to 5,322 p-values from medical journals and found that only 14% of the significant results may be false-positives. This is a sizeable percentage, but it is much lower than the false-positive rates predicted by Ioannidis. Although Jager and Leek’s article was based on actual data, the article had a relatively week impact on discussions about false-positive risks. So far, the article has received only 73 citations in WebOfScience. In comparison, Ioannidis’s purely theoretical article has been cited 518 times in 2020 alone. We believe that Jager and Leek’s article deserves a second look and that discussions about the credibility of published results benefit from empirical investigations
Estimating the False Discovery Risk
To estimate the false discovery rate, Jager and Leek developed a model with two populations of studies. One population includes studies in which the null-hypothesis is true (H0). The other population includes studies in which the null-hypothesis is false; that is, the alternative hypothesis is true (H1). The model assumes that the observed distribution of significant p-values is a mixture of these two populations.
One problem for this model is that it can be difficult to distinguish between studies in which H0 is true and studies in which H1 is true, but it was tested with low statistical power. Furthermore, the distinction between the point-zero null-hypothesis, the nil-hypothesis (Cohen, 1994), and alternative hypotheses with very small effect sizes is rather arbitrary. Many effect sizes may not be exactly zero but too small to have practical significance. This makes it difficult to distinguish clearly between the two populations of studies and estimates based on models that assume distinct populations may be unreliable.
To avoid the distinction between two populations of p-values, we distinguish between the false discovery rate and the false discovery risk. The false discovery risk does not aim to estimate the actual rate of H0 among significant p-values. Rather, it provides an estimate of the worst-case scenario with the highest possible amount of false-positive results. To estimate the false discovery risk, we take advantage of Soric’s (1989) insight that the maximum false discovery rate is limited by statistical power to detect true effects. When power is 100%, all non-significant results are produced by testing false hypotheses (H0). As this scenario maximizes the number of non-significant H0, it also maximizes the number of significant H0 tests and the false discovery rate. Soric showed that the maximum false discovery rate is a direct function of the discovery rate. For example, if 100 studies produce 30 significant results, the discovery rate is 30%. And when the discovery rate is 30%, the maximum false discovery risk with α = 5% is ≈ 0.12. In general, the false discovery risk is a simple transformation of the discovery rate, such as
Our suggestion to estimate the false discovery risk rather than the actual false discovery rate addresses concerns about Jager and Leek’s two-population model that were raised in several commentaries (Gelman and O’Rourke, 2014; Benjamini and Hechtlinger, 2014; Ioannidis, 2005; Goodman, 2014).
If all conducted hypothesis tests were reported, the false discovery risk could be determined simply by computing the percentage of significant results. However, it is well-known that journals are more likely to publish significant results than non-significant results. This selection bias renders the observed discovery rate in journals uninformative (Bartoˇs and Schimmack, 2021; Brunner and Schimmack, 2020). Thus, a major challenge for any empirical estimates of the false discovery risk is to take selection bias into account.
Biostatistics published several commentaries to Jager and Leek’s article. A commentary by Ioannidis (2014) may have contributed to the low impact of Jager and Leek’s article. Ioannidis claims that Jager and Leek’s results can be ignored because they used automatic extraction of p-values, a wrong method, and unreliable data. We address these concerns by means of a new extraction method, a new estimation method, and new simulation studies that evaluate the performance of Jager and Leek’s original method and a new method. To foreshadow the main results, we find that Jager and Leek’s method can sometimes produce biased estimates of the false discovery risk. However, our improved method produces even lower estimates of the false discovery risk. When we applied this method to p-values from medical journals, we obtained an estimate of 13% that closely matches Jager and Leek’s original results. Thus, although Ioannidis (2014) raised some valid objections, our results provide further evidence that false discovery rates in medical research are much lower than Ioannidis (2005) predicted.
Jager and Leek (2014) proposed a selection model that could be fitted to the observed distribution of significant p-values. This model assumed a flat distribution for p-values from the H0 population and a beta distribution for p-values from the H1 population. A single beta distribution can only approximate the actual distribution of p-values; a better solution is to use a mixture of several beta-distributions or, alternatively, convert the p-values into z-scores and model the z-scores with several truncated normal distributions (similar to the suggestion by Cox, 2014). Since reported p-values often come from two-sided tests, the resulting z-scores need to be converted into absolute z-scores that can be modeled as a mixture of truncated folded normal distributions (Bartoˇs and Schimmack, 2021). The weights of the mixture components can then be used to compute the average power of studies that produced a significant result. As this estimate is limited to the set of studies that were significant, we refer to it as the average power after selection for statistical significance. As power determines the outcomes of replication studies, the average power after selection for statistical significance is an estimate of the expected replication rate.
Although an estimate of the expected replication rate is valuable in its own right, it does not provide an estimate of the false discovery risk because it is based on the population of studies after selection for statistical significance. To estimate the expected discovery rate, z-curve models the selection process operating on the significance level and assumes that studies produce a statistically significant result proportionally to their power. For example, studies with 50% power produce one non-significant result for every significant result and studies with 20% power produce four statistically non-significant results for every significant result. It is therefore possible to estimate the average power before selection of statistical significance based on the weights of the mixture components that are obtained by fitting the model to only significant results. As power determines the percentage of significant results, we refer to average power before selection for statistical significance as the expected discovery rate.Extensive simulation studies have demonstrated that z-curve produces good large-sample estimates of the expected discovery rate with exact p-values (Bartoˇs and Schimmack, 2021). Moreover, these simulation studies showed that z-curve produces robust confidence intervals with good coverage. As the false discovery risk is a simple transformation of the EDR, these confidence intervals also provide confidence intervals for estimates of the false discovery risk. To use z-curve for p-values from medical abstracts, we extended z-curve’s expectation-maximization (EM) algorithm (Dempster and others, 1977) to incorporate rounding and censoring similarly to Jager and Leek’s model. To demonstrate that z-curve can obtain valid estimates of the false discovery risk for medical journals, we conducted a simulation study that compared Jager and Leek’s method with z-curve.
We extended the simulation performed by Jager and Leek in several ways. Instead of simulating H1p-values directly from a beta distribution, we used power estimates from individual studies based on meta-analyses (Lamberink and others, 2018) and simulated p-values of two-sided z-tests with corresponding power (excluding all power estimates based meta-analyses with non-significant results). This allows us to assess the performance of the methods under heterogeneity of power to detect H1 corresponding to the actual literature. To simulate H0p-values, we used a uniform distribution.
We manipulated the true false discovery rate from 0 to 1 with a step size of 0.01 and simulated 10,000 observed significant p-values. Similarly to Jager and Leek, we performed four simulation scenarios with an increasing percentage of imprecisely reported p-values. Scenario A used exact p-values, scenario B rounded p-values to three decimal places (with p-values lower than 0.001 censored at 0.001), scenario C rounds 20% p-values to two decimal places (with p-values rounded to 0 censored at 0.01), and scenario D first rounds 20% p-values to two decimal places and further censors 20% p-values at on of the closest ceilings (0.05, 0.01, or 0.001).
Figure 1 displays the true (x-axis) vs. estimated (y-axis) false discovery rate (FDR) for Jager and Leek’s method and the false discovery risk for z-curve across the different scenarios (panels). We see that when precise p-values are reported (panel A in the upper left corner), z-curve can handle the heterogeneity in power very well across the whole range of false discovery rates and produces accurate estimates of false discovery risks. Higher estimate than the actual false discovery rates are expected because the false discovery risk is an estimate of the maximum false discovery rate. Discrepancies are especially expected when power of true hypothesis tests is low. For the simulated scenarios, the discrepancies are less than 20 percentage points and decrease as the true false discovery rate increases. Even though Jager and Leek’s method aims to estimate the true false discovery rates, it produces higher estimates than z-curve. This is problematic because the method produces inflated estimates of the true false discovery rate. Even if the estimates were interpreted as maximum estimates, the method is less sensitive to the actual variation in the false discovery rate than the z-curve method.
Panel B shows that the z-curve method produces similar results when p-values are rounded to three decimals. The Jager and Leek’s method however experiences estimation issues, especially in the lower spectrum of the true false discovery rate since the current implementation only allows to deal with rounding to two decimal places (we also tried specifying the p-values as a rounded input; however, the optimizing routine failed with several errors).
Panel C shows a surprisingly similar performance of the two methods when 20% of p-values are rounded to two decimals, except for very high levels of true false discovery rates, where Jager and Leek’s method starts to underestimate the false discovery rate. Despite the similar performance, the results have to be interpreted as estimates of the false discovery risk (maximum false discovery rate) because both methods overestimate the true false discovery rate for low false discovery rates.
Panel D shows that both methods have problems when 20% of p-values are at the closest ceiling of .05, .01, or .001 without providing clear information about the exact p-value. Z-curve does a little bit better than Jager and Leek’s method. Underestimation of true false discovery rates over 40% is not a serious problem because any actual false discovery rate over 40% is unacceptably high. One solution to the underestimation problem is to exclude p-values that are reported in this way from analyses.
Root mean square error and bias of the false discovery rate estimates for each scenario summa- rized in Table 1 show that z-curve produces estimates with considerably lower root mean square error. The results for bias show that both methods tend to produce higher estimates than the true false discovery rate. For z-curve this is expected because it aims to estimate the maximum false discovery rate. It would only be a problem if estimates of the false discovery risk were lower than the actual false discovery rate. This is only the case in Scenario D, but as shown previously, underestimation only occurs when the true false discovery rate is high.
To summarize, our simulation confirms that Jager and Leek’s method provides meaningful estimates of the false discovery risk and that the method is likely to overestimate the true false discovery rate. Thus, it is likely that the reported estimate of 14% for top medical journals overestimates the actual false discovery rate. Our results also show that z-curve improves over the original method and that the modifications can handle rounding and imprecise reporting when the false discovery rates are below 40%.
Application to Medical Journals
Commentators raised more concerns about Jager and Leek’s mining of p-values than about their estimation method. To address these concerns, we extended Jager and Leek’s data mining approach in the following ways; (1) we extracted p-values only from abstracts labeled as “randomized controlled trial” or “clinical trial” as suggested by Goodman (2014); Ioannidis (2014); Gelman and O’Rourke (2014), (2) we improved the regex script for extracting p-values to cover more possible notations as suggested by Ioannidis (2014), (3) we extracted confidence intervals from abstracts not reporting p-values as suggested by Ioannidis (2014); Benjamini and Hechtlinger (2014). We further scraped p-values from abstracts in “PLoS Medicine” to compare the false discovery rate estimates to a less-selective journal as suggested by Goodman (2014). Finally, we randomly subset the scraped p-values to include only a single p-value per abstract in all analyses, thus breaking the correlation between the estimates as suggested by Goodman (2014). Although there are additional limitations inherent to the chosen approach, these improvements, along with our improved estimation method, make it possible to test the prediction by several commentators that the false discovery rate is well above 14%.
We executed the scraping protocol on July 2021 and scraped abstracts published since 2000 (see Table 2 for a summary of the scraped data). Interactive visualization of the individual abstracts and scraped values can be accessed at https://tinyurl.com/zcurve-FDR.
Figure 2 visualizes the estimated false discovery rates based on z-curve and Jager and Leek’s method based on scraped abstracts from clinical trials and randomized controlled trials and further divided by journal and whether the article was published before (and including) 2010 (left) or after 2010 (right). We see that, in line with the simulation results, Jager and Leek’s method produces slightly higher false discovery rate estimates. Furthermore, z-curve produced considerably wider bootstrapped confidence intervals, suggesting that the confidence interval reported by Jager and Leek (± 1 percentage point) was too narrow.
A comparison of the false discovery estimates based on data before (and including) 2010 and after 2010 shows that confidence intervals overlap, suggesting that false discovery rates have not changed. Separate analyses based on clinical trials and randomized controlled trials also showed no significant differences (see Figure 3). Therefore, to reduce the uncertainty about the false discovery rate, we estimate the false discovery rate for each journal irrespective of publication year. The resulting false discovery rate estimates based z-curve and Jager and Leek’s method are summarized in Table 3. We find that all false discovery rate estimates fall within a .05 to .30 interval. Finally, further aggregating data across the journals provides a false discovery rate estimate of 0.13, 95% [0.08, 0.21] based on z-curve and 0.19, 95% [0.17, 0.20] based on Jager and Leek’s method. This finding suggests that Jager and Leek’s extraction method slightly underestimate the false discovery rate, whereas their model overestimated the false discovery rate.
Additional Z-Curve Results
So far, we used the expected discovery rate only to estimate the false discovery risk, but the expected discovery rate provides valuable information in itself. Ioannidis’s predictions of the false discovery rate were based on scenarios that assumed that less than 10% of all hypothesis are true hypothesis. The same assumption was made to recommend lowering α from .05 to .005 (Benjamin and others, 2018). If all true hypotheses were tested with 100% power, the discovery rate would match the percentage of true hypotheses plus the false-positive results; 10% + 90% × .05 = 14.5%. Because the actual power is less than 100%, the discovery rate would be even less, but the estimated expected discovery rate for top medical journals is 30% with a confidence interval ranging from 20% to 41%. Thus, our results suggest that previous speculations about discovery rates were overly pessimistic.
The expected discovery rate also provides valuable information about the extent of selection bias in medical journals. While the expected discovery rate is only 30%, the observed discovery rate (i.e., the percentage of significant results in abstracts) is more than double (69.7%). This discrepancy is visible in Figure 4. The histogram of observed non-significant z-scores does not match the predicted distribution (blue curve). This evidence of selection bias implies that reported effect sizes are inflated by selection bias. Thus, follow-up studies need to adjust effect sizes when planning the sample sizes via power analyses.
Z-curve also provides information about the replicability of significant results in medical abstracts. The expected replication rate is 65% with a confidence interval ranging from 61% to 69%. This result suggests that sample sizes should be increased to meet the recommended level of 80% power. Furthermore, this estimate may be overly optimistic because comparisons of actual replication rates and z-curve predictions show lower success rates for actual replication studies (Bartoˇs and Schimmack, 2021). One reason could be that exact replication studies are impossible and changes in population will result in lower power due to selection bias and regression to the mean. In the worst case, the actual replication rate might be as low as the expected discovery rate. Thus, our results predict that the success rate of actual replication studies in medicine will be somewhere between 30% and 65%.
Finally, z-curve can be used to adjust the significance level α retrospectively to maintain a false discovery risk of less than 5% Goodman. To do so, it is only necessary to compute the expected discovery rate for different levels of α. With α = .01, the expected discovery rate decreases to 20% and the false discovery risk decreases to 4%. Adjusting α to the recommended level of .005 reduced the expected discovery rate to 17% and the false discovery risk to 2%. Based on these results, it is possible to use α = .01 as a criterion to reject the null-hypothesis while maintaining a false positive risk of 5%.
Like many other human activities, science relies on trust. Over the past decade, it has become clear that some aspects of modern science undermine trust. The biggest problem remains the prioritization of new discoveries that meet the traditional threshold of statistical significance. The selection for significance has many undesirable consequences. Although medicine has responded to this problem by demanding preregistration of clinical trials, our results suggest that selection for significance remains a pervasive problem in medical research. As a result, the observed discovery rate and reported effect sizes provide misleading information about the robustness of published results. To maintain trust in medical research, it is important to take selection bias into account. Concerns about the replicability of published results have led to the emergence of meta- science as an active field of research over the past decade. Unlike meta-physics, meta-science is an empirical enterprise that uses data to investigate science. Data can range from survey studies of research practices to actual replication studies. Jager and Leek made a valuable contribution to meta-science by developing a method to estimate the false discovery rate based on published p-values using a statistical model that takes selection bias into account. Their work stimulated discussion, but their key finding that false discovery rates in medicine are not at an alarmingly high rate was ignored. We followed up on Jager and Leek’s seminal contribution with a different estimation model and an improved extraction method to harvest results from medical abstracts. Despite these methodological improvements, our results firmly replicated Jager and Leek’s key finding that false discovery rates in top medical journals are between 10% and 20%.
We also extended the meta-scientific investigation of medical research in several ways. First, we demonstrated that the false discovery risk can be reduced to less than 5% by lowering the criterion for statistical significance to .01. This recommendation is similar to other proposals to lower α to .005, but our proposal is based on empirical data. Moreover, the α level can be modified for different fields of studies or it can be changed in the future in response to changes in research practices. Thus, rather than recommending one fixed α, we recommend to justify α (Lakens and others, 2018). Fields with low discovery rates should use a lower α than fields with high discovery rates to maintain a false discovery risk below 5%.
We also demonstrated that medical journals have substantial selection bias. Whereas the percentage of significant results in abstracts is over 60%, the expected discovery rate is only 30%. This test for selection bias is important because it would be unnecessary to use selection models if selection bias were negligible. Evidence of substantial selection bias may also help to change publication practices in order to reduce selection bias. For example, journals could be evaluated on the basis of the amount of selection bias just like they are being evaluated in terms of impact factors.
Finally, we provided evidence that the average power of studies with significant results is 65%. As power increases for studies with lower p-values, this estimate implies that power for studies that are significant at p < .01 to produce a p-value below .05 in a replication study would be even higher. Based on these findings, we would predict that at least 50% of results that achieved p < .01 can be successfully replicated. This is comparable to cognitive psychology, where 50% of significant results at p < .05 could be successfully replicated (Open Science Collaboration, 2015).
Limitations and Future Directions
Even though we were able to address several of the criticisms of Jager and Leek’s seminal article, we were unable to address all of them. The question is whether the remaining concerns are sufficient to invalidate our results. We think this is rather unlikely because our results are in line with findings in other fields. The main remaining concern is that mining p-values and confidence intervals from abstracts creates a biased sample of results. The only way to address this concern is to read the actual articles and to pick the focal hypothesis test for the z-curve analysis. Unfortunately, nobody seems to have taken on this daunting task for medical journals. However, social psychologists have hand-coded a large, representative sample of test-statistics (Motyl and others, 2017). The coding used the actual test statistics rather than p-values. Thus, exact p-values were computed and no rounding or truncation problems are present in these data. A z-curve analysis of these data estimated an expected discovery rate of 19%, 95% CI = 6% to 36% (Schimmack, 2020). Given the low replication rate of social psychology, it is not surprising that the expected discovery rate is lower than for medical studies (Open Science Collaboration, 2015). However, even a low expected discovery rate of 19% limits the false discovery risk at 22%, which is not much higher than the false discovery risk in medicine and does not justify the claim that most published results are false. To provide more conclusive evidence for medicine, we strongly encourage hand-coding of medical journals and high powered replication studies. Based on the present results, we predict false positive rates well below 50%.
Supplementary Materials including data and R scripts for reproducing the simulations, data scraping, and analyses are available from https://osf.io/y3gae/.
Conflict of Interest: None declared.
Bartos, Frantisek and Schimmack, Ulrich. (2021). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology.
Benjamin, Daniel J, Berger, James O, Johannesson, Magnus, Nosek, Brian A, Wagenmakers, E-J, Berk, Richard, Bollen, Kenneth A, Brembs, Bjorn, Brown, Lawrence, Camerer, Colin and others. (2018). Redefine statistical significance. Nature Human Behaviour 2(1), 6–10.
Benjamini, Yoav and Hechtlinger, Yotam. (2014). Discussion: An estimate of the science- wise false discovery rate and applications to top medical journals by jager and leek. Biostatistics 15(1), 13–16.
Brunner, Jerry and Schimmack, Ulrich. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology 4.
Cohen, Jacob. (1994). The earth is round (p ¡.05). American Psychologist 49(12), 997.
Cox, David R. (2014). Discussion: Comment on a paper by jager and leek. Biostatistics 15(1), 16–18.
Dempster, Arthur P, Laird, Nan M and Rubin, Donald B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22.
Gelman, Andrew and O’Rourke, Keith. (2014). Difficulties in making inferences about scientific truth from distributions of published p-values. Biostatistics 15(1), 18–23.
Goodman, Steven N. (2014). Discussion: An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15(1), 13–16.
Ioannidis, John PA. (2005). Why most published research findings are false. PLoS medicine 2(8), e124.
Ioannidis, John PA. (2014). Discussion: Why “an estimate of the science-wise false discovery rate and application to the top medical literature” is false. Biostatistics 15(1), 28–36.
Jager, Leah R and Leek, Jeffrey T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15(1), 1–12.
Lakens, Daniel, Adolfi, Federico G, Albers, Casper J, Anvari, Farid, Apps, Matthew AJ, Argamon, Shlomo E, Baguley, Thom, Becker, Raymond B, Benning, Stephen D, Bradford, Daniel E and others. (2018). Justify your alpha. Nature Human Behaviour 2(3), 168–171.
Lamberink, Herm J, Otte, Willem M, Sinke, Michel RT, Lakens, Daniel, Glasziou, Paul P, Tijdink, Joeri K and Vinkers, Christiaan H. (2018). Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. Journal of Clinical Epidemiology 102, 123–128.
Motyl, Matt, Demos, Alexander P, Carsel, Timothy S, Hanson, Brittany E, Melton, Zachary J, Mueller, Allison B, Prims, JP, Sun, Jiaqing, Washburn, An- thony N, Wong, Kendal M and others. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality andSocial Psychology 113(1), 34.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science 349(6251).
Schimmack, Ulrich. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne.
Soric, Branko. (1989). Statistical “discoveries” and effect-size estimation. Journal of the American Statistical Association 84(406), 608–610.