Talk (the first 4 min. were not recorded, it starts right away with my homage to Jacob Cohen).

The first part of the talk discusses the problems with Fisher’s approach to significance testing and the practice in psychology to publish only significant results. I then discuss Neyman-Pearson’s alternative approach, statistical power, and Cohen’s seminal meta-analysis of power in social/abnormal psychology. I then point out that questionable research practices must have been used to publish 95% significant results with only 50% power.

The second part of the talk discusses Soric’s insight that we can estimate the false discovery risk based on the discovery rate. I discuss the Open Science Collaboration project as one way to estimate the discovery rate (prettty high for within-subject cognitive psychology, terribly low for between-subject social psychology), but point out that it doesn’t tell us about clinical psychology. I then introduce z-curve to estimate the discovery rate based on the distribution of significant p-values (converted into z-scores).

In the empirical part, I show the z-curve for Positive Psychology Interventions that shows massive use of QRPs and a high false discovery risk.

I end with a comparison of the z-curve for the Journal of Abnormal Psychology in 2010 and 2020 that shows no change in research practices over time.

The discussion focussed on changing the way we do research and what research we reward. I argue strongly against the implementation of alpha = .005 and for the adoption of Neyman Pearson’s approach with pre-registration which would allow researchers to study small populations (e.g., mental health issues in the African American community) with a higher false-positive risk to balance type-I and type-II errors.

I recorded a meeting with my research assistants who are coding articles to estimate the replicability of psychological research. It is unedited and raw, but you might find it interesting to listen to. Below I give a short description of the topics that were discussed starting from an explanation of effect sizes and ending with a discussion about the choice of a graduate supervisor.

1. Rant about Fischer’s approach to statistics that ignores effect sizes. – look for p < .05, and do a happy dance if you find it, now you can publish. – still the way statistics is taught to undergraduate students.

2. Explaining statistics starting with effect sizes. – unstandardized effect size (height difference between men and women in cm) – unstandardized effect sizes depend on the unit of measurement – to standardize effect sizes we divide by standard deviation (Cohen’s d)

3. Why do/did social psychologists run studies with n = 20 per condition? – limited resources, small subject pool, statistics can be used with n = 20 ~ 30. – obvious that these sample sizes are too small after Cohen (1961) introduced power analysis – but some argued that low power is ok because it is more efficient to get significant results.

4. Simulation of social psychology: 50% of hypothesis are true, 50% are false, the effect size of true hypotheses is d = .4 and the sample size of studies is N = 20. – Analyzing the simulated results (with k = 200 studies) with z-curve.2.0. In this simulation, the true discovery rate is 14%. That is 14% of the 200 studies produced a significant result. – Z-curve correctly estimates this discovery rate based on the distribution of the significant p-values, converted into z-scores. – If only significant results are published, the observed discovery rate is 100%, but the true discovery rate is only 14%. – Publication bias leads to false confidence in published results. – Publication is wasteful because we are discarding useful information.

5. Power analysis. – Fischer did not have power analysis. – Neyman and Pearson invented power analysis, but Fischer wrote the textbook for researchers. – We had 100 years to introduce students to power analysis, but it hasn’t happened. – Cohen wrote books about power analysis, but he was ignored. – Cohen suggested we should aim for 80% power (more is not efficient). – Think a priori about effect size to plan sample sizes. – Power analysis was ignored because it often implied very large samples. (very hard to get participants in Germany with small subject pools). – no change because all p-values were treated as equal. p < .05 = truth. – Literature reviews or textbook treat every published significant results as truth.

6. Repeating simulation (50% true hypotheses, effect size d = .4) with 80% power, N = 200. – much higher discovery rate (58%) – much more credible evidence – z-curve makes it possible to distinguish between p-values from research with low or high discovery rate. – Will this change the way psychologists look at p-values? Maybe, but Cohen and others have tried to change psychology without success. Will z-curve be a game-changer?

7. Personalized p-values – P-values are being created by scientists. – Scientists have some control about the type of p-values they publish. – There are systemic pressures to publish more p-values based on low powered studies. – But at some point, researchers get tenure. – nobody can fire you if you stop publishing – social media allow researchers to publish without censure from peers. – tenure also means you have a responsibility to do good research. – Researcher who are listed on the post with personalized p-values all have tenure. – Some researchers, like David Matsumoto, have a good z-curve. – Other researchers have way too many just significant results. – The observed discovery rates between good and bad researchers are the same. – Z-curve shows that the significant results were produced very differently and differ in credibility and replicability; this could be a game changer if people care about it. – My own z-curve doesn’t look so good. 😦 – How can researchers improve their z-curve – publish better research now – distance yourself from bad old research – So far, few people have distanced themselves from bad old work because there was no incentive to do so. – Now there is an incentive to do so, because researchers can increase credibility of their good work. – some people may move up when we add the 2020 data. – hand-coding of articles will further improve the work.

8. Conclusion and Discussion – not all p-values are created equal. – working with undergraduate is easy because they are unbiased. – once you are in grad school, you have to produce significant results. – z-curve can help to avoid getting into labs that use questionable practices. – I was lucky to work in labs that cared about the science.

A naive model of science assumes that scientists are objective. That is, they derive hypotheses from theories, collect data to test these theories, and then report the results. In reality, scientists are passionate about theories and often want to confirm that their own theories are right. This leads to conformation bias and the use of questionable research practices (QRPs, John et al., 2012; Schimmack, 2015). QRPs are defined as practices that increase the chances of the desired outcome (typically a statistically significant result) while at the same time inflating the risk of a false positive discovery. A simple QRP is to conduct multiple studies and to report only the results that support the theory.

The use of QRPs explains the astonishingly high rate of statistically significant results in psychology journals that is over 90% (Sterling, 1959; Sterling et al., 1995). While it is clear that this rate of significant results is too high, it is unclear how much it is inflated by QRPs. Given the lack of quantitative information about the extent of QRPs, motivated biases also produce divergent opinions about the use of QRPs by social psychologists. John et al. (2012) conducted a survey and concluded that QRPs are widespread. Fiedler and Schwarz (2016) criticized the methodology and their own survey of German psychologists suggested that QRPs are not used frequently. Neither of these studies is ideal because they relied on self-report data. Scientists who heavily use QRPs may simply not participate in surveys of QRPs or underreport the use of QRPs. It has also been suggested that many QRPs happen automatically and are not accessible to self-reports. Thus, it is necessary to study the use of QRPs with objective methods that reflect the actual behavior of scientists. One approach is to compare dissertations with published articles (Cairo et al., 2020). This method provided clear evidence for the use of QRPs, even though a published document could reveal their use. It is possible that this approach underestimates the use of QRPs because even the dissertation results could be influenced by QRPs and the supervision of dissertations by outsiders may reduce the use of QRPs.

With my colleagues, I developed a statistical method that can detect and quantify the use of QRPs (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve uses the distribution of statistically significant p-values to estimate the mean power of studies before selection for significance. This estimate predicts how many non-significant results were obtained in the serach for the significant ones. This makes it possible to compute the estimated discovery rate (EDR). The EDR can then be compared to the observed discovery rate, which is simply the percentage of published results that are statistically significant. The bigger the difference between the ODR and the EDR is, the more questionable research practices were used (see Schimmack, 2021, for a more detailed introduction).

I merely focus on social psychology because (a) I am a social/personality psychologists, who is interested in the credibility of results in my field, and (b) because social psychology has a large number of replication failures (Schimmack, 2020). Similar analyses are planned for other areas of psychology and other disciplines. I also focus on social psychology more than personality psychology because personality psychology is often more exploratory than confirmatory.

Method

I illustrate the use of z-curve to quantify the use of QRPs with the most extreme examples in the credibility rankings of social/personality psychologists (Schimmack, 2021). Figure 1 shows the z-value plot (ZVP) of David Matsumoto. To generate this plot, the tests statistics from t-tests and F-tests were transformed into exact p-values and then transformed into the corresponding values on the standard normal distribution. As two-sided p-values are used, all z-scores are positive. However, because the curve is centered over the z-score that corresponds to the median power before selection for significance (and not zero, when the null-hypothesis is true), the distribution can look relatively normal. The variance of the distribution will be greater than 1 when studies vary in statistical power.

The grey curve in Figure 1 shows the predicted distribution based on the observed distribution of z-scores that are significant (z > 1.96). In this case, the observed number of non-significant results is similar to the predicted number of significant results. As a result, the ODR of 78% closely matches the EDR of 79%.

Figure 2 shows the results for Shelly Chaiken. The first notable observation is that the ODR of 75% is very similar to Matsumoto’s EDR of 78%. Thus, if we simply count the number of significant and non-significant p-values, there is no difference between these two researchers. However, the z-value plot (ZVP) shows a dramatically different picture. The peak density is 0.3 for Matsoumoto and 1.0 for Chaiken. As the maximum density of the standard normal distribution is .4, it is clear that the results in Chaiken’s articles are not from an actual sampling distribution. In other words, QRPs must have been used to produce too many just significant results with p-values just below .05.

The comparison of the ODR and EDR shows a large discrepancy of 64 percentage points too many significant results (ODR = 75% minus EDR = 11%). This is clearly not a chance finding because the ODR falls well outside the 95% confidence interval of the EDR, 5% to 21%.

To examine the use of QPSs in social psychology, I computed the EDR and ORDR for over 200 social/personality psychologists. Personality psychologists were excluded if they reported too few t-values and F-values. The actual values can be found and additional statistics can be found in the credibility rankings (Schimmack, 2021). Here I used these data to examine the use of QRPs in social psychology.

Average Use of QRPs

The average ODR is 73.48 with a 95% confidence interval ranging from 72.67 to 74.29. The average EDR is 35.28 with a 95% confidence interval ranging from 33.14 to 37.43. the inflation due to QRPs is 38.20 percentage points, 95%CI = 36.10 to 40.30. This difference is highly significant, t(221) = 35.89, p < too many zeros behind the decimal for R to give an exact value.

It is of course not surprising that QRPs have been used. More important is the effect size estimate. The results suggest that QRPs inflate the discovery rate by over 100%. This explains why unbiased replication studies in social psychology have only a 25% chance of being significant (Open Science Collaboration, 2015). In fact, we can use the EDR as a conservative predictor of replication outcomes (Bartos & Schimmack, 2020). While the EDR of 35% is a bit higher than the actual replication rate, this may be due to the inclusion of non-focal hypothesis tests in these analyses. Z-curve analyses of focal hypothesis tests typically produce lower EDRs. In contrast, Fiedler and Schwarz failed to comment on the low replicability of social psychology. If social psychologists would not have used QRPs, it remains a mystery why their results are so hard to replicate.

In sum, the present results confirm that, on average, social psychologists heavily used QRPs to produce significant results that support their predictions. However, these averages masks differences between researchers like Matsumoto and Chaiken. The next analyses explore these individual differences between researchers.

Cohort Effects

I had no predictions about the effect of cohort on the use of QRPs. I conducted a twitter poll that suggested a general intuition that the use of QRPs may not have changed over time, but there was a lot of uncertainty in these answers. Similar results were obtained in a Facebook poll in the Psychological Methods Discussion Group. Thus, the a priori hypothesis is a vague prior of no change.

The dataset includes different generations of researchers. I used the first publication listed in WebofScience to date researchers. The earliest date was 1964 (Robert S. Wyer). The latest date was 2012 (Kurt Gray). The histogram shows that researchers from the 1970s to 2000s were well-represented in the dataset.

There was a significant negative correlation between the ODR and cohort, r(N = 222) = -.25, 95%CI = -.12 to -.37, t(220) = 3.83, p = .0002. This finding suggests that over time the proportion of non-significant results increased. For researchers with the first publication in the 1970s, the average ODR was 76%, whereas it was 72% for researchers with the first publication in the 2000s. This is a modest trend. There are various explanations for this trend.

One possibility is that power decreased as researchers started looking for weaker effects. In this case, the EDR should also show a decrease. However, the EDR showed no relationship with cohort, r(N = 222) = -.03, 95%CI = -.16 to .10, t(220) = 0.48, p = .63. Thus, less power does not seem to explain the decrease in the ODR. At the same time, the finding that EDR does not show a notable, abs(r) < .2, relationship with cohort suggests that power has remained constant over time. This is consistent with previous examinations of statistical power in social psychology (Sedlmeier & Gigerenzer, 1989).

Although the ODR decreased significantly and the EDR did not decrease significantly, bias (ODR – EDR) did not show a significant relationship with cohort, r(N = 222) = -.06, 95%CI = -19 to .07, t(220) = -0.94, p = .35, but the 95%CI allows for a slight decrease in bias that would be consistent with the significant decrease in the ODR.

In conclusion, there is a small, statistically significant decrease in the ODR, but the effect over the past 40 decades is too small to have practical significance. The EDR and bias are not even statistically significantly related to cohort. These results suggest that research practices and the use of questionable ones has not changed notably since the beginning of empirical social psychology (Cohen, 1961; Sterling, 1959).

Achievement Motivation

Another possibility is that in each generation, QRPs are used more by researches who are more achievement motivated (Janke et al., 2019). After all, the reward structure in science is based on number of publications and significant results are often needed to publish. In social psychology it is also necessary to present a package of significant results across multiple studies, which is nearly impossible without the use of QRPs (Schimmack, 2012). To examine this hypothesis, I correlated the EDR with researchers’ H-Index (as of 2/1/2021). The correlation was small, r(N = 222) = .10, 95%CI = -.03 to .23, and not significant, t(220) = 1.44, p = .15. This finding is only seemingly inconsistent with Janke et al.’s (2019) finding that self-reported QRPs were significantly correlated with self-reported ambition, r(217) = .20, p = .014. Both correlations are small and positive, suggesting that achievement motivated researchers may be slightly more likely to use QRPs. However, the evidence is by no means conclusive and the actual relationship is weak. Thus, there is no evidence to support that highly productive researchers with impressive H-indices achieved their success by using QRPs more than other researchers. Rather, they became successful in a field where QRPs are the norm. If the norms were different, they would have become successful following these other norms.

Impact

A common saying in science is that “extraordinary claims require extraordinary evidence.” Thus, we might expect stronger evidence for claims of time-reversed feelings (Bem, 2011) than for evidence that individuals from different cultures regulate their emotions differently (Matsumoto et al., 2008). However, psychologists have relied on statistical significance with alpha = .05 as a simple rule to claim discoveries. This is a problem because statistical significance is meaningless when results are selected for significance and replication failures with non-significant results remain unpublished (Sterling, 1959). Thus, psychologists have trusted an invalid criterion that does not distinguish between true and false discoveries. It is , however, possible that social psychologists used other information (e.g, gossip about replication failures at conferences) to focus on credible results and to ignore incredible ones. To examine this question, I correlated authors’ EDR with the number of citations in 2019. I used citation counts for 2019 because citation counts for 2020 are not yet final (the results will be updated with the 2020 counts). Using 2019 increases the chances of finding a significant relationship because replication failures over the past decade could have produced changes in citation rates.

The correlation between EDR and number of citations was statistically significant, r(N = 222) = .16, 95%CI = .03 to .28, t(220) = 2.39, p = .018. However, the lower limit of the 95% confidence interval is close to zero. Thus, it is possible that the real relationship is too small to matter. Moreover, the non-parametric correlation with Kendell’s tau was not significant, tau = .085, z = 1.88, p = .06. Thus, at present there is insufficient evidence to suggest that citation counts take the credibility of significant results into account. At present, p-values less than .05 are treated as equally credible no matter how they were produced.

Conclusion

There is general agreement that questionable research practices have been used to produce an unreal success rate of 90% or more in psychology journals (Sterling, 1959). However, there is less agreement about the amount of QRPs that are being used and the implications for the credibility of significant results in psychology journals (John et al., 2012; Fiedler & Schwarz, 2016). The problem is that self-reports may be biased because researchers are unable or unwilling to report the use of QRPs (Nisbett & Wilson, 1977). Thus, it is necessary to examine this question with alternative methods. The present study used a statistical method to compare the observed discovery rate with a statistically estimated discovery rate based on the distribution of significant p-values. The results showed that on average social psychologists have made extensive use of QRPs to inflate an expected discovery rate of around 35% to an observed discovery rate of 70%. Moreover, the estimated discovery rate of 35%is likely to be an inflated estimate of the discovery rate for focal hypothesis tests because the present analysis is based on focal and non-focal tests. This would explain why the actual success rate in replication studies is even lower thna the estimated discovery rate of 35% (Open Science Collaboration, 2015).

The main novel contribution of this study was to examine individual differences in the use of QRPs. While the ODR was fairly consistent across articles, the EDR varied considerably across researchers. However, this variation showed only very small relationships with a researchers’ cohort (first year of publication). This finding suggests that the use of QRPs varies more across research fields and other factors than over time. Additional analysis should explore predictors of the variation across researchers.

Another finding was that citations of authors’ work do not take credibility of p-values into account. Citations are influenced by popularity of topics and other factors and do not take the strength of evidence into account. One reason for this might be that social psychologists often publish multiple internal replications within a single article. This gives the illusion that results are robust and credible because it is very unlikely to replicate type-I errors. However, Bem’s (2011) article with 9 internal replications of time-reversed feelings showed that QRPs are also used to produce consistent results within a single article (Francis, 2012; Schimmack, 2012). Thus, number of significant results within an article or across articles is also an invalid criterion to evaluate the robustness of results.

In conclusion, social psychologists have conducted studies with low statistical power since the beginning of empirical social psychology. The main reason for this is the preference for between-subject designs that have low statistical power with small sample sizes of N = 40 participants and small to moderate effect sizes. Despite repeated warnings about the problems of selection for significance (Sterling, 1959) and the problems of small sample sizes (Cohen, 1961; Sedelmeier & Gigerenzer, 1989; Tversky & Kahneman, 1971), the practices have not changed since Festinger conducted his seminal study on dissonance with n = 20 per group. Over the past decades, social psychology journals have reported thousands of statistically significant results that are used in review articles, meta-analyses, textbooks, and popular books as evidence to support claims about human behavior. The problem is that it is unclear which of these significant results are true positives and which are false positives, especially if false positives are not just strictly nil-results, but also results with tiny effect sizes that have no practical significance. Without other reliable information, even social psychologists do not know which of their colleagues results are credible or not. Over the past decade, the inability to distinguish credible and incredible information has produced heated debates and a lack of confidence in published results. The present study shows that the general research practices of a researcher provide valuable information about credibility. For example, a p-value of .01 by a researcher with an EDR of 70 is more credible than a p-value of .01 by a researcher with an EDR of 15. Thus, rather than stereotyping social psychologists based on the low replication rate in the Open Science Collaboration project, social psychologists should be evaluated based on their own research practices.

References

Cairo, A. H., Green, J. D., Forsyth, D. R., Behler, A. M. C., & Raldiris, T. L. (2020). Gray (Literature) Matters: Evidence of Selective Hypothesis Reporting in Social Psychological Research. Personality and Social Psychology Bulletin, 46(9), 1344–1362. https://doi.org/10.1177/0146167220903896

Janke, S., Daumiller, M., & Rudert, S. C. (2019). Dark pathways to achievement in science: Researchers’ achievement goals predict engagement in questionable research practices. Social Psychological and Personality Science, 10(6), 783–791. https://doi.org/10.1177/1948550618790227

Scientists have made a contribution when a phenomenon or a statist is named after them. Thus, it is fair to say that Easterlin made a contribution to happiness research because researchers who write about income and happiness often mention his 1974 article “Does Economic Growth Improve the Human Lot? Some Empirical Evidence” (Easterlin, 1974).

To be fair, the article examines the relationship between income and happiness from three perspectives: (a) the correlation between income and happiness across individuals within nations, (b) the correlation of average incomes and average happiness across nations, and (c) the correlation between average income and average happiness within nations over time. A forth perspective, namely the correlation between income and happiness within individuals over time was not examined because no data were available in 1974.

Even for some of the other questions, the data were limited. Here I want to draw attention to Easterlin’s examination of correlations between nations’ wealth and well-being. He draws heavily on Cantril’s seminal contribution to this topic. Cantil (1965) not only developed a measure that can be used to compare well-being across nations, he also used this measure to compare the well-being of 14 nations (Cuba is not included in Table 1 because I did not have new data).

Cantril also correlated the happiness scores with a measure of nations’ wealth. The correlation was r = .5. Cantril also suggested that Cuba and the Dominican Republic were positive and negative outliers, respectively. Excluding these two nations increases the correlation to r = .7.

Easterlin took issue with these results.

“Actually the association between wealth and happiness indicated by Cantril”s international data is not so clear-cut. This is shown by a scatter diagram of the data (Fig. I). The inference about a positive association relies heavily on the observations for India and the United States. [According to Cantril (1965, pp. 130-131), the values for Cuba and the Dominican Republic reflect unusual political circumstances-the immediate aftermath of a successful revolution in Cuba and prolonged political turmoil in the Dominican Republic].

What is perhaps most striking is that the personal happiness ratings for 10 of the 14 countries lie virtually within half a point of the midpoint rating of 5, as is brought out by the broken horizontal lines in the diagram. While a difference of rating of only 0.2 is significant at the 0.05 level, nevertheless there is not much evidence, for these IO countries, of a systematic association between income and happiness. The closeness of the happiness ratings implies also that a similar lack of association would be found between happiness and other economic magnitudes such as income inequality or the rate of change of income.

Nearly 50 years later, it is possible to revisit Easterlin’s challenge of Cantril’s claim that nations’ well-being is tied to their wealth with much better data from the Gallup World Poll. The Gallup World Poll used the same measure of well-being. However, it also provides a better measure of citizens’ wealth by asking for income. In contrast, GDP can be distorted and may not reflect the spending power of the average citizen very well. The data about well-being (World Happiness Report, 2020) and median per capita income (Gallup) are publicly available. All I needed to do was to compute the correlation and make a pretty graph.

The Pearson correlation between income and the ladder scores is r(126) = .75. The rank correlation is r(126) = .80. and the Pearson correlation between the log of income and the ladder scores is r(126) = .85. These results strongly support Cantril’s prediction based on his interpretation of the first cross-national study in the 1960s and refute Eaterlin’s challenge that that this correlation is merely driven by two outliers. Other researchers who analyzed the Gallup World Poll data also reported correlations of r = .8 and showed high stability of nations’ wealth and income over time (Zyphur et al., 2020).

Figure 2 also showed that Easterlin underestimate the range of well-being scores. Even ignoring additional factors like wars, income alone can move well-being from a 4 in one of the poorest countries in the world (Burundi) close to an 8 in one of the richest countries in the world (Norway). It also does not show that Scandinavian countries have a happiness secret. The main reason for their high average well-being appears to be that median personal incomes are very high.

The main conclusion is that social scientists are often biased for a number of reasons. The bias is evident in Easterlin’s interpretation of Cantril’s data. The same anti-materialstic bias can be found in many other articles on this topic that claim the benefits of wealth are limited.

To be clear, a log-function implies that the same amount of wealth buys more well-being in poor countries, but the graph shows no evidence that the benefits of wealth level off. It is also true that the relationship between GDP and happiness over time is more complicated. However, regarding cross-national differences the results are clear. There is a very strong relationship between wealth and well-being. Studies that do not control for this relationship may report spurious relationships that disappear when income is included as a predictor.

Furthermore, the focus on happiness ignores that wealth also buys longer lives. Thus, individuals in richer nations not only have happier lives they also have more happy life years. The current Covid-19 pandemic further increases these inequalities.

In conclusion, one concern about subjective measures of well-being has been that individuals in poor countries may be happy with less and that happiness measures fail to reflect human suffering. This is not the case. Sustainable, global economic growth that raises per capita wealth remains a challenge to improve human well-being.

Please help out to improve this post. If you have conducted successful or unsuccessful replication studies of work done by Jens Forster, please share this information with me and I will add it to this blog post.

Jens Forster was a social psychologists from Germany. He was a rising star and on the way to receiving a prestigious 5 million Euro award from the Alexander von Humboldt Foundation (Retraction Watch, 2015). Then an anonymous whistle blower accused him of scientific misconduct. Under pressure, Forster returned the award without admitting to any wrongdoing.

He also was in transition to move from the University of Amsterdam to the University of Bochum. After a lengthy investigation, Forster was denied tenure and he is no longer working in academia (Science, 2016), despite the fact that an investigation by the German association of psychologists (DGP) did not conclude that he conducted fraud.

While the personal consequences for Forster are similar to those of Stapel, who admitted to fraud and left his tenured position, the effect on the scientific record is different. Stapel retracted over 50 articles that are no longer being cited at high numbers. In contrast, Forster retracted only a few papers and most of his articles are not flagged to readers as potentially fraudulent. We can see the differences in citation counts for Stapel and Forster.

Stapel’s citation counts peaked at 350 and are now down to 150 citations a year. Some of these citations are with co-authors and from papers that have been cleared as credible.

Citation counts for Forster peaked at 450. The also decreased by 200 citations to 250 citations, but we are also seeing an uptick by 100 citations in 2019. The question is whether this muted correction is due to Forster’s denial of wrongdoing or whether the articles that were not retracted actually are more credible.

The difficulty in proving fraud in social psychology is that social psychologists also used many questionable practices to produce significant results. These questionable practices have the same effect as fraud, but they were not considered unethical or illegal. Thus, there are two reasons why articles that have not been retracted may still lack credible evidence. First, it is difficult to prove fraud when authors do not confess. Second, even if no fraud was committed, the data may lack credible evidence because they were produced with questionable practices that are not considered data fabrication.

For readers of the scientific literature it is irrelevant whether incredible (results with low credibility) results were produced with fraud or with other methods. The only question is whether the published results provide credible evidence for the theoretical claims in an article. Fortunately, meta-scientists have made progress over the past decade in answering this question. One method relies on a statistical examination of an author’s published test statistics. Test statistics can be converted into p-values or z-scores so that they have a common metric (e.g., t-values can be compared to F-values). The higher the z-score, the stronger is the evidence against the null-hypothesis. High z-scores are also difficult to obtain with questionable practices. Thus, they are either fraudulent or provide real evidence for a hypothesis (i.e. against the null-hypothesis).

I have published z-curve analyses of over 200 social/personality psychologists that show clear evidence of variation in research practices across researchers (Schimmack, 2021). I did not include Stapel or Forster in these analyses because doubts have been raised about their research practices. However, it is interesting to compare Forster’s z-curve plot to the plot of other researchers because it is still unclear whether anomalous statistical patterns in Forster’s articles are due to fraud or the use of questionable research practices.

The distribution of z-scores shows clear evidence that questionable practices were used because the observed discovery rate of 78% is much higher than the estimated discovery rate of 18% and the ODR is outside of the 95% CI of the EDR, 9% to 47%. An EDR of 18% places Forster at rank #181 in the ranking of 213 social psychologists. Thus, even if Forster did not conduct fraud, many of his published results are questionable.

The comparison of Forster with other social psychologists is helpful because humans’ are prone to overgeneralize from salient examples which is known as stereotyping. Fraud cases like Stapel and Forster have tainted the image of social psychology and undermined trust in social psychology as a science. The fact that Forster would rank very low in comparison to other social psychologists shows that he is not representative of research practices in social psychology. This does not mean that Stapel and Forster are bad apples and extreme outliers. The use of QRPs was widespread but how much researchers used QRPs varied across researchers. Thus, we need to take an individual difference perspective and personalize credibility. The average z-curve plot for all social psychologists ignores that some research practices were much worse and others were much better. Thus, I argue against stereotyping social psychologists and in favor of evaluating each social psychologists based on their own merits. As much as all social psychologists acted within a reward structure that nearly rewarded Forster’s practices with a 5 million dollar prize, researchers navigated this reward structure differently. Hopefully, making research practices transparent can change the reward structure so that credibility gets rewarded.

Last update 4/9/2021 (includes 2020, expanded to 353 social/personality psychologists, minor corrections, added rank numbers for easy comparison)

Introduction

Since Fisher invented null-hypothesis significance testing, researchers have used p < .05 as a statistical criterion to interpret results as discoveries worthwhile of discussion (i.e., the null-hypothesis is false). Once published, these results are often treated as real findings even though alpha does not control the risk of false discoveries.

Statisticians have warned against the exclusive reliance on p < .05, but nearly 100 years after Fisher popularized this approach, it is still the most common way to interpret data. The main reason is that many attempts to improve on this practice have failed. The main problem is that a single statistical result is difficult to interpret. However, when individual results are interpreted in the context of other results, they become more informative. Based on the distribution of p-values it is possible to estimate the maximum false discovery rate (Bartos & Schimmack, 2020; Jager & Leek, 2014). This approach can be applied to the p-values published by individual authors to adjust p-values to keep the risk of false discoveries at a reasonable level, FDR < .05.

Researchers who mainly test true hypotheses with high power have a high discovery rate (many p-values below .05) and a low false discovery rate (FDR < .05). Figure 1 shows an example of a researcher who followed this strategy (for a detailed description of z-curve plots, see Schimmack, 2021).

We see that out of the 317 test-statistics retrieved from his articles, 246 were significant with alpha = .05. This is an observed discovery rate of 78%. We also see that this discovery rate closely matches the estimated discovery rate based on the distribution of the significant p-values, p < .05. The EDR is 79%. With an EDR of 79%, the maximum false discovery rate is only 1%. However, the 95%CI is wide and the lower bound of the CI for the EDR, 27%, allows for 14% false discoveries.

When the ODR matches the EDR, there is no evidence of publication bias. In this case, we can improve the estimates by fitting all p-values, including the non-significant ones. With a tighter CI for the EDR, we see that the 95%CI for the maximum FDR ranges from 1% to 3%. Thus, we can be confident that no more than 5% of the significant results wit alpha = .05 are false discoveries. Readers can therefore continue to use alpha = .05 to look for interesting discoveries in Matsumoto’s articles.

Figure 3 shows the results for a different type of researcher who took a risk and studied weak effect sizes with small samples. This produces many non-significant results that are often not published. The selection for significance inflates the observed discovery rate, but the z-curve plot and the comparison with the EDR shows the influence of publication bias. Here the ODR is similar to Figure 1, but the EDR is only 11%. An EDR of 11% translates into a large maximum false discovery rate of 41%. In addition, the 95%CI of the EDR includes 5%, which means the risk of false positives could be as high as 100%. In this case, using alpha = .05 to interpret results as discoveries is very risky. Clearly, p < .05 means something very different when reading an article by David Matsumoto or Shelly Chaiken.

Rather than dismissing all of Chaiken’s results, we can try to lower alpha to reduce the false discovery rate. If we set alpha = .01, the FDR is 15%. If we set alpha = .005, the FDR is 8%. To get the FDR below 5%, we need to set alpha to .001.

A uniform criterion of FDR < 5% is applied to all researchers in the rankings below. For some this means no adjustment to the traditional criterion. For others, alpha is lowered to .01, and for a few even lower than that.

The rankings below are based on automatrically extracted test-statistics from 40 journals (List of journals). The results should be interpreted with caution and treated as preliminary. They depend on the specific set of journals that were searched, the way results are being reported, and many other factors. The data are available (data.drop) and researchers can exclude articles or add articles and run their own analyses using the z-curve package in R (https://replicationindex.com/2020/01/10/z-curve-2-0/).

I am also happy to receive feedback about coding errors. I also recommended to hand-code articles to adjust alpha for focal hypothesis tests. This typically lowers the EDR and increases the FDR. For example, the automated method produced an EDR of 31 for Bargh, whereas hand-coding of focal tests produced an EDR of 12 (Bargh-Audit).

And here are the rankings. The results are fully automated and I was not able to cover up the fact that I placed only #139 out of 300 in the rankings. In another post, I will explain how researchers can move up in the rankings. Of course, one way to move up in the rankings is to increase statistical power in future studies. The rankings will be updated again when the 2021 data are available.

Despite the preliminary nature, I am confident that the results provide valuable information. Until know all p-values below .05 have been treated as if they are equally informative. The rankings here show that this is not the case. While p = .02 can be informative for one researcher, p = .002 may still entail a high false discovery risk for another researcher.

Is there still something new to say about p-values? Yes, there is. Most discussions of p-values focus on a scenario where a researcher tests a new hypothesis computes a p-value and now has to interpret the result. The status quo follows Fisher’s – 100 year old – approach to compare the p-value to a value of .05. If the p-value is below .05 (two-sided), the inference is that the population effect size deviates from zero in the same direction as the observed effect in the sample. If the p-value is greater than .05 the results are deemed inconclusive.

This approach to the interpretation of the data assumes that we have no other information about our hypothesis or that we do not trust this information sufficiently to incorporate it in our inference about the population effect size. Over the past decade, Bayesian psychologists have argued that we should replace p-values with Bayes-Factors. The advantage of Bayes-Factors is that they can incorporate prior information to draw inferences from data. However, if no prior information is available, the use of Bayesian statistics may cause more harm than good. To use priors without prior information, Bayes-Factors are computed with generic, default priors that are not based on any information about a research question. Along with other problems of Bayes-Factors, this is not an appealing solution to the problem of p-values.

Here I introduce a new approach to the interpretation of p-values that has been called empirical Bayesian and has been successfully applied in genomics to control the field-wise false positive rate. That is, prior information does not rest on theoretical assumptions or default values, but rather on prior empirical information. The information that is used to interpret a new p-value is the distribution of prior p-values.

P-value distributions

Every study is a new study because it relies on a new sample of participants that produces sampling error that is independent of the previous studies. However, studies are not independent in other characteristics. A researcher who conducted a study with N = 40 participants is likely to have used similar sample sizes in previous studies. And a researcher who used N = 200 is also likely to have used larger sample sizes in previous studies. Researchers are also likely to use similar designs. Social psychologists, for example, prefer between-subject designs to better deceive their participants. Cognitive psychologists care less about deception and study simple behaviors that can be repeated hundreds of times within an hour. Thus, researchers who used a between-subject design are likely to have used a between-subject design in previous studies and researchers who used a within-subject design are likely to have used a within-subject design before. Researchers may also be chasing different effect sizes. Finally, researchers can differ in their willingness to take risks. Some may only test hypotheses that are derived from prior theories that have a high probability of being correct, whereas others may be willing to shoot for the moon. All of these consistent differences between researchers (i.e., sample size, effect size, research design) influence the unconditional statistical power of their studies, which is defined as the long-run probability of obtaining significant results, p < .05.

Over the past decade, in the wake of the replication crisis, interest in the distribution of p-values has increased dramatically. For example, one approach uses the distribution of significant p-values, which is known as p-curve analysis (Simonsohn et al., 2014). If p-values were obtained with questionable research practices when the null-hypothesis is true (p-hacking), the distribution of significant p-values is flat. Thus, if the distribution is monotonically decreasing from 0 to .05, the data have evidential value. Although p-curve analyses has been extended to estimate statistical power, simulation studies show that the p-curve algorithm is systematically biased when power varies across studies (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020).

As shown in simulation studies, a better way to estimate power is z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Here I show how z-curve analyses of prior p-values can be used to demonstrate that p-values from one researcher are not equal to p-values of other researchers when we take their prior research practices into account. By using this prior information, we can adjust the alpha level of individual researchers to take their research practices into account. To illustrate this use of z-curve, I first start with an illustration how different research practices influence p-value distributions.

Scenario 1: P-hacking

In the first scenario, we assume that a researcher only tests false hypotheses (i.e., the null-hypothesis is always true (Bem, 2011; Simonsohn et al., 2011). In theory, it would be easy to spot false positives because replication studies would produce produce 19 non-significant results for every significant one and significant ones would have different signs. However, questionable research practices lead to a pattern of results where only significant results in one direction are reported, which is the norm in psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).

In a z-curve analysis, p-values are first converted into z-scores, z = -qnorm(p/2) with qnorm being the inverse normal function and p being a two-sided p-value. A z-curve plot shows the histogram of all z-scores, including non-significant ones (Figure 1).

Visual inspection of the z-curve plot shows that all 200 p-values are significant (on the right side of the criterion value z = 1.96). it also shows that the mode of the distribution as at the significance criterion. Most important, visual inspection shows a steep drop from the mode to the range of non-significant values. That is, while z = 1.96 is the most common value, z = 1.95 is never observed. This drop provides direct visual information that questionable research practices were used because normal sampling error cannot produce such dramatic changes in the distribution.

I am skipping the technical details how the z-curve model is fitted to the distribution of z-scores (Bartos & Schimmack, 2020). It is sufficient to know that the model is fitted to the distribution of significant z-scores with a limited number of model parameters that are equally spaced over the range of z-scores from 0 to 6 (7 parameters, z = 0, z = 1, z = 2, …. z = 6). The model gives different weights to these parameters to match the observed distribution. Based on these estimates, z-curve.2.0 computes several statistics that can be used to interpret single p-values that have been published or future p-values by the same researcher, assuming that the same research practices are used.

The most important statistic is the expected discovery rate (EDR), which corresponds to the average power of all studies that were conducted by a researcher. Importantly, the EDR is an estimate that is based on only the significant results, but makes predictions about the number of non-significant results. In this example with N = 200 participants, the EDR is 7%. Of course, we know that it really is only 5% because the expected discovery rate for true hypotheses that are tested with alpha = .05 is 5%. However, sampling error can introduce biases in our estimates. Nevertheless, even with only 200 observations, the estimate of 7% is relatively close to 5%. Thus, z-curve tells us something important about the way these p-values were obtained. They were obtained in studies with very low power that is close to the criterion value for a false positive result.

Z-curve uses bootstrap to compute confidence intervals around the point estimate of the EDR. the 95%CI ranges from 5% to 18%. As the interval includes 5%, we cannot reject the hypothesis that all tests were false positives (which in this scenario is also the correct conclusion). At the upper end we can see that mean power is low, even if some true hypotheses are being tested.

The EDR can be used for two purposes. First, it can be used to examine the extent of selection for significance by comparing the EDR to the observed discovery rate (ODR; Schimmack, 2012). The ODR is simply the percentage of significant results that was observed in the sample of p-values. In this case, this is 200 out of 200 or 100%. The discrepancy between the EDR of 7% and 100% is large and 100% is clearly outside the 95%CI of the EDR. Thus, we have strong evidence that questionable research practices were used, which we know to be true in this simulation because the 200 tests were selected from a much larger sample of 4,000 tests.

Most important for the use of z-curve to interpret p-values is the ability to estimate the maximum False Discovery Rate (Soric, 1989). The false discovery rate is the percentage of significant results that are false positives or type-I errors. The false discovery rate is often confused with alpha, the long-run probability of making a type-I error. The significance criterion ensures that no more than 5% of significant and non-significant results are false positives. When we test 4,000 false hypotheses (i.e., the null-hypothesis is true) were are not going to have more than 5% (4,000 * .05 = 200) false positive results. This is true in general and it is true in this example. However, when only significant results are published, it is easy to make the mistake to assume that no more than 5% of the published 200 results are false positives. This would be wrong because the 200 were selected to be significant and they are all false positives.

The false discovery rate is the percentage of significant results that are false positives. It no longer matters whether non-significant results are published or not. We are only concerned with the population of p-values that are below .05 (z > 1.96). In our example, the question is how many of the 200 significant results could be false positives. Soric (1989 demonstrated that the EDR limits the number of false positive discoveries. The more discoveries there are, the lower is the risk that discoveries are false. Using a simple formula, we can compute the maximum false discovery rate from the EDR.

FDR = (1/(EDR – 1)*(.05/.95), with alpha = .05

With an EDR of 7%, we obtained a maximum FDR of 68%. We know that the true FDR is 100%, thus, the estimate is too low. However, the reason is that sampling error can have dramatic effects on the FDR estimates when the EDR is low. With an EDR of 6%, the FDR estimate goes up to 82% and with an EDR estimate of 5% it is 100%. To take account of this uncertainty, we can use the 95%CI of the EDR to compute a 95%CI for the FDR estimate, 24% to 100%. Now we see that we cannot rule out that the FDR is 100%.

In short, scenario 1 introduced the use of p-value distributions to provide useful information about the risk that the published results are false discoveries. In this extreme example, we can dismiss the published p-values as inconclusive or as lacking in evidential value.

Scenario 2: The Typical Social Psychologist

It is difficult to estimate the typical effect size in a literature. However, a meta-analysis of meta-analyses suggested that the average effect size in social psychology is Cohen’s d = .4 (Richard et al., 2003). A smaller set of replication studies that did not select for significance estimated an effect size of d = .3 for social psychology (d = .2 for JPSP, d = .4 for Psych Science; Open Science Collaboration, 2015). The later estimate may include an unknown number of hypotheses where the null-hypothesis is true and the true effect size is zero. Thus, I used d = .4 as a reasonable effect size for true hypotheses in social psychology (see also LeBel, Campbell, & Loving, 2017).

It is also known that a rule of thumb in experimental social psychology was to allocate n = 20 participants to a condition, resulting in a sample size of N = 40 in studies with two groups. In a 2 x 2 design, the main effect would be tested with N = 80. However, to keep this scenario simple, I used d = .4 and N = 40 for true effects. This affords 23% power to obtain a significant result.

Finkel, Eastwick, and Reis (2017) argued that power of 25% is optimal if 75% of the hypotheses that are being tested are true. However, the assumption that 75% of hypotheses are true may be on the optimistic side. Wilson and Wixted (2018) suggested that the false discovery risk is closer to 50%. With 23% power for true hypotheses, this implies a false discovery rate of Given uncertainty about the actual false discovery rate in social psychology, I used a scenario with 50% true and 50% false hypotheses.

I kept the number of significant results at 200. To obtain 200 significant results with an equal number of true and false hypotheses, we need 1,428 tests. The 714 true hypotheses contribute 714*.23 = 164 true positives and the 714 false hypotheses produce 714*.05 = 36 false positive results; 164 + 36 = 200. This implies a false discovery rate of 36/200 = 18%. The true EDR is (714*.23+714*.05)/(714+714) = 14%.

The z-curve plot looks very similar to the previous plot, but they are not identical. Although the EDR estimate is higher, it still includes zero. The maximum FDR is well above the actual FDR of 18%, but the 95%CI includes the actual value of 18%.

A notable difference between Figure 1 and Figure 2 is the expected replication rate (ERR), which corresponds to the average power of significant p-values. It is called the estimated replication rate (ERR) because it predicts the percentage of significant results if the studies that were selected for significance were replicated exactly (Brunner & Schimmack, 2020). When power is heterogeneous, power of the studies with significant results is higher than power of studies with non-significant results (Brunner & Schimmack, 2020). In this case, with only two power values, the reason is that false positives have a much lower chance to be significant (5%) than true positives (23%). As a result, the average power of significant studies is higher than the average power of all studies. In this simulation, the true average power of significant studies is the weighted average of true and false positives with significant results, (164*.23 +36*.05)/(164+36) = 20%. Z-curve perfectly estimated this value.

Importantly, the 95% CI of the ERR, 11% to 34%, does not include zero. Thus, we can reject the null-hypotheses that all of the significant results are false positives based on the ERR. In other words, the significant results have evidential value. However, we do not know the composition of this average. It could be a large percentage of false positives and a few true hypotheses with high power or it could be many true positives with low power. We also do not know which of the 200 significant results is a true positive or a false positive. Thus, we would need to conduct replication studies to distinguish between true and false hypotheses. And given the low power, we would only have a 23% chance of successfully replicating a true positive result. This is exactly what happened with the reproducibility project. And the inconsistent results lead to debates and require further replications. Thus, we have real-world evidence how uninformative p-values are when they are obtained this way.

Social psychologists might argue that the use of small samples is justified because most hypotheses in psychology are true. Thus, we can use prior information to assume that significant results are true positives. However, this logic fails when social psychologists test false hypotheses. In this case, the observed distribution of p-values (Figure 1) is not that different from the distribution that is observed when most significant results are true positives that were obtained with low power (Figure 2). Thus, it is doubtful that this is really an optimal use of resources (Finkel et al., 2015). However, until recently this was the way experimental social psychologists conducted their research.

Scenario 3: Cohen’s Way

In 1962 (!), Cohen conducted a meta-analysis of statistical power in social psychology. The main finding was that studies had only a 50% chance to get significant results with a median effect size of d = .5. Cohen (1988) also recommended that researchers should plan studies to have 80% power. However, this recommendation was ignored.

To achieve 80% power with d = .4, researchers need N = 200 participants. Thus, the number of studies is reduced from 5 studies with N = 40 to one study with N = 200. As Finkel et al. (2017) point out, we can make more discoveries with many small studies than a few large ones. However, this ignores that the results of the small studies are difficult to replicate. This was not a concern when social psychologists did not bother to test whether their discoveries are false discoveries or whether they can be replicated. The replication crisis shows the problems of this approach. Now we have results from decades of research that produced significant p-values without providing any information whether these significant results are true or false discoveries.

Scenario 3 examines what social psychology would look like today, if social psychologists had listened to Cohen. The scenario is the same as in the second scenario, including publication bias. There are 50% false hypotheses and 50% true hypotheses with an effect size of d = .4. The only difference is that researchers used N = 200 to test their hypotheses to achieve 80% power.

With 80% power, we need 470 tests (compared to 1,428 in Scenario 2) to produce 200 significant results, 235*.80 + 235*.05 = 188 + 12 = 200. Thus, the EDR is 200/470 = 43%. The true false discovery rate is 6%. The expected replication rate is 188*.80 + 12*.05 = 76%. Thus, we see that higher power increases replicability from 20% to 76% and lowers the false discovery rate from 18% to 6%.

Figure 3 shows the z-curve plot. Visual inspection shows that Figure 3 looks very different from Figures 1 and 2. The estimates are also different. In this example, sampling error inflated the EDR to be 58%, but the 95%CI includes the true value of 46%. The 95%CI does not include the ODR. Thus, there is evidence for publication bias, which is also visible by the steep drop in the distribution at 1.96.

Even with a low EDR of 20%, the maximum FDR is only 21%. Thus, we can conclude with confidence that at least 79% of the significant results are true positives. Remember, in the previous scenario, we could not rule out that most results are false positives. Moreover, the estimated replication rate is 73%, which underestimates the true replication rate of 76%, but the 95%CI includes the true value, 95%CI = 61% – 84%. Thus, if these studies were replicated, we would have a high success rate for actual replication studies.

Just imagine for a moment what social psychology might look like in a parallel universe where social psychologists followed Cohen’s advice. Why didn’t they? The reason is that they did not have z-curve. All they had was p < .05, and using p < .05, all three scenarios are identical. All three scenarios produced 200 significant results. Moreover, as Finkel et al. (2015) pointed out, smaller samples produce 200 significant results quicker than large samples. An additional advantage of small samples is that they inflate point estimates of the population effect size. Thus, the social psychologists with the smallest samples could brag about the biggest (illusory) effect sizes as long as nobody was able to publish replication studies with larger samples that deflated effect sizes of d = .8 to d = .08 (Joy-Gaba & Nosek, 2010).

This game is over, but social psychology – and other social sciences – have published thousands of significant p-values, and nobody knows whether they were obtained using scenario 1, 2, or 3, or probably a combination of these. This is where z-curve can make a difference. P-values are no longer equal when they are considered as a data point from a p-value distribution. In scenario 1, a p-value of .01 and even a p-value of .001 has no meaning. In contrast, in scenario 3 even a p-value of .02 is meaningful and more likely to reflect a true positive than a false positive result. This means that we can use z-curve analyses of published p-values to distinguish between probably false and probably true positives.

I illustrate this with three concrete examples from a project that examined the p-value distributions of over 200 social psychologists (Schimmack, in preparation). The first example has the lowest EDR in the sample. The EDR is 11% and because there are only 210 tests, the 95%CI is wide and includes 5%.

The maximum EDR estimate is high with 41% and the 95%CI includes 100%. This suggests that we cannot rule out the hypothesis that most significant results are false positives. However, the replication rate is 57% and the 95%CI, 45% to 69%, does not include 5%. Thus, some tests tested true hypotheses, but we do not know which ones.

Visual inspection of the plot shows a different distribution than Figure 2. There are more just significant p-values, z = 2.0 to 2.2 and more large z-scores (z > 4). This shows more heterogeneity in power. A comparison of the ODR with the EDR shows that the ODR falls outside the 95%CI of the EDR. This is evidence of publication bias or the use of questionable research practices. One solution to the presence of publication bias is to lower the criterion for statistical significance. As a result, the large number of just significant results is no longer significant and the ODR decreases. This is a post-hoc correction for publication bias. For example, we can lower alpha to .005.

As expected, the ODR decreases considerably from 70% to 39%. In contrast, the EDR increases. The reason is that many questionable research practices produce a pile of just significant p-values. As these values are no longer used to fit the z-curve, it predicts a lot fewer non-significant p-values. The model now underestimates p-values between 2 and 2.2. However, these values do not seem to come from a sampling distribution. Rather they stick out like a tower. By excluding them, the p-values that are still significant with alpha = .005 look more credible. Thus, we can correct for the use of QRPs by lowering alpha and by examining whether these p-values produced interesting discoveries. At the same time, we can ignore the p-values between .05 and .005 and await replication studies to provide empirical evidence whether these hypotheses receive empirical support.

The second example was picked because it was close to the median EDR (33) and ERR (66) in the sample of 200 social psychologists.

The larger sample of tests (k = 1,529) helps to obtain more precise estimates. A comparison of the ODR, 76%, and the 95%CI of the EDR, 12% to 48%, shows that publication bias is present. However, with an EDR of 33%, the maximum FDR is only 11% and the upper limit of the 95%CI is 39%. Thus, we can conclude with confidence that fewer than 50% of the significant results are false positives, however numerous findings might be false positives. Only replication studies can provide this information.

In this example, lowering alpha to .005 did not align the ODR and the EDR. This suggests that these values come from a sampling distribution where non-significant results were not published. Thus, adjusting the there is no simple fix to adjust the significance criterion. In this situation, we can conclude that the published p-values are unlikely to be false positives, but that replication studies are needed to ensure that published significant results are not false positives.

The third example is the social psychologists with the highest EDR. In this case, the EDR is actually a little bit lower than the ODR, suggesting that there is no publication bias. The high EDR also means that the maximum FDR is very small and even the upper limit of the 95%CI is only 7%.

Another advantage of data without publication bias is that it is not necessary to exclude non-significant results from the analysis. Fitting the model to all p-values produces much tighter estimates of the EDR and the maximum FDR.

The upper limit of the 95%CI for the FDR is now 4%. Thus, we conclude that no more than 5% of the p-values less than .05 are false positives. Even p = .02 is unlikely to be a false positive. Finally, the estimated replication rate is 84% with a tight confidence interval ranging from 78% to 90%. Thus, most of the published p-values are expected to replicate in an exact replication study.

I hope these examples make it clear how useful it can be to evaluate single p-values with prior information about the p-values distribution of a lab. As labs differ in their research practices, significant p-values are also different. Only if we ignore the research context and focus on a single result p = .02 equals p = .02. But once we see the broader distribution, p-values of .02 can provide stronger evidence against the null-hypothesis than p-values of .002.

Implications

Cohen tried and failed to change the research culture of social psychologists. Meta-psychological articles have puzzled why meta-analyses of power failed to increase power (Maxwell, 2004; Schimmack, 2012; Sedelmeier & Gigerenzer, 1989). Finkel et al. (2015) provided an explanation. In a game where the winner publishes as many significant results as possible, the optimal strategy is to conduct as many studies as possible with low power. This strategy continues to be rewarded in psychology, where jobs, promotions, grants, and pay raises are based on the number of publications. Cohen (1990) said less is more, but that is not true in a science that does not self-correct and treats every p-value less than .05 as a discovery.

To improve psychology as a science, we need to change the incentive structure and author-wise z-curve analyses can do this. Rather than using p < .05 (or p < .005) as a general rule to claim discoveries, claims of discoveries can be adjusted to the research practices of a researchers. As demonstrated here, this will reward researchers who follow Cohen’s rules and punish those who use questionable practices to produce p-values less than .05 (or Bayes-Factors > 3) without evidential value. And maybe, there is a badge for credible p-values one day.

(incomplete) References

Richard, F. D., Bond, C. F., Jr., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363. http://dx.doi.org/10.1037/1089-2680.7.4.331

Ten years ago, a stunning article by Bem (2011) triggered a crisis of confidence about psychology as a science. The article presented nine studies that seemed to show time-reversed causal effects of subliminal stimuli on human behavior. Hardly anybody believed the findings, but everybody wondered how Bem was able to produce significant results for effects that do not exist. This triggered a debate about research practices in social psychology.

Over the past decade, most articles on the replication crisis in social psychology pointed out problems with existing practices, but some articles tried to defend the status quo (cf. Schimmack, 2020).

Finkel, Eastwick, and Reis (2015) contributed to the debate with a plea to balance false positives and false negatives.

I argue that the main argument in this article is deceptive, but before I do so it is important to elaborate a bit on the use of the word deceptive. Psychologists make a distinction between self-deception and other-deception. Other-deception is easy to explain. For example, a politician may spread a lie for self-gain knowing full well that it is a lie. The meaning of self-deception is also relatively clear. Here individuals are spreading false information because they are unaware that the information is false. The main problem for psychologists is to distinguish between self-deception and other-deception. For example, it is unclear whether Donald Trump’s and his followers’ defence mechanisms are so strong that they really believes the election was stolen without any evidence to support this belief or whether he is merely using a lie for political gains. Similarly, it is also unclear whether Finkel et al. were deceiving themselves when they characterized the research practices of relationship researchers as an error-balanced approach, but the distinction between self-deception and other-deception is irrelevant. Self-deception also leads to the spreading of misinformation that needs to be corrected.

In short, my main thesis is that Finkel et al. misrepresent research practices in psychology and that they draw false conclusions about the status quo and the need for change based on a false premise.

Common Research Practices in Psychology

Psychological research practices follow a number of simple steps.

1. Researchers formulate a hypothesis that two variables are related (e.g., height is related to weight; dieting leads to weight loss).

2. They find ways to measure or manipulate a potential causal factor (height, dieting) and find a way to measure the effect (weight).

3. They recruit a sample of participants (e.g., N = 40).

4. They compute a statistic that reflects the strength of the relationship between the two variables (e.g., height and weight correlate r = .5).

5. They determine the amount of sampling error given their sample size.

6. They compute a test-statistic (t-value, F-value, z-score) that reflects the ratio of the effect size over the sample size (e.g., r (40) = .5; t(38) = 3.56.

7. They use the test-statistic to decide whether the relationship in the sample (e.g., r = .5) is strong enough to reject the nil-hypothesis that the relationship in the population is zero (p = .001).

The important question is what researchers do after they compute a p-value. Here critics of the status quo (the evidential value movement) and Finkel et al. make divergent assumptions.

The Evidential Value Movement

The main assumption of the EVM is that psychologists, including relationship researchers, have interpreted p-values incorrectly. For the most part, the use of p-values in psychology follows Fisher’s original suggestion to use a fixed criterion value of .05 to decide whether a result is statistically significant. In our example of a correlation of r = .5 with N = 40 participants, a p-value of .001 is below .05 and therefore it is sufficiently unlikely that the correlation could have emerged by chance if the real correlation between height and weight was zero. We therefore can reject the nil-hypothesis and infer that there is indeed a positive correlation.

However, if a correlation is not significant (e.g., r = .2, p > .05), the results are inconclusive because we cannot infer from a non-significant result that the nil-hypothesis is true. This creates an asymmetry in the value of significant results. Significant results can be used to claim a discovery (a diet produces weight loss), but non-significant results cannot be used to claim that there is no relationship (a diet has no effect on weight).

This asymmetry explains why most published articles results in psychology report significant results (Sterling, 1959; Sterling et al., 1959). As significant results are more conclusive, journals found it more interesting to publish studies with significant results.

As Sterling (1959) pointed out, if only significant results are published, statistical significance no longer provides valuable information, and as Rosenthal (1979) warned, in theory journals could be filled with significant results even if most results are false positives (i.e., the nil-hypothesis is actually true).

Importantly, Fisher did not prescribe to do studies only once and to publish only significant results. Fisher clearly stated that results should only be considered credible if replication studies confirm the original results most of the time (say 8 out of 10 replication studies also produced p < .05). However, this important criterion of credibility was ignored by social psychologists, especially in research areas like relationship research that is resource intensive.

To conclude, the main concern among critics of research practices in psychology is that selective publishing of significant results produces results that have a high risk of being false positives (cf. Schimmack, 2020).

The Error Balanced Approach

Although Finkel et al. (2015) do not mention Neyman and Pearson, their error-balanced approach is rooted in Neyman-Pearsons approach to the interpretation of p-values. This approach is rather different from Fisher’s approach and it is well documented that Fisher and Neyman-Pearson were in a bitter fight over this issue. Neyman and Pearson introduced the distinction between Type I errors also called false positives and type-II errors also called false negatives.

The type-I error is the same error that one could make in Fisher’s approach, namely a significant results, p < .05, is falsely interpreted as evidence for a relationship when there is no relationship between two variables in the population and the observed relationship was produced by sampling error alone.

So, what is a type-II error? It only occurred to me yesterday that most explanations of type-II errors are based on a misunderstanding of Neyman-Pearson’s approach. A simplistic explanation of a type-II error is the inference that there is no relationship, when a relationship actually exists. In the pregnancy example, a type-II error would be a pregnancy test that suggests a pregnant woman is not pregnant.

This explains conceptually what a type-II error is, but it does not explain how psychologists could ever make a type-II error. To actually make type-II errors, researchers would have to approach research entirely differently than psychologists actually do. Most importantly, they would need to specify a theoretically expected effect size. For example, researchers could test the nil-hypothesis that a relationship between height and weight is r = 0 against the alternative hypothesis that the relationship is r = .4. They would then need to compute the probability of obtaining a non-significant result under the assumption that the correlation is r = .4. This probability is known as the type-II error probability (beta). Only then, a non-significant result can be used to reject the alternative hypothesis that the effect size is .4 or larger with a pre-determined error rate beta. If this suddenly sounds very unfamiliar, the reason is that neither training nor published articles follow this approach. Thus, psychologists never make type-II error because they never specify a priori effect sizes and use p-values greater than .05 to infer that population effect sizes are smaller than a specified effect size.

However, psychologists often seem to believe that they are following Neyman-Pearson because statistics is often taught as a convoluted, incoherent mishmash of the two approaches (Gigerenzer, 1993). It also seems that Finkel et al. (2015) falsely assumed that psychologists follow Neyman-Pearson’s approach and carefully weight the risks of type-I and type-II errors. For example, they write

Psychological scientists typically set alpha (the theoretical possibility of a false positive) at .05, and, following Cohen (1988), they frequently set beta (the theoretical possibility of a false negative) at .20.

It is easy to show that this is not the case. To set the probability of a type-II error at 20%, psychologists would need to specify an effect size that gives them an 80% probability (power) to reject the nil-hypothesis, and they would then report the results with the conclusion that the population effect size is less than their a priori specified effect size. I have read more than 1,000 research articles in psychology and I have never seen an article that followed this approach. Moreover, it has been noted repeatedly that sample sizes are determined on an ad hoc basis with little concerns about low statistical power (Cohen, 1962; Sedlmeier & Gigerenzer, 1989; Schimmack, 2012; Sterling et al., 1995). Thus, the claim that psychologists are concerned about beta (type-II errors) is delusional, even if many psychologists believe it.

Finkel et al. (2015) suggests that an optimal approach to research would balance the risk of false positive results with the risk of false negative results. However, once more they ignore that false negatives can only be specified with clearly specified effect sizes.

Estimates of false positive and false negative rates in situations like these would go a long way toward helping scholars who work with large datasets to refine their confirmatory and exploratory hypothesis testing practices to optimize the balance between false-positive and false-negative error rates.

Moreover, they are blissfully unaware that false positive rates are abstract entities because it is practically impossible to verify that the relationship between two variables in a population is exactly zero. Thus, neither false positives nor false negatives are clearly defined and therefore cannot be counted to compute rates of their occurrences.

Without any information about the actual rate of false positives and false negatives, it is of course difficult to say whether current practices produce too many false positives or false negatives. A simple recommendation would be to increase sample sizes because higher statistical power reduces the risk of false negatives and the risk of false positives. So, it might seem like a win-win. However, this is not what Finkel et al. considered to be best practices.

“As discussed previously, many policy changes oriented toward reducing false-positive rates will exacerbate false-negative rates”

This statement is blatantly false and ignores recommendations to test fewer hypotheses in larger samples (Cohen, 1990; Schimmack, 2012).

They further make unsupported claims about the difficulty of correcting false positive results and false negative results. The evidential value critics have pointed out that current research practices in psychology make it practically impossible to correct a false positive result. Classic findings that failed to replicate are often cited and replications are ignored. The reason is that p < .05 is treated as strong evidence, whereas p > .05 is treated as inconclusive, following Fisher’s approach. If p > .05 was considered evidence against a plausible hypothesis, there would be no reason not to publish it (e.g., a diet does not decrease weight by more than .3 standard deviations in a study with 95% power, p < .05).

We are especially concerned about the evidentiary value movement’s relative neglect of false negatives because, for at least two major reasons, false negatives are much less likely to be the subject of replication attempts. First, researchers typically lose interest in unsuccessful ideas, preferring to use their resources on more “productive” lines of research (i.e., those that yield evidence for an effect rather than lack of evidence for an effect). Second, others in the field are unlikely to learn about these failures because null results are rarely published (Greenwald, 1975). As a result, false negatives are unlikely to be corrected by the normal processes of reconsideration and replication. In contrast, false positives appear in the published literature, which means that, under almost all circumstances, they receive more attention than false negatives. Correcting false positive errors is unquestionably desirable, but the consequences of increasingly favoring the detection of false positives relative to the detection of false negatives are more ambiguous.

This passage makes no sense. As the authors themselves acknowledge, the key problem with existing research practices is that non-significant results are rarely published (“because null-results are rarely published”). In combination with low statistical power to detect small effect sizes, this selection implies that researchers will often obtain non-significant results that are not published. However, it also means that published significant results often inflate the effect size because the true population effect size alone is too weak to produce a significant result. Only with the help of sampling error, the observed relationship is strong enough to be significant. So, many correlations that are r = .2 will be published as correlations of r = .5. The risk of false negatives is also reduced by publication bias. Because researchers do not know that a hypothesis was tested and produced a non-significant result, they will try again. Eventually, a study will produce a significant result (green jelly beans cause acne, p < .05), and the effect size estimate will be dramatically inflated. When follow-up studies fail to replicate this finding, these replication results are again not published because non-significant results are considered inconclusive. This means that current research practices in psychology never produce type-II errors, only produce type-I errors, and type-I errors are not corrected. This fundamentally flawed approach to science has created the replication crisis.

In short, while evidential value critics and Finkel agree that statistical significance is widely used to decide editorial decisions, they draw fundamentally different conclusions from this practice. Finkel et al. falsely label non-significant results in small samples, false negative results, but they are not false negatives in Neyman-Pearson’s approach to significance testing. They are, however, inconclusive results and the best practice to avoid inconclusive results would be to increase statistical power and to specify type-II error probabilities for reasonable effect sizes.

Finkel et al. (2015) are less concerned about calls for higher statistical power. They are more concerned with the introduction of badges for materials sharing, data sharing, and preregistration as “quick-and-dirty indicator of which studies, and which scholars, have strong research integrity” (p. 292).

Finkel et al. (2015) might therefore welcome cleaner and more direct indicators of research integrity that my colleagues and I have developed over the past decade that are related to some of their key concerns about false negative and false positive results (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020, Schimmack, 2012; Schimmack, 2020). To illustrate this approach, I am using Eli J. Finkel’s published results.

I first downloaded published articles from major social and personality journals (Schimmack, 2020). I then converted these pdf files into text files and used R-code to find statistical results that were reported in the text. I then used a separate R-code to search these articles for the name “Eli J. Finkel.” I excluded thank you notes. I then selected the subset of test statistics that appeared in publications by Eli J. Finkel. The extracted test statistics are available in the form of an excel file (data). The file contains 1,638 useable test statistics (z-scores between 0 and 100).

A z-curve analysis of test-statistic converts all published test-statistics into p-values. Then the p-values are converted into z-scores on an standard normal distribution. Because the sign of an effect does not matter, all z-scores are positive The higher a z-score, the stronger is the evidence against the null-hypothesis. Z-scores greater than 1.96 (red line in the plot) are significant with the standard criterion of p < .05 (two-tailed). Figure 1 shows a histogram of the z-scores between 0 and 6; 143 z-scores exceed the upper value. They are included in the calculations, but not shown.

The first notable observation in Figure 1 is that the peak (mode) of the distribution is just to the right side of the significance criterion. It is also visible that there are more results just to the right (p < .05) than to the left (p > .05) around the peak. This pattern is common and reflects the well-known tendency for journals to favor significant results.

The advantage of a z-curve analysis is that it is possible to quantify the amount of publication bias. To do so, we can compare the observed discovery rate with the expected discovery rate. The observed discovery rate is simply the percentage of published results that are significant. Finkel published 1,031 significant results, which is a percentage of 63%.

The expected discovery rate is based on a statistical model. The statistical model is fitted to the distribution of significant results. To produce the distribution of significant results in Figure 1, we assume that they were selected from a larger set of tests that produced significant and non-significant results. Based on the mean power of these tests, we can estimate the full distribution before selection for significance. Simulation studies show that these estimates match simulated true values reasonably well (Bartos & Schimmack, 2020).

The expected discovery rate is 26%. This estimate implies that the average power of statistical tests conducted by Finkel is low. With over 1,000 significant test statistics, it is possible to obtain a fairly close confidence interval around this estimate, 95%CI = 11% to 44%. The confidence interval does not include 50%, showing that the average power is below 50%, which is often considered a minimum value for good science (Tversky & Kahneman, 1971). The 95% confidence interval also does not include the observed discovery rate of 63%. This shows the presence of publication bias. These results are by no means unique to Finkel. I was displeased to see that a z-curve analysis of my own articles produced similar results (ODR = 74%, EDR = 25%).

The EDR estimate is not only useful to examine publication bias. It can also be used to estimate the maximum false discovery rate (Soric, 1989). That is, although it is impossible to specify how many published results are false positives, it is possible to quantify the worst case scenario. Finkel’s EDR estimate of 26% implies a maximum false discovery rate of 15%. Once again, this is an estimate and it is useful to compute a confidence interval around it. The 95%CI ranges from 7% to 43%. On the one hand, this makes it possible to reject Ioannidis’ claim that most published results are false. On the other hand, we cannot rule out that some of Finkel’s significant results were false positives. Moreover, given the evidence that publication bias is present, we cannot rule out the possibility that non-significant results that failed to replicate a significant result are missing from the published record.

A major problem for psychologists is the reliance on p-values to evaluate research findings. Some psychologists even falsely assume that p < .05 implies that 95% of significant results are true positives. As we see here, the risk of false positives can be much higher, but significance does not tell us which p-values below .05 are credible. One solution to this problem is to focus on the false discovery rate as a criterion. This approach has been used in genomics to reduce the risk of false positive discoveries. The same approach can also be used to control the risk of false positives in other scientific disciplines (Jager & Leek, 2014).

To reduce the false discovery rate, we need to reduce the criterion to declare a finding a discovery. A team of researchers suggested to lower alpha from .05 to .005 (Benjamin et al. 2017). Figure 2 shows the results if this criterion is used for Finkel’s published results. We now see that the number of significant results is only 579, but that is still a lot of discoveries. We see that the observed discovery rate decreased to 35%. The reason is that many of the just significant results with p-values between .05 and .005 are no longer considered to be significant. We also see that the expected discovery rate increased! This requires some explanation. Figure 2 shows that there is an excess of significant results between .05 and .005. These results are not fitted to the model. The justification for this would be that these results are likely to be obtained with questionable research practices. By disregarding them, the remaining significant results below .005 are more credible and the observed discovery rate is in line with the expected discovery rate.

The results look different if we do not assume that questionable practices were used. In this case, the model can be fitted to all p-values below .05.

If we assume that p-values are simply selected for significance, the decrease of p-values from .05 to .005 implies that there is a large file-drawer of non-significant results and the expected discovery rate with alpha = .005 is only 11%. This translates into a high maximum false discovery rate of 44%, but the 95%CI is wide and ranges from 14% to 100%. In other words, the published significant results provide no credible evidence for the discoveries that were made. It is therefore charitable to attribute the peak of just significant results to questionable research practices so that p-values below .005 provide some empirical support for the claims in Finkel’s articles.

Discussion

Ultimately, science relies on trust. For too long, psychologists have falsely assumed that most if not all significant results are discoveries. Bem’s (2011) article made many psychologists realize that this is not the case, but this awareness created a crisis of confidence. Which significant results are credible and which ones are false positives? Are most published results false positives? During times of uncertainty, cognitive biases can have a strong effect. Some evidential value warriors saw false positive results everywhere. Others wanted to believe that most published results are credible. These extreme positions are not supported by evidence. The reproducibility project showed that some results replicate and others do not (Open Science Collaboration, 2015). To learn from the mistakes of the past, we need solid facts. Z-curve analyses can provide these facts. It can also help to separate more credible p-values from less credible p-values. Here, I showed that about half of Finkel’s discoveries can be salvaged from the wreckage of the replication crisis in social psychology by using p < .005 as a criterion for a discovery.

However, researchers may also have different risk preferences. Maybe some are more willing to build on a questionable, but intriguing finding than others. Z-curve analysis can accommodate personalized risk-preferences as well. I shared the data here and an R-package is available to fit z-curve with different alpha levels and selection thresholds.

Aside from these practical implications, this blog post also made a theoretical observation. The term type-II error or false negative is often used loosely and incorrectly. Until yesterday, I also made this mistake. Finkel et al. (2015) use the term false negative to refer to all non-significant results were the nil-hypothesis is false. They then worry that there is a high risk of false negatives that needs to be counterbalanced against the risk of a false positive. However, not every trivial deviation from zero is meaningful. For example, a diet that reduces weight by 0.1 pounds is not worthwhile studying. A real type-II error is made when researcher specify a meaningful effect size, conduct a high-powered study to find it, and then falsely conclude that an effect of this magnitude does not exist. To make a type-II error, it is necessary to conduct studies with high power. Otherwise, beta is so high that it makes no sense to draw a conclusion from the data. As average power in psychology in general and in Finkel’s studies is low, it is clear that they did not make any type-II errors. Thus, I recommend to increase power to finally get a balance between type-I and type-II errors which requires making some type-II errors some of the time.

References

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum, Inc.

The past decade has seen major replication failures in social psychology. This has led to a method revolution in social psychology. Thanks to technological advances, many social psychologists moved from studies with smallish undergraduate samples to online studies with hundreds of participants. Thus, findings published after 2016 are more credible than those published before 2016.

However, social psychologists have avoided to take a closer look at theories that were built on the basis of questionable results. Review articles continue to present these theories and cite old studies as if they provided credible evidence for them as if the replication crisis never happened.

One influential theory in social psychology is that stimuli can bypass conscious awareness and still influence behavior. This assumption is based on theories of emotions that emerged in the 1980s. In the famous Lazarus-Zajonc debate most social psychologists sided with Zajonc who quipped that “Preferences need no inferences.”

The influence of Zajonc can be seen in hundreds of studies with implicit primes (Bargh et al., 1996; Devine, 1989) and in modern measures of implicit cognition such as the evaluative priming task and the affect misattribution paradigm (AMP, Payne et al., . 2005).

Payne and Lundberg (2014) credit a study by Murphy and Zajonc (1993) for the development of the AMP. Interestingly, the AMP was developed because Payne was unable to replicate a key finding from Murphy and Zajonc’ studies.

In these studies, a smiling or frowning face was presented immediately before a target stimulus (e.g., a Chinese character). Participants had to evaluate the target. The key finding was that the faces influenced evaluations of the targets only when the faces were processed without awareness. When participants were aware of the faces, they had no effect. When Payne developed the AMP, he found that preceding stimuli (e.g., faces of African Americans) still influenced evaluations of Chinese characters, even though the faces were presented long enough (75ms) to be clearly visible.

Although research with the AMP has blossomed, there has been little interest in exploring the discrepancy between Murphy and Zajonc’s (1993) findings and Payne’s findings.

One possible explanation for the discrepancy is that the Murphy and Zajonc’s (1993) results were obtained with questionable research practices (QRPs, John et al., 2012). Fortunately, it is possible to detect the use of QRPs using forensic statistical tools. Here I use these tools to examine the credibility of Murphy and Zajonc’s claims that subliminal presentations of emotional faces produce implicit priming effects.

Before I examine the small set of studies from this article, it is important to point out that the use of QRPs in this literature is highly probable. This is revealed by examining the broader literature of implicit priming, especially with subliminal stimuli (Schimmack, 2020).

Figure 1 shows that published studies rarely report non-significant results, although the distribution of significant results shows low power and a high probability of non-significant results. While the observed discovery rate is 90%, the expected discovery rate is only 13%. This shows that QRPs were used to supress results that did not show the expected implicit priming effects.

Study 1

Study 1 in Murphy and Zajonc (1993) had 32 participants; 16 with subliminal presentations and 16 with supraliminal presentations. There were 4 within-subject conditions (smiling, frowning & two control conditions). The means of the affect ratings were 3.46 for smiling, 3.06 for both control conditions and 2.70 for the frowning faces. The perfect ordering of means is a bit suspicious, but even more problematic is that the mean differences of experimental conditions and control conditions were all statistically significant. The t-values, df = 15, are 2.23, 2.31, 2.31, and 2.59. Too many significant contrasts have been the downfall for a German social psychologist. Here we can only say that Murphy and Zajonc were very lucky that the two control conditions fell smack in the middle of the two experimental conditions. Any deviation in one direction would have increased one comparison, but decreased the other comparison and increased the risk of a non-significant result.

Study 2

Study 2 was similar, except that the judgments was changed from subjective liking to objective goodness vs. badness judgments.

The means for the two control conditions were again right in the middle, nearly identical to each other, and nearly identical to the means in Study 1 (M = 3.05, 3.06). Given sampling error, it is extremely unlikely that even the same condition produces the same means. Without reporting actual t-values, the authors further claim that all four comparisons of experimental and control conditions are significant.

Taken together, these two studies with surprisingly simiar t-values and 32 participants provide the only evidence for the claim that stimuli outside of awareness can elicit affective reactions. This weak evidence has garnered nearly 1,000 citations without ever being questioned or published replication attempts.

Studies 3-5 did not examine affective priming, but Study 6 did. The paradigm here was different. Participants were subliminally presented with a smiling or a frowning face. Then they had to choose between two pictures, the prime and a foil. The foil either had the same facial expression or a different facial expression. Another manipulation was to have the same or a different gender. This study showed a strong effect of facial expression, t(62) = 6.26, but not of gender.

I liked this design and conducted several conceptual replication studies with emotional pictures (beautiful beaches, dirty toilets). It did not work. Participants were not able to use their affect to pick the right picture from a prime-foil pair. I also manipulated presentation times and with increasing presentation times, participants could pick out the picture, even if the affect was the same (e.g., prime and foil were both pleasant).

Study 6 also explains why Payne was unable to get priming effects for subliminal stimuli that varied race or other features.

One possible explanation for the results in Study 6 is that it is extremely difficult to mask facial expressions, especially smiles. I also did some studies that tried that and at least with computers it was impossible to prevent detection of smiling faces.

Thus, we are left with some questionable results in Studies 1 and 2 as the sole evidence that subliminal stimuli can elicit affective reactions that are transferred to other stimuli.

Conclusion

I have tried to get implicit priming effects on affect measures and failed. It was difficult to publish these failures in the early 2000s. I am sure there are many other replication failures (see Figure 1) and Payne et al.’s (2014) account of the developed the AMP implies as much. Social psychology is still in the process of cleaning up the mess that the use of QRPs created. Implicit priming research is a posterchild of the replication crisis and researchers should stop citing these old articles as if they produced credible evidence.

Emotion researchers may also benefit from revisiting the Lazarus-Zajonc debate. Appraisal theory may not have the sex appeal of unconscious emotions, but it may be a more robust and accurate theory of emotions. Preference may not always require inferences, but preferences that are based on solid inferences are likely to be a better guide of behavior. Therefore I prefer Lazarus over Zajonc.

This is the third part in a mini-series of building a monster-model of well-being. The first part (Part1) introduced the measurement of well-being and the relationship between affect and well-being. The second part added measures of satisfaction with life-domains (Part 2). Part 2 ended with the finding that most of the variance in global life-satisfaction judgments is based on evaluations of important life domains. Satisfaction in important life domains also influences the amount of happiness and sadness individuals experience, but affect had relatively small unique effects on global life-satisfaction judgments. In fact, happiness made a trivial, non-significant unique contribution.

The effects of the various life domains on happiness, sadness, and the weighted average of domain satisfactions is shown in the table below. Regarding happy affective experiences, the results showed that friendships and recreations are important for high levels of positive affect (experiencing happiness), but health or money are relatively unimportant.

In part 3, I am examining how we can add the personality trait extraversion to the model. Evidence that extraverts have higher well-being was first reviewed by Wilson (1967). An influential article by Costa and McCrae (1980) showed that this relationship is stable over a period of 10 years, suggesting that stable dispositions contribute to this relationship. Since then, meta-analyses have repeatedly reaffirmed that extraversion is related to well-being (DeNeve & Cooper, 1998; Heller et al., 2004; Horwood, Smillie, Marrero, Wood, 2020).

Here, I am examining the question how extraversion influences well-being. One criticism of structural equation modeling of correlational, cross-sectional data is that causal arrows are arbitrary and that the results do not provide evidence of causality. This is nonsense. Whether a causal model is plausible or not depends on what we know about the constructs and measures that are being used in a study. Not every study can test all assumptions, but we can build models that make plausible assumptions given well-established findings in the literature. Fortunately, personality psychology has established some robust findings about extraversion and well-being.

First, personality traits and well-being measures show evidence of heritability in twin studies. If well-being showed no evidence of heritability, we could not postulate that a heritable trait like extraversion influences well-being because genetic variance in a cause would produce genetic variance in an outcome.

Second, both personality and well-being have a highly stable variance component. However, the stable variance in extraversion is larger than the stable variance in well-being (Anusic & Schimmack, 2016). This implies that extraversion causes well-being rather than the other way-around because causality goes from the more stable variable to the less stable variable (Conley, 1984). The reasoning is that a variable that changes quickly and influences another variable would produce changes, which contradicts the finding that the outcome is stable. For example, if height were correlated with mood, we would know that height causes variation in mood rather than the other way around because mood changes daily, but height does not. We also have direct evidence that life events that influence well-being such as unemployment can change well-being without changing extraversion (Schimmack, Wagner, & Schupp, 2008). This implies that well-being does not cause extraversion because the changes in well-being due to unemployment would then produce changes in extraversion, which is contradicted by evidence. In short, even though the cross-sectional data used here cannot test the assumption that extraversion causes well-being, the broader literature makes it very likely that causality runs from extraversion to well-being rather than the other way around.

Despite 50-years of research, it is still unknown how extraversion influences well-being. “It is widely appreciated that extraversion is associated with greater subjective well-being. What is not yet clear is what processes relate the two” ((Harris, English, Harms, Gross, & Jackson, 2017, p. 170). Costa and McCrae (1980) proposed that extraversion is a disposition to experience more pleasant affective experiences independent of actual stimuli or life circumstances. That is, extraverts are disposed to be happier than introverts. A key problem with this affect-level model is that it is difficult to test. One way of doing so is to falsify alternative models. One alternative model is the affective reactivity model. Accordingly, extraverts are only happier in situations with rewarding stimuli. This model implies personality x situation interactions that can be tested. So far, however, the affective reactivity model has received very little support in several attempts (Lucas & Baird, 2004). Another model assumes that extraversion is related to situation selection. Extraverts may spend more time in situations that elicit pleasure. Accordingly, both introverts and extraverts enjoy socializing, but extraverts actually spend more time socializing than introverts. This model implies person-situation correlations that can be tested.

Nearly 20 yeas ago, I proposed a mediation model that assumes extraversion has a direct influence on affective experiences and the amount of affective experiences is used to evaluate life-satisfaction (Schimmack, Diener, & Oishi, 2002). Although cited relatively frequently, none of these citations are replication studies. The findings above cast doubt on this model because there is no direct influence of positive affect (happiness) on life-satisfaction judgments.

The following analyses examine how extraversion is related to well-being in the Mississauga Family Study dataset.

1. A multi-method study of extraversion and well-being

I start with a very simple model that predicts well-being from extraversion, CFI = .989, RMSEA = .027. The correlated residuals show some rater-specific correlations between ratings of extraversion and life-satisfaction. Most important, the correlation between the extraversion and well-being factors is only r = .11, 95%CI = .03 to .19.

The effect size is noteworthy because extraversion is often considered to be a very powerful predictor of well-being. For example, Kesebir and Diener (2008) write “Other than extraversion and neuroticism, personality traits such as extraversion … have been found to be strong predictors of happiness” (p. 123)

There are several explanations for the week relationship in this model. First, many studies did not control for shared method variance. Even McCrae and Costa (1991) found a weak relationship when they used informant ratings of extraversion to predict self-ratings of well-being, but they ignored the effect size estimate.

Another possible explanation is that Mississauga is a highly diverse community and that the influence of extraversion on well-being can be weaker in non-Western samples (r ~ .2, Kim et al. , 2017.

I next added the two affect factors (happiness and sadness) to the model to test the mediation model. This model had good fit, CFI = .986, RMSEA = .026. The moderate to strong relationships from extraversion to happy feelings and happy feelings to life-satisfaction were highly significant, z > 5. Thus, without taking domain satisfaction into account, the results appear to replicate Schimmack et al.’s (2002) findings.

However, including domain satisfaction changes the results, CFI = .988, RMSEA = .015.

Although extraversion is a direct predictor of happy feelings, b = .25, z = 6.5, the non-significant path from happy feelings to life-satisfaction implies that extraversion does not influence life-satisfaction via this path, indirect effect b = .00, z = 0.2. Thus, the total effect of b = .14, z = 3.7, is fully mediated by the domain satisfactions.

A broad affective disposition model would predict that extraversion enhances positive affect across all domains, including work. However, the path coefficients show that extraversion is a stronger predictor of satisfaction with some domains than others. The strongest coefficients are obtained for satisfaction with friendships and recreation. In contrast, extraversion has only very small relationships with financial satisfaction, health satisfaction, or housing satisfaction that are not statistically significant. Inspection of the indirect effects shows that friendship (b = .026), leisure (.022), romance (.026), and work (.024) account for most of the total effect. However, power is too low to test significance of individual path coefficients.

Conclusion

The results replicate previous work. First, extraversion is a statistically significant predictor of life-satisfaction, even when method variance is controlled, but the effect size is small. Second, extraversion is a stronger predictor of happy feelings than life-satisfaction and unrelated to sad feelings. However, the inclusion of domain satisfaction judgments shows that happy feelings do not mediate the influence of extraversion on life-satisfaction. Rather, extraversion predicts higher satisfaction with some life domains. It may seem surprising that this is a new finding in 2021, 40-years after Costa and McCrae (1980) emphasized the importance of extraversion for well-being. The reason is that few psychological studies of well-being include measures of domain satisfaction and few sociological studies of well-being include personality measures (Schimmack, Schupp, & Wagner, 2008). The present results show that it would be fruitful to examine how extraversion is related to satisfaction with friendships, romantic relationships, and recreation. This is an important avenue for future research. However, for the monster model of well-being the next step will be to include neuroticism in the model. Stay tuned.