Category Archives: Uncategorized

If Consumer Psychology Wants to be a Science It Has to Behave Like a Science

Consumer psychology is an applied branch of social psychology that uses insights from social psychology to understand consumers’ behaviors. Although there is cross-fertilization and authors may publish in more basic and more applied journals, it is its own field in psychology with its own journals. As a result, it has escaped the close attention that has been paid to the replicability of studies published in mainstream social psychology journals (see Schimmack, 2020, for a review). However, given the similarity in theories and research practices, it is fair to ask why consumer research should be more replicable and credible than basic social psychology. This question was indirectly addressed in a diaologue about the merits of pre-registration that was published in the Journal of Consumer Psychology (Krishna, 2021).

Open science proponents advocate pre-registration to increase the credibility of published results. The main concern is that researchers can use questionable research practices to produce significant results (John et al., 2012). Preregistration of analysis plans would reduce the chances of using QRPs and increase the chances of a non-significant result. This would make the reporting of significant results more valuable because signifiance was produced by the data and not by the creativity of the data analyst.

In my opinion, the focus on pre-registration in the dialogue is misguided. As Pham and Oh (2021) point out, pre-registration would not be necessary, if there is no problem that needs to be fixed. Thus, a proper assessment of the replicability and credibility of consumer research should inform discussions about preregistration.

The problem is that the past decade has seen more articles talking about replications than actual replication studies, especially outside of social psychology. Thus, most of the discussion about actual and ideal research practices occurs without facts about the status quo. How often do consumer psychologists use questionable research practices? How many published results are likely to replicate? What is the typical statistical power of studies in consumer psychology? What is the false positive risk?

Rather than writing another meta-psychological article that is based on paranoid or wishful thinking, I would like to add to the discussion by providing some facts about the health of consumer psychology.

Do Consumer Psychologists Use Questionable Research Practices?

John et al. (2012) conducted a survey study to examine the use of questionable research practices. They found that respondents admitted to using these practices and that they did not consider these practices to be wrong. In 2021, however, nobody is defending the use of questionable practices that can inflate the risk of false positive results and hide replication failures. Consumer psychologists could have conducted an internal survey to find out how prevalent these practices are among consumer psychologists. However, Pham and Oh (2021) do not present any evidence about the use of QRPs by consumer psychologists. Instead, they cite a survey among German social psychologists to suggest that QPRs may not be a big problem in consumer psychology. Below, I will show that QRPs are a big problem in consumer psychology and that consumer psychologists have done nothing over the past decade to curb the use of these practices.

Are Studies in Consumer Psychology Adequately Powered

Concerns about low statistical power go back to the 1960s (Cohen, 1961; Maxwell, 2004; Schimmack, 20212; Sedlmeier & Gigerenzer, 1989; Smaldino & McElreath, 2016). Tversky and Kahneman (1971) refused “to believe that any that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis” (p. 110). Yet, results from the reproducibility project suggest that social psychologists conduct studies with less than 50% power all the time (Open Science Collaboration, 2015). It is not clear why we should expect higher power from consumer research. More concerning is that Pham and Oh (2021) do not even mention low power as a potential problem for consumer psychology. One advantage of a pre-registration is that researchers are forced to think ahead of time about the sample size that is required to have a good chance to show the desired outcome, assuming the theory is right. More than 20 years ago, the APA taskforce on on statistical inference recommended a priori power analysis, but researchers continued to conduct underpowered studies. Pre-registration, however, would not be necessary if consumer psychologists already conduct studies with adquate power. Here I show that power in consumer psychology is unacceptably low and has not increased over the past decade.

False Positive Risk

Pham and Oh note that Simmons, Nelson, & Simmonsohn’s (2011) influential article relied exclusively on simulations and speculations and suggest that the fear of massive p-hacking may be unfounded. “Whereas Simmons et al. (2011) highly influential computer simulations point to massive distortions of test statistics when QRPs are used, recent empirical estimates of the actual impact of self-serving analyses suggest more modest degrees of distortion of reported test statistics in recent consumer studies (see Krefeld-Schwalb & Scheibehenne, 2020). Here I presents of empirical analyses to estimate the false discovery risk in consumer psychology.

Data

The data are part of a larger project that examines research practices in psychology over the past decade. For this purpose, my research team and I downloaded all articles form 2010 to 2020 published in 120 psychology journals that cover a broad range of disciplines. Four journals represent research in consumer psychology, namely the Journal of Consumer Behavior, the Journal of Consumer Psychology, the Journal of Consumer Research and Psychology and Marketing. The articles were converted into text files and the text files were searched for test statistics. All F, t, and z-tests were used, but most test statistics were F and t tests. There were 2,304 tests for Journal of Consumer Behavior, 8940 for Journal of Consumer Psychology, 10,521 for Journal of Consumer Research, and 5,913 for Psychology and Marketing.

Results

I first conducted z-curve analyses for each journal and year separately. The 40 results were analyzed with year as continuous and journal as categorical predictor variable. No time trends were significant, but the main effect for the expected replication rate of journals was significant, F(3,36) = 9.63, p < .001. Inspection of the means showed higher values for Journal of Consumer Psychology and Psychology & Marketing than for the other two journals. No other effects were significant. Therefore, I combined the data of Journal of Consumer Psychology and Psychology of Marketing and the Journal of Consumer Behavior and Journal of Consumer Reserach.

Figure 1 shows the z-curve analysis for the first set of journals. The observed discovery rate (ODR) is simply the percentage of results that are significant. Out of the 14,853 tests, 10636 were significant which yields an ODR of 72%. To examine the influence of questionable research practices, the ODR can be compared to the estimated discovery rate (EDR). The EDR is an estimate that is based on a finite mixture model that is fitted to the distribution of the signifiant test statistics. Figure 1 shows that the fitted grey curve closely matches the observed distribution of test statistics that are all converted into z-scores. Figure 1 also shows the projected distribution that is expected for non-significant results. Contrary to the predicted distribution, observed non-significant results sharply drop off at the level of significance (z = 1.96). This pattern provides visual evidence that non-significant results do not follow a sampling distribution. The EDR is the area under the curve for the significant values relative to the total distribution. The EDR is only 34%. The 95%CI of the EDR can be used to test statistical significance. The ODR of 72% is well out side the 95% confidence interval of the EDR that ranges from 17% to 34%. Thus, there is strong evidence that consumer researchers use QRPs and publish too many significant results.

The EDR can also be used to assess the risk of publishing false positive results; that is significant results without a true population effect. Using a formula from Soric (1989), we can use the EDR to estimate the maximum percentage of false positive results. As the EDR decreases, the false discovery risk increases. With an EDR of 34%, the FDR is 10%, with a 95% confidence interval ranging from 7% to 26%. Thus, the present results do not suggest that most results in consumer psychology journals are false positives as some meta-scientists suggested (Ioannidis, 2005; Simmons et al., 2011).

It is more difficult to asses the replicability of results published in these two journals. On the one hand, z-curve provides an estimate of the expected replication rate. That is, the probability that a significant result produces a significant result again in an exact replication study (Brunner & Schimmack, 2020). The ERR is higher than the EDR because studies that produced a significant result have higher power than studies that did not produce a significant result. The ERR of 63% suggests that more than 50% of significant results can be successfully replicated. However, a comparison of the ERR with success rate in actual replication studies showed that the ERR overestimates actual replication rates (Brunner & Schimmack, 2020). There are a number of reasons for this discrepancy. One reason is that replication studies in psychology are never exact replications and that regression to the mean lowers the chances of reproducing the same effect size in a replication study. In social psychology, the EDR is actually a better predictor of the actual success rate. Thus, the present results suggest that actual replication studies in consumer psychology are likely to produce as many replication failures as studies in social psychology have (Schimmack, 2020).

Figure 2 shows the results for the Journal of Consumer Behavior and the Journal of Consumer Research.

The results are even worse. The ODR of 73% is above the EDR of 26% and well outside the 95%CI of the EDR, . The EDR of 24% implies a false discovery risk of 15%, 95%CI =

Conclusion

The present results show that consumer psychology is plagued by the same problems that have produced replication failures in social psychology. Given the similarities between consumer psychology and social psychology, it is not surprising that the two disciplines are alike. Researchers conduct underpowered studies and use QRPs to report inflated success rates. These illusory results cannot be replicated and it is unclear which statistically significant results reveal effects that have practical significance and which ones are mere false positives. To make matters worse, social psychologists have responded to awareness of these problems by increasing power of their studies and by implementing changes in their research practices. In contrast, z-curve analyses of consumer psychology show no improvement in research practices over the past year. In light of this disappointing tend, it is disconcerting to read an article that suggests improvements in consumer psychology are not needed and that everything is well (Pham and Oh, 2021). I demonstrated with hard data and objective analysis that this assessment is false. It is time for consumer psychologists to face reality and to follow in the footsteps of social psychologists to increase the credibility of their science. While preregistration may be optional, increasing power is not.

Guest Post by Peter Holtz: From Experimenter Bias Effects To the Open Science Movement

This post was first shared as a post in the Facebook Psychological Methods Discussion Group. (Group, Post). I thought it was interesting and deserved a wider audience.

Peter Holtz

I know that this is too long for this group, but I don’t have a blog …

A historical anecdote:

In 1963, Rosenthal and Fode published a famous paper on the Experimenter Bias Effect (EBE): There were of course several different experiments and conditions etc., but for example, research assistants were given a set of 20 photos of people that were to be rated by participants on a scale from -10 ([will experience …] “extreme failure”) to + 10 (…“extreme success”).

The research assistants (e.g., participants in a class on experimental psychology) were told to replicate a “well-established” psychological finding just like “students in physics labs are expected to do” (p. 494). On average, the sets of photos had been rated in a large pre-study as neutral (M=0), but some research assistants were told that the expected mean of their photos was -5, whereas others were told that it was +5. When the research assistants, who were not allowed to communicate with each other during the experiments, handed in the results of their studies, their findings were biased in the direction of the effect that they had expected. Funnily enough, similar biases could be found for experiments with rats in Skinner boxes as well (Rosenthal & Fode, 1963b).

The findings on the EBE were met with skepticism from other psychologists since they casted doubt on experimental psychology’s self-concept as a true and unbiased natural science. And what do researchers do since the days of Socrates if they doubt the findings of a colleague? Sure, they attempt to replicate them. Whereas Rosenthal and colleagues (by and large) produced several successful “conceptual replications” in slightly different contexts (for a summary see e.g. Rosenthal, 1966), others (most notably T. X. Barber) couldn’t replicate Rosenthal and Fode’s original study (e.g., Barber et al., 1969; Barber & Silver, 1968, but also Jacob, 1968; Wessler & Strauss, 1968).

Rosenthal, a versed statistician, responded (e.g., Rosenthal, 1969) that the difference between significant and non-significant may be not itself significant and used several techniques that about ten years later came to be known as “meta-analysis” to argue that although Barber’s and others’ replications, which of course used other groups of participants and materials etc., most often did not yield significant results, a summary of results suggests that there may still be an EBE (1968; albeit probably smaller than in Rosenthal and Fode’s initial studies – let me think… how can we explain that…).

Of course, Barber and friends responded to Rosenthal’s responses (e.g., Barber, 1969 titled “invalid arguments, post-mortem analyses, and the experimenter bias effect”) and vice versa and a serious discussion of psychology’s methodology emerged. Other notables weighed in as well and frequently statisticians such as Rozeboom (1960) and Bakan (1966) were quoted who had by then already done their best to explain to their colleagues the problems of the p-ritual that psychologists use(d) as a verification procedure. (On a side note: To me, Bakan’s 1966 paper is better than much of the recent work on the problems with the p-ritual; in particular the paragraph on the problematic assumption of an “automacity of inference” on p. 430 is still worth reading).

Lykken (1968) and Meehl (1967) soon joined the melee and attacked the p-ritual also from an epistemological perspective. In 1969, Levy wrote an interesting piece about the value of replications in which he argued that replicating the EBE-studies doesn’t make much sense as long as there are no attempts to embed the EBE into a wider explanatory theory that allows for deducing other falsifiable hypotheses as well. Levy knew very well already by 1969 that the question whether some effect “exists” or “does not exist” is only in very rare cases relevant (exactly then when there are strong reasons to assume that an effect does not exist – as is the case, for example, with para-psychological phenomena).

Eventually Rosenthal himself (e.g., 1968a) came to think critically of the “reassuring nature of the null hypothesis decision procedure”. What happened then? At some point Rosenthal moved away from experimenter expectancy effects in the lab to Pygmalion effects in the classroom (1968b) – an idea that is much less likely to provoke criticism and replication attempts: Who doesn’t believe that teachers’ stereotypes influence the way they treat children and consequently the children’s chances to succeed in school? The controversy fizzled out and if you take up a social psychology textbook, you may find the comforting story in it that this crisis was finally “overcome” (Stroebe, Hewstone, & Jonas, 2013, p. 18) by enlarging psychology’s methodological arsenal, for example, with meta-analytic practices and by becoming a stronger and better science with a more rigid methodology etc. Hooray!

So psychology was finally great again from the 1970s on … was it? What can we learn from this episode?- It is not the case that psychologists didn’t know the replication game, but they only played it whenever results went against their beliefs – and that was rarely the case (exceptions are apart from Rosenthal’s studies of course Bem’s “feeling the future” experiments). –

Science is self-correcting – but only whenever there are controversies (and not if subcommunities just happily produce evidence in favor of their pet theories). – Everybody who wanted to know it could know by the 1960s that something is wrong with the p-ritual – but no one cared. This was the game that needed to be played to produce evidence in favor of theories and to get published and to make a career; consequently, people learned to play the verification game more and more effectively. (Bakan writes on p. 423: “What will be said in this paper is hardly original. It is, in a certain sense, what “everybody knows.” To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear.” – in 1966!)-

Just making it more difficult to verify a theory will not solve the problem imo; ambitious psychologists will again find ways to play the game – and to win.- I see two risks with the changes that have been proposed by the “open science community” (in particular preregistration): First, I am afraid that since the verification game still dominates in psychology researchers will simply shift towards “proving” more boring hypotheses; second, there is the risk that psychological theories will be shielded even more from criticism since only criticism based on “good science” (preregistered experiments with a priori power analysis and open data) will be valid whereas criticism based on other types of research activities (e.g., simulations, case studies … or just rational thinking for a change) will be dismissed as “unscientific” => no criticism => no controversy => no improvement => no progress. – And of course, pre-registration and open science etc. allow psychologists to still maintain the misguided, unfortunate, and highly destructive myth of the “automacity of inferences”; no inductive mechanism whatsoever can ensure “true discovery”.-

I think what is needed more is a discussion about the relationship between data and theory and about epistemological questions such as the question what a “growth of knowledge” in science could look like and how it can be facilitated (I call this a “falsificationist turn”).- Irrespective of what is going to happen, authors of textbooks will find ways to write up the history of psychology as a flawless cumulative success story …

A Z-Curve Analysis of a Self-Replication: Shah et al. (2012) Science

Since 2011, psychologists are wondering which published results are credible and which results are not. One way to answer this question would be for researchers to self-replicate their most important findings. However, most psychologists have avoided conducting or publishing self-replications (Schimmack, 2020).

It is therefore always interesting when a self-replication is published. I just came across Shah, Mullainathana, and Shafir (2019). The authors conducted high-powered (much larger sample-sizes) replications of five studies that were published in Shah, Mullainathana, and Shafir’s (2012) Science article.

The article reported five studies with 1, 6, 2, 3, and 1 focal hypothesis tests. One additional test was significant, but the authors focussed on the small effect size and considered it not theoretically important. The replication studies successfully replicated 9 of the 13 significant results; a success rate of 69%. This is higher than the success rate in the famous reproducibility project of 100 studies in social and cognitive psychology; 37% (OSC, 2015).

One interesting question is whether this success rate was predictable based on the original findings. An even more interesting question is whether original results provide clues about the replicability of specific effects. For example, why were the results of Study 1 and 5 harder to replicate than those of the other studies.

Z-curve relies on the strength of the evidence against the null-hypothesis in the original studies to predict replication outcomes (Brunner & Schimmack, 2020; Bartos & Schimmack, 2020). It also takes into account that original results may be selected for significance. For example, the original article reported 14 out of 14 significant results. It is unlikely that all statistical tests of critical hypotheses produce significant results (Schimmack, 2012). Thus, some questionable practices were probably used although the authors do not mention this in their self-replication article.

I converted the 13 test statistics into exact p-values and converted the exact p-values into z-scores. Figure 1 shows the z-curve plot and the results of the z-curve analysis. The first finding is that the observed success rate of 100% is much higher than the expected discovery rate of 15%. Given the small sample of tests, the 95%CI around the estimated discovery rate is wide, but it does not include 100%. This suggests that some questionable practices were used to produce a pretty picture of results. This practice is in line with widespread practices in psychology in 2012.

The next finding is that despite a low discovery rate, the estimated replication rate of 66% is in line with the observed discovery rate. The reason for the difference is that the estimated discovery rate includes the large set of non-significant results that the model predicts. Selection for significance selects studies with higher power that have a higher chance to be significant (Brunner & Schimmack, 2020).

It is unlikely that the authors conducted many additional studies to get only significant results. It is more likely that they used a number of other QRPs. Whatever method they used, QRPs make just significant results questionable. One solution to this problem is to alter the significance criterion post-hoc. This can be done gradually. For example, a first adjustment might lower the significance criterion to alpha = .01.

Figure 2 shows the adjusted results. The observed discovery rate decreased to 69%. In addition, the estimated discovery rate increased to 48% because the model no longer needs to predict the large number of just significant results. Thus, the expected and observed discovery rate are much more in line and suggest little need for additional QRPs. The estimated replication rate decreased because it uses the more stringent criterion of alpha = .01. Otherwise, it would be even more in line with the observed replication rate.

Thus, a simple explanation for the replication outcomes is that some results were obtained with QRPs that produced just significant results with p-values between .01 and .05. These results did not replicate, but the other results did replicate.

There was also a strong point-biseral correlation between the z-scores and the dichotomous replication outcome. When the original p-values were split into p-values above or below .01, they perfectly predicted the replication outcome; p-values greater than .01 did not replicate, those below .01 did replicate.

In conclusion, a single p-values from a single analysis provides little information about replicability, although replicability increases as p-values decrease. However, meta-analyses of p-values with models that take QRPs and selection for significance into account are a promising tool to predict replication outcomes and to distinguish between questionable and solid results in the psychological literature.

Meta-analyses that take QRPs into account can also help to avoid replication studies that merely confirm highly robust results. Four of the z-scores in Shah et al.’s (2019) project were above 4, which makes it very likely that the results replicate. Resources are better spend on findings that have high theoretical importance, but weak evidence. Z-curve can help to identify these results because it corrects for the influence of QRPs.

Conflict of Interest statement: Z-curve is my baby.

How Credible is Clinical Psychology?

Don Lynam and the clinical group at Purdue University invited me to give a talk and they generously gave me permission to share it with you.

Talk (the first 4 min. were not recorded, it starts right away with my homage to Jacob Cohen).

The first part of the talk discusses the problems with Fisher’s approach to significance testing and the practice in psychology to publish only significant results. I then discuss Neyman-Pearson’s alternative approach, statistical power, and Cohen’s seminal meta-analysis of power in social/abnormal psychology. I then point out that questionable research practices must have been used to publish 95% significant results with only 50% power.

The second part of the talk discusses Soric’s insight that we can estimate the false discovery risk based on the discovery rate. I discuss the Open Science Collaboration project as one way to estimate the discovery rate (prettty high for within-subject cognitive psychology, terribly low for between-subject social psychology), but point out that it doesn’t tell us about clinical psychology. I then introduce z-curve to estimate the discovery rate based on the distribution of significant p-values (converted into z-scores).

In the empirical part, I show the z-curve for Positive Psychology Interventions that shows massive use of QRPs and a high false discovery risk.

I end with a comparison of the z-curve for the Journal of Abnormal Psychology in 2010 and 2020 that shows no change in research practices over time.

The discussion focussed on changing the way we do research and what research we reward. I argue strongly against the implementation of alpha = .005 and for the adoption of Neyman Pearson’s approach with pre-registration which would allow researchers to study small populations (e.g., mental health issues in the African American community) with a higher false-positive risk to balance type-I and type-II errors.

A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.

I recorded a meeting with my research assistants who are coding articles to estimate the replicability of psychological research. It is unedited and raw, but you might find it interesting to listen to. Below I give a short description of the topics that were discussed starting from an explanation of effect sizes and ending with a discussion about the choice of a graduate supervisor.

Link to video

The meeting is based on two blog posts that introduce personalized p-values.
1. https://replicationindex.com/2021/01/15/men-are-created-equal-p-values-are-not/
2. https://replicationindex.com/2021/01/19/personalized-p-values/

1. Rant about Fischer’s approach to statistics that ignores effect sizes.
– look for p < .05, and do a happy dance if you find it, now you can publish.
– still the way statistics is taught to undergraduate students.

2. Explaining statistics starting with effect sizes.
– unstandardized effect size (height difference between men and women in cm)
– unstandardized effect sizes depend on the unit of measurement
– to standardize effect sizes we divide by standard deviation (Cohen’s d)

3. Why do/did social psychologists run studies with n = 20 per condition?
– limited resources, small subject pool, statistics can be used with n = 20 ~ 30.
– obvious that these sample sizes are too small after Cohen (1961) introduced power analysis
– but some argued that low power is ok because it is more efficient to get significant results.

4. Simulation of social psychology: 50% of hypothesis are true, 50% are false, the effect size of true hypotheses is d = .4 and the sample size of studies is N = 20.
– Analyzing the simulated results (with k = 200 studies) with z-curve.2.0. In this simulation, the true discovery rate is 14%. That is 14% of the 200 studies produced a significant result.
– Z-curve correctly estimates this discovery rate based on the distribution of the significant p-values, converted into z-scores.
– If only significant results are published, the observed discovery rate is 100%, but the true discovery rate is only 14%.
– Publication bias leads to false confidence in published results.
– Publication is wasteful because we are discarding useful information.

5. Power analysis.
– Fischer did not have power analysis.
– Neyman and Pearson invented power analysis, but Fischer wrote the textbook for researchers.
– We had 100 years to introduce students to power analysis, but it hasn’t happened.
– Cohen wrote books about power analysis, but he was ignored.
– Cohen suggested we should aim for 80% power (more is not efficient).
– Think a priori about effect size to plan sample sizes.
– Power analysis was ignored because it often implied very large samples.
(very hard to get participants in Germany with small subject pools).
– no change because all p-values were treated as equal. p < .05 = truth.
– Literature reviews or textbook treat every published significant results as truth.

6. Repeating simulation (50% true hypotheses, effect size d = .4) with 80% power, N = 200.
– much higher discovery rate (58%)
– much more credible evidence
– z-curve makes it possible to distinguish between p-values from research with low or high discovery rate.
– Will this change the way psychologists look at p-values? Maybe, but Cohen and others have tried to change psychology without success. Will z-curve be a game-changer?

7. Personalized p-values
– P-values are being created by scientists.
– Scientists have some control about the type of p-values they publish.
– There are systemic pressures to publish more p-values based on low powered studies.
– But at some point, researchers get tenure.
– nobody can fire you if you stop publishing
– social media allow researchers to publish without censure from peers.
– tenure also means you have a responsibility to do good research.
– Researcher who are listed on the post with personalized p-values all have tenure.
– Some researchers, like David Matsumoto, have a good z-curve.
– Other researchers have way too many just significant results.
– The observed discovery rates between good and bad researchers are the same.
– Z-curve shows that the significant results were produced very differently and differ in credibility and replicability; this could be a game changer if people care about it.
– My own z-curve doesn’t look so good. 🙁
– How can researchers improve their z-curve
– publish better research now
– distance yourself from bad old research
– So far, few people have distanced themselves from bad old work because there was no incentive to do so.
– Now there is an incentive to do so, because researchers can increase credibility of their good work.
– some people may move up when we add the 2020 data.
– hand-coding of articles will further improve the work.

8. Conclusion and Discussion
– not all p-values are created equal.
– working with undergraduate is easy because they are unbiased.
– once you are in grad school, you have to produce significant results.
– z-curve can help to avoid getting into labs that use questionable practices.
– I was lucky to work in labs that cared about the science.

Nations’ Well-Being and Wealth

Scientists have made a contribution when a phenomenon or a statist is named after them. Thus, it is fair to say that Easterlin made a contribution to happiness research because researchers who write about income and happiness often mention his 1974 article “Does Economic Growth Improve the Human Lot? Some Empirical Evidence” (Easterlin, 1974).

To be fair, the article examines the relationship between income and happiness from three perspectives: (a) the correlation between income and happiness across individuals within nations, (b) the correlation of average incomes and average happiness across nations, and (c) the correlation between average income and average happiness within nations over time. A forth perspective, namely the correlation between income and happiness within individuals over time was not examined because no data were available in 1974.

Even for some of the other questions, the data were limited. Here I want to draw attention to Easterlin’s examination of correlations between nations’ wealth and well-being. He draws heavily on Cantril’s seminal contribution to this topic. Cantil (1965) not only developed a measure that can be used to compare well-being across nations, he also used this measure to compare the well-being of 14 nations (Cuba is not included in Table 1 because I did not have new data).

Cantril.Cross-Cultural.Data.png

Cantril also correlated the happiness scores with a measure of nations’ wealth. The correlation was r = .5. Cantril also suggested that Cuba and the Dominican Republic were positive and negative outliers, respectively. Excluding these two nations increases the correlation to r = .7.

Easterlin took issue with these results.

“Actually the association between wealth and happiness indicated by Cantril”s international data is not so clear-cut. This is shown by a scatter diagram of the data (Fig. I). The inference about a positive association relies heavily on the observations for India and the United States. [According to Cantril (1965, pp. 130-131), the values for Cuba and the Dominican Republic reflect unusual political circumstances-the immediate aftermath of a successful revolution in Cuba and prolonged political turmoil in the Dominican Republic].

What is perhaps most striking is that the personal happiness ratings for 10 of the 14 countries lie virtually within half a point of the midpoint rating of 5, as is brought out by the broken horizontal lines in the diagram. While a difference of rating of only 0.2 is significant at the 0.05 level, nevertheless there is not much evidence, for these IO countries, of a systematic association between income and happiness. The closeness of the happiness ratings implies also that a similar lack of association would be found between happiness and other economic magnitudes such as income inequality or the rate of change of income.

Nearly 50 years later, it is possible to revisit Easterlin’s challenge of Cantril’s claim that nations’ well-being is tied to their wealth with much better data from the Gallup World Poll. The Gallup World Poll used the same measure of well-being. However, it also provides a better measure of citizens’ wealth by asking for income. In contrast, GDP can be distorted and may not reflect the spending power of the average citizen very well. The data about well-being (World Happiness Report, 2020) and median per capita income (Gallup) are publicly available. All I needed to do was to compute the correlation and make a pretty graph.

The Pearson correlation between income and the ladder scores is r(126) = .75. The rank correlation is r(126) = .80. and the Pearson correlation between the log of income and the ladder scores is r(126) = .85. These results strongly support Cantril’s prediction based on his interpretation of the first cross-national study in the 1960s and refute Eaterlin’s challenge that that this correlation is merely driven by two outliers. Other researchers who analyzed the Gallup World Poll data also reported correlations of r = .8 and showed high stability of nations’ wealth and income over time (Zyphur et al., 2020).

Figure 2 also showed that Easterlin underestimate the range of well-being scores. Even ignoring additional factors like wars, income alone can move well-being from a 4 in one of the poorest countries in the world (Burundi) close to an 8 in one of the richest countries in the world (Norway). It also does not show that Scandinavian countries have a happiness secret. The main reason for their high average well-being appears to be that median personal incomes are very high.

The main conclusion is that social scientists are often biased for a number of reasons. The bias is evident in Easterlin’s interpretation of Cantril’s data. The same anti-materialstic bias can be found in many other articles on this topic that claim the benefits of wealth are limited.

To be clear, a log-function implies that the same amount of wealth buys more well-being in poor countries, but the graph shows no evidence that the benefits of wealth level off. It is also true that the relationship between GDP and happiness over time is more complicated. However, regarding cross-national differences the results are clear. There is a very strong relationship between wealth and well-being. Studies that do not control for this relationship may report spurious relationships that disappear when income is included as a predictor.

Furthermore, the focus on happiness ignores that wealth also buys longer lives. Thus, individuals in richer nations not only have happier lives they also have more happy life years. The current Covid-19 pandemic further increases these inequalities.

In conclusion, one concern about subjective measures of well-being has been that individuals in poor countries may be happy with less and that happiness measures fail to reflect human suffering. This is not the case. Sustainable, global economic growth that raises per capita wealth remains a challenge to improve human well-being.

Jens Forster and the Credibility Crisis in Social Psychology

  • Please help out to improve this post. If you have conducted successful or unsuccessful replication studies of work done by Jens Forster, please share this information with me and I will add it to this blog post.

Jens Forster was a social psychologists from Germany. He was a rising star and on the way to receiving a prestigious 5 million Euro award from the Alexander von Humboldt Foundation (Retraction Watch, 2015). Then an anonymous whistle blower accused him of scientific misconduct. Under pressure, Forster returned the award without admitting to any wrongdoing.

He also was in transition to move from the University of Amsterdam to the University of Bochum. After a lengthy investigation, Forster was denied tenure and he is no longer working in academia (Science, 2016), despite the fact that an investigation by the German association of psychologists (DGP) did not conclude that he conducted fraud.

While the personal consequences for Forster are similar to those of Stapel, who admitted to fraud and left his tenured position, the effect on the scientific record is different. Stapel retracted over 50 articles that are no longer being cited at high numbers. In contrast, Forster retracted only a few papers and most of his articles are not flagged to readers as potentially fraudulent. We can see the differences in citation counts for Stapel and Forster.

Stapels Citation Counts

Stapel’s citation counts peaked at 350 and are now down to 150 citations a year. Some of these citations are with co-authors and from papers that have been cleared as credible.

Jens Forster Citations

Citation counts for Forster peaked at 450. The also decreased by 200 citations to 250 citations, but we are also seeing an uptick by 100 citations in 2019. The question is whether this muted correction is due to Forster’s denial of wrongdoing or whether the articles that were not retracted actually are more credible.

The difficulty in proving fraud in social psychology is that social psychologists also used many questionable practices to produce significant results. These questionable practices have the same effect as fraud, but they were not considered unethical or illegal. Thus, there are two reasons why articles that have not been retracted may still lack credible evidence. First, it is difficult to prove fraud when authors do not confess. Second, even if no fraud was committed, the data may lack credible evidence because they were produced with questionable practices that are not considered data fabrication.

For readers of the scientific literature it is irrelevant whether incredible (results with low credibility) results were produced with fraud or with other methods. The only question is whether the published results provide credible evidence for the theoretical claims in an article. Fortunately, meta-scientists have made progress over the past decade in answering this question. One method relies on a statistical examination of an author’s published test statistics. Test statistics can be converted into p-values or z-scores so that they have a common metric (e.g., t-values can be compared to F-values). The higher the z-score, the stronger is the evidence against the null-hypothesis. High z-scores are also difficult to obtain with questionable practices. Thus, they are either fraudulent or provide real evidence for a hypothesis (i.e. against the null-hypothesis).

I have published z-curve analyses of over 200 social/personality psychologists that show clear evidence of variation in research practices across researchers (Schimmack, 2021). I did not include Stapel or Forster in these analyses because doubts have been raised about their research practices. However, it is interesting to compare Forster’s z-curve plot to the plot of other researchers because it is still unclear whether anomalous statistical patterns in Forster’s articles are due to fraud or the use of questionable research practices.

The distribution of z-scores shows clear evidence that questionable practices were used because the observed discovery rate of 78% is much higher than the estimated discovery rate of 18% and the ODR is outside of the 95% CI of the EDR, 9% to 47%. An EDR of 18% places Forster at rank #181 in the ranking of 213 social psychologists. Thus, even if Forster did not conduct fraud, many of his published results are questionable.

The comparison of Forster with other social psychologists is helpful because humans’ are prone to overgeneralize from salient examples which is known as stereotyping. Fraud cases like Stapel and Forster have tainted the image of social psychology and undermined trust in social psychology as a science. The fact that Forster would rank very low in comparison to other social psychologists shows that he is not representative of research practices in social psychology. This does not mean that Stapel and Forster are bad apples and extreme outliers. The use of QRPs was widespread but how much researchers used QRPs varied across researchers. Thus, we need to take an individual difference perspective and personalize credibility. The average z-curve plot for all social psychologists ignores that some research practices were much worse and others were much better. Thus, I argue against stereotyping social psychologists and in favor of evaluating each social psychologists based on their own merits. As much as all social psychologists acted within a reward structure that nearly rewarded Forster’s practices with a 5 million dollar prize, researchers navigated this reward structure differently. Hopefully, making research practices transparent can change the reward structure so that credibility gets rewarded.

Unconscious Emotions: Mindless Citations of Questionable Evidence

The past decade has seen major replication failures in social psychology. This has led to a method revolution in social psychology. Thanks to technological advances, many social psychologists moved from studies with smallish undergraduate samples to online studies with hundreds of participants. Thus, findings published after 2016 are more credible than those published before 2016.

However, social psychologists have avoided to take a closer look at theories that were built on the basis of questionable results. Review articles continue to present these theories and cite old studies as if they provided credible evidence for them as if the replication crisis never happened.

One influential theory in social psychology is that stimuli can bypass conscious awareness and still influence behavior. This assumption is based on theories of emotions that emerged in the 1980s. In the famous Lazarus-Zajonc debate most social psychologists sided with Zajonc who quipped that “Preferences need no inferences.”

The influence of Zajonc can be seen in hundreds of studies with implicit primes (Bargh et al., 1996; Devine, 1989) and in modern measures of implicit cognition such as the evaluative priming task and the affect misattribution paradigm (AMP, Payne et al., . 2005).

Payne and Lundberg (2014) credit a study by Murphy and Zajonc (1993) for the development of the AMP. Interestingly, the AMP was developed because Payne was unable to replicate a key finding from Murphy and Zajonc’ studies.

In these studies, a smiling or frowning face was presented immediately before a target stimulus (e.g., a Chinese character). Participants had to evaluate the target. The key finding was that the faces influenced evaluations of the targets only when the faces were processed without awareness. When participants were aware of the faces, they had no effect. When Payne developed the AMP, he found that preceding stimuli (e.g., faces of African Americans) still influenced evaluations of Chinese characters, even though the faces were presented long enough (75ms) to be clearly visible.

Although research with the AMP has blossomed, there has been little interest in exploring the discrepancy between Murphy and Zajonc’s (1993) findings and Payne’s findings.

Payne and Lundbert (2014)

One possible explanation for the discrepancy is that the Murphy and Zajonc’s (1993) results were obtained with questionable research practices (QRPs, John et al., 2012). Fortunately, it is possible to detect the use of QRPs using forensic statistical tools. Here I use these tools to examine the credibility of Murphy and Zajonc’s claims that subliminal presentations of emotional faces produce implicit priming effects.

Before I examine the small set of studies from this article, it is important to point out that the use of QRPs in this literature is highly probable. This is revealed by examining the broader literature of implicit priming, especially with subliminal stimuli (Schimmack, 2020).

This image has an empty alt attribute; its file name is image-36.png

Figure 1 shows that published studies rarely report non-significant results, although the distribution of significant results shows low power and a high probability of non-significant results. While the observed discovery rate is 90%, the expected discovery rate is only 13%. This shows that QRPs were used to supress results that did not show the expected implicit priming effects.

Study 1

Study 1 in Murphy and Zajonc (1993) had 32 participants; 16 with subliminal presentations and 16 with supraliminal presentations. There were 4 within-subject conditions (smiling, frowning & two control conditions). The means of the affect ratings were 3.46 for smiling, 3.06 for both control conditions and 2.70 for the frowning faces. The perfect ordering of means is a bit suspicious, but even more problematic is that the mean differences of experimental conditions and control conditions were all statistically significant. The t-values, df = 15, are 2.23, 2.31, 2.31, and 2.59. Too many significant contrasts have been the downfall for a German social psychologist. Here we can only say that Murphy and Zajonc were very lucky that the two control conditions fell smack in the middle of the two experimental conditions. Any deviation in one direction would have increased one comparison, but decreased the other comparison and increased the risk of a non-significant result.

Study 2

Study 2 was similar, except that the judgments was changed from subjective liking to objective goodness vs. badness judgments.

The means for the two control conditions were again right in the middle, nearly identical to each other, and nearly identical to the means in Study 1 (M = 3.05, 3.06). Given sampling error, it is extremely unlikely that even the same condition produces the same means. Without reporting actual t-values, the authors further claim that all four comparisons of experimental and control conditions are significant.

Taken together, these two studies with surprisingly simiar t-values and 32 participants provide the only evidence for the claim that stimuli outside of awareness can elicit affective reactions. This weak evidence has garnered nearly 1,000 citations without ever being questioned or published replication attempts.

Studies 3-5 did not examine affective priming, but Study 6 did. The paradigm here was different. Participants were subliminally presented with a smiling or a frowning face. Then they had to choose between two pictures, the prime and a foil. The foil either had the same facial expression or a different facial expression. Another manipulation was to have the same or a different gender. This study showed a strong effect of facial expression, t(62) = 6.26, but not of gender.

I liked this design and conducted several conceptual replication studies with emotional pictures (beautiful beaches, dirty toilets). It did not work. Participants were not able to use their affect to pick the right picture from a prime-foil pair. I also manipulated presentation times and with increasing presentation times, participants could pick out the picture, even if the affect was the same (e.g., prime and foil were both pleasant).

Study 6 also explains why Payne was unable to get priming effects for subliminal stimuli that varied race or other features.

One possible explanation for the results in Study 6 is that it is extremely difficult to mask facial expressions, especially smiles. I also did some studies that tried that and at least with computers it was impossible to prevent detection of smiling faces.

Thus, we are left with some questionable results in Studies 1 and 2 as the sole evidence that subliminal stimuli can elicit affective reactions that are transferred to other stimuli.

Conclusion

I have tried to get implicit priming effects on affect measures and failed. It was difficult to publish these failures in the early 2000s. I am sure there are many other replication failures (see Figure 1) and Payne et al.’s (2014) account of the developed the AMP implies as much. Social psychology is still in the process of cleaning up the mess that the use of QRPs created. Implicit priming research is a posterchild of the replication crisis and researchers should stop citing these old articles as if they produced credible evidence.

Emotion researchers may also benefit from revisiting the Lazarus-Zajonc debate. Appraisal theory may not have the sex appeal of unconscious emotions, but it may be a more robust and accurate theory of emotions. Preference may not always require inferences, but preferences that are based on solid inferences are likely to be a better guide of behavior. Therefore I prefer Lazarus over Zajonc.

Open Science Reveals Most Published Results are Not False

CORRECTION: Open science also means that our mistakes are open and transparent. Shortly after I posted this blog, Spencer Greenberg pointed out that I made a mistake when I used the discovery rate in OSC to estimate the discovery rate in psychological science. I am glad he caught my mistake quickly and I can warn readers that my conclusions do not hold. A 50% success rate for replications in cognitive psychology suggests that most results in cognitive psychology are not false positives, but the low replication rate of 25% for social psychology does allow for a much higher false discover rate than I estimated in this blog post.

===========================================================================

Money does not make the world go round, it cannot buy love, but it does pretty much everything else. Money is behind most scientific discoveries. Just like investments in stock markets, investments in science are unpredictable. Some of these investments are successful (e.g., Covid-19 vaccines), but others are not.

Most scientists, like myself, rely on government funding that is distributed in a peer-reviewed process by scientists to scientists. It is difficult to see how scientists would fund research that aims to show that most of their work is useless, if not fraudulent. This is where private money comes in.

The Arnold foundation handed out two big grants to reform science (Arnold Foundation Awards $6 Million to Boost Quality of Research; The Center for Open Science receives $7.5 million in additional funding from the Laura and John Arnold Foundation).

One grant was given to Ioannidis, who was famous for declaring that “most published results are false” (Ioannidis, 2005). The other grant was given to Nosek, to establish the Open Science Foundation.

Ioannidis and Nosek also worked together as co-authors (Button et al., 2013). In terms of traditional metrics of impact, the Arnold foundations’ investment paid off. Ioannidis’s (2005) has been cited over 4,000 times. Button et al.’s article has been cited over 2,000 times. And an influential article by Nosek and many others that replicated 100 studies from psychology has been cited over 2,000 times.

These articles are go-to citations for authors to claim that science is in a replication crisis, most published results are false, and major reforms to scientific practices are needed. It is no secret that many authors who cite these articles have not read the actual article. This explains why thousands of citations do not include a single article that points out that the Open Science Collaboration findings contradict Ioannidis’s claim that most published results are false.

The Claim

Ioannidis (2005) used hypothetical examples to speculate that most published results are false. The main assumption underlying these scenarios was that researchers are much more likely to test false hypotheses (a vaccine has no effect) than true hypotheses (a vaccine has an effect). The second assumption was that even when researchers test true hypotheses, they do so with a low probability to provide enough evidence (p < .05) that an effect occurred.

Under these assumptions, most empirical tests of hypotheses produce non-significant results (p > .05) and among those that are significant, the majority come from the large number of tests that tested a false hypothesis (false positives).

In theory, it would be easy to verify Ioannidis’s predictions because he predicts that most results are not significant, p > .05. Thus, a simple count of significant and non-significant results would reveal that many published results are false. The problem is that not all hypotheses tests are published and that significant results are more likely to be published than non-significant results. This bias in the selection of results is known as publication bias. Ioannidis (2005) called it researcher bias. As the amount of researcher bias is unknown, there is ample room to suggest that it is large enough to fit Ioannidis’s prediction that most published significant results are false positives.

The Missing Piece

Fifteen years after Ioannidis claimed that most published results are false, there have been few attempts to test this hypothesis empirically. One attempt was made byJager and Leek (2014). This article made two important contributions. First, Jager and Leek created a program to harvest statistical results from abstracts in medical journals. Second, they developed a model to analyze the harvested p-values to estimate the percentage of false positive results in the medical literature. They ended up with an estimate of 14%, which is well below Ioannidis’s claim that over 50% of published results are false.

Ioannidis’s reply made it clear that a multi-million investment in his idea made it impossible to look at this evidence objectively. Clearly, his speculations based on no data must be right and an actual empirical test must be wrong, if it didn’t confirm his prediction. In science this is known as confirmation bias. Ironically, confirmation bias is one of the main obstacles that prevents science from making progress and to correct false beliefs.

Ioannidis (2014), p. 34

Fortunately, there is a much easier way to test Ioannidis’s claim than Jager and Leek’s model that may have underestimated the false discovery risk. All we need to estimate to estimate the false discovery rate under the worst case scenario is a credible estimate of the discovery rate (i.e., the percentage of significant results). Once we know how many tests produced a positive result, we can compute the maximum false discovery rate using a simple formula developed by Soric (1989).

Maximum False Discovery Rate = (1/Discovery Rate – 1)*(.05/.95)

The only challenge is to find a discovery rate that is not inflated by publication bias. And that is where Nosek and the Open Science Foundation come in.

The Reproducibility Project

It has been known for decades that psychology has a publication bias problem. Sterling (1959) observed that over 90% of published results report a statistically significant result. This finding was replicated in 1995 (Sterling et al., 1995) and again in 2015, when the a large team of psychologists replicated 100 studies and 97% of the original studies reported a statistically significant result (Open Science Collaboration, 2015).

Using Soric’s formula this would imply a false discovery rate of 0. However, the replication studies showed that this high discovery rate is inflated by publication bias. More important, the replication studies provide an unbiased estimate of the actual discovery rate in psychology. Thus, these results can be used to estimate the maximum false discovery rate in psychology, using Soric’s formula.

The headline finding of this article was that 36% (35/97) of the replication studies reproduced a significant result.

Using Soric’s formula, this implies a maximum (!) false discovery rate of 9%, which is well below the predicted 50% by Ioannidis. The difference is so large that no statistical test is needed to infer that the Nosek’s results falsify Ioannidis’s claim.

Table 1 also shows the discovery rates for specific journals or research areas. The discovery rate for cognitive psychology in the journal Psychological Science is 53%, which implies a maximum FDR of 5%. For cognitive psychology published in the Journal of Experimental Psychology: Learning, Memory, and Cognition the DR of 48% implies a maximum FDR of 6%.

Things look worse for social psychology, which has also seen a string of major replication failures (Schimmack, 2020). However, even here we do not get false discovery rates over 50%. For social psychology published in Psychological Science, the discovery rate of 29% implies a maximum false discovery rate of 13%, and social psychology published in JPSP has a discovery rate of 23% and a maximum false discovery rate of 18%.

These results do not imply that everything is going well in social psychology, but they do show how unrealistic Ioannidis’s scenarios were that produced false discovery rates over 50%.

Conclusion

The Arnold foundation has funded major attempts to improve science. This is a laudable goal and I have spent the past 10 years working towards the same goal. Here I simply point out that one big successful initiative, the reproducibility project (Open Science Collaboration, 2015), produced valuable data that can be used to test a fundamental assumption in the open science movement, namely the fear that most published results are false. Using the empirical data from the Open Science Collaboration we find no empirical support for this claim. Rather the results are in line with Jager and Leek’s (2014) findings that strictly false results where the null-hypothesis is true are the exception rather than the norm.

This does not mean that everything is going well in science because rejecting the null-hypothesis is only a first step towards testing a theory. However, it is also not helpful to spread false claims about science that may undermine trust in science. “Most published results are false” is an eye-catching claim, but it lacks empirical support. In fact, it has been falsified in every empirical test that has been conducted. Ironically, the strongest empirical evidence based on actual replication studies comes from a project that used open science practices that would not have happened without Ioannidis’s alarmist claim. This shows the advantages of open science practices and implementing these practices remains a valuable goal even if most published results are not strictly false positives.

Empirical Standards for Statistical Significance

Many sciences, including psychology, rely on statistical significance to draw inferences from data. A widely accepted practice is to consider results with a p-value less than .05 as evidence that an effect occurred.

Hundreds of articles have discussed the problems of this approach, but few have offered attractive alternatives. As a result, very little has changed in the way results are interpreted and published in 2020.

Even if this would suddenly change, researchers still have to decide what they should do with the results that have been published so far. At present there are only two options. Either trust all results and hope for the best or assume that most published results are false and start from scratch. Trust everything or trust nothing are not very attractive options. Ideally, we would want to find a method that can sperate more credible findings from less credible ones.

One solution to this problem comes from molecular genetics. When it became possible to measure genetic variation across individuals, geneticists started correlating single variants with phenotypes (e.g., the serotonin transporter gene variation and neuroticism). These studies used the standard approach of declaring results with p-values below .05 as a discovery. Actual replication studies showed that many of these results could not be replicated. In response to these replication failures, the field moved towards genome-wide association studies that tested many genetic variants simultaneously. This further increased the risk of false discoveries. To avoid this problem, geneticists lowered the criterion for a significant finding. This criterion was not picked arbitrarily. Rather it was determined by estimating the false discovery rate or false discovery risk. The classic article that recommeded this approach has been cited over 40,000 times (Benjamin & Hochberg, 1995).

In genetics, a single study produces thousands of p-values that require a correction for multiple comparisons. Studies in other disciplines usually produce a much smaller (typically less than 100) p-values. However, an entire scientific field also generates thousands of p-values. This makes it necessary to control for multiple comparisons and to lower p-values from the nominal value of .05 to maintain a reasonably low false discovery rate.

The main difference between original studies in genomics and meta-analysis of studies in other fields is that publication bias can inflate the percentage of significant results. This leads to biased estimates of the actual false discovery rate (Schimmack, 2020).

One solution to this problem are selection models that take publication bias into account. Jager and Leek (2014) used this approach to estimate the false discovery rate in medical journals for statistically significant results, p < .05. In response to this article, Goodman (2014) suggested to ask a different question.

What significance criterion would ensure a false discovery rate of 5%?

Although this is a useful question, selection models have not been used to answer it. Instead, recommendations for adjusting alpha have been based on ad-hoc assumptions about the number of true hypotheses that are being tested and power of studies.

For example, the false positive rate is greater than 33% with prior odds of 1:10 and a P value threshold of 0.05, regardless of the level of statistical power. Reducing the threshold to 0.005 would reduce this minimum false positive rate to 5% (D. J. Benjamin et al., 2017, p. 7).

Rather than relying on assumptions, it is possible to estimate the maximum false discovery rate based on the distribution of statistically significant p-values (Bartos & Schimmack, 2020).

Here, I illustrate this approach with p-values from 120 psychology journals for articles published between 2010 and 2019. An automated extraction of test-statistics found 670,055 useable test-statistics. All test-statistics were converted into absolute z-scores that reflect the amount of evidence against the null-hypothesis.

Figure 1 shows the distribution of the absolute z-scores. The first notable observation is the drop (from right to left) in the distribution right at the standard level for statistical significance, p < .05 (two-tailed) that corresponds to a z-score of 1.96. This drop reveals publication bias. The amount of bias is reflected in a comparison of the observed discovery rate and the estimated discovery rate. The observed discovery rate of 67% is simply the percentage of p-values below .05. The estimated discovery rate is the percentage of significant results based on the z-curve model that is fitted to the significant results (grey curve). The estimated discovery rate is only 38% and the 95% confidence interval around this estimate, 32% to 49%, does not include the observed discovery rate. This shows that significant results are more likely to be reported and that non-significant results are missing from published article.

If we would use the observed discovery rate of 67%, we would underestimate the risk of false positive results. Using Soric’s (1989) formula,

FDR = (1/DR – 1)*(.05/.95)

a discovery rate of 67% implies a maximum false discovery rate of 3%. Thus, no adjustment to the significance criterion would be needed to maintain a false discovery rate below 5%.

However, publication bias is present and inflates the discovery rate. To adjust for this, we can use the estimated discovery rate of 38% and get a maximum false discovery rate of 9%. As this value exceeds the desired number of false discoveries, we need to lower alpha to reduce the false discovery rate.

Figure 2 shows the results when alpha is set .005 (z = 2.80) as recommended by Benjamin et al. (2017). The model is only fitted to data that are significant with this new criterion. We now see that the observed discovery rate (44%) is even lower than the estimated discovery rate (49%), although the difference is not significant. Thus, there is no evidence of publication bias with this new criterion for significance. The reason is that many questionable practices that are used to report significant results produce just significant results. This is seen in the excess of just significant results between z = 2 and z = 2.8. These results no longer inflate the discovery rate because they are no longer counted as discoveries. We also see that the estimated discovery rate produces a maximum false discovery rate of 6%, which may be close enough to the desired level of 5%.

Another piece of useful information is the estimated replication rate (ERR). This is the average power of results that are significant with p < .005 as criterion. Although lowering the alpha level decreases power, the average power of 66% suggests that many results should replicate successfully in exact replication studies with the same sample size. Increasing sample sizes could help to achieve 80% power.

In conclusion, we can use the distribution of p-values in the psychological literature to evaluate published findings. Based on the present results, readers of published articles could use p < .005 (rule of thumb: z > 2.8, t > 3, or chi-square > 9, F > 9) to evaluate statistical evidence.

The empirical approach to justify alpha with FDRs has the advantage that it can be adjusted for different literatures. This is illustrated with the Attitudes and Social Cognition section of JPSP. Social cognition research has experienced a replication crisis due to massive use of questionable research practices. It is possible that even alpha = .005 is too liberal for this research area.

Figure 3 shows the results for test statistics published in JPSP-ASC from 2000 to 2020.

There is clear evidence of publication bias (ODR = 71%, EDR = 31%). Based on the EDR of 31%, the maximum false discovery rate is 11%, well above the desired level of 5%. Even the 95%CI around the FDR does not include 5%. Thus, it is necessary to lower the alpha criterion.

Using p = .005 as criterion improves things, but not fully. First, a comparison of the ODR and EDR suggests that publication bias was not fully removed, 43% vs. 35%. Second, the EDR of 35% still implies a maximum FDR of 10%, although the 95%CI now touches 5%, but also has 35% as the upper limit. Thus, even with p = .005, the social cognition literature is not credible.

Lowering the criterion further does not solve this problem. The reason is that there are now so few significant results that the discovery rate remains low. This is shown in the next figure where the criterion is set to p < .0005 (z = 3.5). The model cannot be fitted to z-scores so extreme because there is insufficient information about lower power studies. Thus, the model was fitted to z-scores greater than 2.8 (p < .005). in this scenario, the expected discovery rate is 27%, which implies a maximum false discovery rate of 14% and the 95%CI still does not include 5%.

These results illustrate the problem of conducting many studies with low power. The false discovery risk remains high because there are only few test statistics with extreme values and a few extreme test statistics are expected by chance.

In short, setting alpha to .005 is still too liberal for this research area. Given the ample replication failures in social cognition research, most results cannot be trusted. This conclusion is also consistent with the actual replication rate in the Open Science Collaboration (2015) project that could only replicate 7/31 (23% results). With a discovery rate of 23%, the maximum false discovery rate is 18%. This is still way below Ioannidis’s claim that most published results are false positives, but it is also well above 5%.

Different results are expected for the Journal of Experimental Psychology, Learning, Memory, and Cognition (JEP-LMC). Here the OSC project was able to replicate 13/47 (48%) results. A discovery rate of 48% implies a maximum false discovery rate of 6%. Thus, no adjustment to the alpha level may be needed for this journal.

Figure 6 shows the results for the z-curve analysis of test statistics published from 2000 to 2020. There is evidence of publication bias. The ODR of 67% is outside the 95%CI of the EDR 45%, 95%CI = . However, with an EDR of 45%, the maximum FDR is 7%. This is close to the estimate based on the OSC results and close to the desired level of 5%.

For this journal it was sufficient to set the alpha criterion to p < .03. This produced a fairly close match between the ODR (61%) and EDR (58%) and a maximum FDR of 4%.

Conclusion

Significance testing was introduced by Fisher, 100 years ago. He would recognize the way scientists analyze their data because not much has changed. Over the past 100 years, many statisticians and practitioners have pointed out problems with this approach, but no practical alternatives have been offered. Adjusting the significance criterion depending on the research question is one reasonable modification, but often requires more a priori knowledge than researchers have (Lakens et al., 2018). Lowering alpha makes sense when there is a concern about too many false positive results, but can be a costly mistake when false positive results are fewer than feared (Benjamin et al., 2017). Here I presented a solution to this problem. It is possible to use the maximum false-discovery rate to pick alpha so that the percentage of false discoveries is kept at a reasonable minimum.

Even if this recommendation does not influence the behavior of scientists or the practices of journals, it can be helpful to compute alpha values that ensure a low false discovery rate. At present, consumers of scientific research (mostly other scientists) are used to treat all significant results with p-values less than .05 as discoveries. Literature reviews mention studies with p = .04 as if they have the same status as studies with p = .000001. Once a p-values crosses the magic .05 level, it becomes a solid fact. This is wrong because statistical significance alone does not ensure that a finding is a true positive. To avoid this fallacy, consumers of research can do their own adjustment to the alpha level. Readers of JEP:LMC may use .05 or .03 because this alpha level is sufficient. Readers of JPSP-ASC may lower alpha to .001.

Once readers demand stronger evidence from journals that publish weak evidence, researchers may actually change their practices. As long as consumers buy every p-values less than .05, there is little incentive for producers of p-values to try harder to produce stronger evidence, but when consumers demand p-values below .005, supply will follow. Unfortunately, consumers have been gullible and it was easy to sell them results that do not replicate with a p < .05 warranty because they had no rational way to decide which p-values they should trust or not. Maintaining a reasonably low false discovery rate has proved useful in genomics, it may also prove useful for other sciences.