Category Archives: Uncategorized

Totalitarian Scientists

I have read many, if not most highly influential articles in the history of psychology, but once in a while I stumble upon an article I didn’t know. Here is one of them: “The Totalitarian Ego
Fabrication and Revision of Personal History” by Tony Greenwald (1980). Back in the days, Toni Greenwald was a revolutionary, who recognized many of the flaws that prevent psychology from being a real science. For example, in 1975 he published a critical article about the tendency to hide disconfirming evidence. This article is mentioned in the 1980s article.

The 1980s article was written during a time when social psychologists discovered cognitive biases and started to examine why humans often make errors in the processing of information. One influential hypothesis was that cognitive biases are actually beneficial for individuals, which led to Taylor and Brown’s (1988) claim that positive illusions are a sign of mental health.

The same argument is made by Greenwald (1980), and he compares the benefits of biases for individuals to those for totalitarian regimes and scientific theories. The main function of biases is to preserve either the ego of individuals, the organization of a totalitarian regime, or the integrity of a theory.

The view of biases as beneficial has been challenged. Illusions about reality can have dramatic negative consequences for individuals. In fact, there is little evidence to support the claim that positive illusions are beneficial for well-being (Schimmack & Kim, 2020). The idea that illusions are beneficial for scientific theory is even more questionable. After all, the very idea of science is that scientific theories should be subjected to empirical tests and revised or abandoned when they fail these tests. Greenwald (1980) first seems to agree.

But then he cites Popper and Kuhn to come to the opposite conclusion.

At least in the short term, it is beneficial for individuals and scientific theories to protect them against disconfirming evidence. It is only in the long-run when hard evidence forces makes a theory untenable that individuals or theories need to change. For individuals these hard facts may be life experiences that are not under their control. It may take years before it becomes clear that a marriage is not worth saving. However, scientists can avoid this moment of painful reckoning as long as they can hide disconfirming evidence by avoiding strong tests of theories, dismissing disconfirming evidence in their own studies, and using their status as experts in the peer-review process to keep disconfirming evidence from being published. Thus, scientists have a strong incentive to protect their ego and their theories (brain-children) from a confrontation with reality. Thus, scientists who’s ego is invested in a theory, as for example Greenwald is invested in the theory of implicit biases, are the least trustworthy individuals to evaluate a theory; as Feynman observed, scientists should not fool themselves, but when it comes to their own theories, they are the easiest to fool.

Thus, scientists end up behaving like totalitarian societies. They will use all of their energy to preserve theories, even when they are false. Moreover, the biggest fools have an advantage because they have the least doubt about their theories, which facilitates goal attainment. The research program on implicit bias is a great example. The theory that individuals have unconscious, hidden biases that guide their behavior has become a dominant theory in social cognition research, despite much evidence to support it (Schimmack, 2020). Criticism was sporadic and drowned out by the forces that pushed the theory.

While this has been extremely advantageous for the scientists pushing the theory, these totalitarian forces are bad for a science as a hole. Thus, psychology needs to find a mechanism to counteract totalitarianism in science. Fortunately, there are some positive trends that this is happening. The 2010s have seen a string of major replication failures in social psychology that would have been difficult to publish when psychology was prejudiced against null-findings (Grenwald, 1975). Other changes are needed to subject theories to stronger tests so that they can fail before they have become to big to fail.

In conclusion, Greenwald’s (1980) article deserves some recognition for pointing out some similarities between ego-defense mechanisms, totalitarian regimes, and scientific theories. They all want to live forever, but eternal life is an unattainable goal. The goal of empirical research should not be to feed an illusion, but a process of evolution where old theories are constantly replaced by new theories that are better adapted to reality. Implicit bias theory had a good life. It’s time to die.

References

Greenwald, A. G. (1980). The totalitarian ego: Fabrication and revision of personal history. American Psychologist, 35(7), 603–618. https://doi.org/10.1037/0003-066X.35.7.603

Denial is Not Going to Fix Social Psychology

In 2015, social psychologists replicated published results in psychology journals. While the original studies, often including multiple studies, reported nearly exclusively significant results (97% success rate), the replication studies produced only 25% significant results (Open Science Collaboration, 2015).

Since this embarrassing finding has been published, leaders of social psychology have engaged in damage control, using a string of false arguments to suggest that a replication rate of 25% is normal and not a crisis (see Schimmack, 2020, for a review).

One open question about the OSC results is what they actually mean. One explanation is that original studies actually reported false positive results. That is, a significant result was reported although there is actually no effect of an experimental manipulation. The other explanation is that the original studies merely reported inflated effect sizes, but did get the direction of an effect right. As social psychologists do not care about effect sizes, the latter explanation is not a problem for social psychologists. Unfortunately, a replication rate of 25% does not tell us how many original results were false positives, but there have been attempts to estimate the false discovery rate in the OSC studies.

Brent M. Wilson, a post-doctoral student at UC San Diego, and distinguished professor, John T. Wixted, also at UC San Diego published an article that used sign-changes between original studies and replication studies to estimate the false discovery rate (Wilson & Wixted, 2018). The logic is straightforward. A true null-result is equally likely to show an effect in one direction (increase) or the other direction (decrease) due to sampling error alone. Thus, a sign change in a replication study may suggest that the original result was a statistical fluke. Based on this logic, the authors conclude that 49% of the results in social psychology were false positive results.

The implications of this conclusion cannot be overstated. Every other result published in social psychology is a false positive result. Half of the studies in social psychology textbooks support false claims unless textbook writers are clairvoyant and can tell true effects from false effects. If this is not bad enough, the estimate of 49% uses the nil-hypothesis to claim that a reported result is false. However, effects in the same direction that are very small have no practical significance, especially when effect sizes are difficult to estimate because they are susceptible to small changes in experimental procedures. Thus, the implication of Wilson and Wixted’s article is that social psychology has a replication crisis because it is not clear which published results can be replicated with practically meaningful effect sizes. I cited the article accordingly, in my review article (Schimmack, 2020).

You may understand my surprise when the same authors a couple of years later write another article that claims most published results are true (Wilson, Harris, & Wixted, 2020).

Although the authors do not suffer from full-blown amnesia and do recall and cite their previous article, they fail to mention that they previously estimated that 49% of published results in social psychology are false positives. Instead, the blur the distinction between cognitive and social psychology although cognitive psychology had an estimate of 19% false positive, compared to the 49% for social psychology.

So, apparently the authors remembered that they published an article on this topic, but they forgot their main argument and the conclusions of their original article. In fact, in the original article they found a silver lining in the fact that 49% or more of results of social psychology are false positives. They argued that this finding shows that social psychologists are willing to test risky hypothesis that have a high chance of being false. In contrast, cognitive psychologists should be ashamed that they have an 81% success rate, which only shows that they make obvious predictions.

Assuming their estimate is correct, it is not good news that only 1 out of 17 hypotheses that are tested by social psychologists is true. The problem is that social psychologists do not just test a hypothesis and give up when they get a non-significant result. Rather, they continue to run a series of conceptual replication studies with minor variations until a significant result is found. Thus, the chance that false findings are published are rather high, which would explain why findings are difficult to replicate.

In conclusion, Wilson and Wixted published two articles with opposing conclusions. One article claims that social psychology is a wild chase of effects when most experiments test hypotheses that are false (i.e., the null-hypothesis is true). This leads to the publication of many false positive results that fail to replicate in honest replication studies that do not select for significance. Two years later social psychology is a respectable science that may not be much different from cognitive psychology, and most published results are true, which also implies that most tested hypotheses must be true because a high proportion of false hypothesis would result in false positive results in journals.

What caused this flip-flop about the replication crisis is unclear. Maybe the fact that Susan T. Fiske was in charge of publishing the new article in PNAS has something to do with it. Maybe she pressured the authors into saying nice things about social psychology. Maybe they were willing accomplices in white-washing the embarrassing replication outcome for social psychology. I don’t know and I don’t care. Their new PNAS article is nonsense and ignores other evidence that social psychology has a major replication problem (Schimmack, 2020). Fiske may wish that articles like the PNAS article hide the fact that social psychologists made a mockery of the scientific method (publish only studies that work, err on the side of discovery, never replicate a study so that you have plausible deniability). I can only hope that young scholars realize that old practices produces a pile of results that have no theoretical or practical meaning and work towards improving scientific practices. The future of social psychology depends on it.

References

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychologist, in press https://replicationindex.com/2020/01/05/replication-crisis-review/

B. M. Wilson, J. T. Wixted, The prior odds of testing a true effect in cognitive and social psychology. Adv. Methods Pract. Psychol. Sci. 1, 186–197 (2018).

Brent M. Wilson, Christine R. Harris, and John T. Wixted (2020). Science is not a signal detection problem. PNAS, http://www.pnas.org/cgi/doi/10.1073/pnas.1914237117

Replicability Rankings of 120 Psychology Journals (2010-2019)

The individual z-curve plots can be found by clicking on the journal names.

Table 1 shows the results for the expected replication rate. The values are high because the analysis includes manipulation checks and the ERR assumes exact replications. Actual replication attempts of focal hypothesis are likely to produce lower success rates. The ERR is still useful to examine differences between journals and changes over time.

Table 1 reports the individual results for the past three years, aggregated results for 2017-2019, 2014-2016, and 2010-2013. These aggregates produce more stable estimates, especially for journals with fewer test statistics. Correlations among the aggregates are r = .53 for 17-19 with 14-16, r = .65 for 14-16 with 10-13, and r = .32 for 17-19 with 10-13. The lower stability over the longer time period indicates that rank order changes are not just random measurement error.

The Change column is the difference for the average in the last three years (17-19) minus the first four years (2010-2014). The results give some indication whether replicability increased, but scores are still subject to sampling error. Individual time-trend analysis showed statistically significant changes for 13 journals. Statistical significance is indicated by adding a + and printing the change score in bold. Some journals benefit from low sampling error and can show significance with a small increase. Others show larger increases, but sampling error is too large to show a clear linear trend. As more data become available, persistent trends will become significant.

Rank   Journal 2019 2018 2017 17-19 14-16 10-13 Change
1 Journal of Religion and Health 89 75 74 79 81 79 0
2 Journal of Individual Differences 88 69 83 80 66 76 4
3 Journal of Business and Psychology 87 72 86 82 73 77 5
4 Journal of Research in Personality 87 78 78 81 78 72 +9
5 Journal of Happiness Studies 85 69 57 70 79 80 -10
6 Journal of Occupational Health Psychology 85 65 55 68 72 68 0
7 Journal of Youth and Adolescence 85 66 70 74 80 75 -1
8 Journal of Nonverbal Behavior 84 79 88 84 70 68 +16
9 Journal of Research on Adolescence 84 64 66 71 67 73 -2
10 Cognitive Psychology 83 82 72 79 76 76 3
11 Evolution & Human Behavior 83 78 73 78 76 64 +14
12 Psychology of Men and Masculinity 83 69 57 70 69 82 -12
13 Developmental Psychology 82 78 76 79 74 68 +11
14 Evolutionary Psychology 82 78 78 79 78 75 4
15 Experimental Psychology 82 78 74 78 73 72 6
16 Journal of Anxiety Disorders 82 77 78 79 73 75 4
17 Psychological Science 82 76 71 76 67 63 +13
18 Attention, Perception and Psychophysics 81 80 79 80 73 76 4
19 Cognition 81 77 76 78 73 74 4
20 Journal of Behavioral Decision Making 81 76 70 76 75 68 8
21 Journal of Organizational Psychology 81 68 73 74 71 69 5
22 Aggressive Behavior 80 78 72 77 71 68 9
23 Cognitive Development 80 73 79 77 75 70 7
24 Consciousness and Cognition 80 78 77 78 70 70 8
25 Depression & Anxiety 80 82 75 79 74 84 -5
26 European Journal of Personality 80 73 76 76 76 72 4
27 Judgment and Decision Making 80 72 81 78 77 72 6
28 Memory and Cognition 80 80 79 80 76 77 3
29 Psychology and Aging 80 72 79 77 79 74 3
30 Psychonomic Bulletin and Review 80 79 75 78 79 76 2
31 Journal of Applied Psychology 79 67 78 75 74 70 5
32 Journal of Cross-Cultural Psychology 79 78 76 78 78 76 2
33 Journal of Experimental Psychology – General 79 78 78 78 74 70 +8
34 Journal of Memory and Language 79 78 80 79 78 75 4
35 Journal of Occupational and Organizational Psychology 79 82 71 77 76 72 5
36 Journal of Positive Psychology 79 71 82 77 71 68 9
37 Journal of Sex Research 79 80 81 80 78 80 0
38 Journal of Social Psychology 79 79 75 78 70 72 6
39 Law and Human Behavior 79 80 76 78 68 76 2
40 Personality and Individual Differences 79 76 76 77 77 72 +5
41 Perception 79 73 76 76 76 86 -10
42 Acta Psychologica 78 71 77 75 75 74 1
43 Asian Journal of Social Psychology 78 80 68 75 74 70 5
44 Journal of Child and Family Studies 78 75 73 75 70 73 2
45 Journal of Counseling Psychology 78 85 69 77 76 75 2
46 J. of Exp. Psychology – Learning, Memory & Cognition 78 78 79 78 78 76 2
47 Journal of Experimental Social Psychology 78 75 71 75 64 56 +19
48 Memory 78 74 74 75 78 81 -6
49 British Journal of Social Psychology 77 70 75 74 63 63 11
50 Cognitive Therapy and Research 77 74 75 75 67 71 4
51 European Journal of Social Psychology 77 64 71 71 70 64 7
52 Social Psychological and Personality Science 77 80 75 77 63 59 +18
53 Archives of Sexual Behavior 76 72 78 75 78 80 -5
54 Emotion 76 73 73 74 70 70 4
55 Journal of Affective Disorders 76 75 75 75 82 77 -2
56 J. of Exp. Psychology – Human Perception and Performance 76 78 76 77 76 76 1
57 Journal of Pain 76 79 68 74 77 72 2
58 Personal Relationships 76 83 76 78 69 64 +14
59 Psychology of Religion and Spirituality 76 82 69 76 79 69 7
60 Appetite 75 76 77 76 67 72 4
61 Group Processes & Intergroup Relations 75 65 70 70 68 65 5
62 Journal of Cognition and Development 75 82 74 77 70 65 12
63 Journal of Cognitive Psychology 75 79 75 76 76 77 -1
64 Journal of Experimental Psychology – Applied 75 75 80 77 70 70 7
65 JPSP-Personality Processes and Individual Differences 75 79 64 73 72 66 7
66 Political Psychology 75 88 76 80 74 63 17
67 Psychopharmacology 75 68 73 72 74 72 0
68 Psychophysiology 75 79 78 77 74 74 3
69 Quarterly Journal of Experimental Psychology 75 79 76 77 75 74 3
70 Animal Behavior 74 71 77 74 70 72 2
71 Behaviour Research and Therapy 74 75 70 73 74 71 2
72 British Journal of Developmental Psychology 74 77 72 74 72 76 -2
73 Frontiers in Psychology 74 74 76 75 74 72 3
74 Journal of Abnormal Psychology 74 71 69 71 64 70 1
75 Journal of Applied Social Psychology 74 75 73 74 73 72 2
76 Journal of Consumer Behaviour 74 81 72 76 76 79 -3
77 Journal of Health Psychology 74 81 63 73 75 72 1
78 JPSP-Attitudes & Social Cognition 74 80 79 78 66 58 20
79 Journal of Social and Personal Relationships 74 74 71 73 67 73 0
80 Psychology and Marketing 74 68 70 71 70 68 3
81 Behavioral Neuroscience 73 66 73 71 69 70 1
82 Canadian Journal of Experimental Psychology 73 87 73 78 77 76 2
83 Cognition and Emotion 73 75 66 71 72 82 -11
84 European Journal of Developmental Psychology 73 90 85 83 75 71 12
85 Journal of Child Psychology and Psychiatry and Allied Disciplines 73 62 68 68 66 64 4
86 Journal of Educational Psychology 73 73 79 75 71 77 -2
87 Organizational Behavior and Human Decision Processes 73 70 68 70 70 69 1
88 Psychological Medicine 73 76 73 74 75 73 1
89 Sex Roles 73 83 81 79 76 75 4
90 Social Psychology 73 83 73 76 73 71 5
91 Behavioural Brain Research 72 69 71 71 70 72 -1
92 British Journal of Psychology 72 79 76 76 79 74 2
93 Developmental Science 71 67 73 70 67 69 1
94 Journal of Personality 71 79 77 76 70 67 9
95 Behavior Therapy 70 67 71 69 71 72 -3
96 Child Development 70 73 66 70 71 72 -2
97 International Journal of Psychophysiology 70 71 73 71 67 68 3
98 Journal of Consulting and Clinical Psychology 70 71 77 73 64 65 8
99 Journal of Experimental Child Psychology 70 73 71 71 75 73 -2
100 Motivation and Emotion 70 62 73 68 65 70 -2
101 Frontiers in Behavioral Neuroscience 69 72 74 72 70 70 2
102 Journal of Comparative Psychology 69 68 67 68 75 70 -2
103 JPSP-Interpersonal Relationships and Group Processes 69 72 68 70 67 58 +12
104 Frontiers in Human Neuroscience 68 73 71 71 74 75 -4
105 Journal of Consumer Research 68 68 65 67 60 59 8
106 Journal of Family Psychology 68 76 70 71 68 66 5
107 Journal of Vocational Behavior 68 83 75 75 78 80 -5
108 Personality and Social Psychology Bulletin 68 71 74 71 65 61 +10
109 Self and Identity 68 60 67 65 66 72 -7
110 Hormones & Behavior 66 69 61 65 63 63 2
111 Psychoneuroendocrinology 66 68 65 66 64 62 +4
112 Annals of Behavioral Medicine 65 74 70 70 69 74 -4
113 Biological Psychology 65 68 63 65 68 66 -1
114 Cognitive Behavioral Therapy 65 67 75 69 73 71 -2
115 Health Psychology 63 77 70 70 64 65 5
116 Infancy 62 58 61 60 62 64 -4
117 Journal of Consumer Psychology 62 66 57 62 62 62 0
118 Psychology of Music 62 74 81 72 74 79 -7
119 Developmental Psychobiology 60 66 63 63 67 69 -6
120 Social Development 55 80 73 69 73 72 -3

The Lost Decades in Psychological Science

Methodologists have criticized psychological research for decades (Cohen, 1962; Gigerenzer & Sedlmeier, 1989; Maxwell, 2004; Sterling, 1959). A key concern is that psychologists conduct studies that are only meaningful when they reject the nil-hypothesis that results were not just a chance finding; p < .05, and that many studies had low statistical power to do so. As a result, many studies that were conducted remained unpublished, while studies that were published obtain significance only with the help of chance. Despite repeated attempts to educate psychologists about statistical power, there has been little evidence that researchers increased power of their studies. The main reason is that power analyses often showed that large samples were required that would require sometimes years of data collection for a single study. However, pressure to publish increased and nobody could simply work on a study without publishing. Therefore, psychologists found ways to produce significant results with smaller samples. The problem is that these questionable practices inflate effect sizes and make it difficult to replicate results. This produced the replication crisis in psychology (see Schimmack, 2020, for a review). So far, the replication crisis has played out mostly in social psychology because social psychologists have conducted the most attempts to replicate findings and produced a pile of replication failures in the 2010s. The replication crisis has produced a lot of discussion about reforms and many suggestions to increase statistical power. However, the incentive structure has not changed. Graduate students today are required to have a CV with many original articles to be competitive on the job market. Thus, strong forces counteract reforms of research practices in psychology.

In this blog post, I examine whether psychologists have changed their research practices in ways that increase statistical power. To do so, I use automatically extracted test statistics from 121 psychology journals that cover a broad range of psychology, including social, cognitive, personality, developmental, clinical, physiological, and brain sciences. With help of undergraduate students, I downloaded all articles from these journals from 2010 to 2019. To keep this post short, I am only presenting the results for 2010 and 2019. The latest year is particularly important because reforms require times and the latest year provides the best opportunity to see the effects of reforms.

All test-statistics are converted into absolute z-scores. A bigger z-score shows that a test-statistic showed stronger evidence against the nil-hypothesis that there is no effect. The higher the z-score, the greater the power of a study to reject the nil-hypothesis. Thus, any increase in power would result in a distribution of z-scores that is moved to the right. Z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020), uses the distribution of significant z-scores (z > 1.96) to estimate several statistics. One statistic is the expected replication rate. This is the percentage of significant results that would be significant again if the studies were replicated exactly. Another statistic is the expected discovery rate (EDR). The EDR is the rate of significant results that is expected given the distribution of significant results. The EDR can be lower than the observed discovery rate (ODR) if there is selection for significance.

Figure 1 shows the results for 2010.

First, visual inspection shows clear evidence of the use of questionable research practices as there are a lot fewer z-scores just below 1.96 (not significant) than above 1.96. Figure 1 shows with the grey curve the distribution of non-significant results that is expected. The amount of selection is reflected in the discrepancy between the observed discovery rate of 69% and the expected discovery rate of 32%, 95%CI = xx to xx.

The expected replication rate is 71%, 95%CI = xx xx . This is much higher than the estimate of 37% based on the Open Science Collaboration project (Science, 2015). However, there are two caveats. First, the present analysis is based on all reported statistical tests, including manipulation checks that should produce very strong evidence against the null-hypothesis. Results for focal and risky hypothesis tests will be lower. Second, the ERR assumes that studies can be replicated exactly. This is typically not the case in psychology (Lykken, 1968; Stroebe & Strack, 2014). Bartos and Schimmack (2020) found that the EDR was a better predictor of actual replication outcomes. Assuming that this finding generalizes, the present estimate of 32% would be consistent with the Open Science Collaboration results of 37%.

Figure 1 establishes baseline results for 2010 at the beginning of the replication crisis in psychology. Figure 2 shows the results nine years later.

A comparison of the results shows a small improvement. The Observed Discovery Rate decreased from 69% to 64%. This means that researchers are reporting more non-significant results. The EDR increased from 32% to 37%. The ERR also increased from 71% to 75%. However, there is still clear evidence that questionable research practices inflate the percentage of significant results in psychology journals and the small increase in EDR and ERR predicts only a small increase in replicability. Thus, if the Open Science Collaboration Project were repeated with studies from 2019, it is still likely to produce fewer than 50% successful replications.

Figure 3 shows the ERR (black), ODR (grey), and EDR (black, dotted) for all 10 years. The continues trend for the ERR suggests that power is increasing a bit, at least in some areas of psychology. However, the trend for the EDR shows no consistent improvement, suggesting that editors are unwilling or unable to reject manuscripts that used QRPs to produce just-significant results. A simple solution might be to ask either for (a) pre-registration with clear power analysis that is followed exactly or (b) simply demand a more stringent criterion of significance, p < .005, to compensate for the hidden multiple-comparisons.

Conclusion

These results provide important scientific evidence about research practices in psychology. Dozens of articles have discussed the replication crisis. Dozens of editorials by new editors have introduced new policies to increase the replicability of psychological results. However, without hard evidence claims about progress are essentially projective tests that say more about the authors than about the state of psychology as a science. The present results provide no evidence that psychology as a field has successfully addressed problems that are decades old, or can be considered a leader in openness and transparency.

Most important, there is no evidence that researchers listen to Cohen’s famous saying “Less is more, except for sample size.” Instead, they are publishing more and more studies in more and more articles, which leads to more and more citations, which looks like progress on quantitative indicators like Impact Factors. However, most of these findings are only replicated in conceptual replication studies that are also selected for significance, giving false evidence of robustness (Schimmack, 2012). Thus, it is unclear which results are replicable and which results are not.

It would require enormous resources to follow up on these questionable results with actual replication studies. To improve psychology, psychologists need to change the incentive structure. To do so, we need to quantify strength of evidence and stop treating all results that are p < .05 as equally significant. A z-score of 2 is just significant, while a z-score of 5 corresponds to the criterion in particle physics that is used to claim a decisive result. Investing resources in decisive studies needs to be rewarded because a single well-designed experiment with z > 5 provides stronger evidence than many weak studies with z = 2, especially if just-significant results may be obtained with questionable practices that inflate effect sizes.

The only differences between my criticism and previous criticism of low powered studies is that technology now makes it possible to track the research practices of psychologists in real time. Students downloaded the articles for 2019 at the beginning of February and the processing of this information took a couple of days (mainly to convert PDFs into txt files). Running a z-curve analysis with bootstrapped confidence intervals takes only a couple of minutes. Therefore, we have hard empirical evidence about research practices in 2019 and the results show that questionable research practices continue to be used to publish more significant results than the power of studies warrants. I hope that demonstrating and quantifying the use of these practices helps to curb their use and to reward researchers who conduct well-powered studies.

A simple way to change the incentive structure is to ban QRPs and treat them like other research fraud. John et al. (2012) introduced the term “scientific doping” for questionable research practices. If sports organizations can ban doping to create fair competitions, why couldn’t scientists do the same. The past decade has shown that they are unable to self-regulate. It is time for funders and consumers (science journalists, undergraduate students, textbook writers) to demand transparency about research practices and the end of fishing for significance.

Estimating Replicability in the “British Journal of Social Psychology”

Introduction

There is a replication crisis in social psychology (see Schimmack, 2020, for a review). One major cause of the replication crisis is selection for statistical significance. Researchers conduct many studies with low power, but only the significant results get published. As these results ar only significant with the help of sampling error, replication studies fail to replicate a significant result. Awareness of these problems has led some journal editors to change submission guidelines in the hope to attract more replicable results. As replicability depends on power , this would mean that the mean power of statistical tests increased. This can be tested by estimating the mean power before and after selection for significance (Bartos & Schimmack, 2020; Brunner & Schimmack, 2019).

In 2017, John Drury and Hanna Zagefka took over as editors of the “British Journal of Social Psychology” (BJSP). Their editorial directly addresses the replication crisis in social psychology.

A third small change has to do with the continuing crisis in social psychology (especially in quantitative experimental social psychology). We see the mission of social psychology to be to make sense of our social world, in a way which is necessarily selective and has subjective aspects (such as choice of topic and motivation for the research). This sense-making, however, must not entail deliberate distortions, fabrications, and falsifications. It seems apparent to us that the fundamental causes of the growth of data fraud, selective reporting of results and other issues of trust we now face are the institutional pressures to publish and the related reward structure of academic career progression. These factors need to be addressed.

In response to this analysis of problems in the field, they introduced new submission guidelines.

Current debate demonstrates that there is a considerable grey area when deciding which methodological choices are defensible and which ones are not. Clear guidelines are therefore essential. We have added to the submission portal a set of statements to which authors respond in relation to determining sample size, criteria for data exclusion, and reporting of all manipulations, conditions, and measures. We will also encourage authors to share their data with interested parties upon request. These responses will help authors understand what is considered acceptable, and they will help associate editors judge the scientific soundness of the work presented.

In this blog post, I examine the replicability of results published in BJSP and I examine whether changes in submission guidelines have increased replicability. To do so, I downloaded articles from 2000 to 2019 and automatically extracted test-statistics (t-values, F-values) from those articles. All test-statistics were converted into absolute z-scores. Higher z-scores provide stronger evidence against the nil-hypothesis. I then submitted the 8,605 z-scores to a z-curve analysis. Figure 1 shows the results.

First, visual inspects shows a clear drop around z = 1.96. This value corresponds to the typical significance criterion of .05 (two-sided). This drop shows the influence of selectively publishing significant results. A quantitative test of selection can be made by comparing the observed discovery rate to the expected discovery rate. The observed discovery rate is the percentage of significant results that are reported, 70%, 95%CI = 69% to 71%. The expected discovery rate (EDR) is estimated by z-curve on the basis of the distribution of the significant results (grey curve. The EDR is lower, 46%, and the 95%CI, 25% to 57% does not include the ODR. Thus, there is clear evidence that results in BJSP are biased towards significant results.

Z-curve also estimates the replicability of significant results. The expected replication rate (ERR) is the percentage of significant results that is expected in exact replication studies. The ERR is 68%, with a 95%CI ranging from 68% to 73%. This is not a bad replication rate, but there are two caveats. First, automatic extraction does not distinguish theoretically important focal tests from other tests such as manipulation checks. A comparison of automated extraction and hand-coding shows that replication rates for focal tests are lower than the ERR of automated extraction (cf. analysis of JESP). The results for BJSP are slightly better than the results for JESP (ERR: 68% vs. 63%; EDR 46% vs. 35%, but the differences are not statistically significant (confidence intervals overlap). Hand-coding of JESP articles produces an ERR of 39% and an EDR of 12%. Thus, the overall analysis of BJSP suggests that replication rates for actual replication studies are similar to social psychology in general. The Open Science Collaboration found that only 25% could be replicated.

Figure 2 examines time-trends by computing the ERR and EDR for each year. It also computes the ERR (solid) and EDR (dotted) in analyses that are limited to p-values smaller than .005 (grey), which are less likely to be produced by questionable practices. The EDR estimates are highly variable because they are very sensitive to the number of just significant p-values. The ERR estimates are more stable. Importantly, none of them show a significant trend over time. Visual inspection also suggests that editorial changes in 2017 haven’t yet produced changes in published results in 2018 or 2019.

Given concerns about questionable practices and low replicability in social psychology, readers should be cautious about empirical claims, especially when they are based on just-significant results. P-values should be at least below .005 to be considered empirical evidence.

Magical Moderated-Multiple Regression

A naive model of science is that scientists conduct studies and then report the results. At least for psychological science, this model does not describe the actual research practices. It has been documented repeatedly that psychological scientists pick and choose the results that they report. This explains how psychology journals publish mostly significant results (p < .05) although most studies have only a small chance to produce a significant result. One study found that social psychology journals publish nearly 100% significant results, when the actual chance to do so is only 25% (Open Science Collaboration, 2015). The discrepancy is explained by questionable research practices. Just like magic, questionable research practices can produce stunning results that never happened (Bem, 2011). I therefore compared articles that used QRPs to a magic show (Schimmack, 2012).

Over the past decades, several methods have been developed to distinguish real findings from magical ones. Applications of these methods have revealed the use of QRPs, especially in experimental social psychology. So far, the focus has been on simple statistical analysis, where an independent variable (e.g., an experimental manipulation) is used to predict variation in a dependent variable. A recent article focused on more complex statistical analysis, called moderated-multiple regression (O’Boyle, Banks, Carter, Walter & Yuan, 2019).

There are two reasons to suspect that moderated-multiple regression results are magical. First, moderated regression requires large sample sizes to have sufficient power to detect small effects (Murphy & Russell, 2016). Second, interaction terms in regression models are optional. Researchers can focus on the main results to publish and add interaction terms only when they produce a significant result. Thus, outcome reporting bias (O’Boyle et al., 2019) is an easy and seemingly harmless QRP that may produce a large file-drawer of studies where moderated-regression was tried, but failed to produce significant results. This is not the only possible QRP. It is also possible to try multiple interaction terms, until a specific combination of variables produces a significant result.

O’Boyle et al. hand-coded results from 343 articles in six management and applied psychology journals that were published between 1995 and 2014. Evidence for the use of QRPs was provided by examining the prevalence of just significant p-values (right figure). There is an unexplained peak just below .05 (.045 to .05).

P-value distributions are less informative about the presence of QRPs than plots of distributions when the p-values are converted into z-scores. O’Boyle et al. shared their data with me and I conducted a z-curve analysis of moderated regression results in applied psychology. The dataset contained information about 449 results that could be used to compute exact p-values. The z-curve plot shows clear evidence of QRPs.

Visual inspections shows a cliff around z = 1.96, which corresponds to a p-value of .05 (two-tailed). This indicates that there should be more non-significant results than are reported. Z-curve also estimates how many non-significant results there should be given the distribution of significant results (grey curve). The plot shows that a much larger number of non-significant results are expected than are actually reported. Z-curve quantifies the use of QRPs by comparing the observed discovery rate (how many reported results are significant) to the expected discovery rate (the area under the gray curve for significant results). The ODR is 52% and the EDR is only 12% and the confidence intervals do not overlap. The 95%CI for the EDR ranges from 5% to 32%. A value of 5% implies that discoveries are at chance level. Thus, based on these results, it is impossible to reject the nil-hypothesis that all significant results are false positives. This does not mean that all of the results are false positives. Soric’s maximum False Discovery Rate is estimated to be 39%, but the 95%CI is very wide and ranges from 11% to 100%. Thus, we simply have insufficient evidence to draw strong conclusions from the data.

Z-curve also computes the expected replication rate (ERR). The ERR is the percentage of analyses with significant results that are expected to produce a significant result again if studies were replicated exactly with the same sample sizes. The ERR is only 40%. One caveat is that it is difficult or impossible to replicate studies in psychology exactly. Bartos and Schimmack (2020) found that the EDR is a better predictor of actual replication outcomes, which suggests only 12% of results would replicate again.

In conclusion, these results confirm suspicions that moderated regression results are magical. Readers should be cautious or entirely ignore these results unless a study has a large sample size and the statistical evidence is strong (p < .001). Magic is fun, but it has no place in scientific journals.For the future, researchers should clearly state that their analyses are exploratory, report outcomes independent of results, or pre-register their data-analysis plan and follow it exactly.

References

Murphy, K. R., & Russell, C. J. (2016). Mend it or end it: Redirecting the search for interactions in the organizational sciences. Organizational Research Methods. 1094428115625322.

O’Boyle, E., Banks, G.C., Carter, K., Walter, S., & Yuan, Z. (2019). A 20-year review of outcome reporting bias in moderated multiple regression. Journal of Business and Psychology, 34, 19–37. https://doi.org/10.1007/s10869-018-9539-8

Estimating the Replicability of Results in ‘Journal of Experimental Social Psychology”

Picture Credit: Wolfgang Viechtbauer

Abstract

Social psychology, or to be more precise, experimental social psychology, has a replication problem. Although articles mostly report successful attempts to reject the null-hypothesis, these results are obtained with questionable research practices that select for significance. This renders reports of statistical significance results meaningless (Sterling, 1959). Since 2011, some social psychologists are actively trying to improve the credibility of published results. A z-curve analysis of results in JESP shows that these reforms have had a mild positive effect, but that studies are still underpowered and that non-significant results are still suspiciously absent from published articles. Even pre-registration has been unable to ensure that results are reported honestly. The problem is that there are no clear norms that outlaw practices that undermine the credibility of a field. As a result, some bad actors continue to engage in questionable practices that advance their careers at the expense of their colleagues and the reputation of the field. They may not be as culpable as Stapel, who simply made up data, but their use of questionable practices also hurts the reputation of experimental social psychology. Given the strong incentives to cheat, it is wildly optimistic to assume that self-control and nudges are enough to curb bad practices. Strict rules and punishment are unpopular among liberal-leaning social psychologists (Fiske, 2016), but they may be the most effective way to curb these practices. Clear guidelines about research ethics would not affect practices of most researchers who are honest and who are motivated by truth, but it would make it possible to take actions against those who abuse the system for their personal gains.

Introduction

There is a replication crisis in social psychology (see Schimmack, 2020, for a review). Based on actual replication studies, it is estimated that only 25% of significant results in social psychology journals can be replicated (Open Science Collaboration, 2015). The response to the replication crisis by social psychologists has been mixed (Schimmack, 2020).

The “Journal of Experimental Social Psychology” provides an opportunity to examine the effectiveness of new initiatives to improve the credibility of social psychology because the current editor, Roger Giner-Sorrola, has introduced several initiatives to improve the quality of the journal.

Giner-Sorolla (2016) correctly points out that selective reporting of statistically significant results is the key problem of the replication crisis. Given modest power, it is unlikely that multiple hypothesis tests within an article are all significant (Schimmack, 2012). Thus, the requirement to report only supporting evidence leads to dishonest reporting of results.

A group of five true statements and one lie is more dishonest than a group of six true ones; but a group of five significant results and one nonsignificant is more to be expected than a group of six significant results, when sampling at 80% statistical power.” (Gina-Sorrola, 2016, p. 2).

There are three solutions to this problem. First, researchers could reduce the number of hypothesis tests that are conducted. For example, a typical article in JESP reports three studies, which implies a minimum of three hypothesis tests, although often more than one hypothesis is tested within a study. The number of tests could be reduced by a third, if researchers would conduct one high-powered study rather than three moderately powered studies (Schimmack, 2012). However, the editorial did not encourage publication of a single study and there is no evidence that the number of studies in JESP articles has decreased.

Another possibility is to increase power to ensure that nearly all tests can produce significant results. To examine whether researchers increased power accordingly, it is necessary to examine the actual power of hypothesis tests reported in JESP. In this blog post, I am using z-curve to estimate power.

Finally, researchers may report more non-significant results. If studies are powered at 80%, and most hypotheses are true, one would expect that about 20% (1 out of 5) hypothesis tests produce a non-significant result. A simple count of significant results in JESP can answer this question. Sterling (1959) found that social psychology journals nearly exclusively report confirmation of predictions with p < .05. Motyl et al. (2017) replicated this finding for results from 2003 to 2014. The interesting question is whether new editorial policies have reduced this rate since 2016.

JESP has also adopted open-science badges that are rewarding researchers for sharing materials, sharing data, or pre-registering hypothesis. Of these badges, pre-registration is most interesting because it aims to curb the use of questionable research practices (QRPs, John et al., 2012) that are used to produce significant results with low power. Currently, there are relatively few articles where all studies are preregistered. However, JESP is interesting because editors sometimes request a final study that is preregistered following some studies that were not preregistered. Thus, JESP has published 58 articles with at least one preregistered study. This makes it possible to examine the effectiveness of preregistration to ensure more honest reporting of results.

Automated Extraction of Test Statistics

The first analyses are based on automatically extracted test statistics. The main drawback of automatic extraction is that it does not distinguish between manipulation checks and focal hypothesis tests. Thus, the absolute estimates do not reveal how replicable focal hypothesis tests are. The advantage of automatically extracted test-statistics is that it uses all test-statistics that are reported in text (t-values, F-values), which makes it possible to examine trends over time. If power of studies increases, test-statistics for focal and non-focal hypothesis will increase.

To examine time-trends in JESP, I downloaded articles from ZZZZ to 2019, extracted test-statistics, converted them into absolute z-scores, and analyzed the results with z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). To illustrate z-curve, I present the z-curve for all 45,792 test-statistics.

Visual inspection of the z-curve plot shows clear evidence that questionable research practices contributed to significant results in JESP. the distribution of significant z-scores peaks at z = 1.96, which corresponds to p = .05 (two-sided). At this point, there is a steep drop of reported results. Based on the distribution of significant results, z-curve also estimates the expected distribution of non-significant results (grey curve). There is a clear discrepancy between the observed frequencies of non-significant results and the expected frequencies of non-significant results. This discrepancy is quantified by the discovery rates; that is, the percentage of significant results. The observed discovery rate is 70%. The expected discovery rate is only 35% and the 95%CI ranges from 21% to 44%. Thus, the observed discovery rate is much higher than we would expect if there were no selection for significance in the reporting of results.

Z-curve also provides an estimate of the expected replication rate (ERR). This is the percentage of significant results that would be significant again if the studies could be replicated exactly with the same sample size. The ERR is 63% with a 95%CI ranging from 60% to 68%. Although this is lower than the recommended level of 80% power, it does not seem to justify the claim of a replication crisis. However, there are two caveats. First, the estimate includes manipulation checks. Although we cannot take replication of manipulation checks for granted, they are not the main concern. The main concern is that theoretically important, novel results do not replicate. The replicability of these results will be lower than 63%. Another problem is that the ERR is based on the assumption that studies in social psychology can be replicated exactly. This is not possible, nor is it informative. It is also important that results generalize across similar conditions and populations. To estimate the outcome of actual replication studies that are only similar to the original studies, the EDR is a better estimate (Bartos & Schimmack, 2020), and the estimate of 35% is more in line with the result that only 25% of results in social psychology journals can be replicated (Open Science Collaboration, 2015).

Questionable research practices are more likely to produce just-significant results with p-values between .05 to .005 than p-values below .005. Thus, one solution to the problem of low credibility, is to focus on p-values less than .005 (z = 2.8). Figure 2 shows the results when z-curve is limited to these test statistics.

The influence of QRPs now shows up as a pile of just-significant results that are not consistent with the z-curve model. For the more trustworthy results, the ERR increased to 85%, but more importantly, the EDR increased to 75%. Thus, readers of JESP should treat p-values above .005 as questionable results, while p-values below .005 are more likely to replicate. It is of course unclear how many of these trustworthy results are manipulation checks or interesting findings.

Figures 1 and 2 helped to understand ERR and EDR. The next figure shows time trends in the ERR (solid) and EDR (dotted) using results that are significant at .05 (black) and those significant at .005.

Visual inspection suggests no changes or even a decrease in the years leading up to the beginning of the replication crisis in 2011. ERR and the EDR for p-values below .005 show an increasing trend in the years since 2011. This is confirmed by linear regression analysis for the years 2012 to 2019, t(6)s > 4.62. However, the EDR for all significant results does not show a clear trend, suggesting that QRPs are still being used, t(6) = 0.84.

Figure 3 shows the z-curve plot for the years 2017 to 2019 to get a stable estimate for current results.

The main difference to Figure 1 is that there more highly significant results, which is reflected in the higher ERR of 86%, 95%CI = 82% to 91%. However, the EDR of 36%, 95%CI = 24% to 57% is still low and significantly lower than the observed discovery rate of 66%. Thus, there is still evidence that QRPs are being used. However, EDR estimates are highly sensitive to the percentage of just-significant results. Even excluded only results between 2 and 2.2, leads to a very different picture.

Most important, the EDR jumps from 36% to 73%, which is even higher than the ODR. Thus, one interpretation of the results is that a few bad actors continue to use QRPs that produce p-values between .05 and .025, while most other results are reported honestly.

In sum, the results based on automated-extraction of test statistics shows a clear improvement in recent years, especially for p-values below .005. This is consistent with observations that sample sizes have increased in social psychology (reference). The main drawback of these analysis is that estimates based on automated extraction do not reveal the robustness of focal hypothesis tests. This requires hand-coding of test statistics. These results are examined next.

Motyl et al.’s Hand-Coding of JESP (2003,2004,2013,2014).

The results for hand-coded focal tests justify the claim of a replication crisis in experimental social psychology. Even if experiments could be replicated exactly, the expected replication rate is only 39%, 95%CI = 26% to 49%. Given that they cannot be replicated exactly, the EDR suggests that as few as 12% of replications would be successful and the 95%CI includes 5%, meaning all significant results were false positives, 95%CI = 5% to 33%. The comparison to the observed discovery rate of 86% shows the massive use of QRPs to produce mostly significant results with low power. The time-trend analysis suggests that these numbers are representative of results in experimental social psychology until very recently (see also Cohen, 1962).

Focusing only on p-values below .005 may be a solution, but the figure shows that few focal tests reach this criterion. Thus, for the most part, articles in JESP do not provide empirical evidence for social psychological theories of human behavior. Only trustworthy replication studies can provide this information.

Hand-Coding of Articles in 2017

To examine improvement, I personally hand-coded articles published in 2017.

The ERR increased from 39% to 55%, 95%CI = 46% to 66%, and the confidence intervals barely overlap. However, the ERR did not show a positive trend in the automated analysis and even a value of 55% is still low. The EDR also improved from 12% to 35%, but the confidence intervals are much wider, which makes it hard to conclude from these results that this is a real trend. More important, an EDR of 35% is still not good. Finally. the results continue to show the influence of questionable research practices. The comparison of the ODR and EDR shows that many non-significant results that are obtained are not reported. Thus, despite some signs of improvement, these results do not show a radical shift in research practices that is needed to make social psychology more trustworthy.

Pre-Registration

A lot of reformers pin their hope on pre-registration as a way to curb the use of questionable research practices. An analysis of registered reports suggests that this can be the case. Registered reports are studies that are accepted before data are collected. Researchers then collect the data and report the results. This publishing model makes it unnecessary to use QRPs to produce significant results in order to get a publication. Preregistration in JESP is different. Here authors voluntarily post a data analysis plan before they collect data and then follow the preregistered plan in their analysis. To the extent that they do follow their plan exactly, the results are also not selected to be significant. However, there are still ways in which selection for significance may occur. For example, researchers may choose not to publish a preregistered study that produced a non-significant results or editors may not accept these studies for publication. It is therefore necessary to test the effectiveness of pre-registration empirically. For this purpose, I coded 210 studies in 58 articles that included at least one pre-registered study. There were 3 studies in 2016, 15 in 2017, 92 in 2018, and 100 in 2019. Five studies were not coded because they did not test a focal hypothesis or used sequential testing.

On a positive note, the ERR and EDR are higher than the comparison data for all articles in 2017. However, it is not clear how much of this difference is due to a simple improvement over time or preregistration. Not so good is the finding that the observed discovery rate is still high (86%) and this does not even count marginally significant results that are also used to claim a discovery. This high discovery rate is not justified by an increase in power. The EDR suggests that only 62% of results should be significant and the 95%CI does not include 86%, 95%CI = 23% to 77%. Thus, there is still evidence that QRPs are being used even in articles that receive a pre-registration badge.

One possible explanation is that articles can receive a pre-registration badge if at least one of the studies was pre-registered. Often this is the last study that has been requested by the editor to ensure that non-preregistered results are credible. I therefore also z-curved only studies that were pre-registered. There were 134 pre-registered studies.

The results are very similar to the previous results with ERR of 72% vs. 70% and EDR of 66% vs. 64%. Thus, there is no evidence that pre-registered studies are qualitatively better and stronger. Moreover, there is also no evidence that pre-registration leads to more honest reporting of non-significant results. The observed discovery rate is 84% and rises to 90% when marginally significant results are included.

Conclusion

Social psychology, or to be more precise, experimental social psychology, has a replication problem. Although articles mostly report successful attempts to reject the null-hypothesis, these results are obtained with questionable research practices that select for significance. This renders reports of statistical significance results meaningless (Sterling, 1959). Since 2011, some social psychologists are actively trying to improve the credibility of published results. A z-curve analysis of results in JESP shows that these reforms have had a mild positive effect, but that studies are still underpowered and that non-significant results are still suspiciously absent from published articles. Even pre-registration has been unable to ensure that results are reported honestly. The problem is that there are no clear norms that outlaw practices that undermine the credibility of a field. As a result, some bad actors continue to engage in questionable practices that advance their careers at the expense of their colleagues and the reputation of the field. They may not be as culpable as Stapel, who simply made up data, but their use of questionable practices also hurts the reputation of experimental social psychology. Given the strong incentives to cheat, it is wildly optimistic to assume that self-control and nudges are enough to curb bad practices. Strict rules and punishment are unpopular among liberal-leaning social psychologists (Fiske, 2016). The problem is that QRPs hurt social psychology, even if it is just a few bad actors who engage in these practices. Implementing clear standards with consequences would not affect practices of most researchers who are honest and who are motivated by truth, but it would make it possible to take actions against those who abuse the system for their personal gains.

A Recipe to Improve Psychological Science

Raw! First Draft! Manuscript in Preparation for Meta-Psychology
Open Comments are welcome.

The f/utility of psychological research has been debated since psychology became an established discipline after the second world-war (Cohen, 1962, 1994; Lykken, 1968; Sterling, 1959; lots of Meehl). There also have been many proposals to improve psychological science. However, most articles published today follow the same old recipe that was established decades ago; a procedure that Gigerenzer (2018) called the significance-testing ritual .

Step 1 is to assign participants to experimental conditions.

Step 2 is to expose groups to different stimuli or interventions.

Step 3 is to examine whether the differences between means of the groups are statistically significant.

Step 4a: If Step 3 produces a p-value below .05, write up the results and submit to a journal.

Step 4b: If Step 3 produces a p-value above .05, forget about the study, and go back to Step 1.

This recipe produces a literature where the empirical content of journal articles are only significant results that suggest the manipulation had an effect. As Sterling (1959) pointed out, this selective publishing of significant results essentially renders significance testing meaningless. The problem with this recipe became apparent when Bem (2011) published 9 successful demonstration of a phenomenon that does not exist: mental time travel where feelings about random future events seemed to cause behavior. If only successes are reported, significant results only show how motivated researchers are to collect data that support their beliefs.

I argue that the key problem in psychology is the specification of the null-hypothesis. The most common approach is to specify the null-hypothesis as the absence of an effect. Cohen called this the nil-hypothesis. The effect size is zero. Even after an original study rejects the nil-hypothesis, follow-up studies (direct or conceptual replication studies) again specify the nil-hypothesis as the hypothesis that has to be rejected, although the original study already rejected it. I propose to abandon nil-hypothesis testing and to replace it with null-hypothesis testing where the null-hypothesis specifies effect sizes. Contrary to the common practice to start with rejecting the nil-hypothesis, I argue that original studies should start with testing large effect sizes. Subsequent studies should use information from the earlier studies to modify the null-hypothesis. This recipe can be considered a stepwise process of parameter estimation. The advantage of a step-wise approach is that parameter estimation requires large samples that are often impossible to obtain during the early stages of a research program. Moreover, parameter estimation may be wasteful when the ultimate conclusion is that an effect size is too small to be meaningful. I illustrate the approach with a simple between-subject design that compares two groups. For mean differences, the most common effect size is the standardized mean difference (the mean difference when the dependent variable is standardized) and Cohen suggested values of d = .2, .5, and .8 as values for small, medium, and large effect sizes, respectively.

The first test of a novel hypothesis (e.g., taking Daniel Lakens’ course on statistics improves understanding of statistics), starts with the assumption that the effect size is large (H0: |d| = .8).

The next step is to specify what value should be considered a meaningful deviation from this effect size. A reasonable value would be d = .5, which is only a moderate effect size. Another reasonable approach is to half the starting effect size, d = .8/2 = .4. I use d = .4.

The third step is to conduct a power analysis for a mean difference of d = .4. This power analysis is not identical to a typical power analysis with H0: d = 0 and an effect size of d = .4 because the t-distribution is no longer symmetrical when it is centered over values other than zero (this may be a statistical reason for the choice of the nil-hypothesis). However, conceptually the power analysis does not differ. We are postulating a null-hypothesis of d = .8 and are willing to reject it when the population effect size is a meaningfully smaller effect size of d = .4 or less. With some trial and error, we find a sample size of N = 68 (n = 34 per cell). With this sample size, d-values below .4 occur only 5% of the time. Thus, we can reject the null-hypothesis of d = .8, if the study produces an effect size below .4.

The next step depends on the outcome of the first study. If the first study produced a result with an effect size estimate greater than .4, the null-hypothesis lives another day. Thus, the replication study is conducted with the same sample size as the original study (N = 68. The rational is that we have good reason to believe that the effect size is large and it would be wasteful to conduct replication studies with much larger samples (e.g., 2.5 times larger than the original study, N = 170. It is also not necessary to use much larger samples to demonstrate that the original finding was obtained with questionable research practices. An honest replication study has high power to reject the null-hypothesis of d = .8, if the true effect size is only d = .2 or even closer to zero. This makes it easier to reveal the use of questionable research practices with actual replication studies. The benefits are obtained because the original study makes a strong claim that the effect size is large rather than merely claiming that the effect size is positive or negative without specifying an effect size.

If the original study produces a significant result with an effect size less than d = .4, the null-hypothesis is rejected. The new null-hypotheses is the point-estimate of the study. Given a significant result, we know that this value is somewhere between 0 and .4. Let’s assume it is d = .25. This estimate comes with a two-sided 95% confidence interval ranging from d = -.23 to d = .74. The wide confidence interval shows that we can reject d = .8, but not a medium effect size of d = .5 or even a small effect in the opposite direction, d = -.2. Thus, we need to increase sample sizes in the next study to provide a meaningful test of the new null-hypothesis that the effect size is positive, but small (d = .25). We want to ensure that the effect size is indeed positive, d > 0, but weaker than a medium effect size, d = .5. Thus, we need to power the study to be able to reject the null-hypothesis (H0: d = .25) in both direction. This is achieved with a sample size of N = 256 (n = 128 per cell) and sampling error of .125. The 95% confidence interval centered over d = .25, ranges from 0 to .5. Thus, any observed d-value greater than .25 rejects the hypothesis that there is no effect and any value below .25 rejects the hypothesis of a medium effect size, d = .5.

The next step depends again on the outcome of the study. If the observed effect size is d = .35 with a 95% confidence interval ranging from d = .11 to d = 60, the new question is whether the effect size is at least small, d = .2 or whether it is even moderate. We could meta-analyze the results of both studies, but as the second study is larger, it will have a stronger influence on the weighted average. In this case, the weighted average of d = .33 is very close to the estimate of the larger second study. Thus, I am using the estimate of Study 2 for the planning of the next study. With the null-hypothesis of d = .35, a sample size of N = 484 (n = 242 per cell) is required to have 95% power to find a significant result if the population effect size is d = .2 or less, 90% confidence interval, d = .20 to d = .50. Thus, if an effect size less than d = .2 is observed, it is possible to reject the hypothesis that there is at least a statistically small effect size of d = .2. In this case, researchers have to decide whether they want to invest in a much larger study to see whether there is a positive effect at all or whether they would rather abandon this line of research because the effect size is too small to be theoretically or practically meaningful. The estimation of the effect size makes it at least clear that any further studies with small samples are meaningless because they have insufficient power to demonstrate that a small effect exists. This can be a meaningful result in itself because researchers currently waste resources on studies that test small effects with small samples.

If the effect size in Study 2 is less than d = .25, researchers know (with a 5% error probability) that the effect size is less than d = .5. However, it is not clear whether there is a positive effect or not. Say, the observed effect size was d = .10 with a 95%CI ranging from d = -.08 to d = .28. This leaves open the possibility of no effect, but also a statistically small effect of d = .2. Researchers may find it worthwhile to purse this research in the hope that the effect size is at least greater than d = .10, assuming a population effect size of d = .2. Adequate power is achieved with a sample size of N = 1,100 (n = 550 per cell). In this case, the 90% confidence interval around d = .2 ranges from d = .10 to d = .30. Thus, any value less than d = .10, rejects the hypothesis that the effect size is statistically small, d = .2, while any value greater than d = .30 would confirm that the effect size is at least a small effect size of d = .2.

This new way of thinking about null-hypothesis testing requires some mental effort (it is still difficult for me). To illustrate it further, I used open data from the many-lab project (Klein et al., 2014). I start with a project with a strong and well-replicated effect.

Anchoring

The first sample in the ML dataset is from Penn State U – Abington (‘abington’) with N = 84. Thus, the sample has good power to test the first hypothesis that d > .4, assuming an effect size of d = .8. The statistical test of the first anchoring effect (distance from New York to LA with 1,500 mile vs. 6,000 mile anchor) produced a standardized effect size of d = .98 with a 95%CI ranging from d = .52 to 1.44. The confidence interval includes a value of d = .8. Therefore the null-hypothesis cannot be rejected. Contrary to nil-hypothesis testing, however, this finding is highly informative and significant. It does suggest that anchoring is a strong effect.

As Study 1 was consistent with the null-hypothesis of a strong effect, Study 2 replicates the effect with the same sample size. To make this a conceptual replication study, I used the second anchoring question (anchoring2, population of Chicago with 200,000 vs. 6 million as anchor). The sample from Charles University, Prague, Czech Republic provided an equal sample size of N = 84. The study replicated the finding of Study 1, that the 95%CI includes a value of d = .8, 95%CI = .72 to 1.41.

To further examine the robustness of the effect, Study 3 used a different anchoring problem (height of Mt. Everest with 2,000 vs. 45,500 feet as anchors). To keep sample sizes similar, I used the UVA sample (N = 81). This time, the null-hypothesis was rejected with an even larger effect size, d = 1.47, 95%CI = 1.19 to 1.76.

Although additional replication studies can further examine the generalizability of the main effect, the three studies alone are sufficient to provide robust evidence for anchor effects, even with a modest total sample size of N = 249 participants. Researchers could therefore examine replicability and generalizabilty in the context of new research questions that explore boundary conditions, mediators, or moderators. More replication studies or replication studies with larger samples would be unnecessary.

Flag Priming

To maintain good comparability, I start again with the Penn State U – Abington sample (N = 84). The effect size estimate for the flag prime is close to zero, d = .05. More important, the 95% confidence interval does not include d = .8, 95%CI = -.28 to .39. Thus, the null-hypothesis that flag priming is a strong effect is rejected. The results are so disappointing that even a moderate effect size is not included in the confidence interval. Thus, the only question is whether there could be a small effect size. If this is theoretically interesting, the study would have to be sufficiently powered to distinguish a small effect size from zero. Thus, the study could be powered to examine whether the effect size is at least d = .1, assuming an effect size of d = .2. The previous power analysis suggested that a sample of N = 1,100 participants is needed to test this hypothesis. I used the Mturk sample (N = 1000) and the osu (N = 107) samples to get this sample size.

The results showed a positive effect size of d = .12. Using traditional NHST, this finding rejects the nil-hypothesis, but allows for extremely small effect sizes close to zero, 95%CI = .0003 to .25. More important, the results do not reject the actual null-hypothesis that there is a small effect size d = .2, but also do not ensure that the effect size is greater than d = .10. Thus, the results remain inconclusive.

To make use of the large sample of Study 2, it is not necessary to increase the sample size again. Rather, a third study can be conducted with the same sample size, and the results of the two studies can be combined to test the null-hypothesis that d is at least d = .10. I used the Project Implicit sample, although it is a bit bigger (N = 1329).

Study 3 alone produced an effect size of d = .03, 95%CI = -.09 to d = .14. An analysis that combines data from all three samples, produces an estimate of d = .02, 95%CI = -.06 to .10. These results clearly reject the null-hypothesis that d = .2, and they even suggest that d = .10 is unlikely. At this point, it seems reasonable to stop further study of this phenomenon, at least using the same paradigm. Although this program required over 2,000 participants, the results are conclusive and publishable with the conclusion that flag priming has negligible effects on ratings of political values. The ability to provide meaningful results arises from the specification of the null-hypothesis with an effect size rather than the nil-hypothesis that can only test direction of effects without making claims about effect sizes.

The comparison of the two examples shows why it is important to think about effect sizes, even when these effect sizes do not generalize to the real word. Effect sizes are needed to calibrate sample sizes so that resources are not wasted on overpowered studies (studying anchoring with N = 1,000) or on underpowered studies (studying flag priming with N = 100). Using a simple recipe that starts with the assumption that effect sizes are large, it is possible to use few resources first and then increase sample sizes as needed, if effect sizes turn out to be small.

Low vs. High Category Scales

To illustrate the recipe with a small-to-medium effect size, I picked Schwartz et al.’s (1985) manipulation of high versus low frequencies as labels for a response category. I started again with the U Penn State – Abington sample (N = 84). The effect size was d = .33, but the 95% confidence interval ranged from d = -.17 to d = .84. Although, the interval does not exclude d = .8, it seems unlikely that the effect size is large, but it is not unreasonable to assume that the effect size could be moderate rather than small. Thus, the next study used d = .5 as the null-hypothesis and examined whether the effect size is at least d = .2. A power analysis shows that N = 120 (n = 60 per cell) participants are needed. I picked the sample from Brasilia (N = 120) for this purpose. The results showed a strong effect size, d = .88. The 95% confidence interval even excluded a medium effect size, d = .51 to d = 1.23, but given the results of study 1, it is reasonable to conclude that the effect size is not small, but could be medium or even large. A sample size of N = 120 seems reasonable for replication studies that examine the generalizability of results across populations (or conceptual replicaiton studies, but they were not available in this dataset).

To further examine generalizability, I picked the sample from Instanbul (N = 113). Surprisingly, the 95% confidence interval, d = -.31 to d = .14 did not include d = .5. The confidence interval also does not overlap with the confidence interval in Study 2. Thus, there is some uncertainty about the effect and under what conditions it can be produced. However, a meta-analysis across all three studies shows a 95%CI that includes a medium effect size, 95%CI = .21 to .65.

Thus, it seems reasonable to examine replicability in other samples with the same sample size. The next sample with a similar sample size is Laurier (N = 112). The results show an effect size of d = .43 and the 95%CI includes d = .5, 95%CI = .17 to d = .69. The meta-analytic confidence interval, 95%CI = .27 to .61, excludes small effect sizes of d = .2 and large effect sizes of d = .8.

Thus, a research program with four samples and a total sample size of N = 429 participants helped to establish a medium effect size for the effect of low versus high scale labels on ratings. The effect size estimate based on the full ML dataset is d = .48.

At this point, it may seem as if I cheery-picked samples to make the recipe look good. I didn’t, but I don’t have a preregistered analysis plan to show that I did not. I suggest others try it out with other open data where we have a credible estimate of the real effect based on a large sample and then try to approach this effect size using the recipe I proposed here.

The main original contribution of this blog post is to move away from nil-hypothesis significance testing. I am not aware of any other suggestions that are similar to the proposed recipe, but the ideas are firmly based on Neyman-Pearson’s approach to significance testing and Cohen’s recommendation to think about effect sizes in the planning of studies. The use of confidence intervals makes the proposal similar to Cummings’ suggestion to focus more on estimation than hypothesis testing. However, I am not aware of a recipe for the systematic planning of sample sizes that vary as a function of effect sizes. Too often confidence intervals are presented as if the main goal is to provide precise effect size estimates, although the meaning of these precise effect sizes in psychological research is unclear. What a medium effect size for category labels means in practice is not clear, but knowing that it is medium allows researchers to plan studies with adequate power. Finally, the proposal is akin to sequential testing, where researchers look at their data to avoid collecting too many data. However, sequential testing still suffers from the problem that it tests the nil-hypothesis and that a non-significant result is inconclusive. In contrast, this recipe provides valuable information even if the fist study produces a non-significant result. If the first study fails to produce a significant result, it suggests that the effect size is large. This is valuable and publishable information. Significant results are also meaningful because they suggest that the effect size is not large. Thus, results are informative with significant and non-significant results, removing the asymmetry of nil-hypothesis testing where non-significant results are uninformative. The only studies that are not informative are studies where confidence intervals are too wide to be meaningful or replication studies that are underpowered. The recipe helps researchers to avoid these mistakes.

The proposal also addresses the main reason why researchers do not use power analysis to plan sample sizes. The mistaken belief is that it is necessary to guess the population effect size. Here I showed that this is absolutely not necessary. Rather researchers can start with the most optimistic assumptions and test the hypothesis that their effect is large. More often then not, the result will be disappointing, but not useless. The results of the first study provide valuable information for the planning of future studies.

I would be foolish to believe that my proposal can actually change research practices in psychology. Yet, I cannot help thinking that it is a novel proposal that may appeal to some researchers who are struggling in the planning of sample sizes for their studies. The present proposal allows them to shoot for the moon and fail, as long as they document this failure and then replicate with a larger sample. It may not solve all problems, but it is better than p-rep or Bayes-Factors and several other proposals that failed to fix psychological science.

Fiske and the Permanent Crisis in Social Psychology

Remedies include tracking one’s own questionable research practices” (Susan T. Fiske)

In 1959, Sterling observed that results sections of psychological articles provide no information. The reason is that studies nearly always reject the null-hypothesis. As a result, it is not necessary to read the results section. It is sufficient to read the predictions that are being made in the instruction because the outcome of the empirical test is a forgone conclusion.

in 1962, Cohen (1962) found that studies published in the Journal of Abnormal and Social Psychology (now separated into Journal of Abnormal Psychology and Journal of Personality and Social Psychology) have modest power to produce significant results. Three decades later, Gigerenzer and Sedlmeier (1989) replicated this finding.

Thus, for nearly 60 years social psychologists have been publishing many more significant results than they actually obtain in their laboratories, making the empirical results in their articles essentially meaningless. Every claim in social psychology, including crazy findings that nobody believes (Bem, 2011), is significant.

Over the past decades, some social psychologists have rebelled against the status quo in social psychology. To show that significant results do not provide empirical evidence, they have conducted replication studies and reported results even when they did not show a significant results. Suddenly, the success rate of nearly 100% dropped to 25%. Faced with this dismal result that reveals the extend of questionable practices in the field, some social psychologists have tried to downplay the significance of replication failures. Leader of the disinformation movement is Susan Fiske, who was invited to comment on a special issue on the replication crisis in the Journal of Experimental Social psychology (Fiske, 2016). Her article “How to publish rigorous experiments in the 21st century” is an interesting example of deceptive publishing that avoids dealing with the real issue.

First, it is always important to examine the reference list to examine bias in the literature review. For example, Fiske does not mention Bem’s embarrassing article that started the crisis, John et al.’s article on the use of questionable research practices, or Francis and Schimmack’s work on bias detection, although these articles are mentioned in several of the articles she comments on. For example, Hales (2016) writes .

In fact, in some cases failed replications have been foreshadowed by analyses showing that the evidence reported in support of a finding can be implausibly positive. For example, multiple analyses have questioned whether findings in support of precognition (Bem, 2011) are too good to be obtained without using questionable research practices (Francis, 2012;
Schimmack, 2012). In line with these analyses, researchers who have replicated Bem’s procedures have not replicated his results (Galak, LeBoeuf, Nelson, & Simmons, 2012; Ritchie, Wiseman, & French, 2012; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012).

Fiske concludes that the replication crisis is an opportunity to improve research practices. She writes “Constructive advice for 21st century publication standards includes appropriate theory, internal validity, and external validity” Again, it is interesting what she is not saying. If theory, internal and external validity are advice for social psychologists in the 21st century, it implies that 20th century social psychologists did not have good theories and that studies lacked internal and external validity. After all, we do not give advice when things are going well.

Fiske (2016) discusses the replication crisis under the heading of internal validity.

Hales (2016) points out that, in the effort to report effects that are both significant and interesting, researchers may go beyond what the data allow. Over-claiming takes forms beyond the familiar Type I (false positive) and Type II (false negative) errors. A proposed Type III error describes reaching an accurate conclusion but by flawed methods (e.g., confirmation bias, hypothesizing after results are known, discarding data). A proposed Type IV error describes reaching an accurate conclusion based on faulty evidence (insufficient power, invalid measures). Remedies include tracking one’s own questionable research practices (e.g., ad hoc stopping, non-disclosure of failed replications, exploration reported as confirmation) or calculating the plausibility of one’s data (e.g., checking for experimenter bias during analysis). Pre-registration and transparency are encouraged.”

This is as close as Fiske comes to talking about the fundamental problem in social psychology, but Type-III errors are not just a hypothetical possibility; they are the norm in social psychology. Type-III errors explain how social psychologists can be successful most of the time, when their studies have a low probabilty to be successful.

Fiske’s recommendations for improvement are obscure. What does it mean for researchers to “track their own questionable practices?” Is there an acceptable quota of using these practices? What should researchers do when they find that they are using these questionable practices? How would researchers calculate the plausibilty of their data, and why is pre-registration useful? Fiske does not elaborate on this because she is not really interested in improving practices. At least, she makes it very clear what she does not want to happen: she opposes a clear code of research ethics that specifies which practices violate research integrity.

Norms about acceptable research methods change by social influence, not by regulation. As social psychology tells us, people internalize change when they trust and respect the source. A
punishing, feared source elicits at best compliance and at worst reactance, not to mention the source’s own reputational damage.

This naive claim ignores that many human behaviors are regulated by social norms that are enforced with laws. Even scientists have social norms about fraud and Stapel was fired for fabricating data. Clearly, academic freedom has limits. If fabricating data is unethical, it is not clear why hiding disconfirming evidence should be a personal choice.

Fiske also expresses here dislike of blog posts and so-called vigilantes.

“For the most part, the proposals in this special issue are persuasive communications,
not threats. And all are peer-reviewed, not mere blog posts. And they are mostly reasoned advisory proposals, not targeted bullying. As such, they appropriately treat other researchers as
colleagues, not miscreants. This respectful discourse moves the field forward better than vigilantism.

Maybe as a social psychologist, she should be aware that disobedience and protest have always been a part of social change, especially when powerful leaders opposed social change. Arguments that results sections in social psychology are meaningless have been made by eminent researchers in peer-reviewed publications (e.g., Cohen, 1994; Schimmack, 2012) and on blog posts (e.g., R-Index blog). The validity of the argument does not depend on the medium or peer-review, but on the internal and external validity of the evidence, and the evidence for sixty years has shown that social psychologists inflate their success rate.

There is also no evidence that social psychologists follow Fiske’s advice to track their own questionable research practices or avoid the use of these practices. This is not surprising. There is no real incentive to change behaviors and behavior does not change when the reinforcement schedule does not change. As long as p < .05 is rewarded and p > .05 is punished, psychologists will continue to publish meaningless p-values (Sterling, 1959). History has shown again and again that powerful elites do not change for the goodness of the greater good. Real change will come from public pressure (e.g., undergraduate students, funders) to demand honest reporting of results.

Expressing Uncertainty about Analysis Plans with Conservative Confidence Intervals

Unless researchers specify an analysis plan and follow it exactly, it is possible to analyze the same data several ways. If all analysis lead to the same conclusion this is not a problem. However, what should we do when the analyses lead to different conclusions? The problem generally arises when one analysis shows a p-value less than .05 and another plausible analysis shows a p-value greater than .05. The inconsistency introduces uncertainty about the proper conclusion. Traditionally, researchers selectively picked the more favorable analysis, which is known as a questionable research practices because it undermines the purpose of significance testing to control the long-run error rate. However, what do we do if researchers honestly present both results, p = .02 and p = .08? As many statistician have pointed out, the difference between these two results is itself not significant and negligible.

A simple solution to the problem is to switch from hypothesis testing with p-values to hypothesis testing with confidence intervals (Schimmack, 2020). With p = .02 and p = .08, the corresponding confidence intervals could be d = -.05 to .30 and d = .05 to .40. It is simple to present the uncertainty about the proper inference by picking the lower value for the lower limit and the higher value for the upper limit to create a conservative confidence interval, d = -.05 to .40. This confidence interval captures uncertainty about the proper analysis and uncertainty about sampling error. Inferences can then be drawn based on this confidence interval. In this case, there is insufficient information to reject the null-hypothesis. Yet, the data still provide evidence that the effect size is unlikely to be moderate. If this is theoretically meaningful or contradicts previous studies (e.g., studies that used QRPs to inflate effect sizes), the results are still important and publishable.

One problem is when there are many ways to analyze the data. A new suggestion has been to do a multiverse analysis. That is, run all possible analysis and see what you get. The problem is that this may create extremely divergent results and it is not clear how results from a multiverse analysis should be integrated. Conservative confidence intervals provide an easy way to do so, but they may be extremely wide if a multiverse analysis is not limited to a small range of reasonable analyses. It is therefore crucial that researchers think carefully about reasonable alternative ways to analyze the data without trying all possible ways of doing so which makes the results uninformative.