Since 2011, it is an open secret that many published results in psychology journals do not replicate. The replicability of published results is particularly low in social psychology (Open Science Collaboration, 2015).
A key reason for low replicability is that researchers are rewarded for publishing as many articles as possible without concerns about the replicability of the published findings. This incentive structure is maintained by journal editors, review panels of granting agencies, and hiring and promotion committees at universities.
To change the incentive structure, I developed the Replicability Index, a blog that critically examined the replicability, credibility, and integrity of psychological science. In 2016, I created the first replicability rankings of psychology departments (Schimmack, 2016). Based on scientific criticisms of these methods, I have improved the selection process of articles to be used in departmental reviews.
1. I am using Web of Science to obtain lists of published articles from individual authors (Schimmack, 2022). This method minimizes the chance that articles that do not belong to an author are included in a replicability analysis. It also allows me to classify researchers into areas based on the frequency of publications in specialized journals. Currently, I cannot evaluate neuroscience research. So, the rankings are limited to cognitive, social, developmental, clinical, and applied psychologists.
2. I am using department’s websites to identify researchers that belong to the psychology department. This eliminates articles that are from other departments.
3. I am only using tenured, active professors. This eliminates emeritus professors from the evaluation of departments. I am not including assistant professors because the published results might negatively impact their chances to get tenure. Another reason is that they often do not have enough publications at their current university to produce meaningful results.
Like all empirical research, the present results rely on a number of assumptions and have some limitations. The main limitations are that (a) only results that were found in an automatic search are included (b) only results published in 120 journals are included (see list of journals) (c) published significant results (p < .05) may not be a representative sample of all significant results (d) point estimates are imprecise and can vary based on sampling error alone.
These limitations do not invalidate the results. Large difference in replicability estimates are likely to predict real differences in success rates of actual replication studies (Schimmack, 2022).
New York University
I used the department website to find core members of the psychology department. I found 13 professors and 6 associate professors. Figure 1 shows the z-curve for all 12,365 tests statistics in articles published by these 19 faculty members. I use the Figure to explain how a z-curve analysis provides information about replicability and other useful meta-statistics.
1. All test-statistics are converted into absolute z-scores as a common metric of the strength of evidence (effect size over sampling error) against the null-hypothesis (typically H0 = no effect). A z-curve plot is a histogram of absolute z-scores in the range from 0 to 6. The 1,239 (~ 10%) of z-scores greater than 6 are not shown because z-scores of this magnitude are extremely unlikely to occur when the null-hypothesis is true (particle physics uses z > 5 for significance). Although they are not shown, they are included in the meta-statistics.
2. Visual inspection of the histogram shows a steep drop in frequencies at z = 1.96 (dashed blue/red line) that corresponds to the standard criterion for statistical significance, p = .05 (two-tailed). This shows that published results are selected for significance. The dashed red/white line shows significance for p < .10, which is often used for marginal significance. There is another drop around this level of significance.
3. To quantify the amount of selection bias, z-curve fits a statistical model to the distribution of statistically significant results (z > 1.96). The grey curve shows the predicted values for the observed significant results and the unobserved non-significant results. The full grey curve is not shown to present a clear picture of the observed distribution. The statistically significant results (including z > 6) make up 20% of the total area under the grey curve. This is called the expected discovery rate because the results provide an estimate of the percentage of significant results that researchers actually obtain in their statistical analyses. In comparison, the percentage of significant results (including z > 6) includes 70% of the published results. This percentage is called the observed discovery rate, which is the rate of significant results in published journal articles. The difference between a 70% ODR and a 20% EDR provides an estimate of the extent of selection for significance. The difference of 50 percentage points is large. The upper level of the 95% confidence interval for the EDR is 28%. Thus, the discrepancy is not just random. To put this result in context, it is possible to compare it to the average for 120 psychology journals in 2010 (Schimmack, 2022). The ODR is similar (70 vs. 72%), but the EDR is a bit lower (20% vs. 28%), although the difference might be largely due to chance.
4. The EDR can be used to estimate the risk that published results are false positives (i.e., a statistically significant result when H0 is true), using Soric’s (1989) formula for the maximum false discovery rate. An EDR of 20% implies that no more than 20% of the significant results are false positives, however the upper limit of the 95%CI of the EDR, 28%, allows for 36% false positive results. Most readers are likely to agree that this is an unacceptably high risk that published results are false positives. One solution to this problem is to lower the conventional criterion for statistical significance (Benjamin et al., 2017). Figure 2 shows that alpha = .005 reduces the point estimate of the FDR to 3% with an upper limit of the 95% confidence interval of XX%. Thus, without any further information readers could use this criterion to interpret results published by NYU faculty members.
5. The estimated replication rate is based on the mean power of significant studies (Brunner & Schimmack, 2020). Under ideal condition, mean power is a predictor of the success rate in exact replication studies with the same sample sizes as the original studies. However, as NYU professor van Bavel pointed out in an article, replication studies are never exact, especially in social psychology (van Bavel et al., 2016). This implies that actual replication studies have a lower probability of producing a significant result, especially if selection for significance is large. In the worst case scenario, replication studies are not more powerful than original studies before selection for significance. Thus, the EDR provides an estimate of the worst possible success rate in actual replication studies. In the absence of further information, I have proposed to use the average of the EDR and ERR as a predictor of actual replication outcomes. With an ERR of 62% and an EDR of 20%, this implies an actual replication prediction of 41%. This is close to the actual replication rate in the Open Science Reproducibility Project (Open Science Collaboration, 2015). The prediction for results published in 120 journals in 2010 was (ERR = 67% + ERR = 28%)/ 2 = 48%. This suggests that results published by NYU faculty are slightly less replicable than the average result published in psychology journals, but the difference is relatively small and might be mostly due to chance.
6. There are two reasons for low replication rates in psychology. One possibility is that psychologists test many false hypotheses (i.e., H0 is true) and many false positive results are published. False positive results have a very low chance of replicating in actual replication studies (i.e. 5% when .05 is used to reject H0), and will lower the rate of actual replications a lot. Alternative, it is possible that psychologists tests true hypotheses (H0 is false), but with low statistical power (Cohen, 1961). It is difficult to distinguish between these two explanations because the actual rate of false positive results is unknown. However, it is possible to estimate the typical power of true hypotheses tests using Soric’s FDR. If 20% of the significant results are false positives, the power of the 80% true positives has to be (.62 – .2*.05)/.8 = 76%. This would be close to Cohen’s recommended level of 80%, but with a high level of false positive results. Alternatively, the null-hypothesis may never be really true. In this case, the ERR is an estimate of the average power to get a significant result for a true hypothesis. Thus, power is estimated to be between 62% and 76%. The main problem is that this is an average and that many studies have less power. This can be seen in Figure 1 by examining the local power estimates for different levels of z-scores. For z-scores between 2 and 2.5, the ERR is only 47%. Thus, many studies are underpowered and have a low probability of a successful replication with the same sample size even if they showed a true effect.
The results in Figure 1 provide highly aggregated information about replicability of research published by NYU faculty. The following analyses examine potential moderators. First, I examined social and cognitive research. Other areas were too small to be analyzed individually.
The z-curve for the 11 social psychologists was similar to the z-curve in Figure 1 because they provided more test statistics and had a stronger influence on the overall result.
The z-curve for the 6 cognitive psychologists looks different. The EDR and ERR are higher for cognitive psychology, and the 95%CI for social and cognitive psychology do not overlap. This suggests systematic differences between the two fields. These results are consistent with other comparisons of the two fields, including actual replication outcomes (OSC, 2015). With an EDR of 44%, the false discovery risk for cognitive psychology is only 7% with an upper limit of the 95%CI at 12%. This suggests that the conventional criterion of .05 does keep the false positive risk at a reasonably low level or that an adjustment to alpha = .01 is sufficient. In sum, the results show that results published by cognitive researchers at NYU are more replicable than those published by social psychologists.
Since 2015 research practices in some areas of psychology, especially social psychology, have changed to increase replicability. This would imply that research by younger researchers is more replicable than research by more senior researchers that have more publications before 2015. A generation effect would also imply that a department’s replicability increases when older faculty members retire. On the other hand, associate professors are relatively young and likely to influence the reputation of a department for a long time.
The figure above shows that most test statistics come from the (k = 13) professors. As a result, the z-curve looks similar to the z-curve for all test values in Figure 1. The results for the 6 associate professors (below) are more interesting. Although five of the six associate professors are in the social area, the z-curve results show a higher EDR and less selection bias than the plot for all social psychologists. This suggests that the department will improve when full professors in social psychology retire.
Some researchers have changed research practices in response to the replication crisis. It is therefore interesting to examine whether replicability of newer research has improved. To examine this question, I performed a z-curve analysis for articles published in the past five year (2016-2021).
The results show very little signs of improvement. The EDR increased from 20% to 26%, but the confidence intervals are too wide to infer that this is a systematic change. In contrast, Stanford University improved from 22% to 50%, a significant increase. For now, NYU results should be interpreted with alpha = .005 as threshold for significance to maintain a reasonable false positive risk.
The table below shows the meta-statistics of all 19 faculty members. You can see the z-curve for each faculty member by clicking on their name.
It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).
The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.
Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.
Questionable Research Practices
The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).
This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.
Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).
The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.
To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.
For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.
Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.
The key results of the z-curve analysis are captured in Figure 1.
Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.
A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.
Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).
The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.
Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).
With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.
The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.
In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.
The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.
Social psychologists have failed to clean up their act and their literature. Here I show unusually high effect sizes in non-retracted articles by Sanna, who retracted several articles. I point out that non-retraction does not equal credibility and I show that co-authors like Norbert Schwarz lack any motivation to correct the published record. The inability of social psychologists to acknowledge and correct their mistakes renders social psychology a para-science that lacks credibility. Even meta-analyses cannot be trusted because they do not correct properly for the use of questionable research practices.
When I grew up, a popular German Schlager was the song “Aber bitte mit Sahne.” The song is about Germans love of deserts with whipped cream. So, when I saw articles by Sanna, I had to think about whipped cream, which is delicious. Unfortunately, articles by Sanna are the exact opposite. In the early 2010s, it became apparent that Sanna had fabricated data. However, unlike the thorough investigation of a similar case in the Netherlands, the extent of Sanna’s fraud remains unclear (Retraction Watch, 2012). The latest count of Sanna’s retracted articles was 8 (Retraction Watch, 2013).
WebOfScience shows 5 retraction notices for 67 articles, which means 62 articles have not been retracted. The question is whether these article can be trusted to provide valid scientific information? The answer to this question matters because Sanna’s articles are still being cited at a rate of over 100 citations per year.
Meta-Analysis of Ease of Retrieval
The data are also being used in meta-analyses (Weingarten & Hutchinson, 2018). Fraudulent data are particularly problematic for meta-analysis because fraud can produce large effect size estimates that may inflate effect size estimates. Here I report the results of my own investigation that focusses on the ease-of-retrieval paradigm that was developed by Norbert Schwarz and colleagues (Schwarz et al., 1991).
The meta-analysis included 7 studies from 6 articles. Two studies produced independent effect size estimates for 2 conditions for a total of 9 effect sizes.
Sanna, L. J., Schwarz, N., & Small, E. M. (2002). Accessibility experiences and the hindsight bias: I knew it all along versus it could never have happened. Memory & Cognition, 30(8), 1288–1296. https://doi.org/10.3758/BF03213410 [Study 1a, 1b]
Sanna, L. J., Schwarz, N., & Stocker, S. L. (2002). When debiasing backfires: Accessible content and accessibility experiences in debiasing hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(3), 497–502. https://doi.org/10.1037/0278-73126.96.36.1997 [Study 1 & 2]
Sanna, L. J., & Schwarz, N. (2003). Debiasing the hindsight bias: The role of accessibility experiences and (mis)attributions. Journal of Experimental Social Psychology, 39(3), 287–295. https://doi.org/10.1016/S0022-1031(02)00528-0 [Study 1]
Sanna, L. J., Chang, E. C., & Carter, S. E. (2004). All Our Troubles Seem So Far Away: Temporal Pattern to Accessible Alternatives and Retrospective Team Appraisals. Personality and Social Psychology Bulletin, 30(10), 1359–1371. https://doi.org/10.1177/0146167204263784 [Study 3a]
Sanna, L. J., Parks, C. D., Chang, E. C., & Carter, S. E. (2005). The Hourglass Is Half Full or Half Empty: Temporal Framing and the Group Planning Fallacy. Group Dynamics: Theory, Research, and Practice, 9(3), 173–188. https://doi.org/10.1037/1089-26188.8.131.52 [Study 3a, 3b]
Carter, S. E., & Sanna, L. J. (2008). It’s not just what you say but when you say it: Self-presentation and temporal construal. Journal of Experimental Social Psychology, 44(5), 1339–1345. https://doi.org/10.1016/j.jesp.2008.03.017 [Study 2]
When I examined Sanna’s results, I found that all 9 of these 9 effect sizes were extremely large with effect size estimates being larger than one standard deviation. A logistic regression analysis that predicted authorship (With Sanna vs. Without Sanna) showed that the large effect sizes in Sanna’s articles were unlikely to be due to sampling error alone, b = 4.6, se = 1.1, t(184) = 4.1, p = .00004 (1 / 24,642).
These results show that Sanna’s effect sizes are not typical for the ease-of-retrieval literature. As one of his retracted articles used the ease-of retrieval paradigm, it is possible that these articles are equally untrustworthy. As many other studies have investigated ease-of-retrieval effects, it seems prudent to exclude articles by Sanna from future meta-analysis.
These articles should also not be cited as evidence for specific claims about ease-of-retrieval effects for the specific conditions that were used in these studies. As the meta-analysis shows, there have been no credible replications of these studies and it remains unknown how much ease of retrieval may play a role under the specified conditions in Sanna’s articles.
The blog post is also a warning for young scientists and students of social psychology that they cannot trust researchers who became famous with the help of questionable research practices that produced too many significant results. As the reference list shows, several articles by Sanna were co-authored by Norbert Schwarz, the inventor of the ease-of-retrieval paradigm. It is most likely that he was unaware of Sanna’s fraudulent practices. However, he seemed to lack any concerns that the results might be too good to be true. After all, he encountered replicaiton failures in his own lab.
“of course, we had studies that remained unpublished. Early on we experimented with different manipulations. The main lesson was: if you make the task too blatantly difficult, people correctly conclude the task is too difficult and draw no inference about themselves. We also had a couple of studies with unexpected gender differences” (Schwarz, email communication, 5/18,21).
So, why was he not suspicious when Sanna only produced successful results? I was wondering whether Schwarz had some doubts about these studies with the help of hindsight bias. After all, a decade or more later, we know that he committed fraud for some articles on this topic, we know about replication failures in larger samples (Yeager et al., 2019), and we know that the true effect sizes are much smaller than Sanna’s reported effect sizes (Weingarten & Hutchinson, 2018).
Hi Norbert, thank you for your response. I am doing my own meta-analysis of the literature as I have some issues with the published one by Evan. More about that later. For now, I have a question about some articles that I came across, specifically Sanna, Schwarz, and Small (2002). The results in this study are very strong (d ~ 1). Do you think a replication study powered for 95% power with d = .4 (based on meta-analysis) would produce a significant result? Or do you have concerns about this particular paradigm and do not predict a replication failure? Best, Uli (email
His response shows that he is unwilling or unable to even consider the possibility that Sanna used fraud to produce the results in this article that he co-authored.
Uli, that paper has 2 experiments, one with a few vs many manipulation and one with a facial manipulation. I have no reason to assume that the patterns won’t replicate. They are consistent with numerous earlier few vs many studies and other facial manipulation studies (introduced by Stepper & Strack, JPSP, 1993). The effect sizes always depend on idiosyncracies of topic, population, and context, which influence accessible content and accessibility experience. The theory does not make point predictions and the belief that effect sizes should be identical across decades and populations is silly — we’re dealing with judgments based on accessible content, not with immutable objects.
This response is symptomatic of social psychologists response to decades of research that has produced questionable results that often fail to replicate (see Schimmack, 2020, for a review). Even when there is clear evidence of questionable practices, journals are reluctant to retract articles that make false claims based on invalid data (Kitayama, 2020). And social psychologist Daryl Bem wants rather be remembered as loony para-psychologists than as real scientists (Bem, 2021).
The problem with these social psychologists is not that they made mistakes in the way they conducted their studies. The problem is their inability to acknowledge and correct their mistakes. While they are clinging to their CVs and H-Indices to protect their self-esteem, they are further eroding trust in psychology as a science and force junior scientists who want to improve things out of academia (Hilgard, 2021). After all, the key feature of science that distinguishes it from ideologies is the ability to correct itself. A science that shows no signs of self-correction is a para-science and not a real science. Thus, social psychology is currently para-science (i.e., “Parascience is a broad category of academic disciplines, that are outside the scope of scientific study, Wikipedia).
The only hope for social psychology is that young researchers are unwilling to play by the old rules and start a credibility revolution. However, the incentives still favor conformists who suck up to the old guard. Thus, it is unclear if social psychology will ever become a real science. A first sign of improvement would be to retract articles that make false claims based on results that were produced with questionable research practices. Instead, social psychologists continue to write review articles that ignore the replication crisis (Schwarz & Strack, 2016) as if repression can bend reality.
Citation: Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487
In 2011 I wrote a manuscript in response to Bem’s (2011) unbelievable and flawed evidence for extroverts’ supernatural abilities. It took nearly two years for the manuscript to get published in Psychological Methods. While I was proud to have published in this prestigious journal without formal training in statistics and a grasp of Greek notation, I now realize that Psychological Methods was not the best outlet for the article, which may explain why even some established replication revolutionaries do not know it (comment: I read your blog, but I didn’t know about this article). So, I decided to publish an abridged (it is still long), lightly edited (I have learned a few things since 2011), and commented (comments are in […]) version here.
I also learned a few things about titles. So the revised version, has a new title.
Finally, I can now disregard the request from the editor, Scott Maxwell, on behave of reviewer Daryl Bem, to change the name of my statistical index from magic index to incredibilty index. (the advantage of publishing without the credentials and censorship of peer-review).
For readers not familiar with experimental social psychology, it is also important to understand what a multiple study article is. Most science are happy with one empirical study per article. However, social psychologists didn’t trust the results of a single study with p < .05. Therefore, they wanted to see internal conceptual replications of phenomena. Magically, Bem was able to provide evidence for supernatural abilities in not just 1 or 2 or 3 studies, but 8 conceptual replication studies with 9 successful tests. The chance of a false positive result in 9 statistical tests is smaller than the chance of finding evidence for the Higgs-Bosson particle, which was a big discovery in physics. So, readers in 2011 had a difficult choice to make: either supernatural phenomena are real or multiple study articles are unreal. My article shows that the latter is likely to be true, as did an article by Greg Francis.
Aside from Alcock’s demonstration of a nearly perfect negative correlation between effect sizes and sample sizes and my demonstration of insufficient variance in Bem’s p-values, Francis’s article and my article remain the only article that question the validity of Bem’s origina findings. Other articles have shown that the results cannot be replicated, but I showed that the original results were already too good to be true. This blog post explains, how I did it.
Why most multiple-study articles are false: An Introduction to the Magic Index
(the article formerly known as “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles”)
Cohen (1962) pointed out the importance of statistical power for psychology as a science, but statistical power of studies has not increased, while the number of studies in a single article has increased. It has been overlooked that multiple studies with modest power have a high probability of producing nonsignificant results because power decreases as a function of the number of statistical tests that are being conducted (Maxwell, 2004). The discrepancy between the expected number of significant results and the actual number of significant results in multiple-study articles undermines the credibility of the reported
results, and it is likely that questionable research practices have contributed to the reporting of too many significant results (Sterling, 1959). The problem of low power in multiple-study articles is illustrated using Bem’s (2011) article on extrasensory perception and Gailliot et al.’s (2007) article on glucose and self-regulation. I conclude with several recommendations that can increase the credibility of scientific evidence in psychological journals. One major recommendation is to pay more attention to the power of studies to produce positive results without the help of questionable research practices and to request that authors justify sample sizes with a priori predictions of effect sizes. It is also important to publish replication studies with nonsignificant results if these studies have high power to replicate a published finding.
Less is more, except of course for sample size. (Cohen, 1990, p. 1304)
In 2011, the prestigious Journal of Personality and Social Psychology published an article that provided empirical support for extrasensory perception (ESP; Bem, 2011). The publication of this controversial article created vigorous debates in psychology
departments, the media, and science blogs. In response to this debate, the acting editor and the editor-in-chief felt compelled to write an editorial accompanying the article. The editors defended their decision to publish the article by noting that Bem’s (2011) studies were performed according to standard scientific practices in the field of experimental psychology and that it would seem inappropriate to apply a different standard to studies of ESP (Judd & Gawronski, 2011).
Others took a less sanguine view. They saw the publication of Bem’s (2011) article as a sign that the scientific standards guiding publication decisions are flawed and that Bem’s article served as a glaring example of these flaws (Wagenmakers, Wetzels, Borsboom,
& van der Maas, 2011). In a nutshell, Wagenmakers et al. (2011) argued that the standard statistical model in psychology is biased against the null hypothesis; that is, only findings that are statistically significant are submitted and accepted for publication.
This bias leads to the publication of too many positive (i.e., statistically significant) results. The observation that scientific journals, not only those in psychology,
publish too many statistically significant results is by no means novel. In a seminal article, Sterling (1959) noted that selective reporting of statistically significant results can produce literatures that “consist in substantial part of false conclusions” (p.
Three decades later, Sterling, Rosenbaum, and Weinkam (1995) observed that the “practice leading to publication bias have [sic] not changed over a period of 30 years” (p. 108). Recent articles indicate that publication bias remains a problem in psychological
journals (Fiedler, 2011; John, Loewenstein, & Prelec, 2012; Kerr, 1998; Simmons, Nelson, & Simonsohn, 2011; Strube, 2006; Vul, Harris, Winkielman, & Pashler, 2009; Yarkoni, 2010).
Other sciences have the same problem (Yong, 2012). For example, medical journals have seen an increase in the percentage of retracted articles (Steen, 2011a, 2011b), and there is the concern that a vast number of published findings may be false (Ioannidis,
However, a recent comparison of different scientific disciplines suggested that the bias is stronger in psychology than in some of the older and harder scientific disciplines at the top of a hierarchy of sciences (Fanelli, 2010).
It is important that psychologists use the current crisis as an opportunity to fix problems in the way research is being conducted and reported. The proliferation of eye-catching claims based on biased or fake data can have severe negative consequences for a
science. A New Yorker article warned the public that “all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable” (Lehrer, 2010, p. 1).
If students who read psychology textbooks and the general public lose trust in the credibility of psychological science, psychology loses its relevance because
objective empirical data are the only feature that distinguishes psychological science from other approaches to the understanding of human nature and behavior. It is therefore hard to exaggerate the seriousness of doubts about the credibility of research findings published in psychological journals.
In an influential article, Kerr (1998) discussed one source of bias, namely, hypothesizing after the results are known (HARKing). The practice of HARKing may be attributed to the
high costs of conducting a study that produces a nonsignificant result that cannot be published. To avoid this negative outcome, researchers can design more complex studies that test multiple hypotheses. Chances increase that at least one of the hypotheses
will be supported, if only because Type I error increases (Maxwell, 2004). As noted by Wagenmakers et al. (2011), generations of graduate students were explicitly advised that this questionable research practice is how they should write scientific manuscripts
It is possible that Kerr’s (1998) article undermined the credibility of single-study articles and added to the appeal of multiple-study articles (Diener, 1998; Ledgerwood & Sherman, 2012). After all, it is difficult to generate predictions for significant effects
that are inconsistent across studies. Another advantage is that the requirement of multiple significant results essentially lowers the chances of a Type I error, that is, the probability of falsely rejecting the null hypothesis. For a set of five independent studies,
the requirement to demonstrate five significant replications essentially shifts the probability of a Type I error from p < .05 for a single study to p < .0000003 (i.e., .05^5) for a set of five studies.
This is approximately the same stringent criterion that is being used in particle physics to claim a true discovery (Castelvecchi, 2011). It has been overlooked, however, that researchers have to pay a price to meet more stringent criteria of credibility. To demonstrate significance at a more stringent criterion of significance, it is
necessary to increase sample sizes to reduce the probability of making a Type II error (failing to reject the null hypothesis). This probability is called beta. The inverse probability (1 – beta) is called power. Thus, to maintain high statistical power to demonstrate an effect with a more stringent alpha level requires an
increase in sample sizes, just as physicists had to build a bigger collider to have a chance to find evidence for smaller particles like the Higgs boson particle.
Yet there is no evidence that psychologists are using bigger samples to meet more stringent demands of replicability (Cohen, 1992; Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). This raises the question of how researchers are able to replicate findings in multiple-study articles despite modest power to demonstrate significant effects even within a single study. Researchers can use questionable research
practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result. Moreover, a survey of researchers indicated that these
practices are common (John et al., 2012), and the prevalence of these practices has raised concerns about the credibility of psychology as a science (Yong, 2012).
An implicit assumption in the field appears to be that the solution to these problems is to further increase the number of positive replication studies that need to be presented to ensure scientific credibility (Ledgerwood & Sherman, 2012). However, the assumption that many replications with significant results provide strong evidence for a hypothesis is an illusion that is akin to the Texas sharpshooter fallacy (Milloy, 1995). Imagine a Texan farmer named Joe. One day he invites you to his farm and shows you a target with nine shots in the bull’s-eye and one shot just outside the bull’s-eye. You are impressed by his shooting abilities until you find out that he cannot repeat this performance when you challenge him to do it again.
[So far, well-known Texan sharpshooters in experimental social psychology have carefully avoided demonstrating their sharp shooting abilities in open replication studies to avoid the embarrassment of not being able to do it again].
Over some beers, Joe tells you that he first fired 10 shots at the barn and then drew the targets after the shots were fired. One problem in science is that reading a research
article is a bit like visiting Joe’s farm. Readers only see the final results, without knowing how the final results were created. Is Joe a sharpshooter who drew a target and then fired 10 shots at the target? Or was the target drawn after the fact? The reason why multiple-study articles are akin to a Texan sharpshooter is that psychological studies have modest power (Cohen, 1962; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). Assuming
60% power for a single study, the probability of obtaining 10 significant results in 10 studies is less than 1% (.6^10 = 0.6%).
I call the probability to obtain only significant results in a set of studies total power. Total power parallels Maxwell’s (2004) concept of all-pair power for multiple comparisons in analysis-of variance designs. Figure 1 illustrates how total power decreases with the number of studies that are being conducted. Eventually, it becomes extremely unlikely that a set of studies produces only significant results. This is especially true if a single study has modest power. When total power is low, it is incredible that a set
of studies yielded only significant results. To avoid the problem of incredible results, researchers would have to increase the power of studies in multiple-study articles.
Table 1 shows how the power of individual studies has to be adjusted to maintain 80% total power for a set of studies. For example, to have 80% total power for five replications, the power of each study has to increase to 96%.
Table 1 also shows the sample sizes required to achieve 80% total power, assuming a simple between-group design, an alpha level of .05 (two-tailed), and Cohen’s
(1992) guidelines for a small (d = .2), moderate, (d = .5), and strong (d = .8) effect.
[To demonstrate a small effect 7 times would require more than 10,000 participants.]
In sum, my main proposition is that psychologists have falsely assumed that increasing the number of replications within an article increases credibility of psychological science. The problem of this practice is that a truly programmatic set of multiple studies
is very costly and few researchers are able to conduct multiple studies with adequate power to achieve significant results in all replication attempts. Thus, multiple-study articles have intensified the pressure to use questionable research methods to compensate for low total power and may have weakened rather than strengthened the credibility of psychological science.
[I believe this is one reason why the replication crisis has hit experimental social psychology the hardest. Other psychologists could use HARKing to tell a false story about a single study, but experimental social psychologists had to manipulate the data to get significance all the time. Experimental cognitive psychologists also have multiple study articles, but they tend to use more powerful within-subject designs, which makes it more credible to get significant results multiple times. The multiple study BS design made it impossible to do so, which resulted in the publication of BS results.]
What Is the Allure of Multiple-Study Articles?
One apparent advantage of multiple-study articles is to provide stronger evidence against the null hypothesis (Ledgerwood & Sherman, 2012). However, the number of studies is irrelevant because the strength of the empirical evidence is a function of the
total sample size rather than the number of studies. The main reason why aggregation across studies reduces randomness as a possible explanation for observed mean differences (or correlations) is that p values decrease with increasing sample size. The
number of studies is mostly irrelevant. A study with 1,000 participants has as much power to reject the null hypothesis as a meta-analysis of 10 studies with 100 participants if it is reasonable to assume a common effect size for the 10 studies. If true effect sizes vary across studies, power decreases because a random-effects model may be more appropriate (Schmidt, 2010; but see Bonett, 2009). Moreover, the most logical approach to reduce concerns about Type I error is to use more stringent criteria for significance (Mudge, Baker, Edge, & Houlahan, 2012). For controversial or very important research findings, the significance level could be set to p < .001 or, as in particle physics, to p <
[Ironically, five years later we have a debate about p < .05 versus p < .005, without even thinking about p < .0000005 or any mention that even a pair of studies with p < .05 in each study effectively have an alpha less than p < .005, namely .0025 to be exact.]
It is therefore misleading to suggest that multiple-study articles are more credible than single-study articles. A brief report with a large sample (N = 1,000) provides more credible evidence than a multiple-study article with five small studies (N = 40, total
N = 200).
The main appeal of multiple-study articles seems to be that they can address other concerns (Ledgerwood & Sherman, 2012). For example, one advantage of multiple studies could be to test the results across samples from diverse populations (Henrich, Heine, & Norenzayan, 2010). However, many multiple-study articles are based on samples drawn from a narrowly defined population (typically, students at the local university). If researchers were concerned about generalizability across a wider range of individuals, multiple-study articles should examine different populations. However, it is not clear why it would be advantageous to conduct multiple independent studies with different populations. To compare populations, it would be preferable to use the same procedures and to analyze the data within a single statistical model with population as a potential moderating factor. Moreover, moderator tests often have low power. Thus, a single study with a large sample and moderator variables is more informative than articles that report separate analyses with small samples drawn from different populations.
Another attraction of multiple-study articles appears to be the ability to provide strong evidence for a hypothesis by means of slightly different procedures. However, even here, single studies can be as good as multiple-study articles. For example, replication across different dependent variables in different studies may mask the fact that studies included multiple dependent variables and researchers picked dependent variables that produced significant results (Simmons et al., 2011). In this case, it seems preferable to
demonstrate generalizability across dependent variables by including multiple dependent variables within a single study and reporting the results for all dependent variables.
One advantage of a multimethod assessment in a single study is that the power to
demonstrate an effect increases for two reasons. First, while some dependent variables may produce nonsignificant results in separate small studies due to low power (Maxwell, 2004), they may all show significant effects in a single study with the total sample size
of the smaller studies. Second, it is possible to increase power further by constraining coefficients for each dependent variable or by using a latent-variable measurement model to test whether the effect is significant across dependent variables rather than for each one independently.
Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating
a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions. Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe
truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005).
[This is my take down of social psychologists’ claim that multiple conceptual replications test theories, Stroebe & Strack, 2004]
The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect. In this way, empirical studies no longer test theoretical hypotheses because they can only produce two results: Either they support the theory (p < .05) or the manipulation did not work (p > .05). It is therefore worrisome that Bem noted that “like most social psychological experiments, the experiments reported here required extensive pilot testing” (Bem, 2011, p. 421). If Joe is a sharpshooter, who can hit the bull’s-eye from different angles and with different guns, why does he need extensive training before he can perform the critical shot?
The freedom of researchers to discount null findings leads to the paradox that conceptual replications across multiple studies give the impression that an effect is robust followed by warnings that experimental findings may not replicate because they depend “on subtle and unknown factors” (Bem, 2011, p. 422).
If experimental results were highly context dependent, it would be difficult to explain how studies reported in research articles nearly always produce the expected results. One possible explanation for this paradox is that sampling error in small samples creates the illusion that effect sizes vary systematically, although most of the variation is random. Researchers then pick studies that randomly produced inflated effect sizes and may further inflate them by using questionable research methods to achieve significance (Simmons et al., 2011).
[I was polite when I said “may”. This appears to be exactly what Bem did to get his supernatural effects.]
The final set of studies that worked is then published and gives a false sense of the effect size and replicability of the effect (you should see the other side of Joe’s barn). This may explain why research findings initially seem so impressive, but when other researchers try to build on these seemingly robust findings, it becomes increasingly uncertain whether a phenomenon exists at all (Ioannidis, 2005; Lehrer, 2010).
At this point, a lot of resources have been wasted without providing credible evidence for an effect.
[And then Stroebe and Strack in 2014 suggest that real replication studies that let the data determine the outcome are a waste of resources.]
To increase the credibility of reported findings, it would be better to use all of the resources for one powerful study. For example, the main dependent variable in Bem’s (2011) study of ESP was the percentage of correct predictions of future events.
Rather than testing this ability 10 times with N = 100 participants, it would have been possible to test the main effect of ESP in a single study with 10 variations of experimental procedures and use the experimental conditions as a moderating factor. By testing one
main effect of ESP in a single study with N = 1,000, power would be greater than 99.9% to demonstrate an effect with Bem’s a priori effect size.
At the same time, the power to demonstrate significant moderating effects would be much lower. Thus, the study would lead to the conclusion that ESP does exist but that it is unclear whether the effect size varies as a function of the actual experimental
paradigm. This question could then be examined in follow-up studies with more powerful tests of moderating factors.
In conclusion, it is true that a programmatic set of studies is superior to a brief article that reports a single study if both articles have the same total power to produce significant results (Ledgerwood & Sherman, 2012). However, once researchers use questionable research practices to make up for insufficient total power, multiple-study articles lose their main advantage over single-study articles, namely, to demonstrate generalizability across different experimental manipulations or other extraneous factors.
Moreover, the demand for multiple studies counteracts the demand for more
powerful studies (Cohen, 1962; Maxwell, 2004; Rossi, 1990) because limited resources (e.g., subject pool of PSY100 students) can only be used to increase sample size in one study or to conduct more studies with small samples.
It is therefore likely that the demand for multiple studies within a single article has eroded rather than strengthened the credibility of published research findings
(Steen, 2011a, 2011b), and it is problematic to suggest that multiple-study articles solve the problem that journals publish too many positive results (Ledgerwood & Sherman, 2012). Ironically, the reverse may be true because multiple-study articles provide a
false sense of credibility.
Joe the Magician: How Many Significant Results Are Too Many?
Most people enjoy a good magic show. It is fascinating to see something and to know at the same time that it cannot be real. Imagine that Joe is a well-known magician. In front of a large audience, he fires nine shots from impossible angles, blindfolded, and seemingly through the body of an assistant, who miraculously does not bleed. You cannot figure out how Joe pulled off the stunt, but you know it was a stunt. Similarly, seeing Joe hit the bull’s-eye 1,000 times in a row raises concerns about his abilities as a sharpshooter and suggests that some magic is contributing to this miraculous performance. Magic is fun, but it is not science.
[Before Bem’s article appeared, Steve Heine gave a talk at the University of Toront where he presented multiple studies with manipulations of absurdity (absurdity like Monty Python’s “Biggles: Pioneer Air Fighter; cf. Proulx, Heine, & Vohs, PSPB, 2010). Each absurd manipulation was successful. I didn’t have my magic index then, but I did understand the logic of Sterling et al.’s (1995) argument. So, I did ask whether there were also manipulations that did not work and the answer was affirmative. It was rude at the time to ask about a file drawer before 2011, but a recent twitter discussion suggests that it wouldn’t be rude in 2018. Times are changing.]
The problem is that some articles in psychological journals appear to be more magical than one would expect on the basis of the normative model of science (Kerr, 1998). To increase the credibility of published results, it would be desirable to have a diagnostic tool that can distinguish between credible research findings and those that are likely to be based on questionable research practices. Such a tool would also help to
counteract the illusion that multiple-study articles are superior to single-study articles without leading to the erroneous reverse conclusion that single-study articles are more trustworthy.
[I need to explain why I targeted multiple-study articles in particular. Even the personality section of JPSP started to demand multiple studies because they created the illusion of being more rigorous, e.g., the crazy glucose article was published in that section. At that time, I was still trying to publish as many articles as possible in JPSP and I was not able to compete with crazy science.]
Articles should be evaluated on the basis of their total power to demonstrate consistent evidence for an effect. As such, a single-study article with 80% (total) power is superior to a multiple-study article with 20% total power, but a multiple-study article with 80% total power is superior to a single-study article with 80% power.
The Magic Index (formerly known as the Incredibility Index)
The idea to use power analysis to examine bias in favor of theoretically predicted effects and against the null hypothesis was introduced by Sterling et al. (1995). Ioannidis and Trikalinos (2007) provided a more detailed discussion of this approach for the detection of bias in meta-analyses. Ioannidis and Trikalinos’s exploratory test estimates the probability of the number of reported significant results given the average power of the reported studies. Low p values suggest that there are too many significant results, suggesting that questionable research methods contributed to the reported results. In contrast, the inverse inference is not justified because high p values do not justify the inference that questionable research practices did not contribute to the results. To emphasize this asymmetry in inferential strength, I suggest reversing the exploratory test, focusing on the probability of obtaining more nonsignificant results than were reported in a multiple-study article and calling this index the magic index.
Higher values indicate that there is a surprising lack of nonsignificant results (a.k.a., shots that missed the bull’s eye). The higher the magic index is, the more incredible the observed outcome becomes.
Too many significant results could be due to faking, fudging, or fortune. Thus, the statistical demonstration that a set of reported findings is magical does not prove that questionable research methods contributed to the results in a multiple-study article. However, even when questionable research methods did not contribute to the results, the published results are still likely to be biased because fortune helped to inflate effect sizes and produce more significant results than total power justifies.
Computation of the Incredibility Index
To understand the basic logic of the M-index, it is helpful to consider a concrete example. Imagine a multiple-study article with 10 studies with an average observed effect size of d = .5 and 84 participants in each study (42 in two conditions, total N = 840) and all studies producing a significant result. At first sight, these 10 studies seem to provide strong support against the null hypothesis. However, a post hoc power analysis with the average effect size of d = .5 as estimate of the true effect size reveals that each study had
only 60% power to obtain a significant result. That is, even if the true effect size were d = .5, only six out of 10 studies should have produced a significant result.
The M-index quantifies the probability of the actual outcome (10 out of 10 significant results) given the expected value (six out of 10 significant results) using binomial
probability theory. From the perspective of binomial probability theory, the scenario
is analogous to an urn problem with replacement with six green balls (significant) and four red balls (nonsignificant). The binomial probability to draw at least one red ball in 10 independent draws is 99.4%. (Stat Trek, 2012).
That is, 994 out of 1,000 multiple-study articles with 10 studies and 60% average power
should have produced at least one nonsignificant result in one of the 10 studies. It is therefore incredible if an article reports 10 significant results because only six out of 1,000 attempts would have produced this outcome simply due to chance alone.
[I now realize that observed power of 60% would imply that the null-hypothesis is true because observed power is also inflated by selecting for significance. As 50% observed poewr is needed to achieve significance and chance cannot produce the same observed power each time, the minimum observed power is 62%!]
One of the main problems for power analysis in general and the computation of the IC-index in particular is that the true effect size is unknown and has to be estimated. There are three basic approaches to the estimation of true effect sizes. In rare cases, researchers provide explicit a priori assumptions about effect sizes (Bem, 2011). In this situation, it seems most appropriate to use an author’s stated assumptions about effect sizes to compute power with the sample sizes of each study. A second approach is to average reported effect sizes either by simply computing the mean value or by weighting effect sizes by their sample sizes. Averaging of effect sizes has the advantage that post hoc effect size estimates of single studies tend to have large confidence intervals. The confidence intervals shrink when effect sizes are aggregated across
studies. However, this approach has two drawbacks. First, averaging of effect sizes makes strong assumptions about the sampling of studies and the distribution of effect sizes (Bonett, 2009). Second, this approach assumes that all studies have the same effect
size, which is unlikely if a set of studies used different manipulations and dependent variables to demonstrate the generalizability of an effect. Ioannidis and Trikalinos (2007) were careful to warn readers that “genuine heterogeneity may be mistaken for bias” (p.
[I did not know about Ioannidis and Trikalinos’s (2007) article when I wrote the first draft. Maybe that is a good thing because I might have followed their approach. However, my approach is different from their approach and solves the problem of pooling effect sizes. Claiming that my method is the same as Trikalinos’s method is like confusing random effects meta-analysis with fixed-effect meta-analysis]
To avoid the problems of average effect sizes, it is promising to consider a third option. Rather than pooling effect sizes, it is possible to conduct post hoc power analysis for each study. Although each post hoc power estimate is associated with considerable sampling error, sampling errors tend to cancel each other out, and the M-index for a set of studies becomes more accurate without having to assume equal effect sizes in all studies.
Unfortunately, this does not guarantee that the M-index is unbiased because power is a nonlinear function of effect sizes. Yuan and Maxwell (2005) examined the implications of this nonlinear relationship. They found that the M-index may provide inflated estimates of average power, especially in small samples where observed effect sizes vary widely around the true effect size. Thus, the M-index is conservative when power is low and magic had to be used to create significant results.
In sum, it is possible to use reported effect sizes to compute post hoc power and to use post hoc power estimates to determine the probability of obtaining a significant result. The post hoc power values can be averaged and used as the probability for a successful
outcome. It is then possible to use binomial probability theory to determine the probability that a set of studies would have produced equal or more nonsignificant results than were actually reported. This probability is [now] called the M-index.
[Meanwhile, I have learned that it is much easier to compute observed power based on reported test statistics like t, F, and chi-square values because observed power is determined by these statistics.]
Example 1: Extrasensory Perception (Bem, 2011)
I use Bem’s (2011) article as an example because it may have been a tipping point for the current scientific paradigm in psychology (Wagenmakers et al., 2011).
[I am still waiting for EJ to return the favor and cite my work.]
The editors explicitly justified the publication of Bem’s article on the grounds that it was subjected to a rigorous review process, suggesting that it met current standards of scientific practice (Judd & Gawronski, 2011). In addition, the editors hoped that the publication of Bem’s article and Wagenmakers et al.’s (2011) critique would stimulate “critical further thoughts about appropriate methods in research on social cognition and attitudes” (Judd & Gawronski, 2011, p. 406).
A first step in the computation of the M-index is to define the set of effects that are being examined. This may seem trivial when the M-index is used to evaluate the credibility of results in a single article, but multiple-study articles contain many results and it is not always obvious that all results should be included in the analysis (Maxwell, 2004).
[Same here. Maxwell accepted my article, but apparently doesn’t think it is useful to cite when he writes about the replication crisis.]
[deleted minute details about Bem’s study here.]
Another decision concerns the number of hypotheses that should be examined. Just as multiple studies reduce total power, tests of multiple hypotheses within a single study also reduce total power (Maxwell, 2004). Francis (2012b) decided to focus only on the
hypothesis that ESP exists, that is, that the average individual can foresee the future. However, Bem (2011) also made predictions about individual differences in ESP. Therefore, I used all 19 effects reported in Table 7 (11 ESP effects and eight personality effects).
[I deleted the section that explains alternative approaches that rely on effect sizes rather than observed power here.]
I used G*Power 3.1.2 to obtain post hoc power on the basis of effect sizes and sample sizes (Faul, Erdfelder, Buchner, & Lang, 2009).
The M-index is more powerful when a set of studies contains only significant results. In this special case, the M-index is the inverse probability of total power.
[An article by Fabrigar and Wegener misrepresents my article and confuses the M-Index with total power. When articles do report non-significant result and honestly report them as failures to reject the null-hypothesis (not marginal significance), it is necessary to compute the binomial probability to get the M-Index.]
[Again, I deleted minute computations for Bem’s results.]
Using the highest magic estimates produces a total Magic-Index of 99.97% for Bem’s 17 results. Thus, it is unlikely that Bem (2011) conducted 10 studies, ran 19 statistical tests of planned hypotheses, and obtained 14 statisstically significant results.
Yet the editors felt compelled to publish the manuscript because “we can only take the author at his word that his data are in fact genuine and that the reported findings have not been taken from a larger set of unpublished studies showing null effects” (Judd & Gawronski, 2011, p. 406).
[It is well known that authors excluded disconfirming evidence and that editors sometimes even asked authors to engage in this questionable research practice. However, this quote implies that the editors asked Bem about failed studies and that he assured them that there are no failed studies, which may have been necessary to publish these magical results in JPSP. If Bem did not disclose failed studies on request and these studies exist, it would violate even the lax ethical standards of the time that mostly operated on a “don’t ask don’t tell” basis. ]
The M-index provides quantitative information about the credibility of this assumption and would have provided the editors with objective information to guide their decision. More importantly, awareness about total power could have helped Bem to plan fewer studies with higher total power to provide more credible evidence for his hypotheses.
Example 2: Sugar High—When Rewards Undermine Self-Control
Bem’s (2011) article is exceptional in that it examined a controversial phenomenon. I used another nine-study article that was published in the prestigious Journal of Personality and Social Psychology to demonstrate that low total power is also a problem
for articles that elicit less skepticism because they investigate less controversial hypotheses. Gailliot et al. (2007) examined the relation between blood glucose levels and self-regulation. I chose this article because it has attracted a lot of attention (142 citations in Web of Science as of May 2012; an average of 24 citations per year) and it is possible to evaluate the replicability of the original findings on the basis of subsequent studies by other researchers (Dvorak & Simons, 2009; Kurzban, 2010).
[If anybody needs evidence that citation counts are a silly indicator of quality, here it is: the article has been cited 80 times in 2014, 64 times in 2015, 63 times in 2016, and 61 times in 2017. A good reason to retract it, if JPSP and APA cares about science and not just impact factors.]
Sample sizes were modest, ranging from N = 12 to 102. Four studies had sample sizes of N < 20, which Simmons et al. (2011) considered to require special justification. The total N is 359 participants. Table 1 shows that this total sample
size is sufficient to have 80% total power for four large effects or two moderate effects and is insufficient to demonstrate a [single] small effect. Notably, Table 4 shows that all nine reported studies produced significant results.
The M-Index for these 9 studies was greater than 99%. This indicates that from a statistical point of view, Bem’s (2011) evidence for ESP is more credible than Gailliot et al.’s (2007) evidence for a role of blood glucose in self-regulation.
A more powerful replication study with N = 180 participants provides more conclusive evidence (Dvorak & Simons, 2009). This study actually replicated Gailliot et al.’s (1997) findings in Study 1. At the same time, the study failed to replicate the results for Studies 3–6 in the original article. Dvorak and Simons (2009) did not report the correlation, but the authors were kind enough to provide this information. The correlation was not significant in the experimental group, r(90) = .10, and the control group, r(90) =
.03. Even in the total sample, it did not reach significance, r(180) = .11. It is therefore extremely likely that the original correlations were inflated because a study with a sample of N = 90 has 99.9% power to produce a significant effect if the true effect
size is r = .5. Thus, Dvorak and Simons’s results confirm the prediction of the M-index that the strong correlations in the original article are incredible.
In conclusion, Gailliot et al. (2007) had limited resources to examine the role of blood glucose in self-regulation. By attempting replications in nine studies, they did not provide strong evidence for their theory. Rather, the results are incredible and difficult to replicate, presumably because the original studies yielded inflated effect sizes. A better solution would have been to test the three hypotheses in a single study with a large sample. This approach also makes it possible to test additional hypotheses, such as mediation (Dvorak & Simons, 2009). Thus, Example 2 illustrates that
a single powerful study is more informative than several small studies.
Fifty years ago, Cohen (1962) made a fundamental contribution to psychology by emphasizing the importance of statistical power to produce strong evidence for theoretically predicted effects. He also noted that most studies at that time had only sufficient power to provide evidence for strong effects. Fifty years later, power
analysis remains neglected. The prevalence of studies with insufficient power hampers scientific progress in two ways. First, there are too many Type II errors that are often falsely interpreted as evidence for the null hypothesis (Maxwell, 2004). Second, there
are too many false-positive results (Sterling, 1959; Sterling et al., 1995). Replication across multiple studies within a single article has been considered a solution to these problems (Ledgerwood & Sherman, 2012). The main contribution of this article is to point out that multiple-study articles do not provide more credible evidence simply because they report more statistically significant results. Given the modest power of individual studies, it is even less credible that researchers were able to replicate results repeatedly in a series of studies than that they obtained a significant effect in a single study.
The demonstration that multiple-study articles often report incredible results might help to reduce the allure of multiple-study articles (Francis, 2012a, 2012b). This is not to say that multiple-study articles are intrinsically flawed or that single-study articles are superior. However, more studies are only superior if total power is held constant, yet limited resources create a trade-off between the number of studies and total power of a set of studies.
To maintain credibility, it is better to maximize total power rather than number of studies. In this regard, it is encouraging that some editors no longer consider number ofstudies as a selection criterion for publication (Smith, 2012).
[Over the past years, I have been disappointed by many psychologists that I admired or respected. I loved ER Smith’s work on exemplar models that influenced my dissertation work on frequency estimation of emotion. In 2012, I was hopeful that he would make real changes, but my replicability rankings show that nothing changed during his term as editor of the JPSP section that published Bem’s article. Five wasted years and nobody can say he couldn’t have known better.]
Subsequently, I first discuss the puzzling question of why power continues to be ignored despite the crucial importance of power to obtain significant results without the help of questionable research methods. I then discuss the importance of paying more attention to total power to increase the credibility of psychology as a science. Due to space limitations, I will not repeat many other valuable suggestions that have been made to improve the current scientific model (Schooler, 2011; Simmons et al., 2011; Spellman, 2012; Wagenmakers et al., 2011).
In my discussion, I will refer to Bem’s (2011) and Gailliot et al.’s (2007) articles, but it should be clear that these articles merely exemplify flaws of the current scientific
paradigm in psychology.
Why Do Researchers Continue to Ignore Power?
Maxwell (2004) proposed that researchers ignore power because they can use a shotgun approach. That is, if Joe sprays the barn with bullets, he is likely to hit the bull’s-eye at least once. For example, experimental psychologists may use complex factorial
designs that test multiple main effects and interactions to obtain at
least one significant effect (Maxwell, 2004).
Psychologists who work with many variables can test a large number of correlations
to find a significant one (Kerr, 1998). Although studies with small samples have modest power to detect all significant effects (low total power), they have high power to detect at least one significant effect (Maxwell, 2004).
The shotgun model is unlikely to explain incredible results in multiple-study articles because the pattern of results in a set of studies has to be consistent. This has been seen as the main strength of multiple-study articles (Ledgerwood & Sherman, 2012).
However, low total power in multiple-study articles makes it improbable that all studies produce significant results and increases the pressure on researchers to use questionable research methods to comply with the questionable selection criterion that
manuscripts should report only significant results.
A simple solution to this problem would be to increase total power to avoid
having to use questionable research methods. It is therefore even more puzzling why the requirement of multiple studies has not resulted in an increase in power.
One possible explanation is that researchers do not care about effect sizes. Researchers may not consider it unethical to use questionable research methods that inflate effect sizes as long as they are convinced that the sign of the reported effect is consistent
with the sign of the true effect. For example, the theory that implicit attitudes are malleable is supported by a positive effect of experimental manipulations on the implicit association test, no matter whether the effect size is d = .8 (Dasgupta & Greenwald,
2001) or d = .08 (Joy-Gaba & Nosek, 2010), and the influence of blood glucose levels on self-control is supported by a strong correlation of r = .6 (Gailliot et al., 2007) and a weak correlation of r = .1 (Dvorak & Simons, 2009).
The problem is that in the real world, effect sizes matter. For example, it matters whether exercising for 20 minutes twice a week leads to a weight loss of one
pound or 10 pounds. Unbiased estimates of effect sizes are also important for the integrity of the field. Initial publications with stunning and inflated effect sizes produce underpowered replication studies even if subsequent researchers use a priori power analysis.
As failed replications are difficult to publish, inflated effect sizes are persistent and can bias estimates of true effect sizes in meta-analyses. Failed replication studies in file drawers also waste valuable resources (Spellman, 2012).
In comparison to one small (N = 40) published study with an inflated effect size and
nine replication studies with nonsignificant replications in file drawers (N = 360), it would have been better to pool the resources of all 10 studies for one strong test of an important hypothesis (N = 400).
A related explanation is that true effect sizes are often likely to be small to moderate and that researchers may not have sufficient resources for unbiased tests of their hypotheses. As a result, they have to rely on fortune (Wegner, 1992) or questionable research
methods (Simmons et al., 2011; Vul et al., 2009) to report inflated observed effect sizes that reach statistical significance in small samples.
Another explanation is that researchers prefer small samples to large samples because small samples have less power. When publications do not report effect sizes, sample sizes become an imperfect indicator of effect sizes because only strong effects
reach significance in small samples. This has led to the flawed perception that effect sizes in large samples have no practical significance because even effects without practical significance can reach statistical significance (cf. Royall, 1986). This line of
reasoning is fundamentally flawed and confounds credibility of scientific evidence with effect sizes.
The most probable and banal explanation for ignoring power is poor statistical training at the undergraduate and graduate levels. Discussions with colleagues and graduate students suggest that power analysis is mentioned, but without a sense of importance.
[I have been preaching about power for years in my department and it became a running joke for students to mention power in their presentation without having any effect on research practices until 2011. Fortunately, Bem unintentionally made it able to convince some colleagues that power is important.]
Research articles also reinforce the impression that power analysis is not important as sample sizes vary seemingly at random from study to study or article to article. As a result, most researchers probably do not know how risky their studies are and how lucky they are when they do get significant and inflated effects.
I hope that this article will change this and that readers take total power into account when they read the next article with five or more studies and 10 or more significant results and wonder whether they have witnessed a sharpshooter or have seen a magic show.
Finally, it is possible that researchers ignore power simply because they follow current practices in the field. Few scientists are surprised that published findings are too good to be true. Indeed, a common response to presentations of this work has been that the M-index only shows the obvious. Everybody knows that researchers use a number of questionable research practices to increase their chances of reporting significant results, and a high percentage of researchers admit to using these practices, presumably
because they do not consider them to be questionable (John et al., 2012).
[Even in 2014, Stroebe and Strack claim that it is not clear which practices should be considered questionable, whereas my undergraduate students have no problem realizing that hiding failed studies undermines the purpose of doing an empirical study in the first place.]
The benign view of current practices is that successful studies provide all of the relevant information. Nobody wants to know about all the failed attempts of alchemists to turn base metals into gold, but everybody would want to know about a process that
actually achieves this goal. However, this logic rests on the assumption that successful studies were really successful and that unsuccessful studies were really flawed. Given the modest power of studies, this conclusion is rarely justified (Maxwell, 2004).
To improve the status of psychological science, it will be important to elevate the scientific standards of the field. Rather than pointing to limited resources as an excuse,
researchers should allocate resources more wisely (spend less money on underpowered studies) and conduct more relevant research that can attract more funding. I think it would be a mistake to excuse the use of questionable research practices by pointing out that false discoveries in psychological research have less dramatic consequences than drugs with little benefits, huge costs, and potential side effects.
Therefore, I disagree with Bem’s (2000) view that psychologists should “err on the side of discovery” (p. 5).
[Yup, he wrote that in a chapter that was used to train graduate students in social psychology in the art of magic.]
Recommendations for Improvement
Use Power in the Evaluation of Manuscripts
Granting agencies often ask that researchers plan studies with adequate power (Fritz & MacKinnon, 2007). However, power analysis is ignored when researchers report their results. The reason is probably that (a priori) power analysis is only seen as a way to ensure that a study produces a significant result. Once a significant finding has been found, low power no longer seems to be a problem. After all, a significant effect was found (in one condition, for male participants, after excluding two outliers, p =
One way to improve psychological science is to require researchers to justify sample sizes in the method section. For multiple-study articles, researchers should be asked to compute total power.
[This is something nobody has even started to discuss. Although there are more and more (often questionable) a priori power calculations in articles, they tend to aim for 80% power for a single hypothesis test, but these articles often report multiple studies or multiple hypothesis tests in a single article. The power to get two significant results with 80-% for each test is only 64%. ]
If a study has 80% total power, researchers should also explain how they would deal with the possible outcome of a nonsignificant result. Maybe it would change the perception of research contributions when a research article reports 10 significant
results, although power was only sufficient to obtain six. Implementing this policy would be simple. Thus, it is up to editors to realize the importance of statistical power and to make power an evaluation criterion in the review process (Cohen, 1992).
Implementing this policy could change the hierarchy of psychological
journals. Top journals would no longer be the journals with the most inflated effect sizes but, rather, the journals with the most powerful studies and the most credible scientific evidence.
[Based on this idea, I started developing my replicability rankings of journals. And they show that impact factors still do not take replicability into account.]
Reward Effort Rather Than Number of Significant Results
Another recommendation is to pay more attention to the total effort that went into an empirical study rather than the number of significant p values. The requirement to have multiple studies with no guidelines about power encourages a frantic empiricism in
which researchers will conduct as many cheap and easy studies as possible to find a set of significant results.
[And if power is taken into account, researchers now do six cheap Mturk studies. Although this is better than six questionable studies, it does not correct the problem that good research often requires a lot of resources.]
It is simply too costly for researchers to invest in studies with observation of real behaviors, high ecological validity, or longitudinal assessments that take
time and may produce a nonsignificant result.
Given the current environmental pressures, a low-quality/high-quantity strategy is
more adaptive and will ensure survival (publish or perish) and reproductive success (more graduate students who pursue a lowquality/ high-quantity strategy).
[It doesn’t help to become a meta-psychologists. Which smart undergraduate student would risk the prospect of a career by becoming a meta-psychologist?]
A common misperception is that multiple-study articles should be rewarded because they required more effort than a single study. However, the number of studies is often a function of the difficulty of conducting research. It is therefore extremely problematic to
assume that multiple studies are more valuable than single studies.
A single longitudinal study can be costly but can answer questions that multiple cross-sectional studies cannot answer. For example, one of the most important developments in psychological measurement has been the development of the implicit association test
(Greenwald, McGhee, & Schwartz, 1998). A widespread belief about the implicit association test is that it measures implicit attitudes that are more stable than explicit attitudes (Gawronski, 2009), but there exist hardly any longitudinal studies of the stability of implicit attitudes.
[I haven’t checked but I don’t think this has changed much. Cross-sectional Mturk studies can still produce sexier results than a study that simply estimates the stability of the same measure over time. Social psychologists tend to be impatient creatures (e.g., Bem)]
A simple way to change the incentive structure in the field is to undermine the false belief that multiple-study articles are better than single-study articles. Often multiple studies are better combined into a single study. For example, one article published four studies that were identical “except that the exposure duration—suboptimal (4 ms)
or optimal (1 s)—of both the initial exposure phase and the subsequent priming phase was orthogonally varied” (Murphy, Zajonc, & Monahan, 1995, p. 589). In other words, the four studies were four conditions of a 2 x 2 design. It would have been more efficient and
informative to combine the information of all studies in a single study. In fact, after reporting each study individually, the authors reported the results of a combined analysis. “When all four studies are entered into a single analysis, a clear pattern emerges” (Murphy et al., 1995, p. 600). Although this article may be the most extreme example of unnecessary multiplicity, other multiple-study articles could also be more informative by reducing the number of studies in a single article.
Apparently, readers of scientific articles are aware of the limited information gain provided by multiple-study articles because citation counts show that multiple-study articles do not have more impact than single-study articles (Haslam et al., 2008). Thus, editors should avoid using number of studies as a criterion for accepting articles.
Allow Publication of Nonsignificant Results
The main point of the M-index is to alert researchers, reviewers, editors, and readers of scientific articles that a series of studies that produced only significant results is neither a cause for celebration nor strong evidence for the demonstration of a scientific discovery; at least not without a power analysis that shows the results are credible.
Given the typical power of psychological studies, nonsignificant findings should be obtained regularly, and the absence of nonsignificant results raises concerns about the credibility of published research findings.
Most of the time, biases may be benign and simply produce inflated effect sizes, but occasionally, it is possible that biases may have more serious consequences (e.g.,
demonstrate phenomena that do not exist).
A perfectly planned set of five studies, where each study has 80% power, is expected to produce one nonsignificant result. It is not clear why editors sometimes ask researchers to remove studies with nonsignificant results. Science is not a beauty contest, and a
nonsignificant result is not a blemish.
This wisdom is captured in the Japanese concept of wabi-sabi, in which beautiful objects are designed to have a superficial imperfection as a reminder that nothing is perfect. On the basis of this conception of beauty, a truly perfect set of studies is one that echoes the imperfection of reality by including failed studies or studies that did not produce significant results.
Even if these studies are not reported in great detail, it might be useful to describe failed studies and explain how they informed the development of studies that produced significant results. Another possibility is to honestly report that a study failed to produce a significant result with a sample size that provided 80% power and that the researcher then added more participants to increase power to 95%. This is different from snooping (looking at the data until a significant result has been found), especially if it is stated clearly that the sample size was increased because the effect was not significant with the originally planned sample size and the significance test has been adjusted to take into account that two significance tests were performed.
The M-index rewards honest reporting of results because reporting of null findings renders the number of significant results more consistent with the total power of the studies. In contrast, a high M-index can undermine the allure of articles that report more significant results than the power of the studies warrants. In this
way, post-hoc power analysis could have the beneficial effect that researchers finally start paying more attention to a priori power.
Limited resources may make it difficult to achieve high total power. When total power is modest, it becomes important to report nonsignificant results. One way to report nonsignificant results would be to limit detailed discussion to successful studies but to
include studies with nonsignificant results in a meta-analysis. For example, Bem (2011) reported a meta-analysis of all studies covered in the article. However, he also mentioned several pilot studies and a smaller study that failed to produce a significant
result. To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results.
This overall effect is functionally equivalent to the test of the hypothesis in a single
study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results.
[Since then, several articles have proposed meta-analyses and given tutorials on mini-meta-analysis without citing my article and without clarifying that these meta-analysis are only useful if all evidence is included and without clarifying that bias tests like the M-Index can reveal whether all relevant evidence was included.]
It is also important that top journals publish failed replication studies. The reason is that top journals are partially responsible for the contribution of questionable research practices to published research findings. These journals look for novel and groundbreaking studies that will garner many citations to solidify their position
as top journals. As everywhere else (e.g., investing), the higher payoff comes with a higher risk. In this case, the risk is publishing false results. Moreover, the incentives for researchers to get published in top journals or get tenure at Ivy League universities
increases the probability that questionable research practices contribute
to articles in the top journals (Ledford, 2010). Stapel faked data to get a publication in Science, not to get a publication in Psychological Reports.
There are positive signs that some journal editors are recognizing their responsibility for publication bias (Dirnagl & Lauritzen, 2010). The medical journal Journal of Cerebral Blood Flow and Metabolism created a section that allows researchers to publish studies with disconfirmatory evidence so that this evidence is published in the same journal. One major advantage of having this section in top journals is that it may change the evaluation criteria of journal editors toward a more careful assessment of Type I error when they accept a manuscript for publication. After all, it would be quite embarrassing to publish numerous articles that erred on the side of discovery if subsequent issues reveal that these discoveries were illusory.
[After some pressure from social media, JPSP did publish failed replications of Bem, and it now has a replication section (online only). Maybe somebody can dig up some failed replications of glucose studies, I know they exist, or do one more study to publish in JPSP that, just like ESP, glucose is a myth.]
It could also reduce the use of questionable research practices by researchers eager to publish in prestigious journals if there was a higher likelihood that the same journal will publish failed replications by independent researchers.It might also motivate more researchers to conduct rigorous replication studies if they can bet against a finding and hope to get a publication in a prestigious journal.
The M-index can be helpful in putting pressure on editors and journals to curb the proliferation of false-positive results because it can be used to evaluate editors and journals in terms of the credibility of the results that are published in these journals.
As everybody knows, the value of a brand rests on trust, and it is easy to destroy this value when consumers lose that trust. Journals that continue to publish incredible results and suppress contradictory replication studies are not going to survive, especially given the fact that the Internet provides an opportunity for authors of repressed replication studies to get their findings out (Spellman, 2012).
[I wrote this in the third revision when I thought the editor would not want to see the manuscript again.]
[I deleted the section where I pick on Ritchie’s failed replications of Bem because three studies with small studies of N = 50 are underpowered and can be dismissed as false positives. Replication studies should have at least the sample size of original studies which was N = 100 for most of Bem’s studies.]
Another solution would be to ignore p values altogether and to focus more on effect sizes and confidence intervals (Cumming & Finch, 2001). Although it is impossible to demonstrate that the true effect size is exactly zero, it is possible to estimate
true effect sizes with very narrow confidence intervals. For example, a sample of N = 1,100 participants would be sufficient to demonstrate that the true effect size of ESP is zero with a narrow confidence interval of plus or minus .05.
If an even more stringent criterion is required to claim a null effect, sample sizes would have to increase further, but there is no theoretical limit to the precision of effect size estimates. No matter whether the focus is on p values or confidence intervals, Cohen’s recommendation that bigger is better, at least for sample sizes, remains true because large samples are needed to obtain narrow confidence intervals (Goodman & Berlin, 1994).
Changing paradigms is a slow process. It took decades to unsettle the stronghold of behaviorism as the main paradigm in psychology. Despite Cohen’s (1962) important contribution to the field 50 years ago and repeated warnings about the problems of underpowered studies, power analysis remains neglected (Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). I hope the M-index can make a small contribution toward the goal of improving the scientific standards of psychology as a science.
Bem’s (2011) article is not going to be a dagger in the heart of questionable research practices, but it may become the historic marker of a paradigm shift.
There are positive signs in the literature on meta-analysis (Sutton & Higgins, 2008), the search for better statistical methods (Wagenmakers, 2007)*, the call for more
open access to data (Schooler, 2011), changes in publication practices of journals (Dirnagl & Lauritzen, 2010), and increasing awareness of the damage caused by questionable research practices (Francis, 2012a, 2012b; John et al., 2012; Kerr, 1998; Simmons
et al., 2011) to be hopeful that a paradigm shift may be underway.
[Another sad story. I did not understand Wagenmaker’s use of Bayesian methods at the time and I honestly thought this work might make a positive contribution. However, in retrospect I realize that Wagenmakers is more interested in selling his statistical approach at any cost and disregards criticisms of his approach that have become evident in recent years. And, yes, I do understand how the method works and why it will not solve the replication crisis (see commentary by Carlsson et al., 2017, in Psychological Science).]
Even the Stapel debacle (Heatherton, 2010), where a prominent psychologist admitted to faking data, may have a healthy effect on the field.
[Heaterton emailed me and I thought he was going to congratulate me on my nice article or thank me for citing him, but he was mainly concerned that quoting him in the context of Stapel might give the impression that he committed fraud.]
After all, faking increases Type I error by 100% and is clearly considered unethical. If questionable research practices can increase Type I error by up to 60% (Simmons et al., 2011), it becomes difficult to maintain that these widely used practices are questionable but not unethical.
[I guess I was a bit optimistic here. Apparently, you can hide as many studies as you want, but you cannot change one data point because that is fraud.]
During the reign of a paradigm, it is hard to imagine that things will ever change. However, for most contemporary psychologists, it is also hard to imagine that there was a time when psychology was dominated by animal research and reinforcement schedules. Older psychologists may have learned that the only constant in life is change.
[Again, too optimistic. Apparently, many old social psychologists still believe things will remain the same as they always were. Insert head in the sand cartoon here.]
I have been fortunate enough to witness historic moments of change such as the falling of the Berlin Wall in 1989 and the end of behaviorism when Skinner gave his last speech at the convention of the American Psychological Association in 1990. In front of a packed auditorium, Skinner compared cognitivism to creationism. There was dead silence, made more audible by a handful of grey-haired members in the audience who applauded
[Only I didn’t realize that research in 1990 had other problems. Nowadays I still think that Skinner was just another professor with a big ego and some published #me_too allegations to his name, but he was right in his concerns about (social) cognitivism as not much more scientific than creationism.]
I can only hope to live long enough to see the time when Cohen’s valuable contribution to psychological science will gain the prominence that it deserves. A better understanding of the need for power will not solve all problems, but it will go a long way toward improving the quality of empirical studies and the credibility of results published in psychological journals. Learning about power not only empowers researchers to conduct studies that can show real effects without the help of questionable research practices but also empowers them to be critical consumers of published research findings.
Knowledge about power is power.
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide
to publishing in psychological journals (pp. 3–16). Cambridge, England:
Cambridge University Press. doi:10.1017/CBO9780511807862.002
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality
and Social Psychology, 100, 407–425. doi:10.1037/a0021524
Bonett, D. G. (2009). Meta-analytic interval estimation for standardized
and unstandardized mean differences. Psychological Methods, 14, 225–
Cohen, J. (1962). Statistical power of abnormal–social psychological research:
A review. Journal of Abnormal and Social Psychology, 65,
Cohen, J. (1990). Things I have learned (so far). American Psychologist,
45, 1304–1312. doi:10.1037/0003-066X.45.12.1304
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
Dasgupta, N., & Greenwald, A. G. (2001). On the malleability of automatic
attitudes: Combating automatic prejudice with images of admired and
disliked individuals. Journal of Personality and Social Psychology, 81,
Diener, E. (1998). Editorial. Journal of Personality and Social Psychology,
74, 5–6. doi:10.1037/h0092824
Dirnagl, U., & Lauritzen, M. (2010). Fighting publication bias: Introducing
the Negative Results section. Journal of Cerebral Blood Flow and
Metabolism, 30, 1263–1264. doi:10.1038/jcbfm.2010.51
Dvorak, R. D., & Simons, J. S. (2009). Moderation of resource depletion
in the self-control strength model: Differing effects of two modes of
self-control. Personality and Social Psychology Bulletin, 35, 572–583.
Erdfelder, E., Faul, F., & Buchner, A. (1996). GPOWER: A general power
analysis program. Behavior Research Methods, 28, 1–11. doi:10.3758/
Fanelli, D. (2010). “Positive” results increase down the hierarchy of the
sciences. PLoS One, 5, Article e10068. doi:10.1371/journal.pone
Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical
power analyses using G*Power 3.1: Tests for correlation and regression
analyses. Behavior Research Methods, 41, 1149–1160. doi:10.3758/
Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A
flexible statistical power analysis program for the social, behavioral, and
biomedical sciences. Behavior Research Methods, 39, 175–191. doi:
Fiedler, K. (2011). Voodoo correlations are everywhere—not only in
neuroscience. Perspectives on Psychological Science, 6, 163–171. doi:
Francis, G. (2012a). The same old New Look: Publication bias in a study
of wishful seeing. i-Perception, 3, 176–178. doi:10.1068/i0519ic
Francis, G. (2012b). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9
Fritz, M. S., & MacKinnon, D. P. (2007). Required sample size to detect
the mediated effect. Psychological Science, 18, 233–239. doi:10.1111/
Gailliot, M. T., Baumeister, R. F., DeWall, C. N., Maner, J. K., Plant,
E. A., Tice, D. M., & Schmeichel, B. J. (2007). Self-control relies on
glucose as a limited energy source: Willpower is more than a metaphor.
Journal of Personality and Social Psychology, 92, 325–336. doi:
Gawronski, B. (2009). Ten frequently asked questions about implicit
measures and their frequently supposed, but not entirely correct answers.
Canadian Psychology/Psychologie canadienne, 50, 141–150. doi:
Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence
intervals when planning experiments and the misuse of power when
interpreting results. Annals of Internal Medicine, 121, 200–206.
Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring
individual differences in implicit cognition: The implicit association test.
Journal of Personality and Social Psychology, 74, 1464–1480. doi:
Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J.,
& Wilson, S. (2008). What makes an article influential? Predicting
impact in social and personality psychology. Scientometrics, 76, 169–
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in
the world? Behavioral and Brain Sciences, 33, 61–83. doi:10.1017/
Ioannidis, J. P. A. (2005). Why most published research findings are false.
PLoS Medicine, 2(8), Article e124. doi:10.1371/journal.pmed.0020124
Ioannidis, J. P. A., & Trikali nos, T. A. (2007). An exploratory test for an
excess of significant findings. Clinical Trials, 4, 245–253. doi:10.1177/
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence
of questionable research practices with incentives for truth telling.
Psychological Science, 23, 524–532. doi:10.1177/0956797611430953
Joy-Gaba, J. A., & Nosek, B. A. (2010). The surprisingly limited malleability
of implicit racial evaluations. Social Psychology, 41, 137–146.
Judd, C. M., & Gawronski, B. (2011). Editorial comment. Journal of
Personality and Social Psychology, 100, 406. doi:10.1037/0022789
Kerr, N. L. (1998). HARKing: Hypothezising after the results are known.
Personality and Social Psychology Review, 2, 196–217. doi:10.1207/
Kurzban, R. (2010). Does the brain consume additional glucose during
self-control tasks? Evolutionary Psychology, 8, 244–259.
Ledford, H. (2010, August 17). Harvard probe kept under wraps. Nature,
466, 908–909. doi:10.1038/466908a
Ledgerwood, A., & Sherman, J. W. (2012). Short, sweet, and problematic?
The rise of the short report in psychological science. Perspectives on Psychological Science, 7, 60–66. doi:10.1177/1745691611427304
Maxwell, S. E. (2004). The persistence of underpowered studies in psychological
research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082-989X.9.2.147
Milloy, J. S. (1995). Science without sense: The risky business of public
health research. Washington, DC: Cato Institute.
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting
an optimal that minimizes errors in null hypothesis significance tests.
PLoS One, 7(2), Article e32734. doi:10.1371/journal.pone.0032734
Murphy, S. T., Zajonc, R. B., & Monahan, J. L. (1995). Additivity of
nonconscious affect: Combined effects of priming and exposure. Journal
of Personality and Social Psychology, 69, 589–602. doi:10.1037/0022-
Ritchie, S. J., Wiseman, R., & French, C. C. (2012a). Failing the future:
Three unsuccessful attempts to replicate Bem’s “retroactive facilitation
of recall” effect. PLoS One, 7(3), Article e33423. doi:10.1371/
Rossi, J. S. (1990). Statistical power of psychological research: What have
we gained in 20 years? Journal of Consulting and Clinical Psychology,
58, 646–656. doi:10.1037/0022-006X.58.5.646
Royall, R. M. (1986). The effect of sample size on the meaning of
significance tests. American Statistician, 40, 313–315. doi:10.2307/
Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives
on Psychological Science, 5, 233–242. doi:10.1177/
Schooler, J. (2011, February 23). Unpublished results hide the decline
effect. Nature, 470, 437. doi:10.1038/470437a
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power
have an effect on the power of studies? Psychological Bulletin, 105,
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: Undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22,
Smith, E. R. (2012). Editorial. Journal of Personality and Social Psychology,
102, 1–3. doi:10.1037/a0026676
Spellman, B. A. (2012). Introduction to the special section: Data, data,
everywhere . . . especially in my file drawer. Perspectives on Psychological
Science, 7, 58–59. doi:10.1177/1745691611432124
Steen, R. G. (2011a). Retractions in the scientific literature: Do authors
deliberately commit research fraud? Journal of Medical Ethics, 37,
Steen, R. G. (2011b). Retractions in the scientific literature: Is the incidence
of research fraud increasing? Journal of Medical Ethics, 37,
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/ 2282137
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice-versa. American Statistician, 49, 108–112. doi:10.2307/2684823
Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746
Sutton, A. J., & Higgins, J. P. I. (2008). Recent developments in metaanalysis.
Statistics in Medicine, 27, 625–650. doi:10.1002/sim.2934
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high
correlations in fMRI studies of emotion, personality, and social cognition.
Perspectives on Psychological Science, 4, 274–290. doi:10.1111/
Wagenmakers, E. J. (2007). A practical solution to the pervasive problems
of p values. Psychonomic Bulletin & Review, 14, 779–804. doi:10.3758/
Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J.
(2011). Why psychologists must change the way they analyze their data:
The case of psi: Comment on Bem (2011). Journal of Personality and
Social Psychology, 100, 426–432. doi:10.1037/a0022790
Wegner, D. M. (1992). The premature demise of the solo experiment.
Personality and Social Psychology Bulletin, 18, 504–508. doi:10.1177/
Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations
reflect low statistical power—Commentary on Vul et al. (2009).
Perspectives on Psychological Science, 4, 294–298. doi:10.1111/j.1745-
Yong, E. (2012, May 16). Bad copy. Nature, 485, 298–300. doi:10.1038/
Yuan, K. H., & Maxwell, S. (2005). On the post hoc power in testing mean
differences. Journal of Educational and Behavioral Statistics, 30, 141–
Received May 30, 2011
Revision received June 18, 2012
Accepted June 25, 2012
Further Revised February 18, 2018
Thanks to social media, geography is no longer a barrier for scientific discourse. However, language is still a barrier. Fortunately, I understand German and I can respond to the official statement of the board of the German Psychological Association (DGPs), which was posted on the DGPs website (in German).
On September 1, 2015, Prof. Dr. Andrea Abele-Brehm, Prof. Dr. Mario Gollwitzer, and Prof. Dr. Fritz Strack published an official response to the results of the OSF-Replication Project – Psychology (in German) that was distributed to public media in order to correct potentially negative impressions about psychology as a science.
Numerous members of DGPs felt that this official statement did not express their views and noticed that members were not consulted about the official response of their organization. In response to this criticism, DGfP opened a moderated discussion page, where members could post their personal views (mostly in German).
On October 6, 2015, the board closed the discussion page and posted some final words (Schlussbeitrag). In this blog, I provide a critical commentary on these final words.
BOARD’S RESPONSE TO COMMENTS
The board members provide a summary of the core insights and arguments of the discussion from their (personal/official) perspective.
„Wir möchten nun die aus unserer Sicht zentralen Erkenntnisse und Argumente der unterschiedlichen Forumsbeiträge im Folgenden zusammenfassen und deutlich machen, welche vorläufigen Erkenntnisse wir im Vorstand aus ihnen ziehen.“
1. 68% success rate?
The first official statement suggested that the replication project showed that 68% of studies. This number is based on significance in a meta-analysis of the original and replication study. Critics pointed out that this approach is problematic because the replication project showed clearly that the original effect sizes were inflated (on average by 100%). Thus, the meta-analysis is biased and the 68% number is inflated.
In response to this criticism, the DGPs board states that “68% is the maximum [größtmöglich] optimistic estimate.” I think the term “biased and statistically flawed estimate” is a more accurate description of this estimate. It is common practice to consider fail-safe-N or to correct meta-analysis for publication bias. When there is clear evidence of bias, it is unscientific to report the biased estimate. This would be like saying that the maximum optimistic estimate of global warming is that global warming does not exist. This is probably a true statement about the most optimistic estimate, but not a scientific estimate of the actual global warming that has been taking place. There is no place for optimism in science. Optimism is a bias and the aim of science is to remove bias. If DGPs wants to represent scientific psychology, the board should post what they consider the most accurate estimate of replicability in the OSF-project.
2. The widely cited 36% estimate is negative.
The board members then justify the publication of the maximally optimistic estimate as a strategy to counteract negative perceptions of psychology as a science in response to the finding that only 36% of results were replicated. The board members felt that these negative responses misrepresent the OSF-project and psychology as a scientific discipline.
„Dies wird weder dem Projekt der Open Science Collaboration noch unserer Disziplin insgesamt gerecht. Wir sollten jedoch bei der konstruktiven Bewältigung der Krise Vorreiter innerhalb der betroffenen Wissenschaften sein.“
However, reporting the dismal 36% replication rate of the OSF-replication project is not a criticism of the OSF-project. Rather, it assumes that the OSF-replication project was a rigorous and successful attempt to provide an estimate of the typical replicability of results published in top psychology journals. The outcome could have been 70% or 35%. The quality of the project does not depend on the result. The result is also not a negatively biased perception of psychology as a science. It is an objective scientific estimate of the probability that a reported significant result in a journal would produce a significant result again in a replication study. Whether 36% is acceptable or not can be debated, but it seems problematic to post a maximally optimistic estimate to counteract negative implications of an objective estimate.
3. Is 36% replicability good or bad?
Next, the board ponders the implications of the 36% success rate. “How should we evaluate this number?” The board members do not know. According to their official conclusion, this question is complex as divergent contributions on the discussion page suggest.
„Im Science-Artikel wurde die relative Häufigkeit der in den Replikationsstudien statistisch bedeutsamen Effekte mit 36% angegeben. Wie ist diese Zahl zu bewerten? Wie komplex die Antwort auf diese Frage ist, machen die Forumsbeiträge von Roland Deutsch, Klaus Fiedler, Moritz Heene (s.a. Heene & Schimmack) und Frank Renkewitz deutlich.“
To help the board members to understand the number, I can give a brief explanation of replicability. Although there are several ways to define replicability, one plausible definition of replicability is to equate it with statistical power. Statistical power is the probability that a study will produce a significant result. A study with 80% power has an 80% probability to produce a significant result. For a set of 100 studies, one would expect roughly 80 significant results and 20 non-significant results. For 100 studies with 36% power, one would expect roughly 36 significant results and 64 non-significant results. If researchers would publish all studies, the percentage of published significant results would provide an unbiased estimate of the typical power of studies. However, it is well known that significant results are more likely to be written up, submitted for publication, and accepted for publication. These reporting biases explain why psychology journals report over 90% significant results, although the actual power of studies is less than 90%.
In 1962, Jacob Cohen provided the first attempt to estimate replicability of psychological results. His analysis suggested that psychological studies have approximately 50% power. He suggested that psychologists should increase power to 80% to provide robust evidence for effects and to avoid wasting resources on studies that cannot detect small, but practically important effects. For the next 50 years, psychologists have ignored Cohen’s warning that most studies are underpowered, despite repeated reminders that there are no signs of improvement, including reminders by prominent German psychologists like Gerg Giegerenzer, director of a Max Planck Institute (Sedlmeier & Giegerenzer, 1989; Maxwell, 2004; Schimmack, 2012).
The 36% success rate for an unbiased set of 100 replication studies, suggest that the actual power of published studies in psychology journals is 36%. The power of all studies conducted is even lower because the p < .05 selection criterion favors studies with higher power. Does the board think 36% power is an acceptable amount of power?
4. Psychologists should improve replicability in the future
On a positive note, the board members suggest that, after careful deliberation, psychologists need to improve replicability so that it can be demonstrated in a few years that replicability has increased.
„Wir müssen nach sorgfältiger Diskussion unter unseren Mitgliedern Maßnahmen ergreifen (bei Zeitschriften, in den Instituten, bei Förderorganisationen, etc.), die die Replikationsquote im temporalen Vergleich erhöhen können.“
The board members do not mention a simple solution to the replicabilty problem that was advocated over 50 years ago by Jacob Cohen. To increase replicability, psychologists have to think about the strength of the effects that they are investigating and they have to conduct studies that have a realistic chance to distinguish these effects from variation due to random error. This often means investing more resources (larger samples, repeated trials, etc.) in a single study. Unfortunately, the leaders of German psychologists appear to be unaware of this important and simple solution to the replication crisis. They neither mention power as a cause of the problem, nor do they recommend increasing power to increase replicability in the future.
5. Do the Results Reveal Fraud?
The DGPs board members then discuss the possibility that the OSF-reproducibilty results reveal fraud, like the fraud committed by Stapel. The board points out that the OSF-results do not imply that psychologists commit fraud because failed replications can occur for various reasons.
„Viele Medien (und auch einige Kolleginnen und Kollegen aus unserem Fach) nennen die Befunde der Science-Studie im gleichen Atemzug mit den Betrugsskandalen, die unser Fach in den letzten Jahren erschüttert haben. Diese Assoziation ist unserer Meinung nach problematisch: sie suggeriert, die geringe Replikationsrate sei auf methodisch fragwürdiges Verhalten der Autor(inn)en der Originalstudien zurückzuführen.“
It is true that the OSF-results do not reveal fraud. However, the board members confuse fraud with questionable research practices. Fraud is defined as fabricating data that were never collected. Only one of the 100 studies in the OSF-replication project (by Jens Förster, a former student of Fritz Strack, one of the board members) is currently being investigated for fraud by the University of Amsterdam. Despite very strong results in the original study, it failed to replicate.
The more relevant question is how much questionable research practices contributed to the results. Questionable research practices are practices where data are being collected, but statistical results are only being reported if they produce a significant result (studies, conditions, dependent variables, data points that do not produce significant results are excluded from the results that are being submitted for publication. It has been known for over 50 years that these practices produce a discrepancy between the actual power of studies and the rate of significant results that are published in psychology journals (Sterling, 1959).
Recent statistical developments have made it possible to estimate the true power of studies after correcting for publication bias. Based on these calculations, the true power of the original studies in the OSF-project was only 50%. Thus a large portion of the discrepancy between nearly 100% reported significant results and a replication success rate of 36% is explained by publication bias (see R-Index blogs for social psychology and cognitive psychology).
Other factors may contribute to the discrepancy between the statistical prediction that the replication success rate would be 50% and the actual success rate of 36%. Nevertheless, the lion share of the discrepancy can be explained by the questionable practice to report only evidence that supports a hypothesis that a researcher wants to support. This motivated bias undermines the very foundations of science. Unfortunately, the board ignores this implication of the OSF results.
6. What can we do?
The board members have no answer to this important question. In the past four years, numerous articles have been published that have made suggestions how psychology can improve its credibility as a science. Yet, the DPfP board seems to be unaware of these suggestions or unable to comment on these proposals.
„Damit wären wir bei der Frage, die uns als Fachgesellschaft am stärksten beschäftigt und weiter beschäftigen wird. Zum einen brauchen wir eine sorgfältige Selbstreflexion über die Bedeutung von Replikationen in unserem Fach, über die Bedeutung der neuesten Science-Studie sowie der weiteren, zurzeit noch im Druck oder in der Phase der Auswertung befindlichen Projekte des Center for Open Science (wie etwa die Many Labs-Studien) und über die Grenzen unserer Methoden und Paradigmen“
The time for more discussion has passed. After 50 years of ignoring Jacob Cohen’s recommendation to increase statistical power it is time for action. If psychologists are serious about replicability, they have to increase the power of their studies.
The board then discusses the possibility of measuring and publishing replication rates at the level of departments or individual scientists. They are not in favor of such initiatives, but they provide no argument for their position.
„Datenbanken über erfolgreiche und gescheiterte Replikationen lassen sich natürlich auch auf der Ebene von Instituten oder sogar Personen auswerten (wer hat die höchste Replikationsrate, wer die niedrigste?). Sinnvoller als solche Auswertungen sind Initiativen, wie sie zurzeit (unter anderem) an der LMU an der LMU München implementiert wurden (siehe den Beitrag von Schönbrodt und Kollegen).“
The question is why replicability should not be measured and used to evaluate researchers. If the board really valued replicability and wanted to increase replicability in a few years, wouldn’t it be helpful to have a measure of replicability and to reward departments or researchers who invest more resources in high powered studies that can produce significant results without the need to hide disconfirming evidence in file-drawers? A measure of replicability is also needed because current quantitative measures of scientific success are one of the reasons for the replicability crisis. The most successful researchers are those who publish the most significant results, no matter how these results were obtained (with the exception of fraud). To change this unscientific practice of significance chasing, it is necessary to have an alternative indicator of scientific quality that reflects how significant results were obtained.
The board makes some vague concluding remarks that are not worthwhile repeating here. So let me conclude with my own remarks.
The response of the DGPs board is superficial and does not engage with the actual arguments that were exchanged on the discussion page. Moreover, it ignores some solid scientific insights into the causes of the replicability crisis and it makes no concrete suggestions how German psychologists should change their behaviors to improve the credibility of psychology as a science. Not once do they point out that the results of the OSF-project were predictable based on the well-known fact that psychological studies are underpowered and that failed studies are hidden in file-drawers.
I received my education in Germany all the way to the Ph.D at the Free University in Berlin. I had several important professors and mentors that educated me about philosophy of science and research methods (Rainer Reisenzein, Hubert Feger, Hans Westmeyer, Wolfgang Schönpflug). I was a member of DGPs for many years. I do not believe that the opinion of the board members represent a general consensus among German psychologists. I hope that many German psychologists recognize the importance of replicability and are motivated to make changes to the way psychologists conduct research. As I am no longer a member of DGfP, I have no direct influence on it, but I hope that the next election will elect a candidate that will promote open science, transparency, and above all scientific integrity.
Imagine an NBA player has an 80% chance to make one free throw. What is the chance that he makes both free throws? The correct answer is 64% (80% * 80%).
Now consider the possibility that it is possible to distinguish between two types of free throws. Some free throws are good; they don’t touch the rim and make a swishing sound when they go through the net (all net). The other free throws bounce of the rim and go in (rattling in).
What is the probability that an NBA player with an 80% free throw percentage makes a free throw that is all net or rattles in? It is more likely that an NBA player with an 80% free throw average makes a perfect free throw because a free throw that rattles in could easily have bounded the wrong way, which would lower the free throw percentage. To achieve an 80% free throw percentage, most free throws have to be close to perfect.
Let’s say the probability of hitting the rim and going in is 30%. With an 80% free throw average, this means that the majority of free throws are in the close-to-perfect category (20% misses, 30% rattle-in, 50% close-to-perfect).
What does this have to do with science? A lot!
The reason is that the outcome of a scientific study is a bit like throwing free throws. One factor that contributes to a successful study is skill (making correct predictions, avoiding experimenter errors, and conducting studies with high statistical power). However, another factor is random (a lucky or unlucky bounce).
The concept of statistical power is similar to an NBA players’ free throw percentage. A researcher who conducts studies with 80% statistical power is going to have an 80% success rate (that is, if all predictions are correct). In the remaining 20% of studies, a study will not produce a statistically significant result, which is equivalent to missing a free throw and not getting a point.
Many years ago, Jacob Cohen observed that researchers often conduct studies with relatively low power to produce a statistically significant result. Let’s just assume right now that a researcher conducts studies with 60% power. This means, researchers would be like NBA players with a 60% free-throw average.
Now imagine that researchers have to demonstrate an effect not only once, but also a second time in an exact replication study. That is researchers have to make two free throws in a row. With 60% power, the probability to get two significant results in a row is only 36% (60% * 60%). Moreover, many of the freethrows that are made rattle in rather than being all net. The percentages are about 40% misses, 30% rattling in and 30% all net.
One major difference between NBA players and scientists is that NBA players have to demonstrate their abilities in front of large crowds and TV cameras, whereas scientists conduct their studies in private.
Imagine an NBA player could just go into a private room, throw two free throws and then report back how many free throws he made and the outcome of these free throws determine who wins game 7 in the playoff finals. Would you trust the player to tell the truth?
If you would not trust the NBA player, why would you trust scientists to report failed studies? You should not.
It can be demonstrated statistically that scientists are reporting more successes than the power of their studies would justify (Sterling et al., 1995; Schimmack, 2012). Amongst scientists this fact is well known, but the general public may not fully appreciate the fact that a pair of exact replication studies with significant results is often just a selection of studies that included failed studies that were not reported.
Fortunately, it is possible to use statistics to examine whether the results of a pair of studies are likely to be honest or whether failed studies were excluded. The reason is that an amateur is not only more likely to miss a free throw. An amateur is also less likely to make a perfect free throw.
Based on the theory of statistical power developed by Nyman and Pearson and popularized by Jacob Cohen, it is possible to make predictions about the relative frequency of p-values in the non-significant (failure), just significant (rattling in), and highly significant (all net) ranges.
As for made-free-throws, the distinction between lucky and clear successes is somewhat arbitrary because power is continuous. A study with a p-value of .0499 is very lucky because p = .501 would have been not significant (rattled in after three bounces on the rim). A study with p = .000001 is a clear success. Lower p-values are better, but where to draw the line?
As it turns out, Jacob Cohen’s recommendation to conduct studies with 80% power provides a useful criterion to distinguish lucky outcomes and clear successes.
Imagine a scientist conducts studies with 80% power. The distribution of observed test-statistics (e.g. z-scores) shows that this researcher has a 20% chance to get a non-significant result, a 30% chance to get a lucky significant result (p-value between .050 and .005), and a 50% chance to get a clear significant result (p < .005). If the 20% failed studies are hidden, the percentage of results that rattled in versus studies with all-net results are 37 vs. 63%. However, if true power is just 20% (an amateur), 80% of studies fail, 15% rattle in, and 5% are clear successes. If the 80% failed studies are hidden, only 25% of the successful studies are all-net and 75% rattle in.
One problem with using this test to draw conclusions about the outcome of a pair of exact replication studies is that true power is unknown. To avoid this problem, it is possible to compute the maximum probability of a rattling-in result. As it turns out, the optimal true power to maximize the percentage of lucky outcomes is 66% power. With true power of 66%, one would expect 34% misses (p > .05), 32% lucky successes (.050 < p < .005), and 34% clear successes (p < .005).
For a pair of exact replication studies, this means that there is only a 10% chance (32% * 32%) to get two rattle-in successes in a row. In contrast, there is a 90% chance that misses were not reported or that an honest report of successful studies would have produced at least one all-net result (z > 2.8, p < .005).
Example: Unconscious Priming Influences Behavior
I used this test to examine a famous and controversial set of exact replication studies. In Bargh, Chen, and Burrows (1996), Dr. Bargh reported two exact replication studies (studies 2a and 2b) that showed an effect of a subtle priming manipulation on behavior. Undergraduate students were primed with words that are stereotypically associated with old age. The researchers then measured the walking speed of primed participants (n = 15) and participants in a control group (n = 15).
The two studies were not only exact replications of each other; they also produced very similar results. Most readers probably expected this outcome because similar studies should produce similar results, but this false belief ignores the influence of random factors that are not under the control of a researcher. We do not expect lotto winners to win the lottery again because it is an entirely random and unlikely event. Experiments are different because there could be a systematic effect that makes a replication more likely, but in studies with low power results should not replicate exactly because random sampling error influences results.
Study 1: t(28) = 2.86, p = .008 (two-tailed), z = 2.66, observed power = 76%
Study 2: t(28) = 2.16, p = .039 (two-tailed), z = 2.06, observed power = 54%
The median power of these two studies is 65%. However, even if median power were lower or higher, the maximum probability of obtaining two p-values in the range between .050 and .005 remains just 10%.
Although this study has been cited over 1,000 times, replication studies are rare.
One of the few published replication studies was reported by Cesario, Plaks, and Higgins (2006). Naïve readers might take the significant results in this replication study as evidence that the effect is real. However, this study produced yet another lucky success.
Study 3: t(62) = 2.41, p = .019, z = 2.35, observed power = 65%.
The chances of obtaining three lucky successes in a row is only 3% (32% *32% * 32*). Moreover, with a median power of 65% and a reported success rate of 100%, the success rate is inflated by 35%. This suggests that the true power of the reported studies is considerably lower than the observed power of 65% and that observed power is inflated because failed studies were not reported.
The R-Index corrects for inflation by subtracting the inflation rate from observed power (65% – 35%). This means the R-Index for this set of published studies is 30%.
This R-Index can be compared to several benchmarks.
An R-Index of 22% is consistent with the null-hypothesis being true and failed attempts are not reported.
An R-Index of 40% is consistent with 30% true power and all failed attempts are not reported.
It is therefore not surprising that other researchers were not able to replicate Bargh’s original results, even though they increased statistical power by using larger samples (Pashler et al. 2011, Doyen et al., 2011).
In conclusion, it is unlikely that Dr. Bargh’s original results were the only studies that they conducted. In an interview, Dr. Bargh revealed that the studies were conducted in 1990 and 1991 and that they conducted additional studies until the publication of the two studies in 1996. Dr. Bargh did not reveal how many studies they conducted over the span of 5 years and how many of these studies failed to produce significant evidence of priming. If Dr. Bargh himself conducted studies that failed, it would not be surprising that others also failed to replicate the published results. However, in a personal email, Dr. Bargh assured me that “we did not as skeptics might presume run many studies and only reported the significant ones. We ran it once, and then ran it again (exact replication) in order to make sure it was a real effect.” With a 10% probability, it is possible that Dr. Bargh was indeed lucky to get two rattling-in findings in a row. However, his aim to demonstrate the robustness of an effect by trying to show it again in a second small study is misguided. The reason is that it is highly likely that the effect will not replicate or that the first study was already a lucky finding after some failed pilot studies. Underpowered studies cannot provide strong evidence for the presence of an effect and conducting multiple underpowered studies reduces the credibility of successes because the probability of this outcome to occur even when an effect is present decreases with each study (Schimmack, 2012). Moreover, even if Bargh was lucky to get two rattling-in results in a row, others will not be so lucky and it is likely that many other researchers tried to replicate this sensational finding, but failed to do so. Thus, publishing lucky results hurts science nearly as much as the failure to report failed studies by the original author.
Dr. Bargh also failed to realize how lucky he was to obtain his results, in his response to a published failed-replication study by Doyen. Rather than acknowledging that failures of replication are to be expected, Dr. Bargh criticized the replication study on methodological grounds. There would be a simple solution to test Dr. Bargh’s hypothesis that he is a better researcher and that his results are replicable when the study is properly conducted. He should demonstrate that he can replicate the result himself.
In an interview, Tom Bartlett asked Dr. Bargh why he didn’t conduct another replication study to demonstrate that the effect is real. Dr. Bargh’s response was that “he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.” The problem for Dr. Bargh is that there is no reason to believe his original results, either. Two rattling-in results alone do not constitute evidence for an effect, especially when this result could not be replicated in an independent study. NBA players have to make free-throws in front of a large audience for a free-throw to count. If Dr. Bargh wants his findings to count, he should demonstrate his famous effect in an open replication study. To avoid embarrassment, it would be necessary to increase the power of the replication study because it is highly unlikely that even Dr. Bargh can continuously produce significant results with samples of N = 30 participants. Even if the effect is real, sampling error is simply too large to demonstrate the effect consistently. Knowledge about statistical power is power. Knowledge about post-hoc power can be used to detect incredible results. Knowledge about a priori power can be used to produce credible results.
The authors distinguish between fraud and QRPs. Fraud is typically limited to cases in which researchers create false data. In contrast, QRPs typically involve the exclusion of data that are inconsistent with a theoretical hypothesis. QRPs are treated differently than fraud because QRPs can sometimes be used for legitimate purposes.
For example, a data entry error may produce a large outlier that leads to a non-significant result when all data are included in the analysis. The results are significant when the outlier is removed. Statistical textbook often advise to exclude outliers for this reason. However, removal of outliers becomes a QRP when it is used selectively. That is, outliers are not removed when a result is significant or when the outlier helps to produce a significant result, but outliers are removed when removal of outliers helps to get a significant result.
The use of QRPs is damaging because published results provide false impressions about the replicability of empirical results and misleading evidence about the size of an effect.
Below is a list of QRPs.
Selective reporting of (dependent) variables. For example, a researcher may include 10 items to measure depression. Typically, the 10 items are averaged to get the best measure of depression. However, if this analysis does not produce a significant result, the researcher can conduct analyses of each individual item or average items that trend in the right direction. By creating different dependent variables after the study is completed, a researcher increases the chances of obtaining a significant result that will not replicate in a replication study with the same dependent variable.
A simple solution to preventing this QRP is to ask authors to use well-established measures as dependent variables and/or to ask for pre-registration of all measures that are relevant to the test of a theoretical hypothesis (i.e., it is not necessary to specify that the study also asked about handedness because handedness is not a measure of depression).
Deciding whether to collect more data after looking to see whether the results will be significant. It is difficult to distinguish random variation from a true effect in small samples. At the same time, it can be a costly waste of resources (or even unethical in animal research) to conduct studies with large samples, when the effect can be detected in a smaller sample. It is also difficult to know a priori how large a sample should be to obtain a significant result. It therefore seems reasonable to check data while they are being collected for significance. If an effect does not seem to be present in a reasonably large sample size, it may be better to abandon a study. None of these practices are problematic unless a researcher constantly checks for significance and stops data collection immediately after the data show a significant result. This practice capitalizes on sampling error and the experiment will typically stop when sampling error inflates the true effect size.
A simple solution to this problem is to set some a priori rules about the end of data collection. For example, a researcher may calculate sample size based on a rough power analysis. Based on an optimistic assumption that the true effect is large, the data will be checked when the study has 80% power for a large effect (d = .8). If this does not result in a significant result, the researcher continues with the revised hypothesis that the true effect is moderate and then checks the data again when 80% power for a moderate effect is reached. If this does not result in a significant result, the researcher may give up or continue with the revised hypothesis that the true effect is small. This procedure would allow researchers to use an optimal amount of resources. Moreover, they can state there sampling strategy openly so that meta-analysts can make corrections for the small amount of biases that is still introduced by this reasonable form of optional stopping.
Failing to disclose experimental conditions. There are no justifiable reasons for the exclusion of conditions. Evidently, researchers are not going to exclude conditions that are consistent with theoretical predictions. So, the exclusion of conditions can only produce results that are overly consistent with theoretical predictions. If there are reasonable doubts about a condition (e.g., a manipulation check shows that it did not work), the condition can be included and it can be explained why the results may not conform to predictions).
A simple solution to the problem of conditions with unexpected results is that researchers may include too many conditions in their design. A 2 x 2 x 2 factorial design has 8 cells, which allows for 28 comparisons of means. What are the chances that all of these 28 comparisons produce results that are consistent with theoretical predictions?
Another simple solution is to avoid the use of statistical methods with low power. To demonstrate a three-way interaction requires a lot more data than to demonstrate that a pattern of means is consistent with an a priori theoretically predicted pattern.
In a paper reporting selectively studies that worked.
There is no reason for excluding studies that did not work. Excluding studies that were planned as demonstrations of an effect need to be reported. Otherwise the published evidence provides an overly positive picture of the robustness of a phenomenon and effect sizes are inflated.
Just like failed conditions, failed studies can be reported if there is a plausible explanation why it failed whereas other studies worked. However, to justify this claim, it should be demonstrated that the effects in failed and successful studies are really significantly different (a significant moderator effect). If this is not the case, there is no reason to treat failed and successful studies as different from each other.
A simple solution to this problem is to conduct studies with high statistical power because the main reason for failed studies is that studies have low power. If a study has only 30% power, only one out of three studies will produce a significant result. The other two studies are likely to produce a type-II error (not show a significant result when the effect exists). Rather than throwing away the two failed studies, a researcher should have conducted one study with higher power. Another solution is to report all three studies and to test for significance only in a meta-analysis across the three studies.
In a paper, rounding off a p-value just above .054 and claim that it is below .05. This is a minor problem. It is silly to change a p-value, but it does not bias a meta-analysis of effect sizes because researchers do not change effect size information. Moreover, it would be even more silly not to change the p-value and conclude that there is no effect, which is often the case when results are not significant. After all, a p-value of .054 means that the effect in this study would have occurred if the true effect is zero or has the opposite sign.
If a type-I error probability of 5.4% is considered too high, it would be possible to collect more data and test again with a larger sample (taking multiple testing into account).
Moreover, this problem should arise very infrequently. Even if a study is underpowered and has only 50% power, only 2% of p-values are expected to fall into the narrow range between .050 and .054.
In a paper, reporting an unexpected finding as having been predicted from the start. I am sure some statisticians disagree with me and I may be wrong about this one, but I simply do not understand how a statistical analysis of some data cares about the expectations of a researcher. Say, I analyze some data and find a significant effect in the data. How can this effect be influenced by the way I report it later? It may be a type-I error or it is not a type-I error, but my expectations have no influence on the causal processes that produced the empirical data. I think the practice of writing exploratory studies as if they were conducted an a priori hypothesis is considered questionable because it often requires other QRPs (e.g., excluding additional tests that didn’t work) to produce a story that is concocted to explain unexpected results. However, if the results are presented honestly and one out of five predictor variables in a multiple-regression is significant at p < .0001, it is likely to be a replicable finding, even if it is presented with a post-hoc prediction.
In a paper, claiming that results are unaffected by demographic variables (e.g., gender) when one is actually unsure (or knows that they do). Again, this is a relatively minor point because it only speaks about potential moderators of a reported effect. Moderation is important, but the conclusion about the main effect remains unchanged. For example, if an effect exists for men, but not for women, it is still true that on average there is an effect. Furthermore, a more common mistake is often to claim that gender or other factors did not moderate an effect based on an underpowered comparison of 10 men and 30 women in a study with 40 participants. Thus, false claims about moderating variables are annoying, but not a threat to the replicability of empirical results.
Falsifying Data. I personally do not include falsifying or fabricating of data in the list of questionable research practices. I think falsifying and fabrication of data is not a research practice. It is also something that is clearly considered fraudulent and punished when it is discovered. In contrast, questionable research practices are tolerated in many scientific communities and there are no clear guidelines against the use of these practices.
In conclusion, the most problematic research practices that undermine the replicability of published studies are selective reporting of dependent variables, conditions, or entire studies, and optional stopping when significance is reached. These practices make it possible to produce significant results when a study has insufficient power. However, to achieve significance without power, the type-I error rate also increases and replicability decreases. John et al. (2012) aptly compared these QRPs to the use of doping in sports. I consider the R-Index a doping test for science because it reveals that researchers used these QRPs. I hope that the R-Index will discourage the use of QRPs and increase the power and replicability of published studies.
Whether scientific organizations should ban QRPs just like sports organizations ban doping is an interesting question. Meanwhile the R-Index can be used without draconian consequences. Researchers can self-examine the replicability of their findings and they can examine the replicability of published results before they invest resources, time, and the future of their graduate students in research projects that fail. Granting agencies can use the R-Index to reward researchers who conduct fewer studies with replicable results rather than researchers with many studies that fail to replicate. Finally, the R-Index can be used to track how successful current initiatives are to increase the replicability of published studies.
In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.
For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.
In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.
The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.
The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.
The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.
Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.
According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct.
In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.
Four QRPs were considered.
Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.
Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.
Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.
Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results. The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.
The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.
A New Perspective: The Curse of Low Powered Studies
One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).
One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below).
In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.
Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68). An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.
An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.
The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.
The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.
Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).
The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).
The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).
Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.
Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).
The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.
The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.
One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.
Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what could have [edited January 19, 2015] happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”
How Unusual is an R-Index of 51?
The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. However, in response to further inquiries [see follow up blog] Dr. Förster denied having used QRPs that could explain the linearity in his data.
“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”
Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).
How to Avoid Similar Sad Stories in the Future
One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.
Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.
For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.
Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).
One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.
The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.
Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.
It has been known for decades that published results tend to be biased (Sterling, 1959). For most of the past decades this inconvenient truth has been ignored. In the past years, there have been many suggestions and initiatives to increase the replicability of reported scientific findings (Asendorpf et al., 2013). One approach is to examine published research results for evidence of questionable research practices (see Schimmack, 2014, for a discussion of existing tests). This blog post introduces a new test of bias in reported research findings, namely the Test of Insufficient Variance (TIVA).
TIVA is applicable to any set of studies that used null-hypothesis testing to conclude that empirical data provide support for an empirical relationship and reported a significance test (p-values).
Rosenthal (1978) developed a method to combine results of several independent studies by converting p-values into z-scores. This conversion uses the well-known fact that p-values correspond to the area under the curve of a normal distribution. Rosenthal did not discuss the relation between these z-scores and power analysis. Z-scores are observed scores that should follow a normal distribution around the non-centrality parameter that determines how much power a study has to produce a significant result. In the Figure, the non-centrality parameter is 2.2. This value is slightly above a z-score of 1.96, which corresponds to a two-tailed p-value of .05. A study with a non-centrality parameter of 2.2 has 60% power. In specific studies, the observed z-scores vary as a function of random sampling error. The standardized normal distribution predicts the distribution of observed z-scores. As observed z-scores follow the standard normal distribution, the variance of an unbiased set of z-scores is 1. The Figure on top illustrates this with the nine purple lines, which are nine randomly generated z-scores with a variance of 1.
In a real data set the variance can be greater than 1 for two reasons. First, if the nine studies are exact replication studies with different sample sizes, larger samples will have a higher non-centrality parameter than smaller samples. This variance in the true non-centrality variances adds to the variance produced by random sampling error. Second, a set of studies that are not exact replication studies can have variance greater than 1 because the true effect sizes can vary across studies. Again, the variance in true effect sizes produces variance in the true non-centrality parameters that add to the variance produced by random sampling error. In short, the variance is 1 in exact replication studies that also hold the sample size constant. When sample sizes and true effect sizes vary, the variance in observed z-scores is greater than 1. Thus, an unbiased set of z-scores should have a minimum variance of 1.
If the variance in z-scores is less than 1, it suggests that the set of z-scores is biased. One simple reason for insufficient variance is publication bias. If power is 50% and the non-centrality parameter matches the significance criterion of 1.96, 50% of studies that were conducted would not be significant. If these studies are omitted from the set of studies, variance decreases from 1 to .36. Another reason for insufficient variance is that researchers do not report non-significant results or used questionable research practices to inflate effect size estimates. The effect is that variance in observed z-scores is restricted. Thus, insufficient variance in observed z-scores reveals that the reported results are biased and provide an inflated estimate of effect size and replicability.
In small sets of studies, insufficient variance may be due to chance alone. It is possible to quantify how lucky a researcher was to obtain significant results with insufficient variance. This probability is a function of two parameters: (a) the ratio of the observed variance (OV) in a sample over the population variance (i.e., 1), and (b) the number of z-scores minus 1 as the degrees of freedom (k -1).
The product of these two parameters follows a chi-square distribution with k-1 degrees of freedom.
Formula 1: Chi-square = OV * (k – 1) with k-1 degrees of freedom.
Bem (2011) published controversial evidence that appear to demonstrate precognition. Subsequent studies failed to replicate these results (Galak et al.,, 2012) and other bias tests show evidence that the reported results are biased Schimmack (2012). For this reason, Bem’s article provides a good test case for TIVA.
The article reported results of 10 studies with 9 z-scores being significant at p < .05 (one-tailed). The observed variance in the 10 z-scores is 0.19. Using Formula 1, the chi-square value is chi^2 (df = 9) = 1.75. Importantly, chi-square tests are usually used to test whether variance is greater than expected by chance (right tail of the distribution). The reason is that variance is not expected to be less than the variance expected by chance because it is typically assumed that a set of data is unbiased. To obtain a probability of insufficient variance, it is necessary to test the left-tail of the chi-square distribution. The corresponding p-value for chi^2 (df = 9) = 1.75 is p = .005. Thus, there is only a 1 out of 200 probability that a random set of 10 studies would produce a variance as low as Var = .19.
This outcome cannot be attributed to publication bias because all studies were published in a single article. Thus, TIVA supports the hypothesis that the insufficient variance in Bem’s z-scores is the result of questionable research methods and that the reported effect size of d = .2 is inflated. The presence of bias does not imply that the true effect size is 0, but it does strongly suggest that the true effect size is smaller than the average effect size in a set of studies with insufficient variance.
Vohs et al. (2006) published a series of studies that he results of nine experiments in which participants were reminded of money. The results appeared to show that “money brings about a self-sufficient orientation.” Francis and colleagues suggested that the reported results are too good to be true. An R-Index analysis showed an R-Index of 21, which is consistent with a model in which the null-hypothesis is true and only significant results are reported.
Because Vohs et al. (2006) conducted multiple tests in some studies, the median p-value was used for conversion into z-scores. The p-values and z-scores for the nine studies are reported in Table 2. The Figure on top of this blog illustrates the distribution of the 9 z-scores relative to the expected standard normal distribution.
Study p z
Study 1 .026 2.23
Study 2 .050 1.96
Study 3 .046 1.99
Study 4 .039 2.06
Study 5 .021 2.99
Study 6 .040 2.06
Study 7 .026 2.23
Study 8 .023 2.28
Study 9 .006 2.73
The variance of the 9 z-scores is .054. This is even lower than the variance in Bem’s studies. The chi^2 test shows that this variance is significantly less than expected from an unbiased set of studies, chi^2 (df = 8) = 1.12, p = .003. An unusual event like this would occur in only 1 out of 381 studies by chance alone.
In conclusion, insufficient variance in z-scores shows that it is extremely likely that the reported results overestimate the true effect size and replicability of the reported studies. This confirms earlier claims that the results in this article are too good to be true (Francis et al., 2014). However, TIVA is more powerful than the Test of Excessive Significance and can provide more conclusive evidence that questionable research practices were used to inflate effect sizes and the rate of significant results in a set of studies.
TIVA can be used to examine whether a set of published p-values was obtained with the help of questionable research practices. When p-values are converted into z-scores, the variance of z-scores should be greater or equal to 1. Insufficient variance suggests that questionable research practices were used to avoid publishing non-significant results; this includes simply not reporting failed studies.
At least within psychology, these questionable research practices are used frequently to compensate for low statistical power and they are not considered scientific misconduct by governing bodies of psychological science (APA, APS, SPSP). Thus, the present results do not imply scientific misconduct by Bem or Vohs, just like the use of performance enhancing drugs in sports is not illegal unless a drug is put on an anti-doping list. However, jut because a drug is not officially banned, it does not mean that the use of a drug has no negative effects on a sport and its reputation.
One limitation of TIVA is that it requires a set of studies and that variance in small sets of studies can vary considerably just by chance. Another limitation is that TIVA is not very sensitive when there is substantial heterogeneity in true non-centrality parameters. In this case, the true variance in z-scores can mask insufficient variance in random sampling error. For this reason, TIVA is best used in conjunction with other bias tests. Despite these limitations, the present examples illustrate that TIVA can be a powerful tool in the detection of questionable research practices. Hopefully, this demonstration will lead to changes in the way researchers view questionable research practices and how the scientific community evaluates results that are statistically improbable. With rejection rates at top journals of 80% or more, one would hope that in the future editors will favor articles that report results from studies with high statistical power that obtain significant results that are caused by the predicted effect.