Urban Dictionary: Waffle
A Critical Examination of “Research Practices That Can Prevent an Inflation of False-Positive Rates” by Murayama, Pekrun, and Fiedler (2014) in Personality and Social Psychology Review.
The article by Murayama, Pekrun, and Fiedler (MPK) discusses the probability of false positive results (evidence for an effect when no effect is present also known as type-I error) in multiple study articles. When researchers conduct a single study the nominal probability of obtaining a significant result without a real effect (a type-I error) is typically set to 5% (p < .05, two-tailed). Thus, for every significant result one would expect 19 non-significant results. A false-positive finding (type-I error) would be followed by several failed replications. Thus, replication studies can quickly correct false discoveries. Or so, one would like to believe. However, traditionally journals reported only significant results. Thus, false positive results remained uncorrected in the literature because failed replications were not published.
In the 1990s, experimental psychologists that run relatively cheap studies found a solution to this problem. Journals demanded that researchers replicate their findings in a series of studies that were then published in a single article.
MPK point out that the probability of a type-I error decreases exponentially as the number of studies increases. With two studies, the probability is less than 1% (.05 * .05 = .0025). It is easier to see the exponential effect in terms or ratios (1 out of 20, 1 out of 400, 1 out of 8000, etc. In top journals of experimental social psychology, a typical article contains four studies. The probability that all four studies produce a type-I error is only 1 out of 160,000. The corresponding value on a standard normal distribution is z = 4.52, which means the strength of evidence is 4.5 standard deviations away from 0, which represents the absence of an effect. In particle physics a value of z = 5 is used to rule out false-positives. Thus, getting 4 out of 4 significant results in four independent tests of an effect provides strong evidence for an effect.
I am in full agreement with MPK and I made the same point in Schimmack (2012). The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160). A real effect will produce stronger evidence for an effect as sample size increase. Getting four significant results at the 5% level is not more impressive than getting a single significant result at the p < .00001 level.
However, the strength of evidence from multiple study articles depends on one crucial condition. This condition is so elementary and self-evidence that it is not even mentioned in statistics. The condition is that a researcher honestly reports all results. 4 significant results is only impressive when a researcher went into the lab, conducted four studies, and obtained significant results in all studies. Similarly, 4 free throws are only impressive when there were only 4 attempts. 4 out of 20 free-throws is not that impressive and 4 out of 80 attempts is horrible. Thus, the absolute number of successes is not important. What matters is the relative frequency of successes for all attempts that were made.
Schimmack (2012) developed the incredibility index to examine whether a set of significant results is based on honest reporting or whether it was obtained by omitting non-significant results or by using questionable statistical practices to produce significant results. Evidence for dishonest reporting of results would undermine the credibility of the published results.
MPK have the following to say about dishonest reporting of results.
“On a related note, Francis (2012a, 2012b, 2012c, 2012d; see also Schimmack, 2012) recently published a series of analyses that indicated the prevalence of publication bias (i.e., file-drawer problem) in multi-study papers in the psychological literature.” (p. 111). They also note that Francis used a related method to reveal that many multiple-study articles show statistical evidence of dishonest reporting. “Francis argued that there may be many cases in which the findings reported in multi-study papers are too good to be true” (p. 111).
In short, Schimmack and Francis argued that multiple study articles can be misleading because the provide the illusion of replicability (a researcher was able to demonstrate the effect again, and again, and again, therefore it must be a robust effect), but in reality it is not clear how robust the effect is because the results were not obtain in the way as the studies are described in the article (first we did Study 1, then we did Study 2, etc. and voila all of the studies worked and showed the effect).
One objection to Schimmack and Francis would be to find a problem with their method of detecting bias. However, MPK do not comment on the method at all. They sidestep this issue when they write “it is beyond the scope of this article to discuss whether publication bias actually exists in these articles or. or how prevalent it is in general” (p. 111).
After sidestepping the issue, MPK are faced with a dilemma or paradox. Do multiple study articles strengthen the evidence because the combined type-I error probability decreases or do multiple study articles weaken the evidence because the probability that researchers did not report the results of their research program honestly? “Should multi-study findings be regarded as reliable or shaky evidence?” (p. 111).
MPK solve this paradox with a semantic trick. First, they point out that dishonest reporting has undesirable effects on effect size estimates.
“A publication bias, if it exists, leads to overestimation of effect sizes because some null findings are not reported (i.e., only studies with relatively large effect sizes that produce significant results are reported). The overestimation of effect sizes is problematic” (p. 111).
They do not explain why researchers should be allowed to omit studies with non-significant results from an article, given that this practice leads to the undesirable consequences of inflated effect sizes. Accurate estimates of effect sizes would be obtained if researchers published all of their results. In fact, Schimmack (2012) suggested that researchers report all results and then conduct a meta-analysis of their set of studies to examine how strong the evidence of a set of studies is. This meta-analysis would provide an unbiased measure of the true effect size and unbiased evidence about the probability that the results of all studies were obtained in the absence of an effect.
The semantic trick occurs when the authors suggest that dishonest reporting practices are only a problem for effect size estimates, but not for the question whether an effect actually exists.
“However, the presence of publication bias does not necessarily mean that the effect is absent (i.e., that the findings are falsely positive).” (p. 111) and “Publication bias simply means that the effect size is overestimated—it does not necessarily imply that the effect is not real (i.e., falsely positive).” (p. 112).
This statement is true because it is practically impossible to demonstrate false positives, which would require demonstrating that the true effect size is exactly 0. The presence of bias does not warrant the conclusion that the effect size is zero and that reported results are false positives.
However, this is not the point of revealing dishonest practices. The point is that dishonest reporting of results undermines the credibility of the evidence that was used to claim that an effect exists. The issue is the lack of credible evidence for an effect, not credible evidence for the lack of an effect. These two statements are distinct and MPK use the truth of the second statement to suggest that we can ignore whether the first statement is true.
Finally, MPK present a scenario of a multiple study article with 8 studies that all produced significant results. The state that it is “unrealistic that as many as eight statistically significant results were produced by a non-existent effect” (p. 112).
This blue-eyed view of multiple study articles ignores the fact that the replication crisis in psychology was triggered by Bem’s (2011) infamous article that contained 9 out of 9 statistically significant results (one marginal result was attributed to methodological problems, see Schimmack, 2012, for details) that supposedly demonstrated humans ability to foresee the future and to influence the past (e.g., learning after a test increased performance on a test that was taken before learning for the test). Schimmack (2012) used this article to demonstrate how important it can be to evaluate the credibility of multiple study articles and the incredibility index predicted correctly that these results would not replicate. So, it is simply naïve to assume that articles with more studies automatically strengthen evidence for the existence of an effect and that 8 significant results cannot occur in the absence of a true effect (maybe MPK believe in ESP).
It is also not clear why researchers should wonder about the credibility of results in multiple study articles. A simple solution to the paradox is to reported all results honestly. If an honest set of studies provides evidence for an effect, it is not clear why researchers would prefer to engage in dishonest reporting practices. MPK provide no explanation for this practices and make no recommendation to increase honesty in reporting of results as a simple solution to the replicability crisis in psychology.
They write, “the researcher may have conducted 10, or even 20, experiments until he/she obtained 8 successful experiments, but far more studies would have been needed had the effect not existed at all”. This is true, but we do not know how many studies a researcher conducted or what else a researcher did to the data unless all of this information is reported. If the combined evidence of 20 studies with 8 significant results shows that an effect is present, a researcher could just publish all 20 studies. What is the reason to hide over 50% of the evidence?
In the end, MPK assure readers that they “do not intend to defend underpowered studies” and they do suggest that “the most straightforward solution to this paradox is to conduct studies that have sufficient statistical power” (p. 112). I fully agree with these recommendations because powerful studies can provide real evidence for an effect and decrease the incentive to engage in dishonest practices.
It is discouraging that this article was published in a major review journal in social psychology. It is difficult to see how social psychology can regain trust, if social psychologists believe they can simply continue to engaging in dishonest reporting of results. Unfortunately, social psychologists continue to downplay the replication crisis and the shaky foundations of many textbook claims.
6 thoughts on “Klaus Fiedler “it is beyond the scope of this article to discuss whether publication bias actually exists””
“The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160). A real effect will produce stronger evidence for an effect as sample size increase. Getting four significant results at the 5% level is not more impressive than getting a single significant result at the p < .00001 level."
While this is true, I don't think it's a fair accounting of the way social psychologists view the problem. Typically, a social psychologist won't just replicate the experiment exactly, but make modifications to the design with different settings, people, and times. This is to demonstrate the robustness of the experimental results. This result shows up in a coffee shop with a female experimenter; will it also show up in a department store with a male experimenter? While it doesn't increase replicability, it can increase robustness assuming no other issues, and I think this is what most researchers mean when they say the evidence is stronger with multiple experiments. They view multiple experiments as a way to reduce the limitations of any one particular experimental design.
This in turn creates problems of its own for if a result shows up in one study, but not in another study, the experimenter can just change the hypothesis to accommodate this new result. I think social psychology would be improved if they focused more on replication, and worried less about robustness. The absence of direct replication makes it easier for researchers to explain away contrary results. They don't have to deliberately hide evidence, but can just say the negative results aren't applicable to their hypothesis, which is often true because the hypothesis was tailored to the results. Many of them probably do believe they are reporting the results honestly.
Thank you for your comment. I discuss this issue in more detail in Schimmack (2012). Assume all four studies used the same dependent variable. It would be possible to analyze the data as a single study with N = 160 and include the four conditions (male vs. female experimenter, in coffee shop vs. on Mturk) as possible moderator variables. The power to show the main effect is now as high as in a single study without the moderators. The power to show moderator effects is low due to the small sample sizes in each condition. This approach would also help to avoid the problem of either publishing a non-significant result and falsely interpreting it as evidence that the effect is weaker in this condition (without proper test of moderation) or to drop this study based on some post-hoc rationalization why the study did not work.
To me, Fiedler is the epitome of social psychology. This being his magnum opus:
I’ve tried to summarize the main point of his writing for myself, but so far i am having a hard time doing so. I wonder if there is anything useful in it that i am not grasping.
Perhaps it’s the case that i just don’t agree with what is assume is, one of the main points that “It will be seen that the current discourse on questionable research practices, usability of science, and fraud, mainly fueled by whistle-blowers who are themselves members of the scientific community, does not live up to higher levels of moral judgment according to Piaget and Kohlberg.”
More importantly, i wonder why that should matter in the first place.
“It will be seen…”
Looks like the eptitome of social psychology also has ESP powers.
That can be helpful in the planning of studies without statistical power because one can predict the effect size in the sample.
“The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160).”
The differences between four studies with N=40 and one study with N=160 are HUGE. Please, Uli, never ever make the same mistake again. 🙂
ONLY if no publication bias and no qrps and no heterogeneity exists, the two situations are very similar. But well… we know that publication bias and qrps may exist.
But let us assume we live in a perfect world without qrps and no publication bias. And we do not know if heterogeneity exists or not. What would you prefer, four studies of N=40 or one study of N=160? The four studies can be shown to be more efficiently estimate overall effect size under heterogeneity AND may inform us a little bit about heterogeneity.
So, please, do not ever state again that “The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160).” 🙂
ONLY if no publication bias and no qrps and no heterogeneity exists, the two situations are very similar. But well… we know that publication bias and qrps may exist.
Please read Schimmack (2012). It is long but worth it. LOL.
I make exactly your point in the article. In theory there is no gain in doing 4 studies with N = 40 vs. one study with N = 160, but top social psych journals pretend that a “programatic set of studies” is superior and reject single study articles even if N = 10,000 for one study and total N for 5 studies is 200.
Number of study fetishism is bull shit because these “programatic studies” never have the power to really be significant all the time. But they are , see Bem (2011). which only shows that QRPs were used and the results are meaningless. 30 years later, replication crisis, and no empirical evidence to show for.