A previous blog examined how and why Dr. Förster’s data showed incredibly improbable linearity.

The main hypothesis was that two experimental manipulations have opposite effects on a dependent variable.

Assuming that the average effect size of a single manipulation is similar to effect sizes in social psychology, a single manipulation is expected to have an effect size of d = .5 (change by half a standard deviation). As the two manipulations are expected to have opposite effects, the mean difference between the two experimental groups should be one standard deviation (0.5 + 0.5 = 1). With N = 40, and d = 1, a study has 87% power to produce a significant effect (p < .05, two-tailed). With power of this magnitude, it would not be surprising to get significant results in 12 comparisons (Table 1).

**The R-Index for the comparison of the two experimental groups in Table is Ř = 87%**

(Success Rate = 100%, Median Observed Power = 94%, Inflation Rate = 6%).

The Test of Insufficient Variance (TIVA) shows that the variance in z-scores is less than 1, but the probability of this event to occur by chance is 10%, Var(z) = .63, Chi-square (df = 11) = 17.43, p = .096.

Thus, the results for the two experimental groups are perfectly consistent with real empirical data and the large effect size could be the result of two moderately strong manipulations with opposite effects.

The problem for Dr. Förster started when he included a control condition and want to demonstrate in each study that the two experimental groups also differed significantly from the experimental group. As already pointed out in the original post, samples of 20 participants per condition do not provide sufficient power to demonstrate effect sizes of d = .5 consistently.

To make matters worse, the three-group design has even less power than two independent studies because the same control group is used in a three-group comparison. When sampling error inflates the mean in the control group (e.g, true mean = 33, estimated mean = 36), it benefits the comparison for the experimental group with the lower mean, but it hurts the comparison for the experimental group with the higher mean (e.g., M = 27, M = 33, M = 39 vs. M = 27, M = 36, M = 39). When sampling error leads to an underestimation of the true mean in the control group (e.g., true mean = 33, estimated mean = 30), it benefits the comparison of the higher experimental group with the control group, but it hurts the comparison of the lower experimental group and the control group.

Thus, total power to produce significant results for both comparisons is even lower than for two independent studies.

It follows that the problem for a researcher with real data was the control group. Most studies would have produced significant results for the comparison of the two experimental groups, but failed to show significant differences between one of the experimental groups and the control group.

At this point, it is unclear how Jens Förster achieved significant results under the contested assumption that real data were collected. However, it seems most plausible that QRPs would be used to move the mean of the control group to the center so that both experimental groups show a significant difference. When this was impossible, the control group could be dropped, which may explain why 3 studies in Table 1 did not report results for a control group.

The influence of QRPs on the control group can be detected by examining the variation of means in Table 1 across the 12(9) studies. Sampling error should randomly increase or decrease means relative to the overall mean of an experimental condition. Thus, there is no reason to expect a correlation in the pattern of means. Consistent with this prediction, the means of the two experimental groups are unrelated, *r*(12) = .05, *p* = .889; *r*(9) = .36, *p* = .347. In contrast, the means of the control group are correlated with the means of the two experimental groups, *r*(9) = .73, *r*(9) = .71. If the means in the control group are the result of the unbiased means in the experimental groups, it makes sense to predict the means in the control group from the means in the two experimental groups. A regression equation shows that 77% of the variance in the means of the control group is explained by the variation in the means in the experimental groups, R = .88, *F*(2,6) = 10.06, *p* = .01.

This analysis clarifies the source of the unusual linearity in the data. Studies with n = 20 per condition have very low power to demonstrate significant differences between a control group and opposite experimental groups because sampling error in the control group is likely to move the mean of the control group too close to one of the experimental groups to produce a significant difference.

This problem of low power may lead researchers to use QRPs to move the mean of the control group to the center. The problem for users of QRPs is that this statistical boost of power leaves a trace in the data that can be detected with various bias tests. The pattern of the three means will be too linear, there will be insufficient variance in the effect sizes, p-values, and observed power in the comparisons of experimental groups and control groups, the success rate will exceed median observed power, and, as shown here, the means in the control group will be correlated with the means in the experimental group across conditions.

In a personal email Dr. Förster did not comment on the statistical analyses because his background in statistics is insufficient to follow the analyses. However, he rejected this scenario as an account for the unusual linearity in his data; “I never changed any means.” Another problem for this account of what could have happened is that dropping cases from the middle group would lower the sample size of this group, but the sample size is always close to n = 20. Moreover, oversampling and dropping of cases would be a QRP that Dr. Förster would remember and could report. Thus, I now agree with the conclusion of the LOWI commission that the data cannot be explained by using QRPs, mainly because Dr. Förster denies having used any plausible QRPs that could have produced his results.

Some readers may be confused about this conclusion because it may appear to contradict my first blog. However, my first blog merely challenged the claim by the LOWI commission that linearity cannot be explained by QRPs. I found a plausible way in which QRPs could have produced linearity, and these new analyses still suggest that secretive and selective dropping of cases from the middle group could be used to show significant contrasts. Depending on the strength of the original evidence, this use of QRPs would be consistent with the widespread use of QRPs in the field and would not be considered scientific misconduct. As Roy F. Baumeister, a prominent social psychologist put it, “this is just how the field works.” However, unlike Roy Baumeister, who explained improbable results with the use of QRPs, Dr. Förster denies any use of QRPs that could potentially explain the improbable linearity in his results.

In conclusion, the following facts have been established with sufficient certainty:

(a) the reported results are too improbable to reflect just true effects and sampling error; they are not credible.

(b) the main problem for a researcher to obtain valid results is the low power of multiple-study articles and the difficulty of demonstrating statistical differences between one control group and two opposite experimental groups.

(c) to avoid reporting non-significant results, a researcher must drop failed studies and selectively drop cases from the middle group to move the mean of the middle group to the middle.

(d) Dr. Förster denies the use of QRPs and he denies data manipulation.

Evidently, the facts do not add up.

The new analyses suggest that there is one simple way for Dr. Förster to show that his data have some validity. The reason is that the comparison of the two experimental groups shows an R-Index of 87%. This implies that there is nothing statistically improbable about the comparison of these data. If these reported results are based on real data, a replication study is highly likely to replicate the mean difference between the two experimental groups. With n = 20 in each cell (N = 40), it would be relatively easy to conduct a preregistered and transparent replication study. However, without further credible evidence the published data lack credible scientific evidence and it would be prudent to retract all articles that show unusual statistical patterns that cannot be explained by the author.

Dear Ulrich Schimack

I really appreciate your analysis. I have been thinking all along that what could be going on here is some selection effect caused by repeating a study many many times and only picking some “good” results. I personally found the idea that the data has been massively fabricated quite implausible, because why on earth would you fabricate such totally senseless (unrealistic) straight line patterns? However other experts see that as the only explanation. Who is to say?

The power of an F test against inequality of three groups is greatest when the three groups have equally spaced means. This suggests to me that selecting the best from many many experiments (all with some systematic true, non-linear, effect, but too small to have any reasonable power) might tend to deliver one with the observed group averages much more evenly spread than the true group means.

One can come up with a simulation experiment which does not generate this pattern, but what does that prove? Maybe it only proves we haven’t been creative enough. Took the wrong parameters …

In fact, possibly because of my doubt on this issue, the initial investigation by UvA did not conclude a breach of scientific integrity (I was one of the experts who was consulted). Then the affair was taken to a “higher court” at which some new experts were involved, and they managed to convince the new committee that the only explanation was fraud. Actually, from the committee’s report one can see quite well that they (the committee – who are mainly [academic] lawyers…) did not understand the statistical issues, very well, anyway; but they did understand the confidence with which the new experts stated their opinion.

But also it is clear that Forster does not understand the the statistical issues either. That means that he is not competent to be running huge scale intensive statistical experiments. It doesn’t mean he is a cheat. In my opinion, the burden of proof in an accusation of fraud is rather high. Innocent till proven guilty. On the other hand, the papers are clearly scientifically compromised, and need to be withdrawn. We must distinguish between the moral integrity of scientific researchers, and the scientific integrity of scientific works. Judging the one and judging the other is done with different criteria.

My own conclusion was that Forster should have gotten his academic prize and be obliged to do his experiments once more under strict supervision so that the scientific record gets “cleaned”. His earlier papers should have been withdrawn, yes.

Jens Förster gave me permission to further analyse the SPSS files belonging to one of the published papers under certain standard and reasonable conditions, and anyone who wants them can get them from me, along with the same set of conditions. But of course they can also ask Förster himself.

Reblogged this on Replication-Index and commented:

The Jens Forster saga continues. Results are too good to be true, but how did this happen?

I’ve read this saga and commend those involved for their measured approach in turning this thing over and examining it from all sides. One perspective that I have not seen offered is the question of whether the original calculation of low probability of a linear result by the anonymous critique considered the field-wise probability or just the probability of this particular linear relationship among the means given the set-up of this study. I see a serious post-hoc fallacy to the probability calculation. I’m certain that if thousands of studies were examined, we could find studies with point estimates for arbitrary parameters that give a ratio equal to pi to the 5th decimal place. What would that demonstrate? Nothing. It’s the same fallacy as feeling “destined for great things” and “touched by the hand of God” when you’re saved in war, or win the lottery and you feel “blessed”. This whole thing could easily be explained by “sooner or later, a study under this design would have a linear relationship”, and Lars may reasonably be exonerated as an unlucky, and I might add, potentially targeted, individual. Auto-trolling studies for odd patterns would have its own Type 1, Type 2, Type M, and Type S errors, etc.