How Power Analysis Could Have Prevented the Sad Story of Dr. Förster

[further information can be found in a follow up blog]

Background

In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.

For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.

In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.

The Accusation

The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.

The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.

The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.

Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.

According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct.

In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.

Four QRPs were considered.

  1. Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.
  2. Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.
  3. Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.
  4. Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results. The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.

The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.

A New Perspective: The Curse of Low Powered Studies

One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).

One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below). ForsterTable

In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.

Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68).   An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.

An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.

The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.

The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.

Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).

The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).

The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).

Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.

Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).

The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.

The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.

One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.

Conclusion

Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what could have [edited January 19, 2015] happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”

How Unusual is an R-Index of 51?

The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. However, in response to further inquiries [see follow up blog] Dr. Förster denied having used QRPs that could explain the linearity in his data.

Nevertheless, an R-Index of 51% is not unusual and has been explained with the use of QRPs.  For example, the R-Index for a set of studies by Roy Baumeister was 49%, . and Roy Baumeister stated that QRPs were used to obtain significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).

How to Avoid Similar Sad Stories in the Future

One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.

Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.

For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.

Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).

One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.

The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.

Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.

15 thoughts on “How Power Analysis Could Have Prevented the Sad Story of Dr. Förster

  1. `Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).`

    I am sure you are aware of the fact that some journals now recognize certain practices by handing out badges:

    https://osf.io/tvyxz/

    http://www.psychologicalscience.org/index.php/publications/observer/2014/december-14/psychological-science-authors-earn-badges-for-open-practices.html

    Do you think a badge that depicts the study used 95% (or perhaps 80%) power would be a possible useful addition to the existing badges?

      1. Thanks for your reply Dr. R.! I have contacted the people behind the existing badges and asked them if they would find such a “80%/ 95% power badge” a good idea. They would follow up with me once they’d bring it up in the next meeting.

        I am curious to see what the power results of the first issue of Psychological Science in 2015 will be!

  2. The assertion that the excessive linearity of results in the 40 experiments presented by Dr. Förster’s could be due to selective reporting is interesting but incorrect. This can be easily shown by using the simulation script in the complainant’s report and to select only those studies in which the medium group deviates significantly (one-tailed at Alpha=.05) from both the low and high conditions. In that scenario of extreme publication bias, the linearity of results in Dr. Förster’s data still far exceeds what can be expected to occur in real data. So even if Förster committed questionable research practices on an excessive scale (and he explicitly denied ever having used any QRP), his linear results remain practically impossible. The LOWI investigation of course went farther than simply looking at the data and the final verdict in this case was also based on the complete loss of all raw data and Dr. Förster’s inability to indicate when, where, and by whom any of the 40 experiments were run. To date, no one has been able to convincingly come up with an honest scenario that could account for the linearity of results presented by Dr. Förster.

    1. Dear Critical Remark,

      Thank you for your very informative response.

      For readers interested in this topic, it might be helpful to have a link to the complainant’s report with the simulation script.
      I also think the case of Dr. Forster is so important because it sets a precedents that statistical patterns in data triggered an investigation into scientific misconduct. It will be important for the wider scientific community, especially in the Netherlands, to specify clearly how improbable results have to be to infer that QRPs no longer explain unusual patterns in the data and trigger an investigation of scientific misconduct.

      Unusual patterns in Dr. Forster’s data were only noticed because it produced linearity in a comparison of three means. However, other tools for the detection of QRPs like the Test of Insufficient Variance can be applied to any set of p-values and and it can show extremely low probabilities very quickly.

      In this regard, it is important that you highlight additional evidence about the lack of raw data. I think anybody who used QRPs and has a file drawer should learn from this case that it is better to keep copies of failed studies to demonstrate that data were actually collected because results that look way to good to be true without evidence of dropped outliers and failed studies can only suggest data manipulation.

      Of course, I am hoping that sad cases like this one will teach everybody the valuable lesson that extreme use of QRPs is practically indistinguishable from data fabrication and may damage a researcher’s reputation in the long run.

      Based on your publications on this topic, I think we are in agreement that the best solution to this problem is to conduct more powerful studies that can provide strong evidence for theoretical claims.

      Sincerely, Dr. R

      Another important lesson form cases like this is the need for more transparency about the process of data collection (open science). As statistical methods of unusual patterns in data become more widely known, fraudsters will get more sophisticated in the way they manipulate data.

      1. Your position is now a bit unclear to me: “critical remark” pointed out that QRPs are unlikely to explain JF’s results. Yet in your reply you state that cases like JF’s “will teach everybody the valuable lesson that extreme use of QRPs…” etc. So you surmise that JF did engage in (extreme use of) QRPs, even though JF emphatically denies this, and “critical remark” emphasizes that QRPs (type type mentioned in your post) cannot explain JF’s results?

  3. Dear Dr. Dolan,

    thank you for your comment. I am happy to clarify my position.

    My PERSONAL OPINION is that extreme use of QRPs (e.g., run 20 studies and publish 1 successful one) is scientific misconduct and should be treated like data fabrication and data manipulation. However, this is not the position taken by scientific organizations in psychology, where adding five data points that were never observed is considered scientific misconduct, but selectively removing 20 data points that were observed is not. I believe the main reason for this distinction is that it is harder to draw the line when existing data are removed. Sometimes this can be a reasonable way to clean data. So where does cleaning end, and where does cleansing start?

    The R-Index can be helpful because modest use of QRPs will only have a modest effect on the R-Index, whereas dramatic inflation of effect sizes will be punished by a low R-Index. Whether the field will adopt the R-Index or similar metrics to evaluate scientific integrity is an open question.

    My position regarding Jens Forster is unclear because the information about the case is unclear.

    There are claims that raw data are not available, and there are claims that statisticians looked at the SPSS data files and found that the reported statistics were reproduced when these data files were analyzed.

    There are claims that simulations show that QRPs alone cannot explain the amount of linearity, but these simulations have not been made available.

    There are claims that Jens Forster vehemently denied using QRPs, but in the personal communication with me he admitted to the use of QRPs (dropping failed studies) and he claimed that he mentioned this to the LOWI commission.

    In my opinion it is clear that the reported results are biased, but I personally have insufficient information to decide whether the data are just made up or whether the data are the result of QRPs.

    It is my understanding that this is an ongoing investigation and I hope that the truth will emerge. It would be valuable for everybody in this scientific community to find out whether this is a case of fabrication or whether careless use of QRPs can lead to dramatically implausible patterns in published results that may trigger an investigation when a complaint is being made.

    Sincerely, Dr. R

  4. The analysis by Dr. Schimmack is brilliant! Congratulations. I think that there were only a few people that truly believed in the 1:Xtrillion analysis… Why would an intelligent person like Forster make up linear data? Why would he fake data in a situation where he got everything already? Why would he fake at UvA, a place known for its strict statistics department. The accusation lacks a psychological basis.
    Dr. Forster answered to all the details in his blog http://www.socolab.de. It is obvious that many do not read them carefully. For example, he cannot be accused for throwing away material anymore if his boss suggested this to him. It is also obvious that he ran more studies and did not publish non significant results. This is what everybody does in psychology. It is not useful to repeat all the old arguments here that were addressed by Dr. Forster already.
    This new analysis presented by Dr. Schimmack is more enlightening than anything else I read before.

    1. In response to K. Meier’s comment:

      “I think that there were only a few people that truly believed in the 1:Xtrillion analysis…”

      Well, as best I can tell, Dr. Schimmack makes no attempt to deny the veracity of that analysis. Instead, he attempts to explain how this “unusual linearity” could have emerged.

      “For example, he cannot be accused for throwing away material anymore if his boss suggested this to him.”

      I think it is misleading to refer to a university department head as a boss. A tenured professor does not really have a boss, just like it is in Germany.

      1. Hello there,
        I agree that one should not repeat arguments again and again, but the University of Amsterdam should ask other colleagues at the department whether they also followed the head of departments’ advice to throw away questionnaires.
        I have heard that this happened but the university does not say anything about this important aspect.
        It actually seems to me that they are silent on a lot of evidence that could be held in favor of Forster.
        Finally, one should not mix up university systems. At UvA the head of the department has much more weight than he or she would have at other universities (in Germany such positions do not even exist). And the university structure is much more hierarchical than in Germany. Note that for weeks, harsh student protests have taken place at the UvA. Students occupy the presidents’ office and threaten to not leave the building before the presidents team is kicked out. The reason for this protest is the lack of democracy at UvA. (see the http://www.volkskrant.nl coverage).

Leave a Reply