Tag Archives: r-index

Why Frontiers Should Retract Baumeister’s Critique of Carter’s Meta-Analysis

This blog post is heavily based on one of my first blog-posts in 2014 (Schimmack, 2014).  The blog post reports a meta-analysis of ego-depletion studies that used the hand-grip paradigm.  When I first heard about the hand-grip paradigm, I thought it was stupid because there is so much between-subject variance in physical strength.  However, then I learned that it is the only paradigm that uses a pre-post design, which removes between-subject variance from the error term. This made the hand-grip paradigm the most interesting paradigm because it has the highest power to detect ego-depletion effects.  I conducted a meta-analysis of the hand-grip studies and found clear evidence of publication bias.  This finding is very damaging to the wider ego-depletion research because other studies used between-subject designs with small samples which have very low power to detect small effects.

This prediction was confirmed in meta-analyses by Carter,E.C., Kofler, L.M., Forster, D.E., and McCulloch,M.E. (2015) that revealed publication bias in ego-depletion studies with other paradigms.

The results also explain why attempts to show ego-depletion effects with within-subject designs failed (Francis et al., 2018).  Within-subject designs increase power by removing fixed between-subject variance such as physical strength.  However, given the lack of evidence with the hand-grip paradigm it is not surprising that within-subject designs also failed to show ego-depletion effects with other dependent variables in within-subject designs.  Thus, these results further suggest that ego-depletion effects are too small to be used for experimental investigations of will-power.

Of course, Roy F. Baumeister doesn’t like this conclusion because his reputation is to a large extent based on the resource model of will-power.  His response to the evidence that most of the evidence is based on questionable practices that produced illusory evidence has been to attack the critics (cf. Schimmack, 2019).

In 2016, he paid to publish a critique of Carter’s (2015) meta-analysis in Frontiers of Psychology (Cunningham & Baumeister, 2016).   In this article, the authors question the results obtained by bias-tests that reveal publication bias and suggest that there is no evidence for ego-depletion effects.

Unfortunately, Cunningham and Baumeister’s (2016) article is cited frequently as if it contained some valid scientific arguments.

For example, Christodoulou, Lac, and Moore (2017) cite the article to dismiss the results of a PEESE analysis that suggests publication bias is present and there is no evidence that infants can add and subtract. Thus, there is a real danger that meta-analysts will use Cunningham & Baumeister’s (2016) article to dismiss evidence of publication bias and to provide false evidence for claims that rest on questionable research practices.

Fact Checking Cunningham and Baumeister’s Criticisms

Cunningham and Baumeister (2016) claim that results from bias tests are difficult to interpret, but there criticism is based on false arguments and inaccurate claims.

Confusing Samples and Populations

This scientifically sounding paragraph is a load of bull. The authors claim that inferential tests require sampling from a population and raise a question about the adequacy of a sample. However, bias tests do not work this way. They are tests of the population, namely the population of all of the studies that could be retrieved that tested a common hypothesis (e.g., all handgrip studies of ego-depletion). Maybe more studies exist than are available. Maybe the results based on the available studies differ from results if all studies were available, but that is irrelevant. The question is only whether the available studies are biased or not. So, why do we even test for significance? That is a good question. The test for significance only tells us whether bias is merely a product of random chance or whether it was introduced by questionable research practices. However, even random bias is bias. If a set of studies reports only significant results, and the observed power of the studies is only 70%, there is a discrepancy. If this discrepancy is not statistically significant, there is still a discrepancy. If it is statistically significant, we are allowed to attribute it to questionable research practices such as those that Baumeister and several others admitted using.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication) (Schimmack, 2014).

Given the widespread use of questionable research practices in experimental social psychology, it is not surprising that bias-tests reveal bias. It is actually more surprising when these tests fail to reveal bias, which is most likely a problem of low statistical power (Renkewitz & Keiner, 2019).

Misunderstanding Power

The claims about power are not based on clearly defined constructs in statistics. Statistical power is a function of the strength of a signal (the population effect size) and the amount of noise (sampling error). Researches skills are not a part of statistical power. Results should be independent of a researcher. A researcher could of course pick procedures that maximize a signal (powerful interventions) or reduce sampling error (e.g., pre-post designs), but these factors play a role in the designing of a study. Once a study is carried out, the population effect size is what it was and the sampling error is what it was. Thus, honestly reported test statistics tell us about the signal-to-noise ratio in a study that was conducted. Skillful researchers would produce stronger test-statistics (higher t-values, F-values) than unskilled researchers. The problem for Baumeister and other ego-depletion researchers is that the t-values and F-values tend to be weak and suggest questionable research practices rather than skill produced significant results. In short, meta-analysis of test-statistics reveal whether researchers used skill or questionable research practices to produce significant results.

The reference to Morey (2013) suggests that there is a valid criticism of bias tests, but that is not the case. Power-based bias tests are based on sound statistical principles that were outlined by a statistician in the journal American Statistician (Sterling, Rosenbaum, & Weinkam, 1995). Building on this work, Jerry Brunner (professor of statistics) and I published theorems that provide the basis of bias tests like TES to reveal the use of questionable research practices (Brunner & Schimmack, 2019). The real challenge for bias tests is to estimate mean power without information about the population effect sizes. In this regard, TES is extremely conservative because it relies on a meta-analysis of observed effect sizes to estimate power. These effect sizes are inflated when questionable research practices were used, which makes the test conservative. However, there is a problem with TES when effect sizes are heterogeneous. This problem is avoided by alternative bias tests like the R-Index that I used to demonstrate publication bias in the handgrip studies of ego-depletion. In sum, bias tests like the R-Index and TES are based on solid mathematical foundations and simulation studies show that they work well in detecting the use of questionable research practices.

Confusing Absence of Evidence with Evidence of Absence

PET and PEESE are extension of Eggert’s regression test of publication bias. All methods relate sample sizes (or sampling error) to effect size estimates. Questionable research practices tend to introduce a negative correlation between sample size and effect sizes or a positive correlation between sampling error and effect sizes. The reason is that significance requires a signal to noise ratio of 2:1 for t-tests or 4:1 for F-tests to produce a significant result. To achieve this ratio with more noise (smaller sample, more sampling error), the signal has to be inflated more.

The novel contribution of PET and PEESE was to use the intercept of the regression model as an effect size estimate that corrects for publication bias. This estimate needs to be interpreted in the context of the sampling error of the regression model, using a 95%CI around the point estimate.

Carter et al. (2015) found that the 95%CI often included a value of zero, which implies that the data are too weak to reject the null-hypothesis. Such non-significant results are notoriously difficult to interpret because they neither support nor refute the null-hypothesis. The main conclusion that can be drawn from this finding is that the existing data are inconclusive.

This main conclusion does not change when the number of studies is less than 20. Stanley and Doucouliagos (2014) were commenting on the trustworthiness of point estimates and confidence intervals in smaller samples. Smaller samples introduce more uncertainty and we should be cautious in the interpretation of results that suggest there is an effect because the assumptions of the model are violated. However, if the results already show that there is no evidence, small samples merely further increase uncertainty and make the existing evidence even less conclusive.

Aside from the issues regarding the interpretation of the intercept, Cunningham and Baumeister also fail to address the finding that sample sizes and effect sizes were negatively correlated. If this negative correlation is not caused by questionable research practices, it must be caused by something else. Cunningham and Baumeister fail to provide an answer to this important question.

No Evidence of Flair and Skill

Earlier Cunningham and Baumeister (2016) claimed that power depends on researchers’ skills and they argue that new investigators may be less skilled than the experts who developed paradigms like Baumeister and colleagues.

However, they then point out that Carter et al.’s (2015) examined lab as a moderator and found no difference between studies conducted by Baumeister and colleagues or other laboratories.

Thus, there is no evidence whatsoever that Baumeister and colleagues were more skillful and produced more credible evidence for ego-depletion than other laboratories. The fact that everybody got ego-depletion effects can be attributed to the widespread use of questionable research practices that made it possible to get significant results even for implausible phenomena like extrasensory perception (John et al., 2012; Schimmack, 2012). Thus, the large number of studies that support ego-depletion merely shows that everybody used questionable research practices like Baumeister did (Schimmack, 2014; Schimmack, 2016), which is also true for many other areas of research in experimental social psychology (Schimmack, 2019). Francis (2014) found that 80% of articles showed evidence that QRPs were used.

Handgrip Replicability Analysis

The meta-analysis included 18 effect sizes based on handgrip studies.   Two unpublished studies (Ns = 24, 37) were not included in this analysis.   Seeley & Gardner (2003)’s study was excluded because it failed to use a pre-post design, which could explain the non-significant result. The meta-analysis reported two effect sizes for this study. Thus, 4 effects were excluded and the analysis below is based on the remaining 14 studies.

All articles presented significant effects of will-power manipulations on handgrip performance. Bray et al. (2008) reported three tests; one was deemed not significant (p = .10), one marginally significant (.06), and one was significant at p = .05 (p = .01). The results from the lowest p-value were used. As a result, the success rate was 100%.

Median observed power was 63%. The inflation rate is 37% and the R-Index is 26%. An R-Index of 22% is consistent with a scenario in which the null-hypothesis is true and all reported findings are type-I errors. Thus, the R-Index supports Carter and McCullough’s (2014) conclusion that the existing evidence does not provide empirical support for the hypothesis that will-power manipulations lower performance on a measure of will-power.

The R-Index can also be used to examine whether a subset of studies provides some evidence for the will-power hypothesis, but that this evidence is masked by the noise generated by underpowered studies with small samples. Only 7 studies had samples with more than 50 participants. The R-Index for these studies remained low (20%). Only two studies had samples with 80 or more participants. The R-Index for these studies increased to 40%, which is still insufficient to estimate an unbiased effect size.

One reason for the weak results is that several studies used weak manipulations of will-power (e.g., sniffing alcohol vs. sniffing water in the control condition). The R-Index of individual studies shows two studies with strong results (R-Index > 80). One study used a physical manipulation (standing one leg). This manipulation may lower handgrip performance, but this effect may not reflect an influence on will-power. The other study used a mentally taxing (and boring) task that is not physically taxing as well, namely crossing out “e”s. This task seems promising for a replication study.

Power analysis with an effect size of d = .2 suggests that a serious empirical test of the will-power hypothesis requires a sample size of N = 300 (150 per cell) to have 80% power in a pre-post study of will-power.

HandgripRindex

Conclusion

Baumeister has lost any credibility as a scientist. He is pretending to engage in a scientific dispute about the validity of ego-depletion research, but he is ignoring the most obvious evidence that has accumulated during the past decade. Social psychologists have misused the scientific method and engaged in a silly game of producing significant p-values that support their claims. Data were never used to test predictions and studies that failed to support hypotheses were not published.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

As a result, the published record lacks credibility and cannot be used to provide empirical evidence for scientific claims. Ego-depletion is a glaring example of everything that went wrong in experimental social psychology. This is not surprising because Baumeister and his students used questionable research practices more than other social psychologists (Schimmack, 2018). Now he is trying to to repress this truth, which should not surprise any psychologist familiar with motivated biases and repressive coping. However, scientific journals should not publish his pathetic attempts to dismiss criticism of his work. Cunningham and Baumeister’s article provides not a single valid scientific argument. Frontiers of Psychology should retract the article.

References

Carter,E.C.,Kofler,L.M.,Forster,D.E.,and McCulloch,M.E. (2015).A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. J. Exp.Psychol.Gen. 144, 796–815. doi:10.1037/xge0000083

Using the R-index to detect questionable research practices in SSRI studies

Amna Shakil and Ulrich Schimmack

In 2008, Turner and colleagues (2008) examined the presence of publication bias in clinical trials of antidepressants. They found that out of 74 FDA-registered studies, 51% showed positive results. However, positive results were much more likely to be published, as 94% of the published results were positive. There were two reasons for the inflated percentage of positive results. First, negative results were not published. Second, negative results were published as positive results. Turner and colleagues’ (2008) results received a lot of attention and cast doubt on the effectiveness of anti-depressants.

A year after Turner and colleagues (2008) published their study, Moreno, Sutton, Turner, Abrams, Cooper and Palmer (2009) examined the influence of publication bias on the effect-size estimate in clinical trials of antidepressants. They found no evidence of publication bias in the FDA-registered trials, leading the researchers to conclude that the FDA data provide an unbiased gold standard to examine biases in the published literature.

The effect size for treatment with anti-depressants in the FDA data was g = 0.31, 95% confidence interval 0.27 to 0.35. In contrast, the uncorrected average effect size in the published studies was g = 0.41, 95% confidence interval 0.37 to 0.45. This finding shows that publication bias inflates effect size estimates by 32% ((0.41 – 0.31)/0.31).

Moreno et al. (2009) also used regression analysis to obtain a corrected effect size estimate based on the biased effect sizes in the published literature. In this method, effect sizes are regressed on sampling error under the assumption that studies with smaller samples (and larger sampling error) have more bias. The intercept is used as an estimate of the population effect size when sampling error is zero. This correction method yielded an effect size estimate of g = 0.29, 95% confidence interval 0.23 to 0.35, which is similar to the gold standard estimate (.31).

The main limitation of the regression method is that other factors can produce a correlation between sample size and effect size (e.g., higher quality studies are more costly and use smaller samples). To avoid this problem, we used an alternative correction method that does not make this assumption.

The method uses the R-Index to examine bias in a published data set. The R-Index increases as statistical power increases and it decreases when publication bias is present. To obtain an unbiased effect size estimate, studies are selected to maximize the R-Index.

Since the actual data files were not available, graphs A and B from Moreno et al.’s (2009) study were used to obtain information about effect size and sample error of all the FDA-registered and the published journal articles.

The FDA-registered studies had the success rate of 53% and the observed power of 56%, resulting in an inflation of close to 0. The close match between the success rate and observed confirms FDA studies are not biased. Given the lack of bias (inflation), the most accurate estimate of the effect size is obtained by using all studies.

The published journal articles had a success rate of 86% and the observed power of 73%, resulting in the inflation rate of 12%. The inflation rate of 12% confirms that the published data set is biased. The R-Index subtracts the inflation rate from observed power to correct for inflation. Thus, the R-Index for the published studies is 73-12 = 61. The weighted effect size estimate was d = .40.

The next step was to select sets of studies to maximize the R-Index. As most studies were significant, the success rate could not change much. As a result, most of the increase would be achieved by selecting studies with higher sample sizes in order to increase power. The maximum R-Index was obtained for a cut-off point of N = 225. This left 14 studies with a total sample size of 4,170 participants. The success rate was 100% with median observed power of 85%. The Inflation was still 15%, but the R-Index was higher than it was for the full set of studies (70 vs. 61). The weighted average effect size in the selected set of powerful studies was d = .34. This result is very similar to the gold standard in the FDA data. The small discrepancy can be attributed to the fact that even studies with 85% power still have a small bias in the estimation of the true effect size.

In conclusion, our alternative effect size estimation procedure confirms Moreno et al.’s (2009) results using an alternative bias-correction method and shows that the R-Index can be a valuable tool to detect and correct for publication bias in other meta-analyses.

These results have important practical implications. The R-Index confirms that published clinical trials are biased and can provide false information about the effectiveness of drugs. It is therefore important to ensure that clinical trials are preregistered and that all results of clinical trials are published. The R-Index can be used to detect violations of these practices that lead to biased evidence. Another important finding is that clinical trials of antidepressants do show effectiveness and that antidepressants can be used as effective treatments of depression. The presence of publication bias should not be used to claim that antidepressants lack effectiveness.

References

Moreno, S. G., Sutton, A. J., Turner, E. H., Abrams, K. R., Cooper, N. J., Palmer, T. M., & Ades, A. E. (2009). Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publications. Bmj, 339, b2981.

Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358(3), 252-260.

R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

R-Index Bulletin

The world of scientific publishing is changing rapidly and there is a growing need to share scientific information as fast and as cheap as possible.

Traditional journals with pre-publication peer-review are too slow and focussed on major ground-breaking discoveries.

Open-access journals can be expensive.

R-Index Bulletin offers a new opportunity to share results with the scientific community quickly and free of charge.

R-Index Bulletin also avoids the problems of pre-publication peer-review by moving to a post-publication peer-review process. Readers are welcome to comment on posted contributions and to post their own analyses. This process ensures that scientific disputes and their resolution are open and part of the scientific process.

For the time being, submissions can be uploaded as comments to this blog. In the future, R-Index Bulletin may develop into a free online journal.

A submission should contain a brief description of the research question (e.g., what is the R-Index of studies on X, by X, or in the journal X?), the main statistical results (median observed power, success rate, inflation rate, R-Index) and a brief discussion of the implications of the analysis. There is no page restriction and analyses of larger data sets can include moderator analysis. Inclusion of other bias tests (Egger’s regression, TIVA, P-Curve, P-Uniform) is also welcome.

If you have conducted an R-Index analysis, please submit it to R-Index Bulletin to share your findings.

Submissions can be made anonymously or with an author’s name.

Go ahead and press the “Leave a comment” or “Leave a reply” button or scroll to the bottom of the page and past your results in the “Leave a reply” box.

Questionable Research Practices: Definition, Detect, and Recommendations for Better Practices

How Power Analysis Could Have Prevented the Sad Story of Dr. Förster

[further information can be found in a follow up blog]

Background

In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.

For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.

In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.

The Accusation

The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.

The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.

The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.

Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.

According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct.

In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.

Four QRPs were considered.

  1. Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.
  2. Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.
  3. Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.
  4. Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results. The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.

The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.

A New Perspective: The Curse of Low Powered Studies

One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).

One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below). ForsterTable

In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.

Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68).   An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.

An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.

The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.

The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.

Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).

The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).

The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).

Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.

Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).

The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.

The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.

One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.

Conclusion

Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what could have [edited January 19, 2015] happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”

How Unusual is an R-Index of 51?

The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. However, in response to further inquiries [see follow up blog] Dr. Förster denied having used QRPs that could explain the linearity in his data.

Nevertheless, an R-Index of 51% is not unusual and has been explained with the use of QRPs.  For example, the R-Index for a set of studies by Roy Baumeister was 49%, . and Roy Baumeister stated that QRPs were used to obtain significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).

How to Avoid Similar Sad Stories in the Future

One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.

Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.

For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.

Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).

One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.

The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.

Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.

A Playful Way to Learn about Power, Publication Bias, and the R-Index: Simulate questionable research methods and see what happens.

This blog introduces a simple excel spreadsheet that simulates the effect of excluding non-significant results from an unbiased set of studies.

The results in the most left column show the results for an unbiased set of 100 studies (N = 100, dropped = 0). The power value is used to compute the observed power in the 100 studies based on a normal distribution around the non-centrality parameter corresponding to the power value (e.g., power = .50, ncp = 1.96).

For an unbiased set of studies, median observed power is equivalent to the success rate (percentage of significant results) in a set of studies. For example, with 50% power, the median observed ncp is 1.96, which is equivalent to the true ncp of 1.96 that corresponds to 50% power. In this case, the success rate is 50%. As the success rate is equivalent to median observed power, there is no inflation in the success rate and the inflation rate is 0. As a result, the R-Index is equivalent to median observed power and success rate. R-Index = Median Observed Power – Inflation Rate; .50 = .50 – 0.

Moving to the right, studies with the lowest observed ncp values (equivalent to the highest p-values) are dropped in sets of 5 studies. However, you can make changes to the way results are excluded or altered to simulate questionable research practices. When non-significant studies are dropped, median observed power and success rate increase. Eventually, the success rate increases faster than median observed power, leading to a positive inflation rate. As the inflation rate is subtracted from median observed power, the R-Index starts to correct for publication bias. For example, in the example with 50% true power, median observed power is inflated to 63% by dropping 25 non-significant results. The success rate is 67%, the inflation rate is 4% and the R-Index is 59%. Thus, the R-Index still overestimates true power by 9%, but it provides a better estimate of true power than median observed power without a correction (63%).

An important special case is the scenario where all non-significant results are dropped. This scenario is automatically highlighted with orange cells for the number of studies and success rate. With 50% true power, the event occurs when 50% of the studies are dropped. In this scenario, median observed power is 76%, the success rate is 100%, inflation rate is 24% and the R-Index is 51%. These values are slightly different from more exact simulations which show 75% median observed power, 25% inflation rate and an R-Index of 50%.

The table below lists the results for different levels of true power when all non-significant results are dropped. The scenario with 5% power implies that the null-hypothesis is true, but that 5% of significant results are obtained due to sampling error alone.

True Power         MOP      IR           R-Index

5%                     66           34           32
30%                     70           30           40
50%                     75           25           50
65%                     80           20           60
80%                     87           13           73
95%                     96           04           91
Success Rate is fixed at 100%; MOP = median observed power; IR = Inflation Rate, R-Index

The results show that the R-Index tracks observed power, but it is not an unbiased estimate of true power. In real data the process that leads to bias is unknown and it is impossible to obtain an unbiased estimate of true power from a biased set of studies. This is the reason why it is important to eliminate biases in publications as much as possible. However, the R-Index provides some useful information about the true power and replicability in a biased set of studies.

Simulation R-Index [click on link to download spreadsheet]