Tag Archives: replicability

How Power Analysis Could Have Prevented the Sad Story of Dr. Förster

[further information can be found in a follow up blog]

Background

In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.

For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.

In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.

The Accusation

The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.

The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.

The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.

Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.

According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct.

In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.

Four QRPs were considered.

  1. Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.
  2. Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.
  3. Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.
  4. Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results. The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.

The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.

A New Perspective: The Curse of Low Powered Studies

One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).

One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below). ForsterTable

In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.

Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68).   An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.

An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.

The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.

The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.

Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).

The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).

The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).

Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.

Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).

The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.

The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.

One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.

Conclusion

Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what could have [edited January 19, 2015] happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”

How Unusual is an R-Index of 51?

The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. However, in response to further inquiries [see follow up blog] Dr. Förster denied having used QRPs that could explain the linearity in his data.

Nevertheless, an R-Index of 51% is not unusual and has been explained with the use of QRPs.  For example, the R-Index for a set of studies by Roy Baumeister was 49%, . and Roy Baumeister stated that QRPs were used to obtain significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).

How to Avoid Similar Sad Stories in the Future

One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.

Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.

For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.

Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).

One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.

The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.

Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.

The Test of Insufficient Variance (TIVA): A New Tool for the Detection of Questionable Research Practices

It has been known for decades that published results tend to be biased (Sterling, 1959). For most of the past decades this inconvenient truth has been ignored. In the past years, there have been many suggestions and initiatives to increase the replicability of reported scientific findings (Asendorpf et al., 2013). One approach is to examine published research results for evidence of questionable research practices (see Schimmack, 2014, for a discussion of existing tests). This blog post introduces a new test of bias in reported research findings, namely the Test of Insufficient Variance (TIVA).

TIVA is applicable to any set of studies that used null-hypothesis testing to conclude that empirical data provide support for an empirical relationship and reported a significance test (p-values).

Rosenthal (1978) developed a method to combine results of several independent studies by converting p-values into z-scores. This conversion uses the well-known fact that p-values correspond to the area under the curve of a normal distribution. Rosenthal did not discuss the relation between these z-scores and power analysis. Z-scores are observed scores that should follow a normal distribution around the non-centrality parameter that determines how much power a study has to produce a significant result. In the Figure, the non-centrality parameter is 2.2. This value is slightly above a z-score of 1.96, which corresponds to a two-tailed p-value of .05. A study with a non-centrality parameter of 2.2 has 60% power.  In specific studies, the observed z-scores vary as a function of random sampling error. The standardized normal distribution predicts the distribution of observed z-scores. As observed z-scores follow the standard normal distribution, the variance of an unbiased set of z-scores is 1.  The Figure on top illustrates this with the nine purple lines, which are nine randomly generated z-scores with a variance of 1.

In a real data set the variance can be greater than 1 for two reasons. First, if the nine studies are exact replication studies with different sample sizes, larger samples will have a higher non-centrality parameter than smaller samples. This variance in the true non-centrality variances adds to the variance produced by random sampling error. Second, a set of studies that are not exact replication studies can have variance greater than 1 because the true effect sizes can vary across studies. Again, the variance in true effect sizes produces variance in the true non-centrality parameters that add to the variance produced by random sampling error.  In short, the variance is 1 in exact replication studies that also hold the sample size constant. When sample sizes and true effect sizes vary, the variance in observed z-scores is greater than 1. Thus, an unbiased set of z-scores should have a minimum variance of 1.

If the variance in z-scores is less than 1, it suggests that the set of z-scores is biased. One simple reason for insufficient variance is publication bias. If power is 50% and the non-centrality parameter matches the significance criterion of 1.96, 50% of studies that were conducted would not be significant. If these studies are omitted from the set of studies, variance decreases from 1 to .36. Another reason for insufficient variance is that researchers do not report non-significant results or used questionable research practices to inflate effect size estimates. The effect is that variance in observed z-scores is restricted.  Thus, insufficient variance in observed z-scores reveals that the reported results are biased and provide an inflated estimate of effect size and replicability.

In small sets of studies, insufficient variance may be due to chance alone. It is possible to quantify how lucky a researcher was to obtain significant results with insufficient variance. This probability is a function of two parameters: (a) the ratio of the observed variance (OV) in a sample over the population variance (i.e., 1), and (b) the number of z-scores minus 1 as the degrees of freedom (k -1).

The product of these two parameters follows a chi-square distribution with k-1 degrees of freedom.

Formula 1: Chi-square = OV * (k – 1) with k-1 degrees of freedom.

Example 1:

Bem (2011) published controversial evidence that appear to demonstrate precognition. Subsequent studies failed to replicate these results (Galak et al.,, 2012) and other bias tests show evidence that the reported results are biased Schimmack (2012). For this reason, Bem’s article provides a good test case for TIVA.

Bem_p_ZThe article reported results of 10 studies with 9 z-scores being significant at p < .05 (one-tailed). The observed variance in the 10 z-scores is 0.19. Using Formula 1, the chi-square value is chi^2 (df = 9) = 1.75. Importantly, chi-square tests are usually used to test whether variance is greater than expected by chance (right tail of the distribution). The reason is that variance is not expected to be less than the variance expected by chance because it is typically assumed that a set of data is unbiased. To obtain a probability of insufficient variance, it is necessary to test the left-tail of the chi-square distribution.  The corresponding p-value for chi^2 (df = 9) = 1.75 is p = .005. Thus, there is only a 1 out of 200 probability that a random set of 10 studies would produce a variance as low as Var = .19.

This outcome cannot be attributed to publication bias because all studies were published in a single article. Thus, TIVA supports the hypothesis that the insufficient variance in Bem’s z-scores is the result of questionable research methods and that the reported effect size of d = .2 is inflated. The presence of bias does not imply that the true effect size is 0, but it does strongly suggest that the true effect size is smaller than the average effect size in a set of studies with insufficient variance.

Example 2:  

Vohs et al. (2006) published a series of studies that he results of nine experiments in which participants were reminded of money. The results appeared to show that “money brings about a self-sufficient orientation.” Francis and colleagues suggested that the reported results are too good to be true. An R-Index analysis showed an R-Index of 21, which is consistent with a model in which the null-hypothesis is true and only significant results are reported.

Because Vohs et al. (2006) conducted multiple tests in some studies, the median p-value was used for conversion into z-scores. The p-values and z-scores for the nine studies are reported in Table 2. The Figure on top of this blog illustrates the distribution of the 9 z-scores relative to the expected standard normal distribution.

Table 2

Study                    p             z          

Study 1                .026       2.23
Study 2                .050       1.96
Study 3                .046       1.99
Study 4                .039       2.06
Study 5                .021       2.99
Study 6                .040       2.06
Study 7                .026       2.23
Study 8                .023       2.28
Study 9                .006       2.73
                                                           

The variance of the 9 z-scores is .054. This is even lower than the variance in Bem’s studies. The chi^2 test shows that this variance is significantly less than expected from an unbiased set of studies, chi^2 (df = 8) = 1.12, p = .003. An unusual event like this would occur in only 1 out of 381 studies by chance alone.

In conclusion, insufficient variance in z-scores shows that it is extremely likely that the reported results overestimate the true effect size and replicability of the reported studies. This confirms earlier claims that the results in this article are too good to be true (Francis et al., 2014). However, TIVA is more powerful than the Test of Excessive Significance and can provide more conclusive evidence that questionable research practices were used to inflate effect sizes and the rate of significant results in a set of studies.

Conclusion

TIVA can be used to examine whether a set of published p-values was obtained with the help of questionable research practices. When p-values are converted into z-scores, the variance of z-scores should be greater or equal to 1. Insufficient variance suggests that questionable research practices were used to avoid publishing non-significant results; this includes simply not reporting failed studies.

At least within psychology, these questionable research practices are used frequently to compensate for low statistical power and they are not considered scientific misconduct by governing bodies of psychological science (APA, APS, SPSP). Thus, the present results do not imply scientific misconduct by Bem or Vohs, just like the use of performance enhancing drugs in sports is not illegal unless a drug is put on an anti-doping list. However, jut because a drug is not officially banned, it does not mean that the use of a drug has no negative effects on a sport and its reputation.

One limitation of TIVA is that it requires a set of studies and that variance in small sets of studies can vary considerably just by chance. Another limitation is that TIVA is not very sensitive when there is substantial heterogeneity in true non-centrality parameters. In this case, the true variance in z-scores can mask insufficient variance in random sampling error. For this reason, TIVA is best used in conjunction with other bias tests. Despite these limitations, the present examples illustrate that TIVA can be a powerful tool in the detection of questionable research practices.  Hopefully, this demonstration will lead to changes in the way researchers view questionable research practices and how the scientific community evaluates results that are statistically improbable. With rejection rates at top journals of 80% or more, one would hope that in the future editors will favor articles that report results from studies with high statistical power that obtain significant results that are caused by the predicted effect.

Christmas Special: R-Index of “Women Are More Likely to Wear Red or Pink at Peak Fertility”

An article in Psychological Science titled “Women Are More Likely to Wear Red or Pink at Peak Fertility” reported two studies that related women’s cycle to the color of their shirts. Study 1 (N = 100) found that women were more likely to wear red or pink shirts around the time of ovulation. Study 2 (N = 25) replicated this finding. An article in Slate magazine, “Too good to be true” questioned the credibility of the reported results. The critique led to a lively discussion about research practices, statistics, and psychological science in general.

The R-Index provides some useful information about some unresolved issues in the debate.

The main finding in Study 1 was a significant chi-square test, chi-square (1, N = 100) = 5.32, p = .021, z = 2.31, observed power 64%.

The main finding in Study 2 was a chi-square test, chi-square (1, N = 25) = 3.82, p = .051, z = 1.95, observed power 50%.

One way to look at these results is to assume that the authors planned the two studies, including sample sizes, conducted two statistical significance tests and reported the results of their planned analysis. Both tests have to produce significant results in the predicted direction at p = .05 (two-tailed) to be published in Psychological Science. The authors claim that the probability of this event to occur by chance is only 0.25% (5% * 5%). In fact, the probability is even lower because a two-tailed can be significant when the effect is opposite to the hypothesis (i.e., women are less likely to wear red at peak fertility, p < .05, two-tailed). The probability to get significant results in a theoretically predicted direction with p = .05 (two-tailed) is equivalent to a one-tailed test with p = .025 as significance criterion. The probability of this happening twice in a row is only 0.06%.  According to this scenario, the significant results in the two studies are very unlikely to be a chance finding. Thus, they provide evidence that women are more likely to wear red at peak fertility.

The R-Index takes a different perspective. The focus is on replicability of the results reported in the two studies. Replicability is defined as the long-run probability to produce significant results in exact replication studies; everything but random sampling error is constant.

The first step is to estimate replicability of each study. Replicabilty is estimated by converting p-values into observed power estimates. As shown above, observed power is estimated to be 64% in Study 1 and 50% in Study 2.   If these estimates were correct, the probability to replicate significant results in two exact replication studies would be 32%. This also implies that the chance of obtaining significant results in the original studies was only 32%. This raises the question of what researchers would do when a non-significant result is obtained. If reporting or publication bias prevent these results from being published, published results provide an inflated estimate of replicability (100% success rate with 32% probability to be successful).

The R-Index uses the median as the best estimate of the typical power in a set of studies. Median observed power is 57%.  However, the success rate is 100% (two significant results in two reported attempts).  The discrepancy between the success rate (100%) and the expected rate of significant results (57%) shows the inflated rate of significant results that is expected based on the long-run success rate of 57% (100% – 57% = 43%).   This would be equivalent to getting red twice in a roulette game with a 50% chance of red or black (ignoring 0 here).  Ultimately, an unbiased roulette table would produce black outcomes to get the expected rate of 50% red and 50% black numbers.

The R-Index corrects for this inflation by subtracting the inflation rate from observed power.

The R-Index is 57% – 43% = 14%.  

To interpret an R-Index of 14%, the following scenarios are helpful.

When the null-hypothesis is true and non-significant results are not reported, the R-Index is 22%. Thus, the R-Index for this pair of studies is lower than the R-Index for the null-hypothesis.

With just two studies, it is possible that researchers were just lucky to get two significant results despite a low probability of this event to occur.

For other researchers it is not important why reported results are likely to be too good to be true. For science, it is more important that the reported results can be generalized to future studies and real world situations. The main reason to publish studies in scientific journals is to provide evidence that can be replicated even in studies that are not exact replication studies, but provide sufficient opportunity for the same causal process (peak fertility influences women’s clothing choices) to be observed. With this goal in mind, a low R-Index reveals that the two studies provide rather weak evidence for the hypothesis and that the generalizability to future studies and real world scenarios is uncertain.

In fact, only 28% of studies with an average R-Index of 43% replicated in a test of the R-Index (reference!). Failed replication studies consistently tend to have an R-Index below 50%.

For this reason, Psychological Science should have rejected the article and asked the authors to provide stronger evidence for their hypothesis.

Psychological Science should also have rejected the article because the second study had only a quarter of the sample size of Study 1 (N = 25 vs. 100). Given the effect size in Study 1 and observed power of only 63% in Study 1, cutting the sample sizes by 75% reduces the probability to obtain a significant effect in Study 2 to 20%. Thus, the authors were extremely lucky to produce a significant result in Study 2. It would have been better to conduct the replication study with a sample of 150 participants to have 80% power to replicate the effect in Study 1.

Conclusion

The R-Index of “Women Are More Likely to Wear Red or Pink at Peak Fertility” is 19. This is a low value and suggests that the results will not replicate in an exact replication study. It is possible that the authors were just lucky to get two significant results. However, lucky results distort the scientific evidence and these results should not be published without a powerful replication study that does not rely on luck to produce significant results. To avoid controversies like these and to increase the credibility of published results, researchers should conduct more powerful tests of hypothesis and scientific journals should favor studies that have a high R-Index.

The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

tide_naked

“Only when the tide goes out do you discover who has been swimming naked.”  Warren Buffet (Value Investor).

 

 

Francis, Tanzman, and Matthews (2014) examined the credibility of psychological articles published in the prestigious journal Science. They focused on articles that contained four or more articles because (a) the statistical test that they has insufficient power for smaller sets of studies and (b) the authors assume that it is only meaningful to focus on studies that are published within a single article.

They found 26 articles published between 2006 and 2012. Eight articles could not be analyzed with their method.

The remaining 18 articles had a 100% success rate. That is, they never reported that a statistical hypothesis test failed to produce a significant result. Francis et al. computed the probability of this outcome for each article. When the probability was less than 10%, they made the recommendation to be skeptical about the validity of the theoretical claims.

For example, a researcher may conduct five studies with 80% power. As expected, one of the five studies produced a non-significant result. It is rational to assume that this finding is a type-II error as the Type-II error should occur in 1 out of 5 studies. The scientist decides not to include the non-significant result. In this case, there is bias, the average effect size across the four significant studies is slightly inflated, but the empirical results do support empirical claims.

If, however, the null-hypothesis is true and a researcher conducts many statistical tests and reports only significant results, demonstrating excessive significant results would also reveal that the reported results provide no empirical support for the theoretical claims in this article.

The problem with Francis et al.’s approach is that it does not clearly distinguish between these two scenarios.

The R-Index addresses this problem. It provides quantitative information about the replicability of a set of studies. Like Francis et al., the R-Index is based on the observed power of individual statistical tests (see Schimmack, 2012, for details), but the next steps are different. Francis et al. multiply observed power estimates. This approach is only meaningful for sets of studies that reported only significant results. The R-Index can be computed for studies that reported significant and non-significant results. Here are the steps:

Compute median observed power for all theoretically important statistical tests from a single study; then compute the median of these medians. This median estimates the median true power of a set of studies.

Compute the rate of significant results for the same set of statistical tests; then average the rates across the same set of studies. This average estimates the reported success rate for a set of studies.

Median observed power and average success rate are both estimates of true power or replicability of a set of studies. Without bias, these two estimates should converge as the number of studies increase.

If the success rate is higher than median observed power, it suggests that the reported results provide an inflated picture of the true effect size and replicability of a phenomenon.

The R-Index uses the difference between success rate and median observed power to correct the inflated estimate of replicability by subtracting the inflation rate (success rate – median observed power) from the median observed power.

R-Index = Median Observed Power – (Success rate – Median Observed Power)

The R-Index is a quantitative index, where higher values suggest a higher probability that an exact replication study will be successful and it avoids simple dichotomous decisions. Nevertheless, it can be useful to provide some broad categories that distinguish different levels of replicability.

An R-Index of more than 80% is consistent with true power of 80%, even when some results are omitted. I chose 80% as a boundary because Jacob Cohen advised researchers that they should plan studies with 80% power. Many undergraduates learn this basic fact about power and falsely assume that researchers are following a rule that is mentioned in introductory statistics.

An R-Index between 50% and 80% suggests that the reported results support an empirical phenomenon, but that power was less than ideal. Most important, this also implies that these studies make it difficult to distinguish non-significant results and type-II errors. For example, two tests with 50% power are likely to produce one significant result and one non-significant result. Researches are tempted to interpret the significant one and to ignore the non-significant one. However, in a replication study the opposite pattern is just as likely to occur.

An R-Index between25% and 50% raises doubts about the empirical support for the conclusions. The reason is that an R-Index of 22% can be obtained when the null-hypothesis is true and all non-significant results are omitted. In this case, observed power is inflated from 5% to 61%. With a 100% success rate, the inflation rate is 39%, and the R-Index is 22% (61% – 39% = 22%).

An R-Index below 20% suggest that researchers used questionable research methods (importantly, these method are questionable but widely accepted in many research communities and not considered to be ethical misconduct) to obtain results that are statistically significant (e.g., systematically deleting outliers until p < .05).

Table 1 list Francis et al.’s results and the R-Index. Studies are arranged in order of the R-Index.  Only 1 study is in the exemplary category with an R-Index greater than 80%.
4 studies have an R-Index between 50% and 80%.
8 studies have an R-Index in the range between 20% and 50%.
5 studies have an R-Index below 20%.

There are good reasons why researchers should not conduct studies with less than 50% power.  However, 13 of the 18 studies have an R-Index below 50%, which suggests that the true power in these studies was less than 50%.

FrancisScienceTable

Conclusion

The R-Index provides an alternative approach to Francis’s TES to examine the credibility of a set of published studies. Whereas Francis concluded that 15 out of 18 articles show bias that invalidates the theoretical claims of the original article, the R-Index provides quantitative information about the replicability of reported results.

The R-Index does not provide a simple answer about the validity of published findings, but in many cases the R-Index raises concerns about the strength of the empirical evidence and reveals that editorial decisions failed to take replicability into account.

The R-Index provides a simple tool for editors and reviewers to increase the credibility of published results and to increase the replicability of published findings. Editors and reviewers can compute, or ask authors who submit manuscripts to compute, the R-Index and use this information in their editorial decision. There is no clear criterion value, but a higher R-Index is better and moderate R-values should be justified by other criteria (e.g., uniqueness of sample).

The R-Index can be used to examine whether editors continue to accept articles with low replicability or are committed to the publication of empirical results that are credible and replicable.

Nature Neuroscience: R-Index

R-Index of Nature Neuroscience

An article in nature review, neuroscience suggested that the median power in neuroscience studies is just 21% (Katherine S. Button, John P. A. Ioannidis, Claire Mokrysz, Brian A.Nosek, Jonathan Flint, Emma S.J. Robinson and Marcus R. Munafò, 2013).

The authors of this article examined meta-analyses of primary studies in neuroscience that were published in 2011. They analyzed 49 meta-analyses that were based on a total of 730 original studies (on average, 15 studies per meta-analysis, range 2 to 57).

For each primary study, the authors computed observed power based on the sample size and the estimated effect size in the meta-analysis.

Based on their analyses, the authors concluded that the median power in neuroscience is 21%.

Importantly, this is an estimate of the median observed power in neuroscience studies. There is a major problem with this estimate. It is incredibly low because a study with 21% observed power is not statistically significant, p = .25. If median power were 21%, it would mean that over 50% of the original studies in the meta-analysis reported a non-significant result (p > .05). This seems rather unlikely because journals tend to publish mostly significant results.

The estimate is even less plausible because it is based on meta-analytic averages without any correction for bias. These effect sizes are likely to be inflated, which means that median power estimate is inflated. Thus, true power is even less than 21%.

What could explain this implausible result?

  1. A meta-analysis includes published and unpublished studies. It is possible that the published studies reported significant results with observed power greater than 50% (p < .05) and the unpublished studies reported non-significant results with power less than 50%. However, this would imply that meta-analysts were able to retrieve as many unpublished studies as published studies. The authors did not report whether power of published and unpublished studies differed.
  2. A second possibility is that the power analyses produced false results. The authors relied on Ioannidis and Trikalinos’s (2007) approach to the estimation of power. This approach assumes that studies in a meta-analysis have the same true effect size and that the meta-analytic average (weighted mean) provides the best estimate of the true effect size. This estimate of the true effect size is then used to estimate power in individual studies based on the sample size of the study. As already noted by Ioannidis and Trikalinos (2007), this approach can produce biased results when effect sizes in a meta-analysis are heterogeneous.
  3. Estimating power simply on the basis of effect size and sample size can be misleading when the design is not a simple comparison of two groups. Between-subject designs are common in animal studies in neuroscience. However, many fMRI studies use within-subject designs that achieve high statistical power with a few participants because participants serve as their own controls.

Schimmack (2012) proposed an alternative procedure that does not have this limitation. Power is estimated individually for each study based on the observed effect size in this study. This approach makes it possible to estimate median power for heterogeneous sets of studies with different effect sizes. Moreover, this approach makes it possible to compute power when power is not simply a function of sample size and effect size (e.g., within-subject designs).

R-Index of Nature Neuroscience: Analysis

To examine the replicability of research published in nature and neuroscience, I retrieved the most cited articles in this journal until I had a sample of 20 studies. I needed 14 articles to meet this goal. The number of studies in these articles ranged from 1 to 7.

The success rate for focal significance tests was 97%. This implies that the vast majority of significance tests reported a significant result. The median observed power was 84%. The inflation rate is 13% (97% – 84% = 13%). The R-Index is 71%. Based on these numbers, the R-Index predicts that the majority of studies in nature neuroscience would replicate in an exact replication study.

This conclusion differs dramatically from Button et al.’s (2013) conclusion. I therefore examined some of the articles that were used for Button et al.’s analyses.

A study by Davidson et al. (2003) examined treatment effects in 12 depressed patients and compared them to 5 healthy controls. The main findings in this article were three significant interactions between time of treatment and group with z-scores of 3.84, 4.60, and 4.08. Observed power for these values with p = .05 is over 95%. If a more conservative significance level of p = .001 is used, power is still over 70%. However, the meta-analysis focused on the correlation between brain activity at baseline and changes in depression over time. This correlation is shown with a scatterplot without reporting the actual correlation or testing it for significance. The text further states that a similar correlation was observed for an alternative depression measure with r = .46 and noting correctly that this correlation is not significant, t(10) = 1.64, p = .13, d = .95, obs. power = 32%. The meta-analysis found a mean effect size of .92. A power analysis with d = .92 and N = 12 yields a power estimate of 30%. Presumably, this is the value that Button et al. used to estimate power for the Davidson et al. (2003) article. However, the meta-analysis did not include the more powerful analyses that compared patients and controls over time.

Conclusion

In the current replication crisis, there is a lot of confusion about the replicability of published findings. Button et al. (2013) aimed to provide some objective information about the replicability of neuroscience research. They concluded that replicability is very low with a median estimate of 21%. In this post, I point out some problems with their statistical approach and the focus on meta-analyses as a way to make inferences about replicability of published studies. My own analysis shows a relatively high R-Index of 71%. To make sense of this index it is instructive to compare it to the following R-Indices.

In a replication project of psychological studies, I found an R-Index of 43% and 28% of studies were successfully replicated.

In the many-labs replication project, 10 out of 12 studies were successfully replicated, a replication rate of 83% and the R-Index was 72%.

Caveat

Neuroscience studies may have high observed power and still not replicate very well in exact replications. The reason is that measuring brain activity is difficult and requires many steps to convert and reduce observed data into measures of brain activity in specific regions. Actual replication studies are needed to examine the replicability of published results.

Dr. Schnall’s R-Index

In several blog posts, Dr. Schnall made some critical comments about attempts to replicate her work and these blogs created a heated debate about replication studies. Heated debates are typically a reflection of insufficient information. Is the Earth flat? This question created heated debates hundreds of years ago. In the age of space travel it is no longer debated. In this blog, I presented some statistical information that sheds light on the debate about the replicability of Dr. Schnall’s research.

The Original Study

Dr. Schnall and colleagues conducted a study with 40 participants. A comparison of two groups on a dependent variable showed a significant difference, F(1,38) = 3.63. In these days, Psychological Science asked researchers to report P-Rep instead of p-values. P-rep was 90%. The interpretation of P-rep was that there is a 90% chance to find an effect with the SAME SIGN in an exact replication study with the same sample size. The conventional p-value for F(1,38) = 3.63 is p = .06, a finding that commonly is interpreted as marginally significant. The standardized effect size is d = .60, which is considered a moderate effect size. The 95% confidence interval is -.01 to 1.47.

The wide confidence interval makes it difficult to know the true effect size. A post-hoc power analysis, assuming the true effect size is d = .60 suggests that an exact replication study has a 46% chance to produce a significant results (p < .05, two-tailed). However, if the true effect size is lower, actual power is lower. For example, if the true effect size is small (d = .2), a study with N = 40 has only 9% power (that is a 9% chance) to produce a significant result.

The First Replication Study

Drs. Johnson, Cheung, and Donnellan conducted a replication study with 209 participants. Assuming the effect size in the original study is the true effect size, this replication study has 99% power. However, assuming the true effect size is only d = .2, the study has only 31% power to produce a significant result. The study produce a non-significant result, F(1, 206) = .004, p = .95. The effect size was d = .01 (in the same direction). Due to the larger sample, the confidence interval is smaller and ranges from -.26 to .28. The confidence interval includes d = 2. Thus, both studies are consistent with the hypothesis that the effect exists and that the effect size is small, d = .2.

The Second Replication Study

Dr. Huang conducted another replication study with N = 214 participants (Huang, 2004, Study 1). Based on the previous two studies, the true effect might be expected to be somewhere between -.01 and .28, which includes a small effect size of d = .20. A study with N = 214 participants has 31% power to produce a significant result. Not surprisingly, the study produce a non-significant result, t(212) = 1.22, p = .23. At the same time, the effect size fell within the confidence interval set by the previous two studies, d = .17.

A Third Replication Study

Dr. Hung conducted a replication study with N = 440 participants (Study 2). Maintaining the plausible effect size of d = .2 as the best estimate of the true effect size, the study has 55% power to produce a significant result, which means it is nearly as likely to produce a non-significant result as it is to produce a significant result, if the effect size is small (d = .2). The study failed to produce a significant result, t(438) = .042, p = 68. The effect size was d = .04 with a confidence interval ranging from -.14 to .23. Again, this confidence interval includes a small effect size of d = .2.

A Fourth Replication Study

Dr. Hung published a replication study in the supplementary materials to the article. The study again failed to demonstrate a main effect, t(434) = 0.42, p = .38. The effect size is d = .08 with a confidence interval of -.11 to .27. Again, the confidence interval is consistent with a small true effect size of d = .2. However, the study with 436 participants had only a 55% chance to produce a significant result.

If Dr. Huang had combined the two samples to conduct a more powerful study, a study with 878 participants would have 80% power to detect a small effect size of d = .2. However, the combined effect size of d = .06 for the combined samples is still not significant, t(876) = .89. The confidence interval ranges from -.07 to .19. It no longer includes d = .20, but the results are still consistent with a positive, yet small effect in the range between 0 and .20.

Conclusion

In sum, nobody has been able to replicate Schnall’s finding that a simple priming manipulation with cleanliness related words has a moderate to strong effect (d = .6) on moral judgments of hypothetical scenarios. However, all replication studies show a trend in the same direction. This suggests that the effect exists, but that the effect size is much smaller than in the original study; somewhere between 0 and .2 rather than .6.

Now there are three possible explanations for the much larger effect size in Schnall’s original study.

1. The replication studies were not exact replications and the true effect size in Schnall’s version of the experiment is stronger than in the other studies.

2. The true effect size is the same in all studies, but Dr. Schnall was lucky to observe an effect size that was three times as large as the true effect size and large enough to produce a marginally significant result.

3. It is possible that Dr. Schnall did not disclose all of the information about her original study. For example, she may have conducted additional studies that produced smaller and non-significant results and did not report these results. Importantly, this practice is common and legal and in an anonymous survey many researchers admitted using practices that produce inflated effect sizes in published studies. However, it is extremely rare for researchers to admit that these common practices explain one of their own findings and Dr. Schnall has attributed the discrepancy in effect sizes to problems with replication studies.

Dr. Schnall’s Replicability Index

Based on Dr. Schnall’s original study it is impossible to say which of these explanations accounts for her results. However, additional evidence makes it possible to test the third hypothesis that Dr. Schnall knows more than she was reporting in her article. The reason is that luck does not repeat itself. If Dr. Schnall was just lucky, other studies by her should have failed because Lady Luck is only on your side half the time. If, however, disconfirming evidence is systematically excluded from a manuscript, the rate of successful studies is higher than the observed statistical power in published studies (Schimmack, 2012).

To test this hypothesis, I downloaded Dr. Schnall’s 10 most cited articles (in Web of Science, July, 2014). These 10 articles contained 23 independent studies. For each study, I computed the median observed power of statistical tests that tested a theoretically important hypothesis. I also calculated the success rate for each study. The average success rate was 91% (ranging from 45% to 100%, median = 100%). The median observed power was 61%. The inflation rate is 30% (91%-61%). Importantly, observed power is an inflated estimate of replicability when the success rate is inflated. I created the replicability index (R-index) to take this inflation into account. The R-Index subtracts the inflation rate from observed median power.

Dr. Schnall’s R-Index is 31% (61% – 30%).

What does an R-Index of 31% mean? Here are some comparisons that can help to interpret the Index.

Imagine the null-hypothesis is always true, and a researcher publishes only type-I errors. In this case, observed power is 61% and the success rate is 100%. The R-Index is 22%.

Dr. Baumeister admitted that his publications select studies that report the most favorable results. His R-Index is 49%.

The Open Science Framework conducted replication studies of psychological studies published in 2008. A set of 25 completed studies in November 2014 had an R-Index of 43%. The actual rate of successful replications was 28%.

Given this comparison standards, it is hardly surprising that one of Dr. Schnall’s study did not replicate even when the sample size and power of replication studies were considerably higher.

Conclusion

Dr. Schnall’s R-Index suggests that the omission of failed studies provides the most parsimonious explanation for the discrepancy between Dr. Schnall’s original effect size and effect sizes in the replication studies.

Importantly, the selective reporting of favorable results was and still is an accepted practice in psychology. It is a statistical fact that these practices reduce the replicability of published results. So why do failed replication studies that are entirely predictable create so much heated debate? Why does Dr. Schnall fear that her reputation is tarnished when a replication study reveals that her effect sizes were inflated? The reason is that psychologists are collectively motivated to exaggerate the importance and robustness of empirical results. Replication studies break with the code to maintain an image that psychology is a successful science that produces stunning novel insights. Nobody was supposed to test whether published findings are actually true.

However, Bem (2011) let the cat out of the bag and there is no turning back. Many researchers have recognized that the public is losing trust in science. To regain trust, science has to be transparent and empirical findings have to be replicable. The R-Index can be used to show that researchers reported all the evidence and that significant results are based on true effect sizes rather than gambling with sampling error.

In this new world of transparency, researchers still need to publish significant results. Fortunately, there is a simple and honest way to do so that was proposed by Jacob Cohen over 50 years ago. Conduct a power analysis and invest resources only in studies that have high statistical power. If your expertise led you to make a correct prediction, the force of the true effect size will be with you and you do not have to rely on Lady Luck or witchcraft to get a significant result.

P.S. I nearly forgot to comment on Dr. Huang’s moderator effects. Dr. Huang claims that the effect of the cleanliness manipulation depends on how much effort participants exert on the priming task.

First, as noted above, no moderator hypothesis is needed because all studies are consistent with a true effect size in the range between 0 and .2.

Second, Dr. Huang found significant interaction effects in two studies. In Study 2, the effect was F(1,438) = 6.05, p = .014, observed power = 69%. In Study 2a, the effect was F(1,434) = 7.53, p = .006, observed power = 78%. The R-Index for these two studies is 74% – 26% = 48%.   I am waiting for an open science replication with 95% power before I believe in the moderator effect.

Third, even if the moderator effect exists, it doesn’t explain Dr. Schnall’s main effect of d = .6.

Roy Baumeister’s R-Index

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

The R-Index can be used to evaluate the replicability of a set of statistical results. It can be used to evaluate the statistical research integrity of journals, articles on a specific topic (meta-analysis), and researchers. Just like the H-Index has become a popular metric of research excellence, the R-Index of individual researchers can be used to evaluate the replicability of their findings.

I chose Roy Baumeister as an example for several reasons. First, the R-Index is based on my earlier work on the incredibility-index (Schimmack, 2012). In this article, I demonstrated how power analysis can be used to reveal that researchers used questionable research practices to produce statistically significant results. I illustrated this approach with two articles. One article published 10 experiments that appeared to demonstrate time-reversed causality. Independent replication studies failed to replicate this incredible finding. The Incredibility-Index predicted this failure. The second article was a study on glucose consumption and will-power with Roy Baumeister as the senior author. The Incredibility-Index showed that the statistical results reported in this article were even less credible than the time-travel studies in Bem’s (2011) article.

Not surprisingly, Roy Baumeister was a reviewer of the incredibility article. During the review process, Roy Baumeister explained why his article reported more significant results than one would expect on the basis of the statistical power of these studies.

“My paper with Gailliot et al. (2007) is used as an illustration here. Of course, I am quite familiar with the process and history of that one. We initially submitted it with more studies, some of which had weaker results. The editor said to delete those. He wanted the paper shorter so as not to use up a lot of journal space with mediocre results. It worked: the resulting paper is shorter and stronger. Does that count as magic? The studies deleted at the editor’s request are not the only story. I am pretty sure there were other studies that did not work. Let us suppose that our hypotheses were correct and that our research was impeccable. Then several of our studies would have failed, simply given the realities of low power and random fluctuations. Is anyone surprised that those studies were not included in the draft we submitted for publication? If we had included them, certainly the editor and reviewers would have criticized them and formed a more negative impression of the paper. Let us suppose that they still thought the work deserved publication (after all, as I said, we are assuming here that the research was impeccable and the hypotheses correct). Do you think the editor would have wanted to include those studies in the published version?”

To my knowledge this is one of the few frank acknowledgements that questionable research practices (i.e., excluding evidence that does not support an author’s theory) contributed to the picture-perfect results in a published article. It is therefore instructive to examine the R-Index of a researcher who openly acknowledged that the reported results are a biased selection of the empirical evidence.

A tricky issue in any statistical analysis is the sampling of studies. In this case it would be possible to conduct the analysis on the full set of articles published by Roy Baumeister. However, for my analysis I selected a sample. To ensure that the sample is unbiased, I chose a sampling strategy that makes a priori sense and does not involve random sampling because I have control over the random generator. My sampling strategy was to focus on the Top 10 most cited original research articles.

To evaluate the R-Index, it is instructive to keep the following scenarios in mind.

  1. The null-hypothesis is true and a researcher uses questionable research practices to obtain just significant results (p = .049999). The observed power for this set of studies is 50%, but all statistical results are significant, 100% success rate. The success rate is inflated by 50%. The R-Index is observed power minus inflation rate, which is 0%.
  2. The null-hypothesis is true and a researcher drops non-significant results and/or uses questionable research methods that capitalize on chance. In this case, p-values above .05 are not reported and p-values below .05 have a uniform distribution with a median of .025. A p-value of .025 corresponds to 61% observed power. With 100% significant results, the inflation rate is 39%, and the R-Index is 22% (61%-39%).
  3. The null-hypothesis is false and researcher conducts studies with 30% power. The non-significant studies are not published. In this case, observed power is 70%. With 100% success rate, the inflation rate is 30%. The R-Index is 40%.
  4. The null-hypothesis is false and researcher conducts studies with 50% power. The non-significant studies are not published. In this case, observed power is 75%. With 100% success rate, the inflation rate is 25%. The R-Index is 50%.
  5. The null-hypothesis is false and researchers conduct studies with 80% power, as recommended by Cohen. The non-significant results are not published (20% missing). In this case, observed power is 90% with 100% significant results. With 10% inflation rate, the R-Index is 80% (90% – 10%).
  6. A sample of psychological studies published in 2008 produced an R-Index of 43% (Observed Power = 72%, Success Rate = 100%, Inflation Rate = 28%). Exact replications of these studies produced a success rate of 28%.

Roy Baumeister’s Top-10 articles contained 40 studies. Each study reported multiple statistical tests. I computed the median observed power of statistical tests that tested a theoretically relevant hypothesis. I also recorded whether the test was considered supportive of the theoretical hypothesis (typically, p < .05). The median observed power in this set of 40 studies was 69%. The success rate was 89%. The inflation rate is 20% and the R-Index is 49% (69% – 20%).

Roy Baumeister’s R-Index of 49% is consistent with his statement that his articles do not contain all of the studies that tested a theoretical prediction. Studies that tested theoretical predictions and failed to support them are missing. An R-Index of 49% is also consistent with Roy Baumeister’s claim that his practices reflect the common practices in the field. Other sets of studies in social psychology produce similar indices (e.g., replicability project of psychological studies, R-Index = 43%; success rate in empirical replication studies 28%).

In conclusion, Roy Baumeister’s acknowledged the use of questionable research practices (i.e., excluding evidence that does not support a theoretical hypothesis) and his R-Index is 49%. The R-Index of a representative set of studies in psychology in 2008 produced an R-Index of 42%. This suggests that the use of questionable research practices in psychology is widespread and the R-Index is able to detect the use of these practices. A set of studies that were subjected to empirical replication attempts produced a R-Index of 38%, and 28% of replication attempts were successful (72% failed).

The R-Index makes it possible to quantify and compare the use of questionable research practices and I hope it will encourage researchers to conduct fewer and more powerful studies. I also hope that a quantitative index makes it possible to make replicability an evaluation criterion for scientists.

So what could Roy Baumeister have done? He published 9 studies that supported his hypothesis and excluded several more studies because they were underpowered.  I suggest running fewer studies with higher power so that all studies can produce significant results, assuming the null-hypothesis is really false.