Category Archives: science

The Test of Insufficient Variance (TIVA): A New Tool for the Detection of Questionable Research Practices

It has been known for decades that published results tend to be biased (Sterling, 1959). For most of the past decades this inconvenient truth has been ignored. In the past years, there have been many suggestions and initiatives to increase the replicability of reported scientific findings (Asendorpf et al., 2013). One approach is to examine published research results for evidence of questionable research practices (see Schimmack, 2014, for a discussion of existing tests). This blog post introduces a new test of bias in reported research findings, namely the Test of Insufficient Variance (TIVA).

TIVA is applicable to any set of studies that used null-hypothesis testing to conclude that empirical data provide support for an empirical relationship and reported a significance test (p-values).

Rosenthal (1978) developed a method to combine results of several independent studies by converting p-values into z-scores. This conversion uses the well-known fact that p-values correspond to the area under the curve of a normal distribution. Rosenthal did not discuss the relation between these z-scores and power analysis. Z-scores are observed scores that should follow a normal distribution around the non-centrality parameter that determines how much power a study has to produce a significant result. In the Figure, the non-centrality parameter is 2.2. This value is slightly above a z-score of 1.96, which corresponds to a two-tailed p-value of .05. A study with a non-centrality parameter of 2.2 has 60% power.  In specific studies, the observed z-scores vary as a function of random sampling error. The standardized normal distribution predicts the distribution of observed z-scores. As observed z-scores follow the standard normal distribution, the variance of an unbiased set of z-scores is 1.  The Figure on top illustrates this with the nine purple lines, which are nine randomly generated z-scores with a variance of 1.

In a real data set the variance can be greater than 1 for two reasons. First, if the nine studies are exact replication studies with different sample sizes, larger samples will have a higher non-centrality parameter than smaller samples. This variance in the true non-centrality variances adds to the variance produced by random sampling error. Second, a set of studies that are not exact replication studies can have variance greater than 1 because the true effect sizes can vary across studies. Again, the variance in true effect sizes produces variance in the true non-centrality parameters that add to the variance produced by random sampling error.  In short, the variance is 1 in exact replication studies that also hold the sample size constant. When sample sizes and true effect sizes vary, the variance in observed z-scores is greater than 1. Thus, an unbiased set of z-scores should have a minimum variance of 1.

If the variance in z-scores is less than 1, it suggests that the set of z-scores is biased. One simple reason for insufficient variance is publication bias. If power is 50% and the non-centrality parameter matches the significance criterion of 1.96, 50% of studies that were conducted would not be significant. If these studies are omitted from the set of studies, variance decreases from 1 to .36. Another reason for insufficient variance is that researchers do not report non-significant results or used questionable research practices to inflate effect size estimates. The effect is that variance in observed z-scores is restricted.  Thus, insufficient variance in observed z-scores reveals that the reported results are biased and provide an inflated estimate of effect size and replicability.

In small sets of studies, insufficient variance may be due to chance alone. It is possible to quantify how lucky a researcher was to obtain significant results with insufficient variance. This probability is a function of two parameters: (a) the ratio of the observed variance (OV) in a sample over the population variance (i.e., 1), and (b) the number of z-scores minus 1 as the degrees of freedom (k -1).

The product of these two parameters follows a chi-square distribution with k-1 degrees of freedom.

Formula 1: Chi-square = OV * (k – 1) with k-1 degrees of freedom.

Example 1:

Bem (2011) published controversial evidence that appear to demonstrate precognition. Subsequent studies failed to replicate these results (Galak et al.,, 2012) and other bias tests show evidence that the reported results are biased Schimmack (2012). For this reason, Bem’s article provides a good test case for TIVA.

Bem_p_ZThe article reported results of 10 studies with 9 z-scores being significant at p < .05 (one-tailed). The observed variance in the 10 z-scores is 0.19. Using Formula 1, the chi-square value is chi^2 (df = 9) = 1.75. Importantly, chi-square tests are usually used to test whether variance is greater than expected by chance (right tail of the distribution). The reason is that variance is not expected to be less than the variance expected by chance because it is typically assumed that a set of data is unbiased. To obtain a probability of insufficient variance, it is necessary to test the left-tail of the chi-square distribution.  The corresponding p-value for chi^2 (df = 9) = 1.75 is p = .005. Thus, there is only a 1 out of 200 probability that a random set of 10 studies would produce a variance as low as Var = .19.

This outcome cannot be attributed to publication bias because all studies were published in a single article. Thus, TIVA supports the hypothesis that the insufficient variance in Bem’s z-scores is the result of questionable research methods and that the reported effect size of d = .2 is inflated. The presence of bias does not imply that the true effect size is 0, but it does strongly suggest that the true effect size is smaller than the average effect size in a set of studies with insufficient variance.

Example 2:  

Vohs et al. (2006) published a series of studies that he results of nine experiments in which participants were reminded of money. The results appeared to show that “money brings about a self-sufficient orientation.” Francis and colleagues suggested that the reported results are too good to be true. An R-Index analysis showed an R-Index of 21, which is consistent with a model in which the null-hypothesis is true and only significant results are reported.

Because Vohs et al. (2006) conducted multiple tests in some studies, the median p-value was used for conversion into z-scores. The p-values and z-scores for the nine studies are reported in Table 2. The Figure on top of this blog illustrates the distribution of the 9 z-scores relative to the expected standard normal distribution.

Table 2

Study                    p             z          

Study 1                .026       2.23
Study 2                .050       1.96
Study 3                .046       1.99
Study 4                .039       2.06
Study 5                .021       2.99
Study 6                .040       2.06
Study 7                .026       2.23
Study 8                .023       2.28
Study 9                .006       2.73
                                                           

The variance of the 9 z-scores is .054. This is even lower than the variance in Bem’s studies. The chi^2 test shows that this variance is significantly less than expected from an unbiased set of studies, chi^2 (df = 8) = 1.12, p = .003. An unusual event like this would occur in only 1 out of 381 studies by chance alone.

In conclusion, insufficient variance in z-scores shows that it is extremely likely that the reported results overestimate the true effect size and replicability of the reported studies. This confirms earlier claims that the results in this article are too good to be true (Francis et al., 2014). However, TIVA is more powerful than the Test of Excessive Significance and can provide more conclusive evidence that questionable research practices were used to inflate effect sizes and the rate of significant results in a set of studies.

Conclusion

TIVA can be used to examine whether a set of published p-values was obtained with the help of questionable research practices. When p-values are converted into z-scores, the variance of z-scores should be greater or equal to 1. Insufficient variance suggests that questionable research practices were used to avoid publishing non-significant results; this includes simply not reporting failed studies.

At least within psychology, these questionable research practices are used frequently to compensate for low statistical power and they are not considered scientific misconduct by governing bodies of psychological science (APA, APS, SPSP). Thus, the present results do not imply scientific misconduct by Bem or Vohs, just like the use of performance enhancing drugs in sports is not illegal unless a drug is put on an anti-doping list. However, jut because a drug is not officially banned, it does not mean that the use of a drug has no negative effects on a sport and its reputation.

One limitation of TIVA is that it requires a set of studies and that variance in small sets of studies can vary considerably just by chance. Another limitation is that TIVA is not very sensitive when there is substantial heterogeneity in true non-centrality parameters. In this case, the true variance in z-scores can mask insufficient variance in random sampling error. For this reason, TIVA is best used in conjunction with other bias tests. Despite these limitations, the present examples illustrate that TIVA can be a powerful tool in the detection of questionable research practices.  Hopefully, this demonstration will lead to changes in the way researchers view questionable research practices and how the scientific community evaluates results that are statistically improbable. With rejection rates at top journals of 80% or more, one would hope that in the future editors will favor articles that report results from studies with high statistical power that obtain significant results that are caused by the predicted effect.

The R-Index of Ego-Depletion Studies with the Handgrip Paradigm

In 1998 Baumeister and colleagues introduced a laboratory experiment to study will-power. Participants are assigned to one of two conditions. In one condition, participants have to exert will-power to work on an effortful task. The other condition is a control condition with a task that does not require will-power. After the manipulation all participants have to perform a second task that requires will-power. The main hypothesis is that participants who already used will-power on the first task will perform more poorly on the second task than participants in the control condition.

In 2010, a meta-analysis examined the results of studies that had used this paradigm (Hagger Wood, & Chatzisarantis, 2010). The meta-analysis uncovered 198 studies with a total of 10,782 participants. The overall effect size in the meta-analysis suggested strong support for the hypothesis with an average effect size of d = .62.

However, the authors of the meta-analysis did not examine the contribution of publication bias to the reported results. Carter and McCullough (2013) compared the percentage of significant results to average observed power. This test showed clear evidence that studies with significant results and inflated effect sizes were overrepresented in the meta-analysis. Carter and McCullough (2014) used meta-regression to examine bias (Stanley and Doucouliagos, 2013). This approach relies on the fact that several sources of reporting bias and publication bias produce a correlation between sampling error and effect size. When effect sizes are regressed on sampling error, the intercept provides an estimate of the unbiased effect size; that is the effect size when sampling error in the population when sampling error is zero. Stanley and Doucouliagos (2013) use two regression methods. One method uses sampling error as a predictor (PET). The other method uses the sampling error squared as a predictor (PEESE). Carter and McCullough (2013) used both methods. PET showed bias and there was no evidence for the key hypothesis. PEESE also showed evidence of bias, but suggested that the effect is present.

There are several problems with the regression-based approach as a way to correct for biases (Replication-Index, December 17, 2014). One problem is that other factors can produce a correlation between sampling error and effect sizes. In this specific case, it is possible that effect sizes vary across experimental paradigms. Hagger and Chatzisarantis (2014) use these problems to caution readers that it is premature to disregard an entire literature on ego-depletion. The R-Index can provide some additional information about the empirical foundation of ego-depletion theory.

The analyses here focus on the handgrip paradigm because this paradigm has high power to detect moderate to strong effects because these studies measured handgrip strengths before and after the manipulation of will-power. Based on published studies, it is possible to estimate the retest correlation of handgrip performance (r ~ .8). Below are some a priori power analysis with common sample sizes and Cohen’s effect sizes of small, moderate, and large effect sizes.

HandgripPoewr

The power analysis shows that the pre-post design is very powerful to detect moderate to large effect sizes.   Even with a sample size of just 40 participants (20 per condition), power is 71%. If reporting bias and publication bias exclude 30% non-significant results from the evidence, observed power is inflated to 82%. The comparison of success rate (100%) and observed power (82%) leads to an estimated inflation rate of 18%) and an R-Index is 64% (82% – 18%). Thus a moderate effect size in studies with 40 or more participants is expected to produce an R-Index greater than 64%.

However, with typical sample sizes of less than 120 participants, the expected rate of significant results is less than 50%. With N = 80 and true power of 31%, the reporting of only significant results would boost the observed power to 64%. The inflation rate would be 30% and the R-Index would be 39%. In this case, the R-Index overestimates true power by 9%. Thus, an R-Index less than 50% suggests that the true effect size is small or that the null-hypothesis is true (importantly, the null-hypothesis refers to the effect in the handgrip-paradigm, not to the validity of the broader theory that it becomes more difficult to sustain effort over time).

R-Analysis

The meta-analysis included 18 effect sizes based on handgrip studies.   Two unpublished studies (Ns = 24, 37) were not included in this analysis.   Seeley & Gardner (2003)’s study was excluded because it failed to use a pre-post design, which could explain the non-significant result. The meta-analysis reported two effect sizes for this study. Thus, 4 effects were excluded and the analysis below is based on the remaining 14 studies.

All articles presented significant effects of will-power manipulations on handgrip performance. Bray et al. (2008) reported three tests; one was deemed not significant (p = .10), one marginally significant (.06), and one was significant at p = .05 (p = .01). The results from the lowest p-value were used. As a result, the success rate was 100%.

Median observed power was 63%. The inflation rate is 37% and the R-Index is 26%. An R-Index of 22% is consistent with a scenario in which the null-hypothesis is true and all reported findings are type-I errors. Thus, the R-Index supports Carter and McCullough’s (2014) conclusion that the existing evidence does not provide empirical support for the hypothesis that will-power manipulations lower performance on a measure of will-power.

The R-Index can also be used to examine whether a subset of studies provides some evidence for the will-power hypothesis, but that this evidence is masked by the noise generated by underpowered studies with small samples. Only 7 studies had samples with more than 50 participants. The R-Index for these studies remained low (20%). Only two studies had samples with 80 or more participants. The R-Index for these studies increased to 40%, which is still insufficient to estimate an unbiased effect size.

One reason for the weak results is that several studies used weak manipulations of will-power (e.g., sniffing alcohol vs. sniffing water in the control condition). The R-Index of individual studies shows two studies with strong results (R-Index > 80). One study used a physical manipulation (standing one leg). This manipulation may lower handgrip performance, but this effect may not reflect an influence on will-power. The other study used a mentally taxing (and boring) task that is not physically taxing as well, namely crossing out “e”s. This task seems promising for a replication study.

Power analysis with an effect size of d = .2 suggests that a serious empirical test of the will-power hypothesis requires a sample size of N = 300 (150 per cell) to have 80% power in a pre-post study of will-power.

 HandgripRindex

 

Conclusion

The R-Index of 14 will-power studies with the powerful pre-post handgrip paradigm confirms Carter and McCullough’s (2014) conclusion that a meta-analysis of will-power studies (Hagger Wood, & Chatzisarantis, 2010) provided an inflated estimate of the true effect size and that the existing studies provide no empirical support for the effect of will-power manipulations on a second effortful task. The existing studies have insufficient statistical power to distinguish a true null-effect from a small effect (d = .2). Power analysis suggest that future studies should focus on strong manipulations of will-power and use sample sizes of N = 300 participants.

Limitation

This analysis examined only a small set of studies in the meta-analysis that used handgrip performance as dependent variable. Other studies may show different results, but these studies often used a simple between-subject design with small samples. This paradigm has low power to detect even moderate effect sizes. It is therefore likely that the R-Index will also confirm Carter and McCullough’s (2014) conclusion.

The R-Index of Nicotine-Replacement-Therapy Studies: An Alternative Approach to Meta-Regression

Stanley and Doucouliagos (2013) demonstrated how meta-regression can be used to obtain unbiased estimates of effect sizes from a biased set of original studies. The regression approach relies on the fact that small samples often need luck or questionable practices to produce significant results, whereas large samples can show true effects without the help of luck and questionable practices. If questionable practices or publication bias are present, effect sizes in small samples are inflated and this bias is evident in a regression of effect sizes on sampling error. When bias is present, the intercept of the regression equation can provide a better estimate of the average effect size in a set of studies.

One limitation of this approach is that other factors can also produce a correlation between effect size and sampling error. Another problem is that the regression equation can only approximate the effect of bias on effect size estimates.

The R-Index can complement meta-regression in several ways. First, it can be used to examine whether a correlation between effect size and sampling error reflects bias. If small samples have higher effect sizes due to bias, they should also yield more significant results than the power of these studies justifies. If this is not the case, the correlation may simply show that smaller samples examined stronger effects. Second, the R-Index can be used as an alternative way to estimate unbiased effect sizes that does not rely on the relationship between sample size and effect size.

The usefulness of the R-Index is illustrated with Stanley and Doucouliagos (2013) meta-analysis of the effectiveness of nicotine replacement therapy (the patch). Table A1 lists sampling errors and t-values of 42 studies. Stanley and Doucouliagos (2013) found that the 42 studies suggested a reduction in smoking by 93%, but that effectiveness decreased to 22% in a regression that controlled for biased reporting of results. This suggests that published studies inflate the true effect by more than 300%.

I entered the t-values and standard errors into the R-Index spreadsheet. I used sampling error to estimate sample sizes and degrees of freedom (2 / sqrt [N]). I used one-tailed t-tests to allow for negative t-values because the sign of effects is known in a meta-analysis of studies that try to show treatment effects. Significance was tested using p = .025, which is equivalent to using .050 in the test of significance for two-tailed tests (z > 1.96).

The R-Index for all 42 studies was 27%. The low R-Index was mostly explained by the low power of studies with small samples. Median observed power was just 34%. The number of significant results was only slightly higher 40%. The inflation rate was only 7%.

As studies with low power add mostly noise, Stanley (2010) showed that it can be preferable to exclude them from estimates of actual effect sizes. The problem is that it is difficult to find a principled way to determine which studies should be included or excluded. One solution is to retain only studies with large samples. The problem with this approach is that this often limits a meta-analysis to a small set of studies.

One solution is to compute the R-Index for different sets of studies and to base conclusions on the largest unbiased set of studies. For the 42 studies of nicotine replacement therapy, the following effect size estimates were obtained (effect sizes are d-values, d = t * se).

NicotinePatch

The results show the highest R-Index for studies with more than 80 participants. For these studies, observed power is 83% and the percentage of significant results is also 83%, suggesting that this set of studies is an unbiased sample of studies. The weighted average effect size for this set of studies is d = .44. The results also show that the weighted average effect size does not change much as a function of the selection of studies. When all studies are included, there is evidence of bias (8% inflation) and the weighted average effect size is inflated, but the amount of inflation is small (d = .56 vs. d = .44, difference d = .12).

The small amount of bias appears to be inconsistent with Stanley and Doucouliagos (2013) estimate that an uncorrected meta-analysis overestimates the true effect size by over 300% (93% vs. 22% RR). I therefore also examined the log(RR) values in Table 1a.

The average is .68 (compared to the simple mean reported as .66); the median is .53 and the weighted average is .49.   The regression-corrected estimate reported by Stanley and Doucouliagos (2013) is .31. The weighted mean for studies with more than 80 participants is .43. It is now clear why Stanley and Doucouliagos (2013) reported a large effect of the bias correction. First, they used the simple mean as a comparison standard (.68 vs. 31). The effect would be smaller if they had used the weighted mean as a comparison standard (.49 vs. .31). Another factor is that the regression procedure produces a lower estimate than the R-Index approach (.31 vs. 43). More research is needed to compare these results, but the R-Index has a simple logic. When there is no evidence of bias, the weighted average provides a reasonable estimate of the true effect size.

Conclusion

Stanley and Doucouliagos (2013) used regression of effect sizes on sampling error to reveal biases and to obtain an unbiased estimate of the typical effect size in a set of studies. This approach provides a useful tool in the fight against biased reporting of research results. One limitation of this approach is that other factors can produce a correlation between sampling error and effect size. The R-Index can be used to examine how much reporting biases contribute to this correlation. The R-Index can also be used to obtain an unbiased estimate of effect size by computing a weighted average for a select set of studies with a high R-Index.

A meta-analysis of 42 studies of nicotine replacement theory illustrates this approach. The R-Index for the full set of studies was low (24%). This reveals that many studies had low power to demonstrate an effect. These studies provide little information about effectiveness because non-significant results are just as likely to be type-II errors as demonstrations of low effectiveness.

The R-Index increased when studies with larger samples were selected. The maximum R-Index was obtained for studies with at least 80 participants. In this case, observed power was above 80% and there was no evidence of bias. The weighted average effect size for this set of studies was only slightly lower than the weighted average effect size for all studies (log(RR) = .43 vs. .49, RR = 54% vs. 63%, respectively). This finding suggests that smokers who use a nicotine patch are about 50% more likely to quit smoking than smokers without a nicotine patch.

The estimate of 50% risk reduction challenges Stanley and Doucouliagos’s (2013) preferred estimate that bias correction “reduces the efficacy of the patch to only 22%.” The R-Index suggests that this bias-corrected estimate is itself biased.

Another important conclusion is that studies with low power are wasteful and uninformative. They generate a lot of noise and are likely to be systematically biased and they contribute little to a meta-analysis that weights studies by sample size. The best estimate of effect size was based on only 6 out of 42 studies. Researchers should not conduct studies with low power and editors should not publish studies with low power.

The R-Index of Simmons et al.’s 21 Word Solution

Simmons, Nelson, and Simonsohn (2011) demonstrated how researchers can omit inconvenient details from research reports. For example, researchers may have omitted to mention a manipulation that failed to produce a theoretically predicted effect. Such questionable practices have the undesirable consequence that reported results are difficult to replicate. Simons et al. (2011, 2012) proposed a simple solution to this problem. Researchers who are not engaging in questionable research practices could report that they did not engage in these practices. In contrast, researchers who used questionable research practices would have to lie or honestly report that they engaged in these practices. Simons et al. (2012) proposed a simple 21 statement and encouraged researchers to include it in their manuscripts.

“We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study.”

A search in WebofScience in June 2014 retrieved 326 articles that cited Simons et al. (2011). To examine the effectiveness of this solution to the replication crisis, a set of articles was selected that reported original research results and claimed that they adhered to Simons et al.’s standards. The sample size was determined by the rules to sample a minimum of 10 articles and a minimum of 20 studies. The R-Index is based on 11 articles with 21 studies.

The average R-Index for the set of 11 articles is 75%. There are 6 articles with an R-Index greater than 90%, suggesting that these studies had very high statistical power to produce statistically significant results.

To interpret this outcome it is helpful to use the following comparison standards.

When true power is 50% and all non-significant results are deleted to inflate the success rate to 100%, the R-Index is 50%.

A set of 18 multiple study articles in the prestigious journal science had only 1 article with an R-Index over 90% and 13 articles with an R-Index below 50%.

R-Index 21WordSolution

Conclusion

The average R-Index of original research articles that cite Simmons et al.’s (2011) article is fairly high and close to the ideal of 80%. This shows that some researchers are reporting results that are likely to replicate and that these researchers use the Simmons et al. reference to signal their research integrity. It is notable that the average number of studies in these 11 articles is about two studies. None of these articles reported four or more studies and six articles reported a single study. This observation highlights the fact that it is easier to produce replicable results when resources are used for a single study with high statistical power rather than wasting resources on several underpowered studies that either fail or require luck and questionable research practices to produce statistically significant results (Schimmack, 2012).

Although it is encouraging that some researchers are now including a statement that they did not engage in questionable research practices, the number of articles that contain these statements is still low. Only 10 articles in the journal Psychological Science that published Simmons et al.’s article make a reference to Simmons et al. and none of these cited it for the purpose of declaring that the authors complied with Simmons et al.’s recommendations. At present, it is therefore unclear how much researchers have changed their practices or not.

The R-Index provides an alternative approach to examine whether reported results are credible and replicable. Studies with high statistical power and honest reporting of non-significant results are more likely to replicate. The R-Index is easy to compute. Editors could ask authors to compute the R-Index for submitted manuscript. Reviewers can compute the R-Index during their review. Editors can use the R-Index to decide, which manuscripts gets accepted and ask authors to include the R-Index in publications. Most important, readers can compute the R-Index to examine whether they can trust a set of published results.

The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

tide_naked

“Only when the tide goes out do you discover who has been swimming naked.”  Warren Buffet (Value Investor).

 

 

Francis, Tanzman, and Matthews (2014) examined the credibility of psychological articles published in the prestigious journal Science. They focused on articles that contained four or more articles because (a) the statistical test that they has insufficient power for smaller sets of studies and (b) the authors assume that it is only meaningful to focus on studies that are published within a single article.

They found 26 articles published between 2006 and 2012. Eight articles could not be analyzed with their method.

The remaining 18 articles had a 100% success rate. That is, they never reported that a statistical hypothesis test failed to produce a significant result. Francis et al. computed the probability of this outcome for each article. When the probability was less than 10%, they made the recommendation to be skeptical about the validity of the theoretical claims.

For example, a researcher may conduct five studies with 80% power. As expected, one of the five studies produced a non-significant result. It is rational to assume that this finding is a type-II error as the Type-II error should occur in 1 out of 5 studies. The scientist decides not to include the non-significant result. In this case, there is bias, the average effect size across the four significant studies is slightly inflated, but the empirical results do support empirical claims.

If, however, the null-hypothesis is true and a researcher conducts many statistical tests and reports only significant results, demonstrating excessive significant results would also reveal that the reported results provide no empirical support for the theoretical claims in this article.

The problem with Francis et al.’s approach is that it does not clearly distinguish between these two scenarios.

The R-Index addresses this problem. It provides quantitative information about the replicability of a set of studies. Like Francis et al., the R-Index is based on the observed power of individual statistical tests (see Schimmack, 2012, for details), but the next steps are different. Francis et al. multiply observed power estimates. This approach is only meaningful for sets of studies that reported only significant results. The R-Index can be computed for studies that reported significant and non-significant results. Here are the steps:

Compute median observed power for all theoretically important statistical tests from a single study; then compute the median of these medians. This median estimates the median true power of a set of studies.

Compute the rate of significant results for the same set of statistical tests; then average the rates across the same set of studies. This average estimates the reported success rate for a set of studies.

Median observed power and average success rate are both estimates of true power or replicability of a set of studies. Without bias, these two estimates should converge as the number of studies increase.

If the success rate is higher than median observed power, it suggests that the reported results provide an inflated picture of the true effect size and replicability of a phenomenon.

The R-Index uses the difference between success rate and median observed power to correct the inflated estimate of replicability by subtracting the inflation rate (success rate – median observed power) from the median observed power.

R-Index = Median Observed Power – (Success rate – Median Observed Power)

The R-Index is a quantitative index, where higher values suggest a higher probability that an exact replication study will be successful and it avoids simple dichotomous decisions. Nevertheless, it can be useful to provide some broad categories that distinguish different levels of replicability.

An R-Index of more than 80% is consistent with true power of 80%, even when some results are omitted. I chose 80% as a boundary because Jacob Cohen advised researchers that they should plan studies with 80% power. Many undergraduates learn this basic fact about power and falsely assume that researchers are following a rule that is mentioned in introductory statistics.

An R-Index between 50% and 80% suggests that the reported results support an empirical phenomenon, but that power was less than ideal. Most important, this also implies that these studies make it difficult to distinguish non-significant results and type-II errors. For example, two tests with 50% power are likely to produce one significant result and one non-significant result. Researches are tempted to interpret the significant one and to ignore the non-significant one. However, in a replication study the opposite pattern is just as likely to occur.

An R-Index between25% and 50% raises doubts about the empirical support for the conclusions. The reason is that an R-Index of 22% can be obtained when the null-hypothesis is true and all non-significant results are omitted. In this case, observed power is inflated from 5% to 61%. With a 100% success rate, the inflation rate is 39%, and the R-Index is 22% (61% – 39% = 22%).

An R-Index below 20% suggest that researchers used questionable research methods (importantly, these method are questionable but widely accepted in many research communities and not considered to be ethical misconduct) to obtain results that are statistically significant (e.g., systematically deleting outliers until p < .05).

Table 1 list Francis et al.’s results and the R-Index. Studies are arranged in order of the R-Index.  Only 1 study is in the exemplary category with an R-Index greater than 80%.
4 studies have an R-Index between 50% and 80%.
8 studies have an R-Index in the range between 20% and 50%.
5 studies have an R-Index below 20%.

There are good reasons why researchers should not conduct studies with less than 50% power.  However, 13 of the 18 studies have an R-Index below 50%, which suggests that the true power in these studies was less than 50%.

FrancisScienceTable

Conclusion

The R-Index provides an alternative approach to Francis’s TES to examine the credibility of a set of published studies. Whereas Francis concluded that 15 out of 18 articles show bias that invalidates the theoretical claims of the original article, the R-Index provides quantitative information about the replicability of reported results.

The R-Index does not provide a simple answer about the validity of published findings, but in many cases the R-Index raises concerns about the strength of the empirical evidence and reveals that editorial decisions failed to take replicability into account.

The R-Index provides a simple tool for editors and reviewers to increase the credibility of published results and to increase the replicability of published findings. Editors and reviewers can compute, or ask authors who submit manuscripts to compute, the R-Index and use this information in their editorial decision. There is no clear criterion value, but a higher R-Index is better and moderate R-values should be justified by other criteria (e.g., uniqueness of sample).

The R-Index can be used to examine whether editors continue to accept articles with low replicability or are committed to the publication of empirical results that are credible and replicable.

Do it yourself: R-Index Spreadsheet and Manual is now available.

Science is self-correcting, but it often takes too long.

A spreadsheet to compute the R-Index and a manual that shows how to use the spreadsheet is now available on the www.r-index.org website. Researchers from all fields of science that use statistics are welcome to use the R-Index to examine the statistical integrity of published research findings. A high R-Index suggests that a set of studies reported results that are likely to replicate in an EXACT replication study with high statistical power. A low R-Index suggests that published results may be biased and that published results may not replicate. Researchers can share the results of their R-Index analyses by submitting the completed spreadsheets to www.r-index.org and the results will be posted anonymously. Results and spreadsheets will be openly accessible.

Nature Neuroscience: R-Index

R-Index of Nature Neuroscience

An article in nature review, neuroscience suggested that the median power in neuroscience studies is just 21% (Katherine S. Button, John P. A. Ioannidis, Claire Mokrysz, Brian A.Nosek, Jonathan Flint, Emma S.J. Robinson and Marcus R. Munafò, 2013).

The authors of this article examined meta-analyses of primary studies in neuroscience that were published in 2011. They analyzed 49 meta-analyses that were based on a total of 730 original studies (on average, 15 studies per meta-analysis, range 2 to 57).

For each primary study, the authors computed observed power based on the sample size and the estimated effect size in the meta-analysis.

Based on their analyses, the authors concluded that the median power in neuroscience is 21%.

Importantly, this is an estimate of the median observed power in neuroscience studies. There is a major problem with this estimate. It is incredibly low because a study with 21% observed power is not statistically significant, p = .25. If median power were 21%, it would mean that over 50% of the original studies in the meta-analysis reported a non-significant result (p > .05). This seems rather unlikely because journals tend to publish mostly significant results.

The estimate is even less plausible because it is based on meta-analytic averages without any correction for bias. These effect sizes are likely to be inflated, which means that median power estimate is inflated. Thus, true power is even less than 21%.

What could explain this implausible result?

  1. A meta-analysis includes published and unpublished studies. It is possible that the published studies reported significant results with observed power greater than 50% (p < .05) and the unpublished studies reported non-significant results with power less than 50%. However, this would imply that meta-analysts were able to retrieve as many unpublished studies as published studies. The authors did not report whether power of published and unpublished studies differed.
  2. A second possibility is that the power analyses produced false results. The authors relied on Ioannidis and Trikalinos’s (2007) approach to the estimation of power. This approach assumes that studies in a meta-analysis have the same true effect size and that the meta-analytic average (weighted mean) provides the best estimate of the true effect size. This estimate of the true effect size is then used to estimate power in individual studies based on the sample size of the study. As already noted by Ioannidis and Trikalinos (2007), this approach can produce biased results when effect sizes in a meta-analysis are heterogeneous.
  3. Estimating power simply on the basis of effect size and sample size can be misleading when the design is not a simple comparison of two groups. Between-subject designs are common in animal studies in neuroscience. However, many fMRI studies use within-subject designs that achieve high statistical power with a few participants because participants serve as their own controls.

Schimmack (2012) proposed an alternative procedure that does not have this limitation. Power is estimated individually for each study based on the observed effect size in this study. This approach makes it possible to estimate median power for heterogeneous sets of studies with different effect sizes. Moreover, this approach makes it possible to compute power when power is not simply a function of sample size and effect size (e.g., within-subject designs).

R-Index of Nature Neuroscience: Analysis

To examine the replicability of research published in nature and neuroscience, I retrieved the most cited articles in this journal until I had a sample of 20 studies. I needed 14 articles to meet this goal. The number of studies in these articles ranged from 1 to 7.

The success rate for focal significance tests was 97%. This implies that the vast majority of significance tests reported a significant result. The median observed power was 84%. The inflation rate is 13% (97% – 84% = 13%). The R-Index is 71%. Based on these numbers, the R-Index predicts that the majority of studies in nature neuroscience would replicate in an exact replication study.

This conclusion differs dramatically from Button et al.’s (2013) conclusion. I therefore examined some of the articles that were used for Button et al.’s analyses.

A study by Davidson et al. (2003) examined treatment effects in 12 depressed patients and compared them to 5 healthy controls. The main findings in this article were three significant interactions between time of treatment and group with z-scores of 3.84, 4.60, and 4.08. Observed power for these values with p = .05 is over 95%. If a more conservative significance level of p = .001 is used, power is still over 70%. However, the meta-analysis focused on the correlation between brain activity at baseline and changes in depression over time. This correlation is shown with a scatterplot without reporting the actual correlation or testing it for significance. The text further states that a similar correlation was observed for an alternative depression measure with r = .46 and noting correctly that this correlation is not significant, t(10) = 1.64, p = .13, d = .95, obs. power = 32%. The meta-analysis found a mean effect size of .92. A power analysis with d = .92 and N = 12 yields a power estimate of 30%. Presumably, this is the value that Button et al. used to estimate power for the Davidson et al. (2003) article. However, the meta-analysis did not include the more powerful analyses that compared patients and controls over time.

Conclusion

In the current replication crisis, there is a lot of confusion about the replicability of published findings. Button et al. (2013) aimed to provide some objective information about the replicability of neuroscience research. They concluded that replicability is very low with a median estimate of 21%. In this post, I point out some problems with their statistical approach and the focus on meta-analyses as a way to make inferences about replicability of published studies. My own analysis shows a relatively high R-Index of 71%. To make sense of this index it is instructive to compare it to the following R-Indices.

In a replication project of psychological studies, I found an R-Index of 43% and 28% of studies were successfully replicated.

In the many-labs replication project, 10 out of 12 studies were successfully replicated, a replication rate of 83% and the R-Index was 72%.

Caveat

Neuroscience studies may have high observed power and still not replicate very well in exact replications. The reason is that measuring brain activity is difficult and requires many steps to convert and reduce observed data into measures of brain activity in specific regions. Actual replication studies are needed to examine the replicability of published results.

Dr. Schnall’s R-Index

In several blog posts, Dr. Schnall made some critical comments about attempts to replicate her work and these blogs created a heated debate about replication studies. Heated debates are typically a reflection of insufficient information. Is the Earth flat? This question created heated debates hundreds of years ago. In the age of space travel it is no longer debated. In this blog, I presented some statistical information that sheds light on the debate about the replicability of Dr. Schnall’s research.

The Original Study

Dr. Schnall and colleagues conducted a study with 40 participants. A comparison of two groups on a dependent variable showed a significant difference, F(1,38) = 3.63. In these days, Psychological Science asked researchers to report P-Rep instead of p-values. P-rep was 90%. The interpretation of P-rep was that there is a 90% chance to find an effect with the SAME SIGN in an exact replication study with the same sample size. The conventional p-value for F(1,38) = 3.63 is p = .06, a finding that commonly is interpreted as marginally significant. The standardized effect size is d = .60, which is considered a moderate effect size. The 95% confidence interval is -.01 to 1.47.

The wide confidence interval makes it difficult to know the true effect size. A post-hoc power analysis, assuming the true effect size is d = .60 suggests that an exact replication study has a 46% chance to produce a significant results (p < .05, two-tailed). However, if the true effect size is lower, actual power is lower. For example, if the true effect size is small (d = .2), a study with N = 40 has only 9% power (that is a 9% chance) to produce a significant result.

The First Replication Study

Drs. Johnson, Cheung, and Donnellan conducted a replication study with 209 participants. Assuming the effect size in the original study is the true effect size, this replication study has 99% power. However, assuming the true effect size is only d = .2, the study has only 31% power to produce a significant result. The study produce a non-significant result, F(1, 206) = .004, p = .95. The effect size was d = .01 (in the same direction). Due to the larger sample, the confidence interval is smaller and ranges from -.26 to .28. The confidence interval includes d = 2. Thus, both studies are consistent with the hypothesis that the effect exists and that the effect size is small, d = .2.

The Second Replication Study

Dr. Huang conducted another replication study with N = 214 participants (Huang, 2004, Study 1). Based on the previous two studies, the true effect might be expected to be somewhere between -.01 and .28, which includes a small effect size of d = .20. A study with N = 214 participants has 31% power to produce a significant result. Not surprisingly, the study produce a non-significant result, t(212) = 1.22, p = .23. At the same time, the effect size fell within the confidence interval set by the previous two studies, d = .17.

A Third Replication Study

Dr. Hung conducted a replication study with N = 440 participants (Study 2). Maintaining the plausible effect size of d = .2 as the best estimate of the true effect size, the study has 55% power to produce a significant result, which means it is nearly as likely to produce a non-significant result as it is to produce a significant result, if the effect size is small (d = .2). The study failed to produce a significant result, t(438) = .042, p = 68. The effect size was d = .04 with a confidence interval ranging from -.14 to .23. Again, this confidence interval includes a small effect size of d = .2.

A Fourth Replication Study

Dr. Hung published a replication study in the supplementary materials to the article. The study again failed to demonstrate a main effect, t(434) = 0.42, p = .38. The effect size is d = .08 with a confidence interval of -.11 to .27. Again, the confidence interval is consistent with a small true effect size of d = .2. However, the study with 436 participants had only a 55% chance to produce a significant result.

If Dr. Huang had combined the two samples to conduct a more powerful study, a study with 878 participants would have 80% power to detect a small effect size of d = .2. However, the combined effect size of d = .06 for the combined samples is still not significant, t(876) = .89. The confidence interval ranges from -.07 to .19. It no longer includes d = .20, but the results are still consistent with a positive, yet small effect in the range between 0 and .20.

Conclusion

In sum, nobody has been able to replicate Schnall’s finding that a simple priming manipulation with cleanliness related words has a moderate to strong effect (d = .6) on moral judgments of hypothetical scenarios. However, all replication studies show a trend in the same direction. This suggests that the effect exists, but that the effect size is much smaller than in the original study; somewhere between 0 and .2 rather than .6.

Now there are three possible explanations for the much larger effect size in Schnall’s original study.

1. The replication studies were not exact replications and the true effect size in Schnall’s version of the experiment is stronger than in the other studies.

2. The true effect size is the same in all studies, but Dr. Schnall was lucky to observe an effect size that was three times as large as the true effect size and large enough to produce a marginally significant result.

3. It is possible that Dr. Schnall did not disclose all of the information about her original study. For example, she may have conducted additional studies that produced smaller and non-significant results and did not report these results. Importantly, this practice is common and legal and in an anonymous survey many researchers admitted using practices that produce inflated effect sizes in published studies. However, it is extremely rare for researchers to admit that these common practices explain one of their own findings and Dr. Schnall has attributed the discrepancy in effect sizes to problems with replication studies.

Dr. Schnall’s Replicability Index

Based on Dr. Schnall’s original study it is impossible to say which of these explanations accounts for her results. However, additional evidence makes it possible to test the third hypothesis that Dr. Schnall knows more than she was reporting in her article. The reason is that luck does not repeat itself. If Dr. Schnall was just lucky, other studies by her should have failed because Lady Luck is only on your side half the time. If, however, disconfirming evidence is systematically excluded from a manuscript, the rate of successful studies is higher than the observed statistical power in published studies (Schimmack, 2012).

To test this hypothesis, I downloaded Dr. Schnall’s 10 most cited articles (in Web of Science, July, 2014). These 10 articles contained 23 independent studies. For each study, I computed the median observed power of statistical tests that tested a theoretically important hypothesis. I also calculated the success rate for each study. The average success rate was 91% (ranging from 45% to 100%, median = 100%). The median observed power was 61%. The inflation rate is 30% (91%-61%). Importantly, observed power is an inflated estimate of replicability when the success rate is inflated. I created the replicability index (R-index) to take this inflation into account. The R-Index subtracts the inflation rate from observed median power.

Dr. Schnall’s R-Index is 31% (61% – 30%).

What does an R-Index of 31% mean? Here are some comparisons that can help to interpret the Index.

Imagine the null-hypothesis is always true, and a researcher publishes only type-I errors. In this case, observed power is 61% and the success rate is 100%. The R-Index is 22%.

Dr. Baumeister admitted that his publications select studies that report the most favorable results. His R-Index is 49%.

The Open Science Framework conducted replication studies of psychological studies published in 2008. A set of 25 completed studies in November 2014 had an R-Index of 43%. The actual rate of successful replications was 28%.

Given this comparison standards, it is hardly surprising that one of Dr. Schnall’s study did not replicate even when the sample size and power of replication studies were considerably higher.

Conclusion

Dr. Schnall’s R-Index suggests that the omission of failed studies provides the most parsimonious explanation for the discrepancy between Dr. Schnall’s original effect size and effect sizes in the replication studies.

Importantly, the selective reporting of favorable results was and still is an accepted practice in psychology. It is a statistical fact that these practices reduce the replicability of published results. So why do failed replication studies that are entirely predictable create so much heated debate? Why does Dr. Schnall fear that her reputation is tarnished when a replication study reveals that her effect sizes were inflated? The reason is that psychologists are collectively motivated to exaggerate the importance and robustness of empirical results. Replication studies break with the code to maintain an image that psychology is a successful science that produces stunning novel insights. Nobody was supposed to test whether published findings are actually true.

However, Bem (2011) let the cat out of the bag and there is no turning back. Many researchers have recognized that the public is losing trust in science. To regain trust, science has to be transparent and empirical findings have to be replicable. The R-Index can be used to show that researchers reported all the evidence and that significant results are based on true effect sizes rather than gambling with sampling error.

In this new world of transparency, researchers still need to publish significant results. Fortunately, there is a simple and honest way to do so that was proposed by Jacob Cohen over 50 years ago. Conduct a power analysis and invest resources only in studies that have high statistical power. If your expertise led you to make a correct prediction, the force of the true effect size will be with you and you do not have to rely on Lady Luck or witchcraft to get a significant result.

P.S. I nearly forgot to comment on Dr. Huang’s moderator effects. Dr. Huang claims that the effect of the cleanliness manipulation depends on how much effort participants exert on the priming task.

First, as noted above, no moderator hypothesis is needed because all studies are consistent with a true effect size in the range between 0 and .2.

Second, Dr. Huang found significant interaction effects in two studies. In Study 2, the effect was F(1,438) = 6.05, p = .014, observed power = 69%. In Study 2a, the effect was F(1,434) = 7.53, p = .006, observed power = 78%. The R-Index for these two studies is 74% – 26% = 48%.   I am waiting for an open science replication with 95% power before I believe in the moderator effect.

Third, even if the moderator effect exists, it doesn’t explain Dr. Schnall’s main effect of d = .6.

The Replicability-Index (R-Index): Quantifying Research Integrity

ANNIVERSARY POST.  Slightly edited version of first R-Index Blog on December 1, 2014.

In a now infamous article, Bem (2011) produced 9 (out of 10) statistically significant results that appeared to show time-reversed causality.  Not surprisingly, subsequent studies failed to replicate this finding.  Although Bem never admitted it, it is likely that he used questionable research practices to produce his results. That is, he did not just run 10 studies and found 9 significant results. He may have dropped failed studies, deleted outliers, etc.  It is well-known among scientists (but not lay people) that researchers routinely use these questionable practices to produce results that advance their careers.  Think, doping for scientists.

I have developed a statistical index that tracks whether published results were obtained by conducting a series of studies with a good chance of producing a positive result (high statistical power) or whether researchers used questionable research practices.  The R-Index is a function of the observed power in a set of studies. More power means that results are likely to replicate in a replication attempt.  The second component of the R-index is the discrepancy between observed power and the rate of significant results. 100 studies with 80% power should produce, on average, 80% significant results. If observed power is 80% and the success rate is 100%, questionable research practices were used to obtain more significant results than the data justify.  In this case, the actual power is less than 80% because questionable research practices inflate observed power. The R-index subtracts the discrepancy (in this case 20% too many significant results) from observed power to adjust for the inflation.  For example, if observed power is 80% and success rate is 100%, the discrepancy is 20% and the R-index is 60%.

In a paper, I show that the R-index predicts success in empirical replication studies.

The R-index also sheds light on the recent controversy about failed replications in psychology (repligate) between replicators and “replihaters.”   Replicators sometimes imply that failed replications are to be expected because original studies used small samples with surprisingly large effects, possibly due to the use of questionable research practices. Replihaters counter that replicators are incompetent researchers who are motivated to produce failed studies.  The R-Index makes it possible to evaluate these claims objectively and scientifically.  It shows that the rampant use of questionable research practices in original studies makes it extremely likely that replication studies will fail.  Replihaters should take note that questionable research practices can be detected and that many failed replications are predicted by low statistical power in original articles.