Citation: Dr. R (2015). Meta-analysis of observed power. R-Index Bulletin, Vol(1), A2.

In a previous blog post, I presented an introduction to the concept of observed power. Observed power is an estimate of the true power on the basis of observed effect size, sampling error, and significance criterion of a study. Yuan and Maxwell (2005) concluded that observed power is a useless construct when it is applied to a single study, mainly because sampling error in a single study is too large to obtain useful estimates of true power. However, sampling error decreases as the number of studies increases and observed power in a set of studies can provide useful information about the true power in a set of studies.

This blog post introduces various methods that can be used to estimate power on the basis of a set of studies (meta-analysis). I then present simulation studies that compare the various estimation methods in terms of their ability to estimate true power under a variety of conditions. In this blog post, I examine only unbiased sets of studies. That is, the sample of studies in a meta-analysis is a representative sample from the population of studies with specific characteristics. The first simulation assumes that samples are drawn from a population of studies with fixed effect size and fixed sampling error. As a result, all studies have the same true power (homogeneous). The second simulation assumes that all studies have a fixed effect size, but that sampling error varies across studies. As power is a function of effect size and sampling error, this simulation models heterogeneity in true power. The next simulations assume heterogeneity in population effect sizes. One simulation uses a normal distribution of effect sizes. Importantly, a normal distribution has no influence on the mean because effect sizes are symmetrically distributed around the mean effect size. The next simulations use skewed normal distributions. This simulation provides a realistic scenario for meta-analysis of heterogeneous sets of studies such as a meta-analysis of articles in a specific journal or articles on different topics published by the same author.

Observed Power Estimation Method 1: The Percentage of Significant Results

The simplest method to determine observed power is to compute the percentage of significant results. As power is defined as the long-range percentage of significant results, the percentage of significant results in a set of studies is an unbiased estimate of the long-term percentage. The main limitation of this method is that the dichotomous measure (significant versus insignificant) is likely to be imprecise when the number of studies is small. For example, two studies can only show observed power values of 0, 25%, 50%, or 100%, even if true power were 75%. However, the percentage of significant results plays an important role in bias tests that examine whether a set of studies is representative. When researchers hide non-significant results or use questionable research methods to produce significant results, the percentage of significant results will be higher than the percentage of significant results that could have been obtained on the basis of the actual power to produce significant results.

Observed Power Estimation Method 2: The Median

Schimmack (2012) proposed to average observed power of individual studies to estimate observed power. Yuan and Maxwell (2005) demonstrated that the average of observed power is a biased estimator of true power. It overestimates power when power is less than 50% and it underestimates true power when power is above 50%. Although the bias is not large (no more than 10 percentage points), Yuan and Maxwell (2005) proposed a method that produces an unbiased estimate of power in a meta-analysis of studies with the same true power (exact replication studies). Unlike the average that is sensitive to skewed distributions, the median provides an unbiased estimate of true power because sampling error is equally likely (50:50 probability) to inflate or deflate the observed power estimate. To avoid the bias of averaging observed power, Schimmack (2014) used median observed power to estimate the replicability of a set of studies.

Observed Power Estimation Method 3: P-Curve’s KS Test

Another method is implemented in Simonsohn’s (2014) pcurve. Pcurve was developed to obtain an unbiased estimate of a population effect size from a biased sample of studies. To achieve this goal, it is necessary to determine the power of studies because bias is a function of power. The pcurve estimation uses an iterative approach that tries out different values of true power. For each potential value of true power, it computes the location (quantile) of observed test statistics relative to a potential non-centrality parameter. The best fitting non-centrality parameter is located in the middle of the observed test statistics. Once a non-central distribution has been found, it is possible to assign each observed test-value a cumulative percentile of the non-central distribution. For the actual non-centrality parameter, these percentiles have a uniform distribution. To find the best fitting non-centrality parameter from a set of possible parameters, pcurve tests whether the distribution of observed percentiles follows a uniform distribution using the Kolmogorov-Smirnov test. The non-centrality parameter with the smallest test statistics is then used to estimate true power.

Observed Power Estimation Method 4: P-Uniform

van Assen, van Aert, and Wicherts (2014) developed another method to estimate observed power. Their method is based on the use of the gamma distribution. Like the pcurve method, this method relies on the fact that observed test-statistics should follow a uniform distribution when a potential non-centrality parameter matches the true non-centrality parameter. P-uniform transforms the probabilities given a potential non-centrality parameter with a negative log-function (-log[x]). These values are summed. When probabilities form a uniform distribution, the sum of the log-transformed probabilities matches the number of studies. Thus, the value with the smallest absolute discrepancy between the sum of negative log-transformed percentages and the number of studies provides the estimate of observed power.

Observed Power Estimation Method 5: Averaging Standard Normal Non-Centrality Parameter

In addition to these existing methods, I introduce to novel estimation methods. The first new method converts observed test statistics into one-sided p-values. These p-values are then transformed into z-scores. This approach has a long tradition in meta-analysis that was developed by Stouffer et al. (1949). It was popularized by Rosenthal during the early days of meta-analysis (Rosenthal, 1979). Transformation of probabilities into z-scores makes it easy to aggregate probabilities because z-scores follow a symmetrical distribution. The average of these z-scores can be used as an estimate of the actual non-centrality parameter. The average z-score can then be used to estimate true power. This approach avoids the problem of averaging power estimates that power has a skewed distribution. Thus, it should provide an unbiased estimate of true power when power is homogenous across studies.

Observed Power Estimation Method 6: Yuan-Maxwell Correction of Average Observed Power

Yuan and Maxwell (2005) demonstrated a simple average of observed power is systematically biased. However, a simple average avoids the problems of transforming the data and can produce tighter estimates than the median method. Therefore I explored whether it is possible to apply a correction to the simple average. The correction is based on Yuan and Maxwell’s (2005) mathematically derived formula for systematic bias. After averaging observed power, Yuan and Maxwell’s formula for bias is used to correct the estimate for systematic bias. The only problem with this approach is that bias is a function of true power. However, as observed power becomes an increasingly good estimator of true power in the long run, the bias correction will also become increasingly better at correcting the right amount of bias.

The Yuan-Maxwell correction approach is particularly promising for meta-analysis of heterogeneous sets of studies such as sets of diverse studies in a journal. The main advantage of this method is that averaging of power makes no assumptions about the distribution of power across different studies (Schimmack, 2012). The main limitation of averaging power was the systematic bias, but Yuan and Maxwell’s formula makes it possible to reduce this systematic bias, while maintaining the advantage of having a method that can be applied to heterogeneous sets of studies.

RESULTS

Homogeneous Effect Sizes and Sample Sizes

The first simulation used 100 effect sizes ranging from .01 to 1.00 and 50 sample sizes ranging from 11 to 60 participants per condition (Ns = 22 to 120), yielding 5000 different populations of studies. The true power of these studies was determined on the basis of the effect size, sample size, and the criterion p < .025 (one-tailed), which is equivalent to .05 (two-tailed). Sample sizes were chosen so that average power across the 5,000 studies was 50%. The simulation drew 10 random samples from each of the 5,000 populations of studies. Each sample of a study simulated a between-subject design with the given population effect size and sample size. The results were stored as one-tailed p-values. For the meta-analysis p-values were converted into z-scores. To avoid biases due to extreme outliers, z-scores greater than 5 were set to 5 (observed power = .999).

The six estimation methods were then used to compute observed power on the basis of samples of 10 studies. The following figures show observed power as a function of true power. The green lines show the 95% confidence interval for different levels of true power. The figure also includes red dashed lines for a value of 50% power. Studies with more than 50% observed power would be significant. Studies with less than 50% observed power would be non-significant. The figures also include a blue line for 80% true power. Cohen (1988) recommended that researchers should aim for a minimum of 80% power. It is instructive how accurate estimation methods are in evaluating whether a set of studies met this criterion.

The histogram shows the distribution of true power across the 5,000 populations of studies.

The histogram shows that the simulation covers the full range of power. It also shows that high-powered studies are overrepresented because moderate to large effect sizes can achieve high power for a wide range of sample sizes. The distribution is not important for the evaluation of different estimation methods and benefits all estimation methods equally because observed power is a good estimator of true power when true power is close to the maximum (Yuan & Maxwell, 2005).

The next figure shows scatterplots of observed power as a function of true power. Values above the diagonal indicate that observed power overestimates true power. Values below the diagonal show that observed power underestimates true power.

Visual inspection of the plots suggests that all methods provide unbiased estimates of true power. Another observation is that the count of significant results provides the least accurate estimates of true power. The reason is simply that aggregation of dichotomous variables requires a large number of observations to approximate true power. The third observation is that visual inspection provides little information about the relative accuracy of the other methods. Finally, the plots show how accurate observed power estimates are in meta-analysis of 10 studies. When true power is 50%, estimates very rarely exceed 80%. Similarly, when true power is above 80%, observed power is never below 50%. Thus, observed power can be used to examine whether a set of studies met Cohen’s recommended guidelines to conduct studies with a minimum of 80% power. If observed power is 50%, it is nearly certain that the studies did not have the recommended 80% power.

To examine the relative accuracy of different estimation methods quantitatively, I computed bias scores (observed power – true power). As bias can overestimate and underestimate true power, the standard deviation of these bias scores can be used to quantify the precision of various estimation methods. In addition, I present the mean to examine whether a method has large sample accuracy (i.e. the bias approaches zero as the number of simulations increases). I also present the percentage of studies with no more than 20% points bias. Although 20% bias may seem large, it is not important to estimate power with very high precision. When observed power is below 50%, it suggests that a set of studies was underpowered even if the observed power estimate is an underestimation.

The quantitative analysis also shows no meaningful differences among the estimation methods. The more interesting question is how these methods perform under more challenging conditions when the set of studies are no longer exact replication studies with fixed power.

The next simulation simulated variation in sample sizes. For each population of studies, sample sizes were varied by multiplying a particular sample size by factors of 1 to 5.5 (1.0, 1.5,2.0…,5.5). Thus, a base-sample-size of 40 created a range of sample sizes from 40 to 220. A base-sample size of 100 created a range of sample sizes from 100 to 2,200. As variation in sample sizes increases the average sample size, the range of effect sizes was limited to a range from .004 to .4 and effect sizes were increased in steps of d = .004. The histogram shows the distribution of power in the 5,000 population of studies.

The simulation covers the full range of true power, although studies with low and very high power are overrepresented.

The results are visually not distinguishable from those in the previous simulation.

The quantitative comparison of the estimation methods also shows very similar results.

In sum, all methods perform well even when true power varies as a function of variation in sample sizes. This conclusion may not generalize to more extreme simulations of variation in sample sizes, but more extreme variations in sample sizes would further increase the average power of a set of studies because the average sample size would increase as well. Thus, variation in effect sizes poses a more realistic challenge for the different estimation methods.

Heterogeneous, Normally Distributed Effect Sizes

The next simulation used a random normal distribution of true effect sizes. Effect sizes were simulated to have a reasonable but large variation. Starting effect sizes ranged from .208 to 1.000 and increased in increments of .008. Sample sizes ranged from 10 to 60 and increased in increments of 2 to create 5,000 populations of studies. For each population of studies, effect sizes were sampled randomly from a normal distribution with a standard deviation of SD = .2. Extreme effect sizes below d = -.05 were set to -.05 and extreme effect sizes above d = 1.20 were set to 1.20. The first histogram of effect sizes shows the 50,000 population effect sizes. The histogram on the right shows the distribution of true power for the 5,000 sets of 10 studies.

The plots of observed and true power show that the estimation methods continue to perform rather well even when population effect sizes are heterogeneous and normally distributed.

The quantitative comparison suggests that puniform has some problems with heterogeneity. More detailed studies are needed to examine whether this is a persistent problem for puniform, but given the good performance of the other methods it seems easier to use these methods.

Heterogeneous, Skewed Normal Effect Sizes

The next simulation puts the estimation methods to a stronger challenge by introducing skewed distributions of population effect sizes. For example, a set of studies may contain mostly small to moderate effect sizes, but a few studies examined large effect sizes. To simulated skewed effect size distributions, I used the rsnorm function of the fGarch package. The function creates a random distribution with a specified mean, standard deviation, and skew. I set the mean to d = .2, the standard deviation to SD = .2, and skew to 2. The histograms show the distribution of effect sizes and the distribution of true power for the 5,000 sets of studies (k = 10).

This time the results show differences between estimation methods in the ability of various estimation methods to deal with skewed heterogeneity. The percentage of significant results is unbiased, but is imprecise due to the problem of averaging dichotomous variables. The other methods show systematic deviations from the 95% confidence interval around the true parameter. Visual inspection suggests that the Yuan-Maxwell correction method has the best fit.

This impression is confirmed in quantitative analyses of bias. The quantitative comparison confirms major problems with the puniform estimation method. It also shows that the median, p-curve, and the average z-score method have the same slight positive bias. Only the Yuan-Maxwell corrected average power shows little systematic bias.

To examine biases in more detail, the following graphs plot bias as a function of true power. These plots can reveal that a method may have little average bias, but has different types of bias for different levels of power. The results show little evidence of systematic bias for the Yuan-Maxwell corrected average of power.

The following analyses examined bias separately for simulation with less or more than 50% true power. The results confirm that all methods except the Yuan-Maxwell correction underestimate power when true power is below 50%. In contrast, most estimation methods overestimate true power when true power is above 50%. The exception is puniform which still underestimated true power. More research needs to be done to understand the strange performance of puniform in this simulation. However, even if p-uniform could perform better, it is likely to be biased with skewed distributions of effect sizes because it assumes a fixed population effect size.

Conclusion

This investigation introduced and compared different methods to estimate true power for a set of studies. All estimation methods performed well when a set of studies had the same true power (exact replication studies), when effect sizes were homogenous and sample sizes varied, and when effect sizes were normally distributed and sample sizes were fixed. However, most estimation methods were systematically biased when the distribution of effect sizes was skewed. In this situation, most methods run into problems because the percentage of significant results is a function of the power of individual studies rather than the average power.

The results of these analyses suggest that the R-Index (Schimmack, 2014) can be improved by simply averaging power and then applying the Yuan-Maxwell correction. However, it is important to realize that the median method tends to overestimate power when power is greater than 50%. This makes it even more difficult for the R-Index to produce an estimate of low power when power is actually high. The next step in the investigation of observed power is to examine how different methods perform in unrepresentative (biased) sets of studies. In this case, the percentage of significant results is highly misleading. For example, Sterling et al. (1995) found percentages of 95% power, which would suggest that studies had 95% power. However, publication bias and questionable research practices create a bias in the sample of studies that are being published in journals. The question is whether other observed power estimates can reveal bias and can produce accurate estimates of the true power in a set of studies.

The authors distinguish between fraud and QRPs. Fraud is typically limited to cases in which researchers create false data. In contrast, QRPs typically involve the exclusion of data that are inconsistent with a theoretical hypothesis. QRPs are treated differently than fraud because QRPs can sometimes be used for legitimate purposes.

For example, a data entry error may produce a large outlier that leads to a non-significant result when all data are included in the analysis. The results are significant when the outlier is removed. Statistical textbook often advise to exclude outliers for this reason. However, removal of outliers becomes a QRP when it is used selectively. That is, outliers are not removed when a result is significant or when the outlier helps to produce a significant result, but outliers are removed when removal of outliers helps to get a significant result.

The use of QRPs is damaging because published results provide false impressions about the replicability of empirical results and misleading evidence about the size of an effect.

Below is a list of QRPs.

Selective reporting of (dependent) variables. For example, a researcher may include 10 items to measure depression. Typically, the 10 items are averaged to get the best measure of depression. However, if this analysis does not produce a significant result, the researcher can conduct analyses of each individual item or average items that trend in the right direction. By creating different dependent variables after the study is completed, a researcher increases the chances of obtaining a significant result that will not replicate in a replication study with the same dependent variable.

A simple solution to preventing this QRP is to ask authors to use well-established measures as dependent variables and/or to ask for pre-registration of all measures that are relevant to the test of a theoretical hypothesis (i.e., it is not necessary to specify that the study also asked about handedness because handedness is not a measure of depression).

Deciding whether to collect more data after looking to see whether the results will be significant. It is difficult to distinguish random variation from a true effect in small samples. At the same time, it can be a costly waste of resources (or even unethical in animal research) to conduct studies with large samples, when the effect can be detected in a smaller sample. It is also difficult to know a priori how large a sample should be to obtain a significant result. It therefore seems reasonable to check data while they are being collected for significance. If an effect does not seem to be present in a reasonably large sample size, it may be better to abandon a study. None of these practices are problematic unless a researcher constantly checks for significance and stops data collection immediately after the data show a significant result. This practice capitalizes on sampling error and the experiment will typically stop when sampling error inflates the true effect size.

A simple solution to this problem is to set some a priori rules about the end of data collection. For example, a researcher may calculate sample size based on a rough power analysis. Based on an optimistic assumption that the true effect is large, the data will be checked when the study has 80% power for a large effect (d = .8). If this does not result in a significant result, the researcher continues with the revised hypothesis that the true effect is moderate and then checks the data again when 80% power for a moderate effect is reached. If this does not result in a significant result, the researcher may give up or continue with the revised hypothesis that the true effect is small. This procedure would allow researchers to use an optimal amount of resources. Moreover, they can state there sampling strategy openly so that meta-analysts can make corrections for the small amount of biases that is still introduced by this reasonable form of optional stopping.

Failing to disclose experimental conditions. There are no justifiable reasons for the exclusion of conditions. Evidently, researchers are not going to exclude conditions that are consistent with theoretical predictions. So, the exclusion of conditions can only produce results that are overly consistent with theoretical predictions. If there are reasonable doubts about a condition (e.g., a manipulation check shows that it did not work), the condition can be included and it can be explained why the results may not conform to predictions).

A simple solution to the problem of conditions with unexpected results is that researchers may include too many conditions in their design. A 2 x 2 x 2 factorial design has 8 cells, which allows for 28 comparisons of means. What are the chances that all of these 28 comparisons produce results that are consistent with theoretical predictions?

Another simple solution is to avoid the use of statistical methods with low power. To demonstrate a three-way interaction requires a lot more data than to demonstrate that a pattern of means is consistent with an a priori theoretically predicted pattern.

In a paper reporting selectively studies that worked.

There is no reason for excluding studies that did not work. Excluding studies that were planned as demonstrations of an effect need to be reported. Otherwise the published evidence provides an overly positive picture of the robustness of a phenomenon and effect sizes are inflated.

Just like failed conditions, failed studies can be reported if there is a plausible explanation why it failed whereas other studies worked. However, to justify this claim, it should be demonstrated that the effects in failed and successful studies are really significantly different (a significant moderator effect). If this is not the case, there is no reason to treat failed and successful studies as different from each other.

A simple solution to this problem is to conduct studies with high statistical power because the main reason for failed studies is that studies have low power. If a study has only 30% power, only one out of three studies will produce a significant result. The other two studies are likely to produce a type-II error (not show a significant result when the effect exists). Rather than throwing away the two failed studies, a researcher should have conducted one study with higher power. Another solution is to report all three studies and to test for significance only in a meta-analysis across the three studies.

In a paper, rounding off a p-value just above .054 and claim that it is below .05. This is a minor problem. It is silly to change a p-value, but it does not bias a meta-analysis of effect sizes because researchers do not change effect size information. Moreover, it would be even more silly not to change the p-value and conclude that there is no effect, which is often the case when results are not significant. After all, a p-value of .054 means that the effect in this study would have occurred if the true effect is zero or has the opposite sign.

If a type-I error probability of 5.4% is considered too high, it would be possible to collect more data and test again with a larger sample (taking multiple testing into account).

Moreover, this problem should arise very infrequently. Even if a study is underpowered and has only 50% power, only 2% of p-values are expected to fall into the narrow range between .050 and .054.

In a paper, reporting an unexpected finding as having been predicted from the start. I am sure some statisticians disagree with me and I may be wrong about this one, but I simply do not understand how a statistical analysis of some data cares about the expectations of a researcher. Say, I analyze some data and find a significant effect in the data. How can this effect be influenced by the way I report it later? It may be a type-I error or it is not a type-I error, but my expectations have no influence on the causal processes that produced the empirical data. I think the practice of writing exploratory studies as if they were conducted an a priori hypothesis is considered questionable because it often requires other QRPs (e.g., excluding additional tests that didn’t work) to produce a story that is concocted to explain unexpected results. However, if the results are presented honestly and one out of five predictor variables in a multiple-regression is significant at p < .0001, it is likely to be a replicable finding, even if it is presented with a post-hoc prediction.

In a paper, claiming that results are unaffected by demographic variables (e.g., gender) when one is actually unsure (or knows that they do). Again, this is a relatively minor point because it only speaks about potential moderators of a reported effect. Moderation is important, but the conclusion about the main effect remains unchanged. For example, if an effect exists for men, but not for women, it is still true that on average there is an effect. Furthermore, a more common mistake is often to claim that gender or other factors did not moderate an effect based on an underpowered comparison of 10 men and 30 women in a study with 40 participants. Thus, false claims about moderating variables are annoying, but not a threat to the replicability of empirical results.

Falsifying Data. I personally do not include falsifying or fabricating of data in the list of questionable research practices. I think falsifying and fabrication of data is not a research practice. It is also something that is clearly considered fraudulent and punished when it is discovered. In contrast, questionable research practices are tolerated in many scientific communities and there are no clear guidelines against the use of these practices.

In conclusion, the most problematic research practices that undermine the replicability of published studies are selective reporting of dependent variables, conditions, or entire studies, and optional stopping when significance is reached. These practices make it possible to produce significant results when a study has insufficient power. However, to achieve significance without power, the type-I error rate also increases and replicability decreases. John et al. (2012) aptly compared these QRPs to the use of doping in sports. I consider the R-Index a doping test for science because it reveals that researchers used these QRPs. I hope that the R-Index will discourage the use of QRPs and increase the power and replicability of published studies.

Whether scientific organizations should ban QRPs just like sports organizations ban doping is an interesting question. Meanwhile the R-Index can be used without draconian consequences. Researchers can self-examine the replicability of their findings and they can examine the replicability of published results before they invest resources, time, and the future of their graduate students in research projects that fail. Granting agencies can use the R-Index to reward researchers who conduct fewer studies with replicable results rather than researchers with many studies that fail to replicate. Finally, the R-Index can be used to track how successful current initiatives are to increase the replicability of published studies.

A previous blog examined how and why Dr. Förster’s data showed incredibly improbable linearity.

The main hypothesis was that two experimental manipulations have opposite effects on a dependent variable.

Assuming that the average effect size of a single manipulation is similar to effect sizes in social psychology, a single manipulation is expected to have an effect size of d = .5 (change by half a standard deviation). As the two manipulations are expected to have opposite effects, the mean difference between the two experimental groups should be one standard deviation (0.5 + 0.5 = 1). With N = 40, and d = 1, a study has 87% power to produce a significant effect (p < .05, two-tailed). With power of this magnitude, it would not be surprising to get significant results in 12 comparisons (Table 1).

The R-Index for the comparison of the two experimental groups in Table is Ř = 87%
(Success Rate = 100%, Median Observed Power = 94%, Inflation Rate = 6%).

The Test of Insufficient Variance (TIVA) shows that the variance in z-scores is less than 1, but the probability of this event to occur by chance is 10%, Var(z) = .63, Chi-square (df = 11) = 17.43, p = .096.

Thus, the results for the two experimental groups are perfectly consistent with real empirical data and the large effect size could be the result of two moderately strong manipulations with opposite effects.

The problem for Dr. Förster started when he included a control condition and want to demonstrate in each study that the two experimental groups also differed significantly from the experimental group. As already pointed out in the original post, samples of 20 participants per condition do not provide sufficient power to demonstrate effect sizes of d = .5 consistently.

To make matters worse, the three-group design has even less power than two independent studies because the same control group is used in a three-group comparison. When sampling error inflates the mean in the control group (e.g, true mean = 33, estimated mean = 36), it benefits the comparison for the experimental group with the lower mean, but it hurts the comparison for the experimental group with the higher mean (e.g., M = 27, M = 33, M = 39 vs. M = 27, M = 36, M = 39). When sampling error leads to an underestimation of the true mean in the control group (e.g., true mean = 33, estimated mean = 30), it benefits the comparison of the higher experimental group with the control group, but it hurts the comparison of the lower experimental group and the control group.

Thus, total power to produce significant results for both comparisons is even lower than for two independent studies.

It follows that the problem for a researcher with real data was the control group. Most studies would have produced significant results for the comparison of the two experimental groups, but failed to show significant differences between one of the experimental groups and the control group.

At this point, it is unclear how Jens Förster achieved significant results under the contested assumption that real data were collected. However, it seems most plausible that QRPs would be used to move the mean of the control group to the center so that both experimental groups show a significant difference. When this was impossible, the control group could be dropped, which may explain why 3 studies in Table 1 did not report results for a control group.

The influence of QRPs on the control group can be detected by examining the variation of means in Table 1 across the 12(9) studies. Sampling error should randomly increase or decrease means relative to the overall mean of an experimental condition. Thus, there is no reason to expect a correlation in the pattern of means. Consistent with this prediction, the means of the two experimental groups are unrelated, r(12) = .05, p = .889; r(9) = .36, p = .347. In contrast, the means of the control group are correlated with the means of the two experimental groups, r(9) = .73, r(9) = .71. If the means in the control group are the result of the unbiased means in the experimental groups, it makes sense to predict the means in the control group from the means in the two experimental groups. A regression equation shows that 77% of the variance in the means of the control group is explained by the variation in the means in the experimental groups, R = .88, F(2,6) = 10.06, p = .01.

This analysis clarifies the source of the unusual linearity in the data. Studies with n = 20 per condition have very low power to demonstrate significant differences between a control group and opposite experimental groups because sampling error in the control group is likely to move the mean of the control group too close to one of the experimental groups to produce a significant difference.

This problem of low power may lead researchers to use QRPs to move the mean of the control group to the center. The problem for users of QRPs is that this statistical boost of power leaves a trace in the data that can be detected with various bias tests. The pattern of the three means will be too linear, there will be insufficient variance in the effect sizes, p-values, and observed power in the comparisons of experimental groups and control groups, the success rate will exceed median observed power, and, as shown here, the means in the control group will be correlated with the means in the experimental group across conditions.

In a personal email Dr. Förster did not comment on the statistical analyses because his background in statistics is insufficient to follow the analyses. However, he rejected this scenario as an account for the unusual linearity in his data; “I never changed any means.” Another problem for this account of what could have happened is that dropping cases from the middle group would lower the sample size of this group, but the sample size is always close to n = 20. Moreover, oversampling and dropping of cases would be a QRP that Dr. Förster would remember and could report. Thus, I now agree with the conclusion of the LOWI commission that the data cannot be explained by using QRPs, mainly because Dr. Förster denies having used any plausible QRPs that could have produced his results.

Some readers may be confused about this conclusion because it may appear to contradict my first blog. However, my first blog merely challenged the claim by the LOWI commission that linearity cannot be explained by QRPs. I found a plausible way in which QRPs could have produced linearity, and these new analyses still suggest that secretive and selective dropping of cases from the middle group could be used to show significant contrasts. Depending on the strength of the original evidence, this use of QRPs would be consistent with the widespread use of QRPs in the field and would not be considered scientific misconduct. As Roy F. Baumeister, a prominent social psychologist put it, “this is just how the field works.” However, unlike Roy Baumeister, who explained improbable results with the use of QRPs, Dr. Förster denies any use of QRPs that could potentially explain the improbable linearity in his results.

In conclusion, the following facts have been established with sufficient certainty:
(a) the reported results are too improbable to reflect just true effects and sampling error; they are not credible.
(b) the main problem for a researcher to obtain valid results is the low power of multiple-study articles and the difficulty of demonstrating statistical differences between one control group and two opposite experimental groups.
(c) to avoid reporting non-significant results, a researcher must drop failed studies and selectively drop cases from the middle group to move the mean of the middle group to the middle.
(d) Dr. Förster denies the use of QRPs and he denies data manipulation.
Evidently, the facts do not add up.

The new analyses suggest that there is one simple way for Dr. Förster to show that his data have some validity. The reason is that the comparison of the two experimental groups shows an R-Index of 87%. This implies that there is nothing statistically improbable about the comparison of these data. If these reported results are based on real data, a replication study is highly likely to replicate the mean difference between the two experimental groups. With n = 20 in each cell (N = 40), it would be relatively easy to conduct a preregistered and transparent replication study. However, without further credible evidence the published data lack credible scientific evidence and it would be prudent to retract all articles that show unusual statistical patterns that cannot be explained by the author.

In 2011, Dr. Förster published an article in Journal of Experimental Psychology: General. The article reported 12 studies and each study reported several hypothesis tests. The abstract reports that “In all experiments, global/local processing in 1 modality shifted to global/local processing in the other modality”.

For a while this article was just another article that reported a large number of studies that all worked and neither reviewers nor the editor who accepted the manuscript for publication found anything wrong with the reported results.

In 2012, an anonymous letter voiced suspicion that Jens Forster violated rules of scientific misconduct. The allegation led to an investigation, but as of today (January 1, 2015) there is no satisfactory account of what happened. Jens Förster maintains that he is innocent (5b. Brief von Jens Förster vom 10. September 2014) and blames the accusations about scientific misconduct on a climate of hypervigilance after the discovery of scientific misconduct by another social psychologist.

The Accusation

The accusation is based on an unusual statistical pattern in three publications. The 3 articles reported 40 experiments with 2284 participants, that is an average sample size of N = 57 participants in each experiment. The 40 experiments all had a between-subject design with three groups: one group received a manipulation design to increase scores on the dependent variable. A second group received the opposite manipulation to decrease scores on the dependent variable. And a third group served as a control condition with the expectation that the average of the group would fall in the middle of the two other groups. To demonstrate that both manipulations have an effect, both experimental groups have to show significant differences from the control group.

The accuser noticed that the reported means were unusually close to a linear trend. This means that the two experimental conditions showed markedly symmetrical deviations from the control group. For example, if one manipulation increased scores on the dependent variables by half a standard deviation (d = +.5), the other manipulation decreased scores on the dependent variable by half a standard deviation (d = -.5). Such a symmetrical pattern can be expected when the two manipulations are equally strong AND WHEN SAMPLE SIZES ARE LARGE ENOUGH TO MINIMIZE RANDOM SAMPLING ERROR. However, the sample sizes were small (n = 20 per condition, N = 60 per study). These sample sizes are not unusual and social psychologists often use n = 20 per condition to plan studies. However, these sample sizes have low power to produce consistent results across a large number of studies.

The accuser computed the statistical probability of obtaining the reported linear trend. The probability of obtaining the picture-perfect pattern of means by chance alone was incredibly small.

Based on this finding, the Dutch National Board for Research Integrity (LOWI) started an investigation of the causes for this unlikely finding. An English translation of the final report was published on retraction watch. An important question was whether the reported results could have been obtained by means of questionable research practices or whether the statistical pattern can only be explained by data manipulation. The English translation of the final report includes two relevant passages.

According to one statistical expert “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.” This would mean that Dr. Förster acted in accordance with scientific practices and that his behavior would not constitute scientific misconduct.

In response to this assessment the Complainant “extensively counters the expert’s claim that the unlikely patterns in the experiments can be explained by QRP.” This led to the decision that scientific misconduct occurred.

Four QRPs were considered.

Improper rounding of p-values. This QRP can only be used rarely when p-values happen to be close to .05. It is correct that this QRP cannot produce highly unusual patterns in a series of replication studies. It can also be easily checked by computing exact p-values from reported test statistics.

Selecting dependent variables from a set of dependent variables. The articles in question reported several experiments that used the same dependent variable. Thus, this QRP cannot explain the unusual pattern in the data.

Collecting additional research data after an initial research finding revealed a non-significant result. This description of an QRP is ambiguous. Presumably it refers to optional stopping. That is, when the data trend in the right direction to continue data collection with repeated checking of p-values and stopping when the p-value is significant. This practices lead to random variation in sample sizes. However, studies in the reported articles all have more or less 20 participants per condition. Thus, optional stopping can be ruled out. However, if a condition with 20 participants does not produce a significant result, it could simply be discarded, and another condition with 20 participants could be run. With a false-positive rate of 5%, this procedure will eventually yield the desired outcome while holding sample size constant. It seems implausible that Dr. Förster conducted 20 studies to obtain a single significant result. Thus, it is even more plausible that the effect is actually there, but that studies with n = 20 per condition have low power. If power were just 30%, the effect would appear in every third study significantly, and only 60 participants were used to produce significant results in one out of three studies. The report provides insufficient information to rule out this QRP, although it is well-known that excluding failed studies is a common practice in all sciences.

Selectively and secretly deleting data of participants (i.e., outliers) to arrive at significant results. The report provides no explanation how this QRP can be ruled out as an explanation. Simmons, Nelson, and Simonsohn (2011) demonstrated that conducting a study with 37 participants and then deleting data from 17 participants can contribute to a significant result when the null-hypothesis is true. However, if an actual effect is present, fewer participants need to be deleted to obtain a significant result. If the original sample size is large enough, it is always possible to delete cases to end up with a significant result. Of course, at some point selective and secretive deletion of observation is just data fabrication. Rather than making up data, actual data from participants are deleted to end up with the desired pattern of results. However, without information about the true effect size, it is difficult to determine whether an effect was present and just embellished (see Fisher’s analysis of Mendel’s famous genetics studies) or whether the null-hypothesis is true.

The English translation of the report does not contain any statements about questionable research practices from Dr. Förster. In an email communication on January 2, 2014, Dr. Förster revealed that he in fact ran multiple studies, some of which did not produce significant results, and that he only reported his best studies. He also mentioned that he openly admitted to this common practice to the commission. The English translation of the final report does not mention this fact. Thus, it remains an open question whether QRPs could have produced the unusual linearity in Dr. Förster’s studies.

A New Perspective: The Curse of Low Powered Studies

One unresolved question is why Dr. Förster would manipulate data to produce a linear pattern of means that he did not even mention in his articles. (Discover magazine).

One plausible answer is that the linear pattern is the by-product of questionable research practices to claim that two experimental groups with opposite manipulations are both significantly different from a control group. To support this claim, the articles always report contrasts of the experimental conditions and the control condition (see Table below).

In Table 1 the results of these critical tests are reported with subscripts next to the reported means. As the direction of the effect is theoretically determined, a one-tailed test was used. The null-hypothesis was rejected when p < .05.

Table 1 reports 9 comparisons of global processing conditions and control groups and 9 comparisons of local processing conditions with a control group; a total of 18 critical significance tests. All studies had approximately 20 participants per condition. The average effect size across the 18 studies is d = .71 (median d = .68). An a priori power analysis with d = .7, N = 40, and significance criterion .05 (one-tailed) gives a power estimate of 69%.

An alternative approach is to compute observed power for each study and to use median observed power (MOP) as an estimate of true power. This approach is more appropriate when effect sizes vary across studies. In this case, it leads to the same conclusion, MOP = 67.

The MOP estimate of power implies that a set of 100 tests is expected to produce 67 significant results and 33 non-significant results. For a set of 18 tests, the expected values are 12.4 significant results and 5.6 non-significant results.

The actual success rate in Table 1 should be easy to infer from Table 1, but there are some inaccuracies in the subscripts. For example, Study 1a shows no significant difference between means of 38 and 31 (d = .60, but it shows a significant difference between means 31 and 27 (d = .33). Most likely the subscript for the control condition should be c not a.

Based on the reported means and standard deviations, the actual success rate with N = 40 and p < .05 (one-tailed) is 83% (15 significant and 3 non-significant results).

The actual success rate (83%) is higher than one would expect based on MOP (67%). This inflation in the success rate suggests that the reported results are biased in favor of significant results (the reasons for this bias are irrelevant for the following discussion, but it could be produced by not reporting studies with non-significant results, which would be consistent with Dr. Förster’s account ).

The R-Index was developed to correct for this bias. The R-Index subtracts the inflation rate (83% – 67% = 16%) from MOP. For the data in Table 1, the R-Index is 51% (67% – 16%).

Given the use of a between-subject design and approximately equal sample sizes in all studies, the inflation in power can be used to estimate inflation of effect sizes. A study with N = 40 and p < .05 (one-tailed) has 50% power when d = .50.

Thus, one interpretation of the results in Table 1 is that the true effect sizes of the manipulation is d = .5, that 9 out of 18 tests should have produced a significant contrast at p < .05 (one-tailed) and that questionable research practices were used to increase the success rate from 50% to 83% (15 vs. 9 successes).

The use of questionable research practices would also explain unusual linearity in the data. Questionable research practices will increase or omit effect sizes that are insufficient to produce a significant result. With a sample size of N = 40, an effect size of d = .5 is insufficient to produce a significant result, d = .5, se = 32, t(38) = 1.58, p = .06 (one-tailed). Random sampling error that works against the hypothesis can only produce non-significant results that have to be dropped or moved upwards using questionable methods. Random error that favors the hypothesis will inflate the effect size and start producing significant results. However, random error is normally distributed around the true effect size and is more likely to produce results that are just significant (d = .8) than to produce results that are very significant (d = 1.5). Thus, the reported effect sizes will be clustered more closely around the median inflated effect size than one would expect based on an unbiased sample of effect sizes.

The clustering of effect sizes will happen for the positive effects in the global processing condition and for the negative effects in the local processing condition. As a result, the pattern of all three means will be more linear than an unbiased set of studies would predict. In a large set of studies, this bias will produce a very low p-value.

One way to test this hypothesis is to examine the variability in the reported results. The Test of Insufficient Variance (TIVA) was developed for this purpose. TIVA first converts p-values into z-scores. The variance of z-scores is known to be 1. Thus, a representative sample of z-scores should have a variance of 1, but questionable research practices lead to a reduction in variance. The probability that a set of z-scores is a representative set of z-scores can be computed with a chi-square test and chi-square is a function of the ratio of the expected and observed variance and the number of studies. For the set of studies in Table 1, the variance in z-scores is .33. The chi-square value is 54. With 17 degrees of freedom, the p-value is 0.00000917 and the odds of this event occurring by chance are 1 out of 109,056 times.

Conclusion

Previous discussions about abnormal linearity in Dr. Förster’s studies have failed to provide a satisfactory answer. An anonymous accuser claimed that the data were fabricated or manipulated, which the author vehemently denies. This blog proposes a plausible explanation of what could have [edited January 19, 2015] happened. Dr. Förster may have conducted more studies than were reported and included only studies with significant results in his articles. Slight variation in sample sizes suggests that he may also have removed a few outliers selectively to compensate for low power. Importantly, neither of these practices would imply scientific misconduct. The conclusion of the commission that scientific misconduct occurred rests on the assumption that QRPs cannot explain the unusual linearity of means, but this blog points out how selective reporting of positive results may have inadvertently produced this linear pattern of means. Thus, the present analysis support the conclusion by an independent statistical expert mentioned in the LOWI report: “QRP cannot be excluded, which in the opinion of the expert is a common, if not “prevalent” practice, in this field of science.”

How Unusual is an R-Index of 51?

The R-Index for the 18 statistical tests reported in Table 1 is 51% and TIVA confirms that the reported p-values have insufficient variance. Thus, it is highly probable that questionable research practices contributed to the results and in a personal communication Dr. Förster confirmed that additional studies with non-significant results exist. However, in response to further inquiries [see follow up blog] Dr. Förster denied having used QRPs that could explain the linearity in his data.

Nevertheless, an R-Index of 51% is not unusual and has been explained with the use of QRPs. For example, the R-Index for a set of studies by Roy Baumeister was 49%, . and Roy Baumeister stated that QRPs were used to obtain significant results.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

Sadly, it is quite common to find an R-Index of 50% or lower for prominent publications in social psychology. This is not surprising because questionable research practices were considered good practices until recently. Even at present, it is not clear whether these practices constitute scientific misconduct (see discussion in Dialogue, Newsletter of the Society for Personality and Social Psychology).

How to Avoid Similar Sad Stories in the Future

One way to avoid accusations of scientific misconduct is to conduct a priori power analyses and to conduct only studies with a realistic chance to produce a significant result when the hypothesis is correct. When random error is small, true patterns in data can emerge without the help of QRPs.

Another important lesson from this story is to reduce the number of statistical tests as much as possible. Table 1 reported 18 statistical tests with the aim to demonstrate significance in each test. Even with a liberal criterion of .1 (one-tailed), it is highly unlikely that so many significant tests will produce positive results. Thus, a non-significant result is likely to emerge and researchers should think ahead of time how they would deal with non-significant results.

For the data in Table 1, Dr. Förster could have reported the means of 9 small studies without significance tests and conduct significance tests only once for the pattern in all 9 studies. With a total sample size of 360 participants (9 * 40), this test would have 90% power even if the effect size is only d = .35. With 90% power, the total power to obtain significant differences from the control condition for both manipulations would be 81%. Thus, the same amount of resources that were used for the controversial findings could have been used to conduct a powerful empirical test of theoretical predictions without the need to hide inconclusive, non-significant results in studies with low power.

Jacob Cohen has been trying to teach psychologists the importance of statistical power for decades and psychologists stubbornly ignored his valuable contribution to research methodology until he died in 1998. Methodologists have been mystified by the refusal of psychologists to increase power in their studies (Maxwell, 2004).

One explanation is that small samples provided a huge incentive. A non-significant result can be discarded with little cost of resources, whereas a significant result can be published and have the additional benefit of an inflated effect size, which allows boosting the importance of published results.

The R-Index was developed to balance the incentive structure towards studies with high power. A low R-Index reveals that a researcher is reporting biased results that will be difficult to replicate by other researchers. The R-Index reveals this inconvenient truth and lowers excitement about incredible results that are indeed incredible. The R-Index can also be used by researchers to control their own excitement about results that are mostly due to sampling error and to curb the excitement of eager research assistants that may be motivated to bias results to please a professor.

Curbed excitement does not mean that the R-Index makes science less exciting. Indeed, it will be exciting when social psychologists start reporting credible results about social behavior that boost a high R-Index because for a true scientist nothing is more exciting than the truth.

It has been known for decades that published results tend to be biased (Sterling, 1959). For most of the past decades this inconvenient truth has been ignored. In the past years, there have been many suggestions and initiatives to increase the replicability of reported scientific findings (Asendorpf et al., 2013). One approach is to examine published research results for evidence of questionable research practices (see Schimmack, 2014, for a discussion of existing tests). This blog post introduces a new test of bias in reported research findings, namely the Test of Insufficient Variance (TIVA).

TIVA is applicable to any set of studies that used null-hypothesis testing to conclude that empirical data provide support for an empirical relationship and reported a significance test (p-values).

Rosenthal (1978) developed a method to combine results of several independent studies by converting p-values into z-scores. This conversion uses the well-known fact that p-values correspond to the area under the curve of a normal distribution. Rosenthal did not discuss the relation between these z-scores and power analysis. Z-scores are observed scores that should follow a normal distribution around the non-centrality parameter that determines how much power a study has to produce a significant result. In the Figure, the non-centrality parameter is 2.2. This value is slightly above a z-score of 1.96, which corresponds to a two-tailed p-value of .05. A study with a non-centrality parameter of 2.2 has 60% power. In specific studies, the observed z-scores vary as a function of random sampling error. The standardized normal distribution predicts the distribution of observed z-scores. As observed z-scores follow the standard normal distribution, the variance of an unbiased set of z-scores is 1. The Figure on top illustrates this with the nine purple lines, which are nine randomly generated z-scores with a variance of 1.

In a real data set the variance can be greater than 1 for two reasons. First, if the nine studies are exact replication studies with different sample sizes, larger samples will have a higher non-centrality parameter than smaller samples. This variance in the true non-centrality variances adds to the variance produced by random sampling error. Second, a set of studies that are not exact replication studies can have variance greater than 1 because the true effect sizes can vary across studies. Again, the variance in true effect sizes produces variance in the true non-centrality parameters that add to the variance produced by random sampling error. In short, the variance is 1 in exact replication studies that also hold the sample size constant. When sample sizes and true effect sizes vary, the variance in observed z-scores is greater than 1. Thus, an unbiased set of z-scores should have a minimum variance of 1.

If the variance in z-scores is less than 1, it suggests that the set of z-scores is biased. One simple reason for insufficient variance is publication bias. If power is 50% and the non-centrality parameter matches the significance criterion of 1.96, 50% of studies that were conducted would not be significant. If these studies are omitted from the set of studies, variance decreases from 1 to .36. Another reason for insufficient variance is that researchers do not report non-significant results or used questionable research practices to inflate effect size estimates. The effect is that variance in observed z-scores is restricted. Thus, insufficient variance in observed z-scores reveals that the reported results are biased and provide an inflated estimate of effect size and replicability.

In small sets of studies, insufficient variance may be due to chance alone. It is possible to quantify how lucky a researcher was to obtain significant results with insufficient variance. This probability is a function of two parameters: (a) the ratio of the observed variance (OV) in a sample over the population variance (i.e., 1), and (b) the number of z-scores minus 1 as the degrees of freedom (k -1).

The product of these two parameters follows a chi-square distribution with k-1 degrees of freedom.

Formula 1: Chi-square = OV * (k – 1) with k-1 degrees of freedom.

Example 1:

Bem (2011) published controversial evidence that appear to demonstrate precognition. Subsequent studies failed to replicate these results (Galak et al.,, 2012) and other bias tests show evidence that the reported results are biased Schimmack (2012). For this reason, Bem’s article provides a good test case for TIVA.

The article reported results of 10 studies with 9 z-scores being significant at p < .05 (one-tailed). The observed variance in the 10 z-scores is 0.19. Using Formula 1, the chi-square value is chi^2 (df = 9) = 1.75. Importantly, chi-square tests are usually used to test whether variance is greater than expected by chance (right tail of the distribution). The reason is that variance is not expected to be less than the variance expected by chance because it is typically assumed that a set of data is unbiased. To obtain a probability of insufficient variance, it is necessary to test the left-tail of the chi-square distribution. The corresponding p-value for chi^2 (df = 9) = 1.75 is p = .005. Thus, there is only a 1 out of 200 probability that a random set of 10 studies would produce a variance as low as Var = .19.

This outcome cannot be attributed to publication bias because all studies were published in a single article. Thus, TIVA supports the hypothesis that the insufficient variance in Bem’s z-scores is the result of questionable research methods and that the reported effect size of d = .2 is inflated. The presence of bias does not imply that the true effect size is 0, but it does strongly suggest that the true effect size is smaller than the average effect size in a set of studies with insufficient variance.

Example 2:

Vohs et al. (2006) published a series of studies that he results of nine experiments in which participants were reminded of money. The results appeared to show that “money brings about a self-sufficient orientation.” Francis and colleagues suggested that the reported results are too good to be true. An R-Index analysis showed an R-Index of 21, which is consistent with a model in which the null-hypothesis is true and only significant results are reported.

Because Vohs et al. (2006) conducted multiple tests in some studies, the median p-value was used for conversion into z-scores. The p-values and z-scores for the nine studies are reported in Table 2. The Figure on top of this blog illustrates the distribution of the 9 z-scores relative to the expected standard normal distribution.

Table 2

Study p z

Study 1 .026 2.23
Study 2 .050 1.96
Study 3 .046 1.99
Study 4 .039 2.06
Study 5 .021 2.99
Study 6 .040 2.06
Study 7 .026 2.23
Study 8 .023 2.28
Study 9 .006 2.73

The variance of the 9 z-scores is .054. This is even lower than the variance in Bem’s studies. The chi^2 test shows that this variance is significantly less than expected from an unbiased set of studies, chi^2 (df = 8) = 1.12, p = .003. An unusual event like this would occur in only 1 out of 381 studies by chance alone.

In conclusion, insufficient variance in z-scores shows that it is extremely likely that the reported results overestimate the true effect size and replicability of the reported studies. This confirms earlier claims that the results in this article are too good to be true (Francis et al., 2014). However, TIVA is more powerful than the Test of Excessive Significance and can provide more conclusive evidence that questionable research practices were used to inflate effect sizes and the rate of significant results in a set of studies.

Conclusion

TIVA can be used to examine whether a set of published p-values was obtained with the help of questionable research practices. When p-values are converted into z-scores, the variance of z-scores should be greater or equal to 1. Insufficient variance suggests that questionable research practices were used to avoid publishing non-significant results; this includes simply not reporting failed studies.

At least within psychology, these questionable research practices are used frequently to compensate for low statistical power and they are not considered scientific misconduct by governing bodies of psychological science (APA, APS, SPSP). Thus, the present results do not imply scientific misconduct by Bem or Vohs, just like the use of performance enhancing drugs in sports is not illegal unless a drug is put on an anti-doping list. However, jut because a drug is not officially banned, it does not mean that the use of a drug has no negative effects on a sport and its reputation.

One limitation of TIVA is that it requires a set of studies and that variance in small sets of studies can vary considerably just by chance. Another limitation is that TIVA is not very sensitive when there is substantial heterogeneity in true non-centrality parameters. In this case, the true variance in z-scores can mask insufficient variance in random sampling error. For this reason, TIVA is best used in conjunction with other bias tests. Despite these limitations, the present examples illustrate that TIVA can be a powerful tool in the detection of questionable research practices. Hopefully, this demonstration will lead to changes in the way researchers view questionable research practices and how the scientific community evaluates results that are statistically improbable. With rejection rates at top journals of 80% or more, one would hope that in the future editors will favor articles that report results from studies with high statistical power that obtain significant results that are caused by the predicted effect.

An article in Psychological Science titled “Women Are More Likely to Wear Red or Pink at Peak Fertility” reported two studies that related women’s cycle to the color of their shirts. Study 1 (N = 100) found that women were more likely to wear red or pink shirts around the time of ovulation. Study 2 (N = 25) replicated this finding. An article in Slate magazine, “Too good to be true” questioned the credibility of the reported results. The critique led to a lively discussion about research practices, statistics, and psychological science in general.

The R-Index provides some useful information about some unresolved issues in the debate.

The main finding in Study 1 was a significant chi-square test, chi-square (1, N = 100) = 5.32, p = .021, z = 2.31, observed power 64%.

The main finding in Study 2 was a chi-square test, chi-square (1, N = 25) = 3.82, p = .051, z = 1.95, observed power 50%.

One way to look at these results is to assume that the authors planned the two studies, including sample sizes, conducted two statistical significance tests and reported the results of their planned analysis. Both tests have to produce significant results in the predicted direction at p = .05 (two-tailed) to be published in Psychological Science. The authors claim that the probability of this event to occur by chance is only 0.25% (5% * 5%). In fact, the probability is even lower because a two-tailed can be significant when the effect is opposite to the hypothesis (i.e., women are less likely to wear red at peak fertility, p < .05, two-tailed). The probability to get significant results in a theoretically predicted direction with p = .05 (two-tailed) is equivalent to a one-tailed test with p = .025 as significance criterion. The probability of this happening twice in a row is only 0.06%. According to this scenario, the significant results in the two studies are very unlikely to be a chance finding. Thus, they provide evidence that women are more likely to wear red at peak fertility.

The R-Index takes a different perspective. The focus is on replicability of the results reported in the two studies. Replicability is defined as the long-run probability to produce significant results in exact replication studies; everything but random sampling error is constant.

The first step is to estimate replicability of each study. Replicabilty is estimated by converting p-values into observed power estimates. As shown above, observed power is estimated to be 64% in Study 1 and 50% in Study 2. If these estimates were correct, the probability to replicate significant results in two exact replication studies would be 32%. This also implies that the chance of obtaining significant results in the original studies was only 32%. This raises the question of what researchers would do when a non-significant result is obtained. If reporting or publication bias prevent these results from being published, published results provide an inflated estimate of replicability (100% success rate with 32% probability to be successful).

The R-Index uses the median as the best estimate of the typical power in a set of studies. Median observed power is 57%. However, the success rate is 100% (two significant results in two reported attempts). The discrepancy between the success rate (100%) and the expected rate of significant results (57%) shows the inflated rate of significant results that is expected based on the long-run success rate of 57% (100% – 57% = 43%). This would be equivalent to getting red twice in a roulette game with a 50% chance of red or black (ignoring 0 here). Ultimately, an unbiased roulette table would produce black outcomes to get the expected rate of 50% red and 50% black numbers.

The R-Index corrects for this inflation by subtracting the inflation rate from observed power.

The R-Index is 57% – 43% = 14%.

To interpret an R-Index of 14%, the following scenarios are helpful.

When the null-hypothesis is true and non-significant results are not reported, the R-Index is 22%. Thus, the R-Index for this pair of studies is lower than the R-Index for the null-hypothesis.

With just two studies, it is possible that researchers were just lucky to get two significant results despite a low probability of this event to occur.

For other researchers it is not important why reported results are likely to be too good to be true. For science, it is more important that the reported results can be generalized to future studies and real world situations. The main reason to publish studies in scientific journals is to provide evidence that can be replicated even in studies that are not exact replication studies, but provide sufficient opportunity for the same causal process (peak fertility influences women’s clothing choices) to be observed. With this goal in mind, a low R-Index reveals that the two studies provide rather weak evidence for the hypothesis and that the generalizability to future studies and real world scenarios is uncertain.

In fact, only 28% of studies with an average R-Index of 43% replicated in a test of the R-Index (reference!). Failed replication studies consistently tend to have an R-Index below 50%.

For this reason, Psychological Science should have rejected the article and asked the authors to provide stronger evidence for their hypothesis.

Psychological Science should also have rejected the article because the second study had only a quarter of the sample size of Study 1 (N = 25 vs. 100). Given the effect size in Study 1 and observed power of only 63% in Study 1, cutting the sample sizes by 75% reduces the probability to obtain a significant effect in Study 2 to 20%. Thus, the authors were extremely lucky to produce a significant result in Study 2. It would have been better to conduct the replication study with a sample of 150 participants to have 80% power to replicate the effect in Study 1.

Conclusion

The R-Index of “Women Are More Likely to Wear Red or Pink at Peak Fertility” is 14. This is a low value and suggests that the results will not replicate in an exact replication study. It is possible that the authors were just lucky to get two significant results. However, lucky results distort the scientific evidence and these results should not be published without a powerful replication study that does not rely on luck to produce significant results. To avoid controversies like these and to increase the credibility of published results, researchers should conduct more powerful tests of hypothesis and scientific journals should favor studies that have a high R-Index.

Adendum 12/20/2020

Since this article has been published, concerns about studies that relate women’s hormone levels to their behaviors has increased. Leading evolutionary psychologists Gangstad declared most findings in this literature to be garbage (Engber, 2018). This literature is yet another example of the bad practices in psychology. Conducting low powered studies and then publishing only results that became significant with the help of inflated effect sizes is not science and does not generate useful knowledge. It is sad that 10 years after the replication crisis and six years after I posted this blog post, many psychologists still continue to pursue research following this flawed model.

“Only when the tide goes out do you discover who has been swimming naked.” Warren Buffet (Value Investor).

Francis, Tanzman, and Matthews (2014) examined the credibility of psychological articles published in the prestigious journal Science. They focused on articles that contained four or more articles because (a) the statistical test that they has insufficient power for smaller sets of studies and (b) the authors assume that it is only meaningful to focus on studies that are published within a single article.

They found 26 articles published between 2006 and 2012. Eight articles could not be analyzed with their method.

The remaining 18 articles had a 100% success rate. That is, they never reported that a statistical hypothesis test failed to produce a significant result. Francis et al. computed the probability of this outcome for each article. When the probability was less than 10%, they made the recommendation to be skeptical about the validity of the theoretical claims.

For example, a researcher may conduct five studies with 80% power. As expected, one of the five studies produced a non-significant result. It is rational to assume that this finding is a type-II error as the Type-II error should occur in 1 out of 5 studies. The scientist decides not to include the non-significant result. In this case, there is bias, the average effect size across the four significant studies is slightly inflated, but the empirical results do support empirical claims.

If, however, the null-hypothesis is true and a researcher conducts many statistical tests and reports only significant results, demonstrating excessive significant results would also reveal that the reported results provide no empirical support for the theoretical claims in this article.

The problem with Francis et al.’s approach is that it does not clearly distinguish between these two scenarios.

The R-Index addresses this problem. It provides quantitative information about the replicability of a set of studies. Like Francis et al., the R-Index is based on the observed power of individual statistical tests (see Schimmack, 2012, for details), but the next steps are different. Francis et al. multiply observed power estimates. This approach is only meaningful for sets of studies that reported only significant results. The R-Index can be computed for studies that reported significant and non-significant results. Here are the steps:

Compute median observed power for all theoretically important statistical tests from a single study; then compute the median of these medians. This median estimates the median true power of a set of studies.

Compute the rate of significant results for the same set of statistical tests; then average the rates across the same set of studies. This average estimates the reported success rate for a set of studies.

Median observed power and average success rate are both estimates of true power or replicability of a set of studies. Without bias, these two estimates should converge as the number of studies increase.

If the success rate is higher than median observed power, it suggests that the reported results provide an inflated picture of the true effect size and replicability of a phenomenon.

The R-Index uses the difference between success rate and median observed power to correct the inflated estimate of replicability by subtracting the inflation rate (success rate – median observed power) from the median observed power.

R-Index = Median Observed Power – (Success rate – Median Observed Power)

The R-Index is a quantitative index, where higher values suggest a higher probability that an exact replication study will be successful and it avoids simple dichotomous decisions. Nevertheless, it can be useful to provide some broad categories that distinguish different levels of replicability.

An R-Index of more than 80% is consistent with true power of 80%, even when some results are omitted. I chose 80% as a boundary because Jacob Cohen advised researchers that they should plan studies with 80% power. Many undergraduates learn this basic fact about power and falsely assume that researchers are following a rule that is mentioned in introductory statistics.

An R-Index between 50% and 80% suggests that the reported results support an empirical phenomenon, but that power was less than ideal. Most important, this also implies that these studies make it difficult to distinguish non-significant results and type-II errors. For example, two tests with 50% power are likely to produce one significant result and one non-significant result. Researches are tempted to interpret the significant one and to ignore the non-significant one. However, in a replication study the opposite pattern is just as likely to occur.

An R-Index between25% and 50% raises doubts about the empirical support for the conclusions. The reason is that an R-Index of 22% can be obtained when the null-hypothesis is true and all non-significant results are omitted. In this case, observed power is inflated from 5% to 61%. With a 100% success rate, the inflation rate is 39%, and the R-Index is 22% (61% – 39% = 22%).

An R-Index below 20% suggest that researchers used questionable research methods (importantly, these method are questionable but widely accepted in many research communities and not considered to be ethical misconduct) to obtain results that are statistically significant (e.g., systematically deleting outliers until p < .05).

Table 1 list Francis et al.’s results and the R-Index. Studies are arranged in order of the R-Index. Only 1 study is in the exemplary category with an R-Index greater than 80%.
4 studies have an R-Index between 50% and 80%.
8 studies have an R-Index in the range between 20% and 50%.
5 studies have an R-Index below 20%.

There are good reasons why researchers should not conduct studies with less than 50% power. However, 13 of the 18 studies have an R-Index below 50%, which suggests that the true power in these studies was less than 50%.

Conclusion

The R-Index provides an alternative approach to Francis’s TES to examine the credibility of a set of published studies. Whereas Francis concluded that 15 out of 18 articles show bias that invalidates the theoretical claims of the original article, the R-Index provides quantitative information about the replicability of reported results.

The R-Index does not provide a simple answer about the validity of published findings, but in many cases the R-Index raises concerns about the strength of the empirical evidence and reveals that editorial decisions failed to take replicability into account.

The R-Index provides a simple tool for editors and reviewers to increase the credibility of published results and to increase the replicability of published findings. Editors and reviewers can compute, or ask authors who submit manuscripts to compute, the R-Index and use this information in their editorial decision. There is no clear criterion value, but a higher R-Index is better and moderate R-values should be justified by other criteria (e.g., uniqueness of sample).

The R-Index can be used to examine whether editors continue to accept articles with low replicability or are committed to the publication of empirical results that are credible and replicable.

Original: December 5, 2014 Revised: December 28, 2020

Power Failure in Neuroscience

An article in Nature Review” Neuroscience suggested that the median power in neuroscience studies is just 21% (Katherine S. Button, John P. A. Ioannidis, Claire Mokrysz, Brian A.Nosek, Jonathan Flint, Emma S.J. Robinsonand Marcus R. Munafò, 2013).

The authors of this article examined meta-analyses of primary studies in neuroscience that were published in 2011. They analyzed 49 meta-analyses that were based on a total of 730 original studies (on average, 15 studies per meta-analysis, range 2 to 57).

For each primary study, the authors computed observed power based on the sample size and the estimated effect size in the meta-analysis.

Based on their analyses, the authors concluded that the median power in neuroscience is 21%.

There is a major problem with this estimate that the authors overlooked. The power estimate is incredibly low because a median power estimate of 21% corresponds to a p-value of p = .25. If median power were 21%, it would mean that over 50% of the original studies in the meta-analysis reported a non-significant result (p > .05). This seems rather unlikely because journals tend to publish mostly significant results.

The estimate is even less plausible because it is based on meta-analytic averages without any correction for bias. These effect sizes are likely to be inflated, which means that median power estimate is inflated. Thus, true power is even less than 21% and even more results are non-significant.

What could explain this implausible result?

A meta-analysis includes published and unpublished studies. It is possible that the published studies reported significant results with observed power greater than 50% (p < .05) and the unpublished studies reported non-significant results with power less than 50%. However, this would imply that meta-analysts were able to retrieve as many unpublished studies as published studies. The authors did not report whether power of published and unpublished studies differed.

A second possibility is that the power analyses produced false results. The authors relied on Ioannidis and Trikalinos’s (2007) approach to the estimation of power. This approach assumes that studies in a meta-analysis have the same true effect size and that the meta-analytic average (weighted mean) provides the best estimate of the true effect size. This estimate of the true effect size is then used to estimate power in individual studies based on the sample size of the study. As already noted by Ioannidis and Trikalinos (2007), this approach can produce biased results when effect sizes in a meta-analysis are heterogeneous.

Estimating power simply on the basis of effect size and sample size can be misleading when the design is not a simple comparison of two groups. Between-subject designs are common in animal studies in neuroscience. However, many fMRI studies use within-subject designs that achieve high statistical power with a few participants because participants serve as their own controls.

Schimmack (2012) proposed an alternative procedure that does not have this limitation. Power is estimated individually for each study based on the observed effect size in this study. This approach makes it possible to estimate median power for heterogeneous sets of studies with different effect sizes. Moreover, this approach makes it possible to compute power when power is not simply a function of sample size and effect size (e.g., within-subject designs).

R-Index of Nature Neuroscience: Analysis

To examine the replicability of research published in nature and neuroscience, I retrieved the most cited articles in this journal until I had a sample of 20 studies. I needed 14 articles to meet this goal. The number of studies in these articles ranged from 1 to 7.

The success rate for focal significance tests was 97%. This implies that the vast majority of significance tests reported a significant result. The median observed power was 84%. The inflation rate is 13% (97% – 84% = 13%). The R-Index is 71%. Based on these numbers, the R-Index predicts that the majority of studies in nature neuroscience would replicate in an exact replication study.

This conclusion differs dramatically from Button et al.’s (2013) conclusion. I therefore examined some of the articles that were used for Button et al.’s analyses.

A study by Davidson et al. (2003) examined treatment effects in 12 depressed patients and compared them to 5 healthy controls. The main findings in this article were three significant interactions between time of treatment and group with z-scores of 3.84, 4.60, and 4.08. Observed power for these values with p = .05 is over 95%. If a more conservative significance level of p = .001 is used, power is still over 70%. However, the meta-analysis focused on the correlation between brain activity at baseline and changes in depression over time. This correlation is shown with a scatterplot without reporting the actual correlation or testing it for significance. The text further states that a similar correlation was observed for an alternative depression measure with r = .46 and noting correctly that this correlation is not significant, t(10) = 1.64, p = .13, d = .95, obs. power = 32%. The meta-analysis found a mean effect size of .92. A power analysis with d = .92 and N = 12 yields a power estimate of 30%. Presumably, this is the value that Button et al. used to estimate power for the Davidson et al. (2003) article. However, the meta-analysis did not include the more powerful analyses that compared patients and controls over time.

Conclusion

In the current replication crisis, there is a lot of confusion about the replicability of published findings. Button et al. (2013) aimed to provide some objective information about the replicability of neuroscience research. They concluded that replicability is very low with a median estimate of 21%. In this post, I point out some problems with their statistical approach and the focus on meta-analyses as a way to make inferences about replicability of published studies. My own analysis shows a relatively high R-Index of 71%. To make sense of this index it is instructive to compare it to the following R-Indices.

In a replication project of psychological studies, I found an R-Index of 43% and 28% of studies were successfully replicated.

In the many-labs replication project, 10 out of 12 studies were successfully replicated, a replication rate of 83% and the R-Index was 72%.

Caveat

Neuroscience studies may have high observed power and still not replicate very well in exact replications. The reason is that measuring brain activity is difficult and requires many steps to convert and reduce observed data into measures of brain activity in specific regions. Actual replication studies are needed to examine the replicability of published results.

In several blog posts, Dr. Schnall made some critical comments about attempts to replicate her work and these blogs created a heated debate about replication studies. Heated debates are typically a reflection of insufficient information. Is the Earth flat? This question created heated debates hundreds of years ago. In the age of space travel it is no longer debated. In this blog, I presented some statistical information that sheds light on the debate about the replicability of Dr. Schnall’s research.

The Original Study

Dr. Schnall and colleagues conducted a study with 40 participants. A comparison of two groups on a dependent variable showed a significant difference, F(1,38) = 3.63. In these days, Psychological Science asked researchers to report P-Rep instead of p-values. P-rep was 90%. The interpretation of P-rep was that there is a 90% chance to find an effect with the SAME SIGN in an exact replication study with the same sample size. The conventional p-value for F(1,38) = 3.63 is p = .06, a finding that commonly is interpreted as marginally significant. The standardized effect size is d = .60, which is considered a moderate effect size. The 95% confidence interval is -.01 to 1.47.

The wide confidence interval makes it difficult to know the true effect size. A post-hoc power analysis, assuming the true effect size is d = .60 suggests that an exact replication study has a 46% chance to produce a significant results (p < .05, two-tailed). However, if the true effect size is lower, actual power is lower. For example, if the true effect size is small (d = .2), a study with N = 40 has only 9% power (that is a 9% chance) to produce a significant result.

The First Replication Study

Drs. Johnson, Cheung, and Donnellan conducted a replication study with 209 participants. Assuming the effect size in the original study is the true effect size, this replication study has 99% power. However, assuming the true effect size is only d = .2, the study has only 31% power to produce a significant result. The study produce a non-significant result, F(1, 206) = .004, p = .95. The effect size was d = .01 (in the same direction). Due to the larger sample, the confidence interval is smaller and ranges from -.26 to .28. The confidence interval includes d = 2. Thus, both studies are consistent with the hypothesis that the effect exists and that the effect size is small, d = .2.

The Second Replication Study

Dr. Huang conducted another replication study with N = 214 participants (Huang, 2004, Study 1). Based on the previous two studies, the true effect might be expected to be somewhere between -.01 and .28, which includes a small effect size of d = .20. A study with N = 214 participants has 31% power to produce a significant result. Not surprisingly, the study produce a non-significant result, t(212) = 1.22, p = .23. At the same time, the effect size fell within the confidence interval set by the previous two studies, d = .17.

A Third Replication Study

Dr. Hung conducted a replication study with N = 440 participants (Study 2). Maintaining the plausible effect size of d = .2 as the best estimate of the true effect size, the study has 55% power to produce a significant result, which means it is nearly as likely to produce a non-significant result as it is to produce a significant result, if the effect size is small (d = .2). The study failed to produce a significant result, t(438) = .042, p = 68. The effect size was d = .04 with a confidence interval ranging from -.14 to .23. Again, this confidence interval includes a small effect size of d = .2.

A Fourth Replication Study

Dr. Hung published a replication study in the supplementary materials to the article. The study again failed to demonstrate a main effect, t(434) = 0.42, p = .38. The effect size is d = .08 with a confidence interval of -.11 to .27. Again, the confidence interval is consistent with a small true effect size of d = .2. However, the study with 436 participants had only a 55% chance to produce a significant result.

If Dr. Huang had combined the two samples to conduct a more powerful study, a study with 878 participants would have 80% power to detect a small effect size of d = .2. However, the combined effect size of d = .06 for the combined samples is still not significant, t(876) = .89. The confidence interval ranges from -.07 to .19. It no longer includes d = .20, but the results are still consistent with a positive, yet small effect in the range between 0 and .20.

Conclusion

In sum, nobody has been able to replicate Schnall’s finding that a simple priming manipulation with cleanliness related words has a moderate to strong effect (d = .6) on moral judgments of hypothetical scenarios. However, all replication studies show a trend in the same direction. This suggests that the effect exists, but that the effect size is much smaller than in the original study; somewhere between 0 and .2 rather than .6.

Now there are three possible explanations for the much larger effect size in Schnall’s original study.

1. The replication studies were not exact replications and the true effect size in Schnall’s version of the experiment is stronger than in the other studies.

2. The true effect size is the same in all studies, but Dr. Schnall was lucky to observe an effect size that was three times as large as the true effect size and large enough to produce a marginally significant result.

3. It is possible that Dr. Schnall did not disclose all of the information about her original study. For example, she may have conducted additional studies that produced smaller and non-significant results and did not report these results. Importantly, this practice is common and legal and in an anonymous survey many researchers admitted using practices that produce inflated effect sizes in published studies. However, it is extremely rare for researchers to admit that these common practices explain one of their own findings and Dr. Schnall has attributed the discrepancy in effect sizes to problems with replication studies.

Dr. Schnall’s Replicability Index

Based on Dr. Schnall’s original study it is impossible to say which of these explanations accounts for her results. However, additional evidence makes it possible to test the third hypothesis that Dr. Schnall knows more than she was reporting in her article. The reason is that luck does not repeat itself. If Dr. Schnall was just lucky, other studies by her should have failed because Lady Luck is only on your side half the time. If, however, disconfirming evidence is systematically excluded from a manuscript, the rate of successful studies is higher than the observed statistical power in published studies (Schimmack, 2012).

To test this hypothesis, I downloaded Dr. Schnall’s 10 most cited articles (in Web of Science, July, 2014). These 10 articles contained 23 independent studies. For each study, I computed the median observed power of statistical tests that tested a theoretically important hypothesis. I also calculated the success rate for each study. The average success rate was 91% (ranging from 45% to 100%, median = 100%). The median observed power was 61%. The inflation rate is 30% (91%-61%). Importantly, observed power is an inflated estimate of replicability when the success rate is inflated. I created the replicability index (R-index) to take this inflation into account. The R-Index subtracts the inflation rate from observed median power.

Dr. Schnall’s R-Index is 31% (61% – 30%).

What does an R-Index of 31% mean? Here are some comparisons that can help to interpret the Index.

Imagine the null-hypothesis is always true, and a researcher publishes only type-I errors. In this case, observed power is 61% and the success rate is 100%. The R-Index is 22%.

Dr. Baumeister admitted that his publications select studies that report the most favorable results. His R-Index is 49%.

The Open Science Framework conducted replication studies of psychological studies published in 2008. A set of 25 completed studies in November 2014 had an R-Index of 43%. The actual rate of successful replications was 28%.

Given this comparison standards, it is hardly surprising that one of Dr. Schnall’s study did not replicate even when the sample size and power of replication studies were considerably higher.

Conclusion

Dr. Schnall’s R-Index suggests that the omission of failed studies provides the most parsimonious explanation for the discrepancy between Dr. Schnall’s original effect size and effect sizes in the replication studies.

Importantly, the selective reporting of favorable results was and still is an accepted practice in psychology. It is a statistical fact that these practices reduce the replicability of published results. So why do failed replication studies that are entirely predictable create so much heated debate? Why does Dr. Schnall fear that her reputation is tarnished when a replication study reveals that her effect sizes were inflated? The reason is that psychologists are collectively motivated to exaggerate the importance and robustness of empirical results. Replication studies break with the code to maintain an image that psychology is a successful science that produces stunning novel insights. Nobody was supposed to test whether published findings are actually true.

However, Bem (2011) let the cat out of the bag and there is no turning back. Many researchers have recognized that the public is losing trust in science. To regain trust, science has to be transparent and empirical findings have to be replicable. The R-Index can be used to show that researchers reported all the evidence and that significant results are based on true effect sizes rather than gambling with sampling error.

In this new world of transparency, researchers still need to publish significant results. Fortunately, there is a simple and honest way to do so that was proposed by Jacob Cohen over 50 years ago. Conduct a power analysis and invest resources only in studies that have high statistical power. If your expertise led you to make a correct prediction, the force of the true effect size will be with you and you do not have to rely on Lady Luck or witchcraft to get a significant result.

P.S. I nearly forgot to comment on Dr. Huang’s moderator effects. Dr. Huang claims that the effect of the cleanliness manipulation depends on how much effort participants exert on the priming task.

First, as noted above, no moderator hypothesis is needed because all studies are consistent with a true effect size in the range between 0 and .2.

Second, Dr. Huang found significant interaction effects in two studies. In Study 2, the effect was F(1,438) = 6.05, p = .014, observed power = 69%. In Study 2a, the effect was F(1,434) = 7.53, p = .006, observed power = 78%. The R-Index for these two studies is 74% – 26% = 48%. I am waiting for an open science replication with 95% power before I believe in the moderator effect.

Third, even if the moderator effect exists, it doesn’t explain Dr. Schnall’s main effect of d = .6.