It has been known for decades that published results tend to be biased (Sterling, 1959). For most of the past decades this inconvenient truth has been ignored. In the past years, there have been many suggestions and initiatives to increase the replicability of reported scientific findings (Asendorpf et al., 2013). One approach is to examine published research results for evidence of questionable research practices (see Schimmack, 2014, for a discussion of existing tests). This blog post introduces a new test of bias in reported research findings, namely the Test of Insufficient Variance (TIVA).

TIVA is applicable to any set of studies that used null-hypothesis testing to conclude that empirical data provide support for an empirical relationship and reported a significance test (p-values).

Rosenthal (1978) developed a method to combine results of several independent studies by converting p-values into z-scores. This conversion uses the well-known fact that p-values correspond to the area under the curve of a normal distribution. Rosenthal did not discuss the relation between these z-scores and power analysis. Z-scores are observed scores that should follow a normal distribution around the non-centrality parameter that determines how much power a study has to produce a significant result. In the Figure, the non-centrality parameter is 2.2. This value is slightly above a z-score of 1.96, which corresponds to a two-tailed p-value of .05. A study with a non-centrality parameter of 2.2 has 60% power. In specific studies, the observed z-scores vary as a function of random sampling error. The standardized normal distribution predicts the distribution of observed z-scores. As observed z-scores follow the standard normal distribution, the variance of an unbiased set of z-scores is 1. The Figure on top illustrates this with the nine purple lines, which are nine randomly generated z-scores with a variance of 1.

In a real data set the variance can be greater than 1 for two reasons. First, if the nine studies are exact replication studies with different sample sizes, larger samples will have a higher non-centrality parameter than smaller samples. This variance in the true non-centrality variances adds to the variance produced by random sampling error. Second, a set of studies that are not exact replication studies can have variance greater than 1 because the true effect sizes can vary across studies. Again, the variance in true effect sizes produces variance in the true non-centrality parameters that add to the variance produced by random sampling error. In short, the variance is 1 in exact replication studies that also hold the sample size constant. When sample sizes and true effect sizes vary, the variance in observed z-scores is greater than 1. Thus, an unbiased set of z-scores should have a minimum variance of 1.

If the variance in z-scores is less than 1, it suggests that the set of z-scores is biased. One simple reason for insufficient variance is publication bias. If power is 50% and the non-centrality parameter matches the significance criterion of 1.96, 50% of studies that were conducted would not be significant. If these studies are omitted from the set of studies, variance decreases from 1 to .36. Another reason for insufficient variance is that researchers do not report non-significant results or used questionable research practices to inflate effect size estimates. The effect is that variance in observed z-scores is restricted. Thus, insufficient variance in observed z-scores reveals that the reported results are biased and provide an inflated estimate of effect size and replicability.

In small sets of studies, insufficient variance may be due to chance alone. It is possible to quantify how lucky a researcher was to obtain significant results with insufficient variance. This probability is a function of two parameters: (a) the ratio of the observed variance (OV) in a sample over the population variance (i.e., 1), and (b) the number of z-scores minus 1 as the degrees of freedom (k -1).

The product of these two parameters follows a chi-square distribution with k-1 degrees of freedom.

Formula 1: Chi-square = OV * (k – 1) with k-1 degrees of freedom.

**Example 1: **

Bem (2011) published controversial evidence that appear to demonstrate precognition. Subsequent studies failed to replicate these results (Galak et al.,, 2012) and other bias tests show evidence that the reported results are biased Schimmack (2012). For this reason, Bem’s article provides a good test case for TIVA.

The article reported results of 10 studies with 9 z-scores being significant at p < .05 (one-tailed). The observed variance in the 10 z-scores is 0.19. Using Formula 1, the chi-square value is chi^2 (df = 9) = 1.75. Importantly, chi-square tests are usually used to test whether variance is greater than expected by chance (right tail of the distribution). The reason is that variance is not expected to be less than the variance expected by chance because it is typically assumed that a set of data is unbiased. To obtain a probability of insufficient variance, it is necessary to test the left-tail of the chi-square distribution. The corresponding p-value for chi^2 (df = 9) = 1.75 is p = .005. Thus, there is only a 1 out of 200 probability that a random set of 10 studies would produce a variance as low as Var = .19.

This outcome cannot be attributed to publication bias because all studies were published in a single article. Thus, TIVA supports the hypothesis that the insufficient variance in Bem’s z-scores is the result of questionable research methods and that the reported effect size of d = .2 is inflated. The presence of bias does not imply that the true effect size is 0, but it does strongly suggest that the true effect size is smaller than the average effect size in a set of studies with insufficient variance.

**Example 2: **

Vohs et al. (2006) published a series of studies that he results of nine experiments in which participants were reminded of money. The results appeared to show that “money brings about a self-sufficient orientation.” Francis and colleagues suggested that the reported results are too good to be true. An R-Index analysis showed an R-Index of 21, which is consistent with a model in which the null-hypothesis is true and only significant results are reported.

Because Vohs et al. (2006) conducted multiple tests in some studies, the median p-value was used for conversion into z-scores. The p-values and z-scores for the nine studies are reported in Table 2. The Figure on top of this blog illustrates the distribution of the 9 z-scores relative to the expected standard normal distribution.

Table 2

Study p z

Study 1 .026 2.23

Study 2 .050 1.96

Study 3 .046 1.99

Study 4 .039 2.06

Study 5 .021 2.99

Study 6 .040 2.06

Study 7 .026 2.23

Study 8 .023 2.28

Study 9 .006 2.73

The variance of the 9 z-scores is .054. This is even lower than the variance in Bem’s studies. The chi^2 test shows that this variance is significantly less than expected from an unbiased set of studies, chi^2 (df = 8) = 1.12, p = .003. An unusual event like this would occur in only 1 out of 381 studies by chance alone.

In conclusion, insufficient variance in z-scores shows that it is extremely likely that the reported results overestimate the true effect size and replicability of the reported studies. This confirms earlier claims that the results in this article are too good to be true (Francis et al., 2014). However, TIVA is more powerful than the Test of Excessive Significance and can provide more conclusive evidence that questionable research practices were used to inflate effect sizes and the rate of significant results in a set of studies.

**Conclusion**

TIVA can be used to examine whether a set of published p-values was obtained with the help of questionable research practices. When p-values are converted into z-scores, the variance of z-scores should be greater or equal to 1. Insufficient variance suggests that questionable research practices were used to avoid publishing non-significant results; this includes simply not reporting failed studies.

At least within psychology, these questionable research practices are used frequently to compensate for low statistical power and they are not considered scientific misconduct by governing bodies of psychological science (APA, APS, SPSP). Thus, the present results do not imply scientific misconduct by Bem or Vohs, just like the use of performance enhancing drugs in sports is not illegal unless a drug is put on an anti-doping list. However, jut because a drug is not officially banned, it does not mean that the use of a drug has no negative effects on a sport and its reputation.

One limitation of TIVA is that it requires a set of studies and that variance in small sets of studies can vary considerably just by chance. Another limitation is that TIVA is not very sensitive when there is substantial heterogeneity in true non-centrality parameters. In this case, the true variance in z-scores can mask insufficient variance in random sampling error. For this reason, TIVA is best used in conjunction with other bias tests. Despite these limitations, the present examples illustrate that TIVA can be a powerful tool in the detection of questionable research practices. Hopefully, this demonstration will lead to changes in the way researchers view questionable research practices and how the scientific community evaluates results that are statistically improbable. With rejection rates at top journals of 80% or more, one would hope that in the future editors will favor articles that report results from studies with high statistical power that obtain significant results that are caused by the predicted effect.

I think this is a really good idea. I would like to make a few technical comments. My main points are that the TIVA method is a valid and conservative test under conditions of heterogeneity, and that it is most likely to raise a red flag when only significant results are reported and the overall power is low.

When the test statistic has a continuous distribution and the null hypothesis is true, the p-value is uniformly distributed on the interval from zero to one, and so is 1-P. Transforming 1-P by the inverse of the cumulative distribution function of a standard normal yields a standard normal random variable Z.

When the null hypothesis is false, the quantity Z has an exactly normal distribution only when the p-value arises from a one-sided Z-test. Referring to the true value of the population mean as the “non-centrality parameter,” it is true that the Z-score has a normal distribution centered around the non-centrality parameter, as stated in the blog.

In the general case, for example when the p-values arise from an F test or chi-squared test, the distribution of the Z-score is most assuredly NOT normal. I have derived the formula for the density function of Z for the general case; it reduces to a normal density only in the case of a one-sided Z test.

On the other hand, if one plots this formula for specific non-central distributions and specific values of the non-centrality parameter, the result definitely LOOKS normal in every case. It’s unimodal, almost symmetric, and clearly closer to normal than anything we will ever encounter in nature. I can’t insert pictures in this reply to the blog, but if you paste in the following R code, you will see a plot of the sampling distribution of Z when the p-value comes from an F test with 4 and 120 degrees of freedom, and a non-centrality parameter of 10. This would correspond to a one-way ANOVA with 5 groups and n=25 per group. The power for this example is around 0.70. If you looked very carefully at the code, you’d see my formula for the density of Z as it applies to a non-central F.

# Density of the inverse normal transform

df1=4; df2=120; alpha=0.05; lambda=10

crit = qf(1-alpha,df1,df2)

# First, what is the power?

crit = qf(1-alpha,df1,df2)

pow = 1-pf(crit,df1,df2,lambda); pow # Gives power of around 0.70

z = seq(from=-2,to=6,by=0.05)

y = qf(pnorm(z),df1,df2)

Density = dnorm(z) * df(y,df1,df2,lambda) / df(y,df1,df2)

plot(z,Density,type=’l’)

title(“Distribution of Z for df1=4, df2=120, ncp = 10”)

On a mathematical level, I don’t understand why this picture looks so normal, but there is no doubt that it does. I have examined quite a few examples and they’re all more or less like this. In my judgement it is completely safe to treat the Z-score transformations of the p-values as normal for any traditional significance test, even when the null hypothesis is false.

For F-tests and chi-squared tests, the population mean of these almost-normal Z-scores does not equal the non-centrality parameter in general, but I can show that it is an increasing function of the non-centrality parameter. Bigger non-centrality parameters yield pseudo-normal curves shifted more to the right.

Apart from a typo in the blog, the TIVA test is a traditional chi-squared test of the null hypothesis that the variance of a normal random sample is greater than or equal to one versus the alterative that it is less than one. I am unable to derive a general formula for the true variance of the Z-score, but I have approximated it numerically for quite a few F and chi-squared examples. The variance appears to depend on the non-centrality parameter, but not much. Invariably, it is a bit more than one. For example, in that F-test with four numerator degrees of freedom, 120 denominator degrees of freedom and a non-centrality parameter of 10, the variance of Z is around 1.18. I suspect that the theoretical variance of the Z-score is always greater than one when the null hypothesis is false, but I can’t prove it.

Since the TIVA is designed to detect whether the variance of a sample of Z-scores is smaller than one, any tendency for the true variance to be greater than one should help make the test conservative. That is, if a sample of p-values came from a collection of legitimate tests with exactly the same true non-centrality parameter, the probability of TIVA raising a red flag would be strictly less than 0.05.

Ah, the same non-centrality parameter. The chi-squared test for a single variance assumes random sampling from a normal distribution with a single population mean and a single population variance. This means that the studies that produced the p-values would have to be exact replications with the same subject population — a rarity at best.

As indicated in the blog, any collection of p-values will be subject to heterogeneity in the non-centrality parameter and in other variables such as numerator and denominator degrees of freedom in the case of F-tests, whether it’s an F-test or a chi-squared test, and so on. Thus, the expected value of each Z-score will be shifted to the right by a different amount.

I would like to suggest an idealized model for this process, and then distill the model into a set of assumptions for the TIVA method. Making the model explicit provides a set of conditions under which the TIVA is conservative. It is worth considering whether departure from these conditions in any particular would represent a Questionable Research Practice, or just a way in which my model is unrealistic.

Here is the idealized model. There is a population of published studies. For each study, there is a sample size, study design and subject population, together with the unknown true parameter values of the subject population. Data are sampled randomly from the subject population.

For each study in the population of studies, there is sub-population of reported significance tests. Because of random sampling, or random assignment, or both, the assumptions of the reported tests are approximately satisfied. The appearance of a test in the study is unrelated to the observed p-value of the test.

A study is randomly selected from the population of studies, which induces randomness in the sample size, the population parameters, and details of the study design. The process by which a study is selected is statistically independent of the p-values reported in the study.

From the study, one significance test is randomly selected, and its p-value is recorded. The significance test is selected by a process that is statistically independent of the process for selecting the study, and also independent of the p-value of the test.

This whole process is repeated k times, yielding a random sample of k p-values. Z-scores are calculated from the p-values, and the TIVA test statistic is computed as (k-1)*OV, where OV is the usual sample variance with n-1 (in this case, k-1) in the denominator. We conclude that the variance of the Z-scores is suspiciously low at significance level alpha if the TIVA statistic is less than the critical value cutting off the bottom alpha of a chi-squared distribution with k-1 degrees of freedom.

The restriction of one p-value per study is onerous. Its purpose is to ensure that the p-values used to compute the TIVA statistic are independent. I do not know the consequences of violating this assumption. They may or may not be serious.

Now I will simplify the model a bit, producing a set of formal assumptions for the TIVA. As indicated earlier, the Z-score corresponding to a given p-value has a sampling distribution that is nearly normal, with variance that depends to a negligible degree upon details like the non-centrality parameter, but is always a bit greater than one. The expected value (population mean) of this normal distribution varies considerably as a function the non-centrality parameter, with larger values of the non-centrality parameter yielding larger expected values.

A simplification that will tend to make the TIVA conservative is to state that the variance of the Z-score is exactly equal to one. This is adopted from now on.

Because random sampling of a study and a p-value from that study produce a randomly sampled expected value, we will just say that an expected value (population mean) M is randomly selected from an unknown distribution. Then, in a manner that is independent of the process that produced M, a Z-score is generated from a normal distribution with population mean M and population variance one. The last-mentioned independence condition is critical, and would be violated if the probability of including a test in the sample were influenced by the p-value.

An equivalent and more convenient statement of the assumptions is this. M is sampled from an unknown distribution with expected value mu and variance sigma-squared (both unknown, of course). Independently of M, an error term e is sampled from a normal distribution with expected value zero and variance one. Then, Z = M + e. This process is repeated independently k times, one for each p-value.

Now it is easier to see why the test should be conservative. Since M and e are independent, the variance of their sum is the sum of variances, and the true variance of Z is sigma-squared plus one. (Actually it’s a little more than that.) This is how heterogeneity produces greater variance.

Two cases are of interest. If for some reason the distribution of M happens to be normal, then Z is normal not just conditionally upon the value of M, but normal overall. Then, the TIVA test statistic has a gamma distribution that is (sigma-squared + 1) times a (chi-square with k-1 degrees of freedom), and the probability that TIVA will reject the null hypothesis is strictly less than alpha. That is, in this case the test is conservative, period.

If the distribution of M is not normal, the distribution of the TIVA statistic is intractable, which is a fancy way of saying I can’t derive it and I think nobody else can either. Still, the true population variance of the Z-scores is at least one plus sigma-squared, which should make make the TIVA test conservative. The argument isn’t rigorous, but it is difficult to see how the addition of an independent piece of random noise that increases the population variance would also increase the probability of getting a low sample variance — something that would be necessary in order to increase the Type I error probability of the test to alpha or above. I’m convinced, but confirming this intuition with some simulations would be worth the effort.

To summarize, under conditions of heterogeneity the true variance of the Rosenthal Z-scores is greater than one, and TIVA is a conservative test of the null hypothesis that the variance is greater than or equal to one versus the alternative that it is less than one.

Now, what would make the variance small of the Z-scores smaller than one, apart from Questionable Research Practices on the part of the person analyzing the p-values? As stated in the blog, an obvious possibility is a tendency to mostly report p-values that are less than 0.05. This would restrict the range of Z-scores, and hence their variance. So at the very least, the TIVA method is a promising way to test for publication bias.

My final comment is something that is hinted by the blog, but not quite stated directly. Even in the presence of publication bias, the TIVA will be forgiving when overall power is high. The reason is simple. When power is high, most of the tests are significant anyway, the range of Z-scores is no longer restricted much, and it is unlikely that the sample variance will be conclusively less than one. Thus, the TIVA does not just detect publication bias. It detects publication bias in conjunction with low statistical power. This is a truly suspicious combination, and stinks of bad statistical analysis. By “bad,” I only mean bad if the goal of research is to advance knowledge. For the goal of furthering the careers of those who employ it, the statistical analysis in question is very good indeed.

Jerry Brunner

Department of Statistical Sciences

University of Toronto

I have been thinking more about this. The author of the blog is right to be concerned about the power of the TIVA. Conservative tests are naturally at least a bit underpowered near the parameter value specified by the null hypothesis, but in this case heterogeneity could completely mask under-dispersion of the Z-scores.

One way around this would be to stratify the p-values a priori into clusters that should be more homogeneous. For example, manipulation checks would go in one pile, analyses of data from big surveys would go in another, and so on. There would definitely be a separate pile for repeated measures analyses. Then the TIVA statistic would be computed for each pile. Because independent chi-squares are additive, the statistics would be summed and the degrees of freedom would be summed, yielding an overall TIVA statistic that would be less contaminated by heterogeneity.

Perhaps better would be to clean out the heterogeneity with a regression. My first reply above expresses the Rosenthal Z as Z = M + e. This looks like a regression equation, and our null hypothesis is about the variance of e, the error term. The variable M, the source of heterogeneity, is a latent variable (a factor, if you will) that we will never be able to observe. Still, we can get a handle on it because it is determined to an important degree by sample size, as well as effect size and other features of the test.

Sample size is observable, so let’s do a regression with Z as the dependent variable and sample size as the independent variable. For F-tests and t-tests, I highly recommend not using sample size as reported in the methods section, but denominator degrees of freedom. This allows for loss of subjects due to missing values. In the case of classical repeated measures ANOVA it’s really critical, because the denominator degrees of freedom is the effective sample size, and can vastly exceed the number of subjects.

As for effect size, I recommend not trying to approximate it (especially not as a function of the p-values!), except in the case of manipulation checks. The determination of whether a test is a manipulation check should be pretty reliable, and effect sizes for manipulation checks should be huge. So the regression will have two independent variables, effective sample size and a zero-one dummy variable for whether the test is a manipulation check. If the sample of significance tests contains no manipulation checks, the dummy variable would be omitted.

In this regression, the sample size is k, the number of significance tests. The classical theory of regression with normal errors says that the error sum of squares divided by the variance of the errors has a chi-squared distribution with k-p degrees of freedom. With two independent variables and an intercept, p=3.

So the Regression TIVA is to calculate Z-scores from a sample of k p-values, and do a regression in which the dependent variable is Z, and the independent variables are effective sample size and a dummy variable for whether the test is a manipulation check. An unbiased estimate of the variance of the error terms, which should be one or greater, is Mean Squared Error from the regression. The TIVA test statistic is Sum of Squares Error. The p-value of the test is the lower tail area of a chi-squared distribution with k-3 degrees of freedom, or k-2 if there are no manipulation checks.

This modification of the TIVA method requires further study, but I believe it has promise.

Jerry Brunner

Department of Statistical Sciences

University of Toronto

Reblogged this on LMGTFY.

Reblogged this on Replication-Index and commented:

Finally corrected and updated the Introduction to Test of Insufficient Variance showing bias in Bem (2011) and Vohs (2006).

Reblogged this on POLITICAL PSYCHOLOGY.