# Men are created equal, p-values are not.

Is there still something new to say about p-values? Yes, there is. Most discussions of p-values focus on a scenario where a researcher tests a new hypothesis computes a p-value and now has to interpret the result. The status quo follows Fisher’s – 100 year old – approach to compare the p-value to a value of .05. If the p-value is below .05 (two-sided), the inference is that the population effect size deviates from zero in the same direction as the observed effect in the sample. If the p-value is greater than .05 the results are deemed inconclusive.

This approach to the interpretation of the data assumes that we have no other information about our hypothesis or that we do not trust this information sufficiently to incorporate it in our inference about the population effect size. Over the past decade, Bayesian psychologists have argued that we should replace p-values with Bayes-Factors. The advantage of Bayes-Factors is that they can incorporate prior information to draw inferences from data. However, if no prior information is available, the use of Bayesian statistics may cause more harm than good. To use priors without prior information, Bayes-Factors are computed with generic, default priors that are not based on any information about a research question. Along with other problems of Bayes-Factors, this is not an appealing solution to the problem of p-values.

Here I introduce a new approach to the interpretation of p-values that has been called empirical Bayesian and has been successfully applied in genomics to control the field-wise false positive rate. That is, prior information does not rest on theoretical assumptions or default values, but rather on prior empirical information. The information that is used to interpret a new p-value is the distribution of prior p-values.

## P-value distributions

Every study is a new study because it relies on a new sample of participants that produces sampling error that is independent of the previous studies. However, studies are not independent in other characteristics. A researcher who conducted a study with N = 40 participants is likely to have used similar sample sizes in previous studies. And a researcher who used N = 200 is also likely to have used larger sample sizes in previous studies. Researchers are also likely to use similar designs. Social psychologists, for example, prefer between-subject designs to better deceive their participants. Cognitive psychologists care less about deception and study simple behaviors that can be repeated hundreds of times within an hour. Thus, researchers who used a between-subject design are likely to have used a between-subject design in previous studies and researchers who used a within-subject design are likely to have used a within-subject design before. Researchers may also be chasing different effect sizes. Finally, researchers can differ in their willingness to take risks. Some may only test hypotheses that are derived from prior theories that have a high probability of being correct, whereas others may be willing to shoot for the moon. All of these consistent differences between researchers (i.e., sample size, effect size, research design) influence the unconditional statistical power of their studies, which is defined as the long-run probability of obtaining significant results, p < .05.

Over the past decade, in the wake of the replication crisis, interest in the distribution of p-values has increased dramatically. For example, one approach uses the distribution of significant p-values, which is known as p-curve analysis (Simonsohn et al., 2014). If p-values were obtained with questionable research practices when the null-hypothesis is true (p-hacking), the distribution of significant p-values is flat. Thus, if the distribution is monotonically decreasing from 0 to .05, the data have evidential value. Although p-curve analyses has been extended to estimate statistical power, simulation studies show that the p-curve algorithm is systematically biased when power varies across studies (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020).

As shown in simulation studies, a better way to estimate power is z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Here I show how z-curve analyses of prior p-values can be used to demonstrate that p-values from one researcher are not equal to p-values of other researchers when we take their prior research practices into account. By using this prior information, we can adjust the alpha level of individual researchers to take their research practices into account. To illustrate this use of z-curve, I first start with an illustration how different research practices influence p-value distributions.

### Scenario 1: P-hacking

In the first scenario, we assume that a researcher only tests false hypotheses (i.e., the null-hypothesis is always true (Bem, 2011; Simonsohn et al., 2011). In theory, it would be easy to spot false positives because replication studies would produce produce 19 non-significant results for every significant one and significant ones would have different signs. However, questionable research practices lead to a pattern of results where only significant results in one direction are reported, which is the norm in psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).

In a z-curve analysis, p-values are first converted into z-scores, z = -qnorm(p/2) with qnorm being the inverse normal function and p being a two-sided p-value. A z-curve plot shows the histogram of all z-scores, including non-significant ones (Figure 1).

Visual inspection of the z-curve plot shows that all 200 p-values are significant (on the right side of the criterion value z = 1.96). it also shows that the mode of the distribution as at the significance criterion. Most important, visual inspection shows a steep drop from the mode to the range of non-significant values. That is, while z = 1.96 is the most common value, z = 1.95 is never observed. This drop provides direct visual information that questionable research practices were used because normal sampling error cannot produce such dramatic changes in the distribution.

I am skipping the technical details how the z-curve model is fitted to the distribution of z-scores (Bartos & Schimmack, 2020). It is sufficient to know that the model is fitted to the distribution of significant z-scores with a limited number of model parameters that are equally spaced over the range of z-scores from 0 to 6 (7 parameters, z = 0, z = 1, z = 2, …. z = 6). The model gives different weights to these parameters to match the observed distribution. Based on these estimates, z-curve.2.0 computes several statistics that can be used to interpret single p-values that have been published or future p-values by the same researcher, assuming that the same research practices are used.

The most important statistic is the expected discovery rate (EDR), which corresponds to the average power of all studies that were conducted by a researcher. Importantly, the EDR is an estimate that is based on only the significant results, but makes predictions about the number of non-significant results. In this example with N = 200 participants, the EDR is 7%. Of course, we know that it really is only 5% because the expected discovery rate for true hypotheses that are tested with alpha = .05 is 5%. However, sampling error can introduce biases in our estimates. Nevertheless, even with only 200 observations, the estimate of 7% is relatively close to 5%. Thus, z-curve tells us something important about the way these p-values were obtained. They were obtained in studies with very low power that is close to the criterion value for a false positive result.

Z-curve uses bootstrap to compute confidence intervals around the point estimate of the EDR. the 95%CI ranges from 5% to 18%. As the interval includes 5%, we cannot reject the hypothesis that all tests were false positives (which in this scenario is also the correct conclusion). At the upper end we can see that mean power is low, even if some true hypotheses are being tested.

The EDR can be used for two purposes. First, it can be used to examine the extent of selection for significance by comparing the EDR to the observed discovery rate (ODR; Schimmack, 2012). The ODR is simply the percentage of significant results that was observed in the sample of p-values. In this case, this is 200 out of 200 or 100%. The discrepancy between the EDR of 7% and 100% is large and 100% is clearly outside the 95%CI of the EDR. Thus, we have strong evidence that questionable research practices were used, which we know to be true in this simulation because the 200 tests were selected from a much larger sample of 4,000 tests.

Most important for the use of z-curve to interpret p-values is the ability to estimate the maximum False Discovery Rate (Soric, 1989). The false discovery rate is the percentage of significant results that are false positives or type-I errors. The false discovery rate is often confused with alpha, the long-run probability of making a type-I error. The significance criterion ensures that no more than 5% of significant and non-significant results are false positives. When we test 4,000 false hypotheses (i.e., the null-hypothesis is true) were are not going to have more than 5% (4,000 * .05 = 200) false positive results. This is true in general and it is true in this example. However, when only significant results are published, it is easy to make the mistake to assume that no more than 5% of the published 200 results are false positives. This would be wrong because the 200 were selected to be significant and they are all false positives.

The false discovery rate is the percentage of significant results that are false positives. It no longer matters whether non-significant results are published or not. We are only concerned with the population of p-values that are below .05 (z > 1.96). In our example, the question is how many of the 200 significant results could be false positives. Soric (1989 demonstrated that the EDR limits the number of false positive discoveries. The more discoveries there are, the lower is the risk that discoveries are false. Using a simple formula, we can compute the maximum false discovery rate from the EDR.

FDR = (1/(EDR – 1)*(.05/.95), with alpha = .05

With an EDR of 7%, we obtained a maximum FDR of 68%. We know that the true FDR is 100%, thus, the estimate is too low. However, the reason is that sampling error can have dramatic effects on the FDR estimates when the EDR is low. With an EDR of 6%, the FDR estimate goes up to 82% and with an EDR estimate of 5% it is 100%. To take account of this uncertainty, we can use the 95%CI of the EDR to compute a 95%CI for the FDR estimate, 24% to 100%. Now we see that we cannot rule out that the FDR is 100%.

In short, scenario 1 introduced the use of p-value distributions to provide useful information about the risk that the published results are false discoveries. In this extreme example, we can dismiss the published p-values as inconclusive or as lacking in evidential value.

## Scenario 2: The Typical Social Psychologist

It is difficult to estimate the typical effect size in a literature. However, a meta-analysis of meta-analyses suggested that the average effect size in social psychology is Cohen’s d = .4 (Richard et al., 2003). A smaller set of replication studies that did not select for significance estimated an effect size of d = .3 for social psychology (d = .2 for JPSP, d = .4 for Psych Science; Open Science Collaboration, 2015). The later estimate may include an unknown number of hypotheses where the null-hypothesis is true and the true effect size is zero. Thus, I used d = .4 as a reasonable effect size for true hypotheses in social psychology (see also LeBel, Campbell, & Loving, 2017).

It is also known that a rule of thumb in experimental social psychology was to allocate n = 20 participants to a condition, resulting in a sample size of N = 40 in studies with two groups. In a 2 x 2 design, the main effect would be tested with N = 80. However, to keep this scenario simple, I used d = .4 and N = 40 for true effects. This affords 23% power to obtain a significant result.

Finkel, Eastwick, and Reis (2017) argued that power of 25% is optimal if 75% of the hypotheses that are being tested are true. However, the assumption that 75% of hypotheses are true may be on the optimistic side. Wilson and Wixted (2018) suggested that the false discovery risk is closer to 50%. With 23% power for true hypotheses, this implies a false discovery rate of Given uncertainty about the actual false discovery rate in social psychology, I used a scenario with 50% true and 50% false hypotheses.

I kept the number of significant results at 200. To obtain 200 significant results with an equal number of true and false hypotheses, we need 1,428 tests. The 714 true hypotheses contribute 714*.23 = 164 true positives and the 714 false hypotheses produce 714*.05 = 36 false positive results; 164 + 36 = 200. This implies a false discovery rate of 36/200 = 18%. The true EDR is (714*.23+714*.05)/(714+714) = 14%.

The z-curve plot looks very similar to the previous plot, but they are not identical. Although the EDR estimate is higher, it still includes zero. The maximum FDR is well above the actual FDR of 18%, but the 95%CI includes the actual value of 18%.

A notable difference between Figure 1 and Figure 2 is the expected replication rate (ERR), which corresponds to the average power of significant p-values. It is called the estimated replication rate (ERR) because it predicts the percentage of significant results if the studies that were selected for significance were replicated exactly (Brunner & Schimmack, 2020). When power is heterogeneous, power of the studies with significant results is higher than power of studies with non-significant results (Brunner & Schimmack, 2020). In this case, with only two power values, the reason is that false positives have a much lower chance to be significant (5%) than true positives (23%). As a result, the average power of significant studies is higher than the average power of all studies. In this simulation, the true average power of significant studies is the weighted average of true and false positives with significant results, (164*.23 +36*.05)/(164+36) = 20%. Z-curve perfectly estimated this value.

Importantly, the 95% CI of the ERR, 11% to 34%, does not include zero. Thus, we can reject the null-hypotheses that all of the significant results are false positives based on the ERR. In other words, the significant results have evidential value. However, we do not know the composition of this average. It could be a large percentage of false positives and a few true hypotheses with high power or it could be many true positives with low power. We also do not know which of the 200 significant results is a true positive or a false positive. Thus, we would need to conduct replication studies to distinguish between true and false hypotheses. And given the low power, we would only have a 23% chance of successfully replicating a true positive result. This is exactly what happened with the reproducibility project. And the inconsistent results lead to debates and require further replications. Thus, we have real-world evidence how uninformative p-values are when they are obtained this way.

Social psychologists might argue that the use of small samples is justified because most hypotheses in psychology are true. Thus, we can use prior information to assume that significant results are true positives. However, this logic fails when social psychologists test false hypotheses. In this case, the observed distribution of p-values (Figure 1) is not that different from the distribution that is observed when most significant results are true positives that were obtained with low power (Figure 2). Thus, it is doubtful that this is really an optimal use of resources (Finkel et al., 2015). However, until recently this was the way experimental social psychologists conducted their research.

## Scenario 3: Cohen’s Way

In 1962 (!), Cohen conducted a meta-analysis of statistical power in social psychology. The main finding was that studies had only a 50% chance to get significant results with a median effect size of d = .5. Cohen (1988) also recommended that researchers should plan studies to have 80% power. However, this recommendation was ignored.

To achieve 80% power with d = .4, researchers need N = 200 participants. Thus, the number of studies is reduced from 5 studies with N = 40 to one study with N = 200. As Finkel et al. (2017) point out, we can make more discoveries with many small studies than a few large ones. However, this ignores that the results of the small studies are difficult to replicate. This was not a concern when social psychologists did not bother to test whether their discoveries are false discoveries or whether they can be replicated. The replication crisis shows the problems of this approach. Now we have results from decades of research that produced significant p-values without providing any information whether these significant results are true or false discoveries.

Scenario 3 examines what social psychology would look like today, if social psychologists had listened to Cohen. The scenario is the same as in the second scenario, including publication bias. There are 50% false hypotheses and 50% true hypotheses with an effect size of d = .4. The only difference is that researchers used N = 200 to test their hypotheses to achieve 80% power.

With 80% power, we need 470 tests (compared to 1,428 in Scenario 2) to produce 200 significant results, 235*.80 + 235*.05 = 188 + 12 = 200. Thus, the EDR is 200/470 = 43%. The true false discovery rate is 6%. The expected replication rate is 188*.80 + 12*.05 = 76%. Thus, we see that higher power increases replicability from 20% to 76% and lowers the false discovery rate from 18% to 6%.

Figure 3 shows the z-curve plot. Visual inspection shows that Figure 3 looks very different from Figures 1 and 2. The estimates are also different. In this example, sampling error inflated the EDR to be 58%, but the 95%CI includes the true value of 46%. The 95%CI does not include the ODR. Thus, there is evidence for publication bias, which is also visible by the steep drop in the distribution at 1.96.

Even with a low EDR of 20%, the maximum FDR is only 21%. Thus, we can conclude with confidence that at least 79% of the significant results are true positives. Remember, in the previous scenario, we could not rule out that most results are false positives. Moreover, the estimated replication rate is 73%, which underestimates the true replication rate of 76%, but the 95%CI includes the true value, 95%CI = 61% – 84%. Thus, if these studies were replicated, we would have a high success rate for actual replication studies.

Just imagine for a moment what social psychology might look like in a parallel universe where social psychologists followed Cohen’s advice. Why didn’t they? The reason is that they did not have z-curve. All they had was p < .05, and using p < .05, all three scenarios are identical. All three scenarios produced 200 significant results. Moreover, as Finkel et al. (2015) pointed out, smaller samples produce 200 significant results quicker than large samples. An additional advantage of small samples is that they inflate point estimates of the population effect size. Thus, the social psychologists with the smallest samples could brag about the biggest (illusory) effect sizes as long as nobody was able to publish replication studies with larger samples that deflated effect sizes of d = .8 to d = .08 (Joy-Gaba & Nosek, 2010).

This game is over, but social psychology – and other social sciences – have published thousands of significant p-values, and nobody knows whether they were obtained using scenario 1, 2, or 3, or probably a combination of these. This is where z-curve can make a difference. P-values are no longer equal when they are considered as a data point from a p-value distribution. In scenario 1, a p-value of .01 and even a p-value of .001 has no meaning. In contrast, in scenario 3 even a p-value of .02 is meaningful and more likely to reflect a true positive than a false positive result. This means that we can use z-curve analyses of published p-values to distinguish between probably false and probably true positives.

I illustrate this with three concrete examples from a project that examined the p-value distributions of over 200 social psychologists (Schimmack, in preparation). The first example has the lowest EDR in the sample. The EDR is 11% and because there are only 210 tests, the 95%CI is wide and includes 5%.

The maximum EDR estimate is high with 41% and the 95%CI includes 100%. This suggests that we cannot rule out the hypothesis that most significant results are false positives. However, the replication rate is 57% and the 95%CI, 45% to 69%, does not include 5%. Thus, some tests tested true hypotheses, but we do not know which ones.

Visual inspection of the plot shows a different distribution than Figure 2. There are more just significant p-values, z = 2.0 to 2.2 and more large z-scores (z > 4). This shows more heterogeneity in power. A comparison of the ODR with the EDR shows that the ODR falls outside the 95%CI of the EDR. This is evidence of publication bias or the use of questionable research practices. One solution to the presence of publication bias is to lower the criterion for statistical significance. As a result, the large number of just significant results is no longer significant and the ODR decreases. This is a post-hoc correction for publication bias. For example, we can lower alpha to .005.

As expected, the ODR decreases considerably from 70% to 39%. In contrast, the EDR increases. The reason is that many questionable research practices produce a pile of just significant p-values. As these values are no longer used to fit the z-curve, it predicts a lot fewer non-significant p-values. The model now underestimates p-values between 2 and 2.2. However, these values do not seem to come from a sampling distribution. Rather they stick out like a tower. By excluding them, the p-values that are still significant with alpha = .005 look more credible. Thus, we can correct for the use of QRPs by lowering alpha and by examining whether these p-values produced interesting discoveries. At the same time, we can ignore the p-values between .05 and .005 and await replication studies to provide empirical evidence whether these hypotheses receive empirical support.

The second example was picked because it was close to the median EDR (33) and ERR (66) in the sample of 200 social psychologists.

The larger sample of tests (k = 1,529) helps to obtain more precise estimates. A comparison of the ODR, 76%, and the 95%CI of the EDR, 12% to 48%, shows that publication bias is present. However, with an EDR of 33%, the maximum FDR is only 11% and the upper limit of the 95%CI is 39%. Thus, we can conclude with confidence that fewer than 50% of the significant results are false positives, however numerous findings might be false positives. Only replication studies can provide this information.

In this example, lowering alpha to .005 did not align the ODR and the EDR. This suggests that these values come from a sampling distribution where non-significant results were not published. Thus, adjusting the there is no simple fix to adjust the significance criterion. In this situation, we can conclude that the published p-values are unlikely to be false positives, but that replication studies are needed to ensure that published significant results are not false positives.

The third example is the social psychologists with the highest EDR. In this case, the EDR is actually a little bit lower than the ODR, suggesting that there is no publication bias. The high EDR also means that the maximum FDR is very small and even the upper limit of the 95%CI is only 7%.

Another advantage of data without publication bias is that it is not necessary to exclude non-significant results from the analysis. Fitting the model to all p-values produces much tighter estimates of the EDR and the maximum FDR.

The upper limit of the 95%CI for the FDR is now 4%. Thus, we conclude that no more than 5% of the p-values less than .05 are false positives. Even p = .02 is unlikely to be a false positive. Finally, the estimated replication rate is 84% with a tight confidence interval ranging from 78% to 90%. Thus, most of the published p-values are expected to replicate in an exact replication study.

I hope these examples make it clear how useful it can be to evaluate single p-values with prior information about the p-values distribution of a lab. As labs differ in their research practices, significant p-values are also different. Only if we ignore the research context and focus on a single result p = .02 equals p = .02. But once we see the broader distribution, p-values of .02 can provide stronger evidence against the null-hypothesis than p-values of .002.

## Implications

Cohen tried and failed to change the research culture of social psychologists. Meta-psychological articles have puzzled why meta-analyses of power failed to increase power (Maxwell, 2004; Schimmack, 2012; Sedelmeier & Gigerenzer, 1989). Finkel et al. (2015) provided an explanation. In a game where the winner publishes as many significant results as possible, the optimal strategy is to conduct as many studies as possible with low power. This strategy continues to be rewarded in psychology, where jobs, promotions, grants, and pay raises are based on the number of publications. Cohen (1990) said less is more, but that is not true in a science that does not self-correct and treats every p-value less than .05 as a discovery.

To improve psychology as a science, we need to change the incentive structure and author-wise z-curve analyses can do this. Rather than using p < .05 (or p < .005) as a general rule to claim discoveries, claims of discoveries can be adjusted to the research practices of a researchers. As demonstrated here, this will reward researchers who follow Cohen’s rules and punish those who use questionable practices to produce p-values less than .05 (or Bayes-Factors > 3) without evidential value. And maybe, there is a badge for credible p-values one day.

## (incomplete) References

Richard, F. D., Bond, C. F., Jr., & Stokes-Zoota, J. J. (2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363. http://dx.doi.org/10.1037/1089-2680.7.4.331 # The Replicability Index Is the Most Powerful Tool to Detect Publication Bias in Meta-Analyses

## Abstract

Methods for the detection of publication bias in meta-analyses were first introduced in the 1980s (Light & Pillemer, 1984). However, existing methods tend to have low statistical power to detect bias, especially when population effect sizes are heterogeneous (Renkewitz & Keiner, 2019). Here I show that the Replicability Index (RI) is a powerful method to detect selection for significance while controlling the type-I error risk better than the Test of Excessive Significance (TES). Unlike funnel plots and other regression methods, RI can be used without variation in sampling error across studies. Thus, it should be a default method to examine whether effect size estimates in a meta-analysis are inflated by selection for significance. However, the RI should not be used to correct effect size estimates. A significant results merely indicates that traditional effect size estimates are inflated by selection for significance or other questionable research practices that inflate the percentage of significant results.

## Evaluating the Power and Type-I Error Rate of Bias Detection Methods

Just before the end of the year, and decade, Frank Renkewitz and Melanie Keiner published an important article that evaluated the performance of six bias detection methods in meta-analyses (Renkewitz & Keiner, 2019).

The article makes several important points.

1. Bias can distort effect size estimates in meta-analyses, but the amount of bias is sometimes trivial. Thus, bias detection is most important in conditions where effect sizes are inflated to a notable degree (say more than one-tenth of a standard deviation, e.g., from d = .2 to d = .3).

2. Several bias detection tools work well when studies are homogeneous (i.e. ,the population effect sizes are very similar). However, bias detection is more difficult when effect sizes are heterogeneous.

3. The most promising tool for heterogeneous data was the Test of Excessive Significance (Francis, 2013; Ioannidis, & Trikalinos, 2013). However, simulations without bias showed that the higher power of TES was achieved by a higher false-positive rate that exceeded the nominal level. The reason is that TES relies on the assumption that all studies have the same population effect size and this assumption is violated when population effect sizes are heterogeneous.

This blog post examines two new methods to detect publication bias and compares them to the TES and the Test of Insufficient Variance (TIVA) that performed well when effect sizes were homogeneous (Renkewitz & Keiner , 2019). These methods are not entirely new. One method is the Incredibility Index, which is similar to TES (Schimmack, 2012). The second method is the Replicability Index, which corrects estimates of observed power for inflation when bias is present.

### The Basic Logic of Power-Based Bias Tests

The mathematical foundations for bias tests based on statistical power were introduced by Sterling et al. (1995). Statistical power is defined as the conditional probability of obtaining a significant result when the null-hypothesis is false. When the null-hypothesis is true, the probability of obtaining a significant result is set by the criterion for a type-I error, alpha. To simplify, we can treat cases where the null-hypothesis is true as the boundary value for power (Brunner & Schimmack, 2019). I call this unconditional power. Sterling et al. (1995) pointed out that for studies with heterogeneity in sample sizes, effect sizes or both, the discoery rate; that is the percentage of significant results, is predicted by the mean unconditional power of studies. This insight makes it possible to detect bias by comparing the observed discovery rate (the percentage of significant results) to the expected discovery rate based on the unconditional power of studies. The empirical challenge is to obtain useful estimates of unconditional mean power, which depends on the unknown population effect sizes.

Ioannidis and Trialinos (2007) were the first to propose a bias test that relied on a comparison of expected and observed discovery rates. The method is called Test of Excessive Significance (TES). They proposed a conventional meta-analysis of effect sizes to obtain an estimate of the population effect size, and then to use this effect size and information about sample sizes to compute power of individual studies. The final step was to compare the expected discovery rate (e.g., 5 out of 10 studies) with the observed discovery rate (8 out of 10 studies) with a chi-square test and to test the null-hypothesis of no bias with alpha = .10. They did point out that TES is biased when effect sizes are heterogeneous (see Renkewitz & Keiner, 2019, for a detailed discussion).

Schimmack (2012) proposed an alternative approach that does not assume a fixed effect sizes across studies, called the incredibility index. The first step is to compute observed-power for each study. The second step is to compute the average of these observed power estimates. This average effect size is then used as an estimate of the mean unconditional power. The final step is to compute the binomial probability of obtaining as many or more significant results that were observed for the estimated unconditional power. Schimmack (2012) showed that this approach avoids some of the problems of TES when effect sizes are heterogeneous. Thus, it is likely that the Incredibility Index produces fewer false positives than TES.

Like TES, the incredibility index has low power to detect bias because bias inflates observed power. Thus, the expected discovery rate is inflated, which makes it a conservative test of bias. Schimmack (2016) proposed a solution to this problem. As the inflation in the expected discovery rate is correlated with the amount of bias, the discrepancy between the observed and expected discovery rate indexes inflation. Thus, it is possible to correct the estimated discovery rate by the amount of observed inflation. For example, if the expected discovery rate is 70% and the observed discovery rate is 90%, the inflation is 20 percentage points. This inflation can be deducted from the expected discovery rate to get a less biased estimate of the unconditional mean power. In this example, this would be 70% – 20% = 50%. This inflation-adjusted estimate is called the Replicability Index. Although the Replicability Index risks a higher type-I error rate than the Incredibility Index, it may be more powerful and have a better type-I error control than TES.

To test these hypotheses, I conducted some simulation studies that compared the performance of four bias detection methods. The Test of Insufficient Variance (TIVA; Schimmack, 2015) was included because it has good power with homogeneous data (Renkewitz & Keiner, 2019). The other three tests were TES, ICI, and RI.

Selection bias was simulated with probabilities of 0, .1, .2, and 1. A selection probability of 0 implies that non-significant results are never published. A selection probability of .1 implies that there is a 10% chance that a non-significant result is published when it is observed. Finally, a selection probability of 1 implies that there is no bias and all non-significant results are published.

Effect sizes varied from 0 to .6. Heterogeneity was simulated with a normal distribution with SDs ranging from 0 to .6. Sample sizes were simulated by drawing from a uniform distribution with values between 20 and 40, 100, and 200 as maximum. The number of studies in a meta-analysis were 5, 10, 20, and 30. The focus was on small sets of studies because power to detect bias increases with the number of studies and power was often close to 100% with k = 30.

Each condition was simulated 100 times and the percentage of significant results with alpha = .10 (one-tailed) was used to compute power and type-I error rates.

## RESULTS

### Bias

Figure 1 shows a plot of the mean observed d-scores as a function of the mean population d-scores. In situations without heterogeneity, mean population d-scores corresponded to the simulated values of d = 0 to d = .6. However, with heterogeneity, mean population d-scores varied due to sampling from the normal distribution of population effect sizes.

The figure shows that bias could be negative or positive, but that overestimation is much more common than underestimation.  Underestimation was most likely when the population effect size was 0, there was no variability (SD = 0), and there was no selection for significance.  With complete selection for significance, bias always overestimated population effect sizes, because selection was simulated to be one-sided. The reason is that meta-analysis rarely show many significant results in both directions.

An Analysis of Variance (ANOVA) with number of studies (k), mean population effect size (mpd), heterogeneity of population effect sizes (SD), range of sample sizes (Nmax) and selection bias (sel.bias) showed a four-way interaction, t = 3.70.   This four-way interaction qualified main effects that showed bias decreases with effect sizes (d), heterogeneity (SD), range of sample sizes (N), and increased with severity of selection bias (sel.bias).

The effect of selection bias is obvious in that effect size estimates are unbiased when there is no selection bias and increases with severity of selection bias.  Figure 2 illustrates the three way interaction for the remaining factors with the most extreme selection bias; that is, all non-significant results are suppressed.

The most dramatic inflation of effect sizes occurs when sample sizes are small (N = 20-40), the mean population effect size is zero, and there is no heterogeneity (light blue bars). This condition simulates a meta-analysis where the null-hypothesis is true. Inflation is reduced, but still considerable (d = .42), when the population effect is large (d = .6). Heterogeneity reduces bias because it increases the mean population effect size. However, even with d = .6 and heterogeneity, small samples continue to produce inflated estimates by d = .25 (dark red). Increasing sample sizes (N = 20 to 200) reduces inflation considerably. With d = 0 and SD = 0, inflation is still considerable, d = .52, but all other conditions have negligible amounts of inflation, d < .10.

As sample sizes are known, they provide some valuable information about the presence of bias in a meta-analysis. If studies with large samples are available, it is reasonable to limit a meta-analysis to the larger and more trustworthy studies (Stanley, Jarrell, & Doucouliagos, 2010).

### Discovery Rates

If all results are published, there is no selection bias and effect size estimates are unbiased. When studies are selected for significance, the amount of bias is a function of the amount of studies with non-significant results that are suppressed. When all non-significant results are suppressed, the amount of selection bias depends on the mean power of the studies before selection for significance which is reflected in the discovery rate (i.e., the percentage of studies with significant results). Figure 3 shows the discovery rates for the same conditions that were used in Figure 2. The lowest discovery rate exists when the null-hypothesis is true. In this case, only 2.5% of studies produce significant results that are published. The percentage is 2.5% and not 5% because selection also takes the direction of the effect into account. Smaller sample sizes (left side) have lower discovery rates than larger sample sizes (right side) because larger samples have more power to produce significant results. In addition, studies with larger effect sizes have higher discovery rates than studies with small effect sizes because larger effect sizes increase power. In addition, more variability in effect sizes increases power because variability increases the mean population effect sizes, which also increases power.

In conclusion, the amount of selection bias and the amount of inflation of effect sizes varies across conditions as a function of effect sizes, sample sizes, heterogeneity, and the severity of selection bias. The factorial design covers a wide range of conditions. A good bias detection method should have high power to detect bias across all conditions with selection bias and low type-I error rates across conditions without selection bias.

### Overall Performance of Bias Detection Methods

Figure 4 shows the overall results for 235,200 simulations across a wide range of conditions. The results replicate Renkewitz and Keiner’s finding that TES produces more type-I errors than the other methods, although the average rate of type-I errors is below the nominal level of alpha = .10. The error rate of the incredibility index is practically zero, indicating that it is much more conservative than TES. The improvement for type-I errors does not come at the cost of lower power. TES and ICI have the same level of power. This finding shows that computing observed power for each individual study is superior than assuming a fixed effect size across studies. More important, the best performing method is the Replicability Index (RI), which has considerably more power because it corrects for inflation in observed power that is introduced by selection for significance. This is a promising results because one of the limitation of the bias tests examined by Renkewitz and Keiner was the low power to detect selection bias across a wide range of realistic scenarios.

Logistic regression analyses for power showed significant five-way interactions for TES, IC, and RI. For TIVA, two four-way interactions were significant. For type-I error rates no four-way interactions were significant, but at least one three-way interaction was significant. These results show that results systematic vary in a rather complex manner across the simulated conditions. The following results show the performance of the four methods in specific conditions.

### Number of Studies (k)

Detection of bias is a function of the amount of bias and the number of studies. With small sets of studies (k = 5), it is difficult to detect power. In addition, low power can suppress false-positive rates because significant results without selection bias are even less likely than significant results with selection bias. Thus, it is important to examine the influence of the number of studies on power and false positive rates.

Figure 5 shows the results for power. TIVA does not gain much power with increasing sample sizes. The other three methods clearly become more powerful as sample sizes increase. However, only the R-Index shows good power with twenty studies and still acceptable studies with just 10 studies. The R-Index with 10 studies is as powerful as TES and ICI with 10 studies.

Figure 6 shows the results for the type-I error rates. Most important, the high power of the R-Index is not achieved by inflating type-I error rates, which are still well-below the nominal level of .10. A comparison of TES and ICI shows that ICI controls type-I error much better than TES. TES even exceeds the nominal level of .10 with 30 studies and this problem is going to increase as the number of studies gets larger.

### Selection Rate

Renkewitz and Keiner noticed that power decreases when there is a small probability that non-significant results are published. To simplify the results for the amount of selection bias, I focused on the condition with n = 30 studies, which gives all methods the maximum power to detect selection bias. Figure 7 confirms that power to detect bias deteriorates when non-significant results are published. However, the influence of selection rate varies across methods. TIVA is only useful when only significant results are selected, but even TES and ICI have only modest power even if the probability of a non-significant result to be published is only 10%. Only the R-Index still has good power, and power is still higher with a 20% chance to select a non-significant result than with a 10% selection rate for TES and ICI.

### Population Mean Effect Size

With complete selection bias (no significant results), power had ceiling effects. Thus, I used k = 10 to illustrate the effect of population effect sizes on power and type-I error rates. (Figure 8)

In general, power decreased as the population mean effect sizes increased. The reason is that there is less selection because the discovery rates are higher. Power decreased quickly to unacceptable levels (< 50%) for all methods except the R-Index. The R-Index maintained good power even with the maximum effect size of d = .6.

Figure 9 shows that the good power of the R-Index is not achieved by inflating type-I error rates. The type-I error rate is well below the nominal level of .10. In contrast, TES exceeds the nominal level with d = .6.

### Variability in Population Effect Sizes

I next examined the influence of heterogeneity in population effect sizes on power and type-I error rates. The results in Figure 10 show that hetergeneity decreases power for all methods. However, the effect is much less sever for the RI than for the other methods. Even with maximum heterogeneity, it has good power to detect publication bias.

Figure 11 shows that the high power of RI is not achieved by inflating type-I error rates. The only method with a high error-rate is TES with high heterogeneity.

### Variability in Sample Sizes

With a wider range of sample sizes, average power increases. And with higher power, the discovery rate increases and there is less selection for significance. This reduces power to detect selection for significance. This trend is visible in Figure 12. Even with sample sizes ranging from 20 to 100, TIVA, TES, and IC have modest power to detect bias. However, RI maintains good levels of power even when sample sizes range from 20 to 200.

Once more, only TES shows problems with the type-I error rate when heterogeneity is high (Figure 13). Thus, the high power of RI is not achieved by inflating type-I error rates.

### Stress Test

The following analyses examined RI’s performance more closely. The effect of selection bias is self-evident. As more non-significant results are available, power to detect bias decreases. However, bias also decreases. Thus, I focus on the unfortunately still realistic scenario that only significant results are published. I focus on the scenario with the most heterogeneity in sample sizes (N = 20 to 200) because it has the lowest power to detect bias. I picked the lowest and highest levels of population effect sizes and variability to illustrate the effect of these factors on power and type-I error rates. I present results for all four set sizes.

The results for power show that with only 5 studies, bias can only be detected with good power if the null-hypothesis is true. Heterogeneity or large effect sizes produce unacceptably low power. This means that the use of bias tests for small sets of studies is lopsided. Positive results strongly indicate severe bias, but negative results are inconclusive. With 10 studies, power is acceptable for homogeneous and high effect sizes as well as for heterogeneous and low effect sizes, but not for high effect sizes and high heterogeneity. With 20 or more studies, power is good for all scenarios.

The results for the type-I error rates reveal one scenario with dramatically inflated type-I error rates, namely meta-analysis with a large population effect size and no heterogeneity in population effect sizes.

### Solutions

The high type-I error rate is limited to cases with high power. In this case, the inflation correction over-corrects. A solution to this problem is found by considering the fact that inflation is a non-linear function of power. With unconditional power of .05, selection for significance inflates observed power to .50, a 10 fold increase. However, power of .50 is inflated to .75, which is only a 50% increase. Thus, I modified the R-Index formula and made inflation contingent on the observed discovery rate.

RI2 = Mean.Observed.Power – (Observed Discovery Rate – Mean.Observed.Power)*(1-Observed.Discovery.Rate). This version of the R-Index reduces power, although power is still superior to the IC.

It also fixed the type-I error problem at least with sample sizes up to N = 30.

## Example 1: Bem (2011)

Bem’s (2011) sensational and deeply flawed article triggered the replication crisis and the search for bias-detection tools (Francis, 2012; Schimmack, 2012). Table 1 shows that all tests indicate that Bem used questionable research practices to produce significant results in 9 out of 10 tests. This is confirmed by examination of his original data (Schimmack, 2018). For example, for one study, Bem combined results from four smaller samples with non-significant results into one sample with a significant result. The results also show that both versions of the Replicability Index are more powerful than the other tests.

## Example 2: Francis (2014) Audit of Psychological Science

Francis audited multiple-study articles in the journal Psychological Science from 2009-2012. The main problem with the focus on single articles is that they often contain relatively few studies and the simulation studies showed that bias tests tend to have low power if 5 or fewer studies are available (Renkewitz & Keiner, 2019). Nevertheless, Francis found that 82% of the investigated articles showed signs of bias, p < .10. This finding seems very high given the low power of TES in the simulation studies. It would mean that selection bias in these articles was very high and power of the studies was extremely low and homogeneous, which provides the ideal conditions to detect bias. However, the high type-I error rates of TES under some conditions may have produced more false positive results than the nominal level of .10 suggests. Moreover, Francis (2014) modified TES in ways that may have further increased the risk of false positives. Thus, it is interesting to reexamine the 44 studies with other bias tests. Unlike Francis, I coded one focal hypothesis test per study.

I then applied the bias detection methods. Table 2 shows the p-values.

The Figure shows the percentage of significant results for the various methods. The results confirm that despite the small number of studies, the majority of multiple-study articles show significant evidence of bias. Although statistical significance does not speak directly to effect sizes, the fact that these tests were significant with a small set of studies implies that the amount of bias is large. This is also confirmed by a z-curve analysis that provides an estimate of the average bias across all studies (Schimmack, 2019).

A comparison of the methods shows with real data that the R-Index (RI1) is the most powerful method and even more powerful than Francis’s method that used multiple studies from a single study. The good performance of TIVA shows that population effect sizes are rather homogeneous as TIVA has low power with heterogeneous data. The Incredibility Index has the worst performance because it has an ultra-conservative type-I error rate. The most important finding is that the R-Index can be used with small sets of studies to demonstrate moderate to large bias.

## Discussion

In 2012, I introduced the Incredibility Index as a statistical tool to reveal selection bias; that is, the published results were selected for significance from a larger number of results. I compared the IC with TES and pointed out some advantages of averaging power rather than effect sizes. However, I did not present extensive simulation studies to compare the performance of the two tests. In 2014, I introduced the replicability index to predict the outcome of replication studies. The replicability index corrects for the inflation of observed power when selection for significance is present. I did not think about RI as a bias test. However, Renkewitz and Keiner (2019) demonstrated that TES has low power and inflated type-I error rates. Here I examined whether IC performed better than TES and I found it did. Most important, it has much more conservative type-I error rates even with extreme heterogeneity. The reason is that selection for significance inflates observed power which is used to compute the expected percentage of significant results. This led me to see whether the bias correction that is used to compute the Replicability Index can boost power, while maintaining acceptable type-I error rates. The present results shows that this is the case for a wide range of scenarios. The only exception are meta-analysis of studies with a high population effect size and low heterogeneity in effect sizes. To avoid this problem, I created an alternative R-Index that reduces the inflation adjustment as a function of the percentage of non-significant results that are reported. I showed that the R-Index is a powerful tool that detects bias in Bem’s (2011) article and in a large number of multiple-study articles published in Psychological Science. In conclusion, the replicability index is the most powerful test for the presence of selection bias and it should be routinely used in meta-analyses to ensure that effect sizes estimates are not inflated by selective publishing of significant results. As the use of questionable practices is no longer acceptable, the R-Index can be used by editors to triage manuscripts with questionable results or to ask for a new, pre-registered, well-powered additional study. The R-Index can also be used in tenure and promotion evaluations to reward researchers that publish credible results that are likely to replicate.

## References

Francis, G. (2013). Replication, statistical consistency, and publication bias. Journal of Mathematical Psychology, 57, 153–169. https://doi.org/10.1016/j.jmp.2013.02.003

Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials: Journal of the Society for Clinical Trials, 4, 245–253. https://doi.org/10.1177/1740774507079441

R. J. Light; D. B. Pillemer (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts: Harvard University Press.

Renkewitz, F., & Keiner, M. (2019). How to Detect Publication Bias in Psychological Research
A Comparative Evaluation of Six Statistical Methods. Zeitschrift für Psychologie, 227, 261-279. https://doi.org/10.1027/2151-2604/a000386.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566. doi:10.1037/a0029487

Schimmack, U. (2014, December 30). The test of insufficient variance (TIVA): A new tool for the detection of questionable research practices [Blog Post]. Retrieved from http://replicationindex.com/2014/12/30/the-test-ofinsufficient-
variance-tiva-a-new-tool-for-the-detection-ofquestionable-
research-practices/

Schimmack, U. (2016). A revised introduction to the R-Index. Retrieved
from https://replicationindex.com/2016/01/31/a-revisedintroduction-
to-the-r-index/

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112. # Baby Einstein: The Numbers Do Not Add Up

A small literature suggests that babies can add and subtract. Wynn (1992) showed 5-month olds a Mickey Mouse doll, covered this toy, and placed another doll behind the cover to imply addition (1 + 1 = 2). A second group of infants saw two Mickey Mouse dolls, that were covered and then one Mickey Mouse was removed (2 – 1 = 1). When the cover was removed, either 1 or 2 Mickeys were visible. Infants looked longer at the incongruent display, suggesting that they expected 2 Mickeys in the addition scenario and one Mickey in the subtraction scenario.

Both studies produced just significant results; Study 1, t(30) = 2.078, p = .046 (two-tailed), Study 2 , t(14) = 1.795, p = .047 (one-tailed). Post-2011, these just significant results raise a red flag about the replicability of these results.

This study produced a small literature that was meta-analyzed by Christodoulou, Lac, and Moore (2017). The headline finding was that a random-effects meta-analysis showed a significant effect, d = .34, “suggesting that the phenomenon Wynn originally reported is reliable.”

The problem with effect-size meta-analysis is that effect sizes are inflated when published results are selected for significance. Christodoulou et al. (2017) examined the presence of publication bias using a variety of statistical tests that produced inconsistent results. The Incredibility Index showed that there were just as many significant results (k = 12) as one would predict based on median observed power (k = 11). Trim-and-fill suggested some bias, but the corrected effect size estimate would still be significant, d = .24. However, PEESE showed significant evidence of publication bias, and no significant effect after correcting for bias.

Christodoulou et al. (2017) dismiss the results obtained with PEESE that would suggest the findings are not robust.

For instance, the PET-PEESE has been criticized on grounds that it severely penalizes samples with a small N (Cunningham & Baumeister, 2016), is inappropriate for syntheses involving a limited number of studies (Cunningham & Baumeister, 2016), is sometimes inferior in performance compared to estimation methods that do not correct for publication bias (Reed, Florax, & Poot, 2015), and is premised on acceptance of the assumption that large sample sizes confer unbiased effect size estimates (Inzlicht, Gervais, & Berkman, 2015). Each of the other four tests used have been criticized on various grounds as well (e.g., Cunningham & Baumeister, 2016)

These arguments are not very convincing. Studies with larger samples produce more robust results than studies with smaller samples. Thus, placing a greater emphasizes on larger samples is justified by the smaller sampling error in these studies. In fact, random effects meta-analysis gives too much weight to small samples. It is also noteworthy that Baumeister and Inzlicht are not unbiased statisticians. Their work has been criticized as unreliable using PEESE and their responses are at least partially motivated by defending their work.

I will demonstrate that the PEETSE results are credible and that the other methods failed to reveal publication bias because effect-size meta-analyses fail to reveal the selection bias in original articles. For example, Wynn’s (1992) seminal finding was only significant with a one-sided test. However, the meta-analysis used a two-sided p-value of .055, which was coded as a non-significant result. This is a coding mistake because the result was used to reject the null-hypothesis with a different alpha level. A follow-up study by McCrink and Wynn (2004) reported a significant interaction effect with opposite effects for addition and subtraction, p = .016. However, the meta-analysis coded addition and subtraction separately, which produced one significant, p = .01, and one non-significant result, p = .504. The coding by subgroups is common in meta-analysis to conduct moderator analyses. However, this practices mutes the selection bias, which makes it more difficult to detect selection bias. Thus, bias tests need to be applied to the focal tests that supported authors’ main conclusions.

I recoded all 12 articles that reported 14 independent tests of the hypothesis that babies can add and subtract. I found only two articles that reported a failure to reject the null-hypothesis. Wakeley,Rivera, and Langer’s (2000) article is a rare example of an article in a major journal that reported a series of failed replication studies before 2011. “Unlike Wynn, we found no systematic evidence of either imprecise or precise adding and subtracting in young infants” (p. 1525). Moore and Cocas (2006) published two studies. Study 2 reported a non-significant result with an effect in the opposite direction. They clearly stated that this result failed to replicate Wynn’s results. “This test failed to reveal a reliable difference between the two groups’ fixation preferences, t(87) = -1.31, p = .09” However, they continued to examine the data with an Analysis of Variance that produced a significant four-way interaction, F(1, 85) = 4.80, p = .031. If this result had been used as the focal test, there would be only 2 non-significant results. However, I coded the study as reporting a non-significant result. Thus, the success rate across 14 studies in 12 articles is 11/14 = 78.6%. Without Wakeley et al.’s exceptional report of replication failures, the success rate would have been 93%, which is the norm in psychology publications (Sterling, 1959; Sterling et al., 1995).

The mean observed power of the 14 studies was MOP = 57%. The binomial probabilty of obtaining 11 or more significant results in 14 studies with 57% power is p = .080. This shows significant bias with the typical alpha level of .10 for bias tests due to the low power of these tests in small samples.

I also developed a more powerful bias tests that corrects for the inflation in the estimate of observed mean power that is based on the replicability index (Schimmack, 2016). Simulation studies show that this method has higher power, while maintaining good type-I error rates. To correct for inflation, I subtract the difference between the success rate and observed mean power from the observed mean power (simulation studies show that the mean is superior to the median that was used in the 2016 manuscript). This yields a value of .57 – (.79 – .57) = .35. The binomial probability of obtaining 11 out of 14 significant results with just 35% power is p = .001. These results confirm the results obtained with PEESE that publication bias contributes to the evidence in favor of babies’ math abilities.

To examine the credibilty of the published literature, I submitted the 11 significant results to a z-curve analysis (Brunner & Schimmack, 2019). The z-curve analysis also confirms the presence of publication bias. Whereas the observed discovery rate is 79%, 95%CI = 57% to 100%, the expected discovery rate is only 6%, 95%CI = 5% to 31%. As the confidence intervals do not overlap, the difference is statistically significant. The expected replication rate is 15%. Thus, if the 11 studies could be replicated exactly only 2 rather than 11 are expected to be significant again. The 95%CI included a value of 5% which means that all studies could be false positives. This shows that the published studies do not provide empirical evidence to reject the null-hypothesis that babies cannot add or subtract.

Meta-analyses also have another drawback. They focus on results that are common across studies. However, subsequent studies are not mere replication studies. Several studies in this literature examined whether the effect is an artifact of the experimental procedure and showed that performance is altered by changing the experimental setup. These studies first replicate the original finding and then show that the effect can be attributed to other factors. Given the low power to replicate the effect, it is not clear how credible this evidence is. However, it does show that even if the effect were robust, it does not warrant the conclusion that infants can do math.

### Conclusion

The problems with bias tests in standard meta-analysis are by no means unique to this article. It is well known that original articles publish nearly exclusively confirmatory evidence with success rates over 90%. However, meta-analyses often include a much larger number of non-significant results. This paradox is explained by the coding of original studies that produces non-significant results that were either not published or not the focus of an original article. This coding practices mutes the signal and makes it difficult to detect publication bias. This does not mean that the bias has disappeared. Thus, most published meta-analysis are useless because effect sizes are inflated to an unknown degree by selection for significance in the primary literature. # The Black Box of Meta-Analysis: Personality Change

Psychologists treat meta-analyses as the gold standard to answer empirical questions. The idea is that meta-analyses combine all of the relevant information into a single number that reveals the answer to an empirical question. The problem with this naive interpretation of meta-analyses is that meta-analyses cannot provide more information than the original studies contained. If original studies have major limitations, a meta-analytic integration does not make these limitations disappear. Meta-analyses can only reduce random sampling error, but they cannot fix problems of original studies. However, once a meta-analysis is published, the problems are often ignored and the preliminary conclusion is treated as an ultimate truth.

In this regard meta-analyses are like collateralized debt obligations that were popular until problems with CDOs triggered the financial crisis in 2008. A collateralized debt obligation (CDO) pools together cash flow-generating assets and repackages this asset pool into discrete tranches that can be sold to investors. The problem is when a CDO is considered to be less risky than the actual debt in the CDO actually is and investors believe they get high returns with low risks, when the actual debt is much more risky than investors believe.

In psychology, the review process and publication in a top journal give the appeal that information is trustworthy and can be cited as solid evidence. However, a closer inspection of the original studies might reveal that the results of a meta-analysis rest on shaky foundations.

Roberts et al. (2006) published a highly-cited meta-analysis in the prestigious journal Psychological Bulletin. The key finding of this meta-analysis was that personality levels change with age in longitudinal studies of personality.

The strongest change was observed for conscientiousness. According to the figure, conscientiousness doesn’t change much during adolescence, when the prefrontal cortex is still developing, but increases from d ~ .4 to d ~ .9 from age 30 to age 70 by about half a standard deviation.

Like many other consumers, I bought the main finding and used the results in my Introduction to Personality lectures without carefully checking the meta-anlysis. However, when I analyzed new data from longitudinal studies with large national representative samples, I could not find the predicted pattern (Schimmack, 2019a, 2019b, 2019c). Thus, I decided to take a closer look at the meta-analysis.

Roberts and colleagues list all the studies that were used with information about sample sizes, personality dimensions, and the ages that were studied. Thus, it is easy to find the studies that examined conscientiousness with participants who were 30 years or older at the start of the study.

There are 19 studies with a total sample size of N = 5,144 participants. However, sample sizes vary dramatically across studies from a low of N = 21 to a high of N = 2,274. Table 1 shows the proportion of participants that would be used to weight effect sizes according to sample sizes. By far the largest study found no significant increase in conscientiousness. I tried to find information about effect sizes from the other studies, but the published articles didn’t contain means or the information was from an unpublished source. I did not bother to obtain information from samples with less than 100 participants, because they contribute only 8% to the total sample size. Even big effects would be washed out by the larger samples.

The main conclusion that can be drawn from this information is that there is no reliable information to make claims about personality change throughout adulthood. If we assume that conscientiousness changes by half a standard deviation over a 40 year period, the average effect size for a decade is d = .12. For studies with even shorter retest intervals, the predicted effect size is even weaker. It is therefore highly speculative to extrapolate from this patchwork of data and make claims about personality change during adulthood.

Fortunately, much better information is now available from longitudinal panels with over thousand participants who have been followed for 12 (SOEP) or 20 (MIDUS) years with three or four retests. Theories of personality stability and change need to be revisited in the light of this new evidence. Updating theories in the face of new data is at the basis of science. Citing an outdated meta-analysis as if it provided a timeless answer to a question is not. # The Hierarchy of Consistency Revisited

In 1984, James J. Conley published one of the most interesting studies of personality stability. However, this important article was published in Personality and Individual Differences and has been ignored. Even today, the article has only 184 citations in WebofScience. In contrast, the more recent meta-analyses of personality stability by Roberts and DelVeccio (2001) has 1,446 citations.

Sometimes more recent and more citations doesn’t mean better. The biggest problem in studies of stability is that random and occasion specific measurement error attenuates observed retest correlations. Thus, observed retest correlations are prone to underestimate the true stability of personality traits. With a single retest-correlation it is impossible to separate measurement error from real change. However, when more then two repeated measurements are observed, it is possible to separate random measurement error from true change, using a statistical approach that was developed by Heise (1969).

The basic idea of Heise’s model is that change accumulates over time. Thus, if traits change from T1 to T2 and from T2 to T3, the trait changed even more from T1 to T3.

Without going into mathematical details, the observed retest correlation from T1 to T3 should match the product of the retest correlations from T1 to T2 and T2 to T3.

For example, if r12 = .8 and r 23 = .8, r13 should be .8 * .8 = .64.

The same is also true if the retest correlations are not identical. Maybe more change occurred from T1 to T2 than from T2 to T3. The total stability is still a function of the product of the two partial stabilities. For example, r12 = .8 and r23 = .5 yields r13 = .8 * .5 = .4.

However, if there is random measurement error, the r13 correlation will be larger than the product of the r12 and r23 correlations. For example, using the above example and a reliability of .8, we get r12 = .8 * .8 = .64, r23 = .4 * .8 = .32 and the product is .64 * .32 = .20, while the actual r13 correlation is .32 * .8 = .256. Assuming that reliability is constant, we have three equations with three unknowns and it is possible to solve the equations to estimate reliability.

(1) r12 = rel*s1; s1 = r12/rel
(2) r23 = rel*s2; s2 = r23/rel
(3) r13 = rel*s1*s2, rel = r13/(s1*s2)

r = (r12*r23)/r13

with r12 = .64, r23 = .32, and r13 = .256, we get (.64*.32)/.256 = .8.

Heise’s model is called an autoregressive model which implies that over time, retest correlations will become smaller and smaller until they approach zero. However, if stability is high, this can take a long time. For example, Conley (1984) estimated that the annual stability of IQ tests is r = .99. With this high stability, the retest correlation over 40 years is still r = .67. Consistent with Conley’s prediction a study found that the retest correlation from age 11 to age 70 of r = .67 (ref), which is even higher than predicted by Conley.

The Figure below shows Conley’s estimate for personality traits like extraversion and neuroticism. The figure shows that reliability varies across studies and instruments from as low as .4 to as high as .9. After correcting for unreliability, the estimated annual stability of personality traits is s = .98.

The figure also shows that most studies in this meta-analysis of retest correlations covered short time-intervals from a few month up to 10 years. Studies with 10 or more years are rare. As a result, Conley’s estimates are not very precise.

To test Conley’s predictions, I used the three waves of the Midlife in the US study (MIDUS). Each wave was approximately 10 years apart with a total time span of 20 years. To analyze the data, I fitted a measurement model to the personality items in the MIDUS. The fit of the measurement model has been examined elsewhere (Schimmack, 2019). The measurement model was constrained for all three waves (see OSF for syntax). The model had acceptable overall fit, CFI = .963, RMSEA = .018, SRMR = .035 (see OSF for output).

The key finding are the retest correlations r12, r23, and r13 for the Big Five and two method factors; a factor for evaluative bias (halo) and acquiescence bias.

For all traits except acquiescence bias, the r13 correlation is lower than the r12 or r23 correlation, indicating some real change. However, for all traits, the r13 correlation is higher than the product of r12 and r23, indicating the presence of random measurement error or occasion specific variance.

The next table shows the decomposition of the retest-correlations into a reliability component and a stability component.

The reliability estimates range from .84 to .92 for the Big Five scales. Reliability of the method factor is estimated to be lower. After correcting for unreliability, 20-year stability estimates increase from observed levels of .72 to .85 to estimated levels of .83 to .1. The implied annual stability estimates are above .99, which is higher than Conley’s estimate of .98.

Unfortunately, three time points are not enough to test the assumptions of Heise’s model. Maybe reliability increases over time. Another possibility is that some of the variance in personality is influenced by stable factors that never change (e.g., genetic variance). In this case, retest correlations do not approach zero, but to a level that is set by the influence of stable factors.

Anusic and Schimmack’s meta-analysis suggested that for the oldest age group, the amount of stable variance is 80, and that this asymptote is reached very quickly (see picture). However, this model predicts that 10-year retest correlations are equivalent to 20-year retest correlations, which is not consistent with the results in Table 1. Thus, the MIDUS data suggest that the model in Figure 1 overestimates the amount of stable trait variance in personality. More data are needed to model the contribution of stable factors to stability of personality traits. However, both models predict high stability of personality over a long period of 20 years.

## Conclusion

Science can be hard. Astronomy required telescopes to study the universe. Psychologists need longitudinal studies to examine stability of personality and personality development. The first telescopes were imperfect and led to false beliefs about canals and life on Mars. Similarly, longitudinal data are messy and provide imperfect glimpses into the stability of personality. However, the accumulating evidence shows impressive stability in personality differences. Many psychologists are dismayed by this finding because they have a fixation on disorders and negative traits. However, the Big Five traits are not disorders or undesirable traits. They are part of human diversity. When it comes to normal diversity, stability is actually desirable. Imagine you train for a job and after ten years of training you don’t like it anymore. Imagine you marry a quiet introvert and five year later, he is a wild party animal. Imagine, you never know who you are because your personality is constantly changing. The grass on the other side of the fence is often greener, but self-acceptance and building on one’s true strength may be a better way to live a happy life than to try to change your personality to fit cultural norms or parental expectations. Maybe stability and predictability aren’t so bad after all.

The results also have implications for research on personality change and development. If natural variation in factors that influence personality produces only very small changes over periods of a few years, it will be difficult to study personality change. Moreover, small real changes will be contaminated with relatively large amounts of random measurement error. Good measurement models that can separate real change from noise are needed to do so.

References

Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality and self-opinion. Personality and Individual Differences, 84, 11-25.

Heise D. R. (1969) Separating reliability and stability in test-retest correlation. Am. social. Rev. 34, 93-101.

Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency of personality traits from childhood to old age: A quantitative review of longitudinal studies. Psychological Bulletin, 126, 3–25.

# Where Do Non-Significant Results in Meta-Analysis Come From?

It is well known that focal hypothesis tests in psychology journals nearly always reject the null-hypothesis (Sterling, 1959; Sterling et al., 1995). However, meta-analyses often contain a fairly large number of non-significant results. To my knowledge, the emergence of non-significant results in meta-analysis has not been examined systematically (happy to be proven wrong). Here I used the extremely well-done meta-analysis of money priming studies to explore this issue (Lodder, Ong, Grasman, & Wicherts, 2019).

I downloaded their data and computed z-scores by (1) dividing Cohen’s d by sampling errror (2/sqrt(N)) to compute t-values, (2) convert the absolute t-values into two-sided p-values, and (3) converting the p-values into absolute z-scores. The z-scores were submitted to a z-curve analysis (Brunner & Schimmack, 2019).

The first figure shows the z-curve for all test-statistics. Out of 282 tests, only 116 (41%) are significant. This finding is surprising, given the typical discovery rates over 90% in psychology journals. The figure also shows that the observed discovery rate of 41% is higher than the expected discovery rate of 29%, although the difference is relatively small and the confidence intervals overlap. This might suggest that publication bias in the money priming literature is not a serious problem. On the other hand, meta-analysis may mask the presence of publication bias in the published literature for a number of reasons.

Published vs. Unpublished Studies

Publication bias implies that studies with non-significant results end up in the proverbial file-drawer. Meta-analysts try to correct for publication bias by soliciting unpublished studies. The money-priming meta-analysis included 113 unpublished studies.

Figure 2 shows the z-curve for these studies. The observed discovery rate is slightly lower than for the full set of studies, 29%, and more consistent with the expected discovery rate, 25%. Thus, there this set of studies appears to be unbiased.

The complementary finding for published studies (Figure 3) is that the observed discovery rate increases, 49%, while the expected discovery rate remains low, 31%. Thus, published articles report a higher percentage of significant results without more statistical power to produce significant results.

A New Type of Publications: Independent Replication Studies

In response to concerns about publication bias and questionable research practices, psychology journals have become more willing to publish null-results. An emerging format are pre-registered replication studies with the explicit aim of probing the credibility of published results. The money priming meta-analysis included 47 independent replication studies.

Figure 4 shows that independent replication studies had a very low observed discovery rate, 4%, that is matched by a very low expected discovery rate, 5%. It is remarkable that the discovery rate for replication studies is lower than the discovery rate for unpublished studies. One reason for this discrepancy is that significance alone is not sufficient to get published and authors may be selective in the sharing of unpublished results.

Removing independent replication studies from the set of published studies further increases the observed discovery rate, 66%. Given the low power of replication studies, the expected discovery rate also increases somewhat, but it is notably lower than the observed discovery rate, 35%. The difference is now large enough to be statistically significant, despite the rather wide confidence interval around the expected discovery rate estimate.

Coding of Interaction Effects

After a (true or false) effect has been established in the literature, follow up studies often examine boundary conditions and moderators of an effect. Evidence for moderation is typically demonstrated with interaction effects that are sometimes followed by contrast analysis for different groups. One way to code these studies would be to focus on the main effect and to ignore the moderator analysis. However, meta-analysts often split the sample and treat different subgroups as independent samples. This can produce a large number of non-significant results because a moderator analysis allows for the fact that the effect emerged only in one group. The resulting non-significant results may provide false evidence of honest reporting of results because bias tests rely on the focal moderator effect to examine publication bias.

The next figure is based on studies that involved an interaction hypothesis. The observed discovery rate, 42%, is slightly higher than the expected discovery rate, 25%, but bias is relatively mild and interaction effects contribute 34 non-significant results to the meta-analysis.

The analysis of the published main effect shows a dramatically different pattern. The observed discovery rate increased to 56/67 = 84%, while the expected discovery rate remained low with 27%. The 95%CI do not overlap, demonstrating that the large file-drawer of missing studies is not just a chance finding.

I also examined more closely the 7 non-significant results in this set of studies.

1. Gino and Mogliner (2014) reported results of a money priming study with cheating as the dependent variable. There were 98 participants in 3 conditions. Results were analyzed with percentage of cheating participants and extent of cheating. The percentage of cheating participants produced a significant contrast of the money priming and control condition, chi2(1, N = 65) = 3.97. However, the meta-analysis used the extent of cheating dependent variable, which should only a marginally significant effect with a one-tailed p-value of .07. “Simple contrasts revealed that participants cheated more in the money condition (M = 4.41, SD = 4.25) than in both the control condition (M = 2.76, SD = 3.96; p = .07) and the time condition (M = 1.55, SD = 2.41; p = .002).” Thus, this non-significant results was presented as supporting evidence in the original article.
2. Jin, Z., Shiomura, K., & Jiang, L. (2015) conducted a priming studies with reaction times as dependent variables. This design is different from social priming studies in the meta-analysis. Moreover, money priming effects were examined within-participants, and the study produced several significant complex interaction effects. Thus, this study also does not count as a published failure to replicate money priming effects.
3. Mukherjee, S., Nargundkar, M., & Manjaly, J. A. (2014) examined the influence of money primes on various satisfaction judgments. Study 1 used a small sample of N = 48 participants with three dependent variables. Two achieved significance, but the meta-analysis aggregated across DVs, which resulted in a non-significant outcome. Study 2 used a larger sample and replicated significance for two outcomes. It was not included in the meta-analysis. In this case, aggregation of DVs explains a non-significant result in the meta-analysis, while the original article reported significant results.
4. I was unable to retrieve this article, but the abstract suggests that the article reports a significant interaction. ” We found that although money-primed reactance in control trials in which the majority provided correct responses, this effect vanished in critical trials in which the majority provided incorrect answers.”
[https://www.sbp-journal.com/index.php/sbp/article/view/3227]
5. Wierzbicki, J., & Zawadzka, A. (2014) published two studies. Study 1 reported a significant result. Study 2 added a non-significant result to the meta-analysis. Although the effect for money priming was not significant, this study reported a significant effect for credit-card priming and a money priming x morality interaction effect. Thus, the article also did not report a money-priming failure as the key finding.
6. Gasiorowska, A. (2013) is an article in Polish.
7. is a duplication of article 5

In conclusion, none of the 7 studies with non-significant results in the meta-analysis that were published in a journal reported that money priming had no effect on a dependent variable. All articles reported some significant results as the key finding. This further confirms how dramatically publication bias distorts the evidence reported in psychology journals.

Conclusion

In this blog post, I examined the discrepancy between null-results in journal articles and in meta-analysis, using a meta-analysis of money priming. While the meta-analysis suggested that publication bias is relatively modest, published articles showed clear evidence of publication bias with an observed discovery rate of 89%, while the expected discovery rate was only 27%.

Three factors contributed to this discrepancy: (a) the inclusion of unpublished studies, (b) independent replication studies, and (c) the coding of interaction effects as separate effects for subgroups rather than coding the main effect.

After correcting for publication bias, expected discovery rates are consistently low with estimates around 30%. The main exception are the independent replication studies that found no evidence at all. Overall, these results confirm that published money priming studies and other social priming studies cannot be trusted because the published studies overestimate replicability and effect sizes.

It is not the aim of this blog post to examine whether some money priming paradigms can produce replicable effects. The main goal was to explain why publication bias in meta-analysis is often small, when publication bias in the published literature is large. The results show that several factors contribute to this discrepancy and that the inclusion of unpublished studies, independent replication studies, and coding of effects explain most of these discrepancies. # The (lacking) predictive validity of the race IAT

Good science requires valid measures. This statement is hardly controversial. Not surprisingly, all authors of some psychological measure claim that their measure is valid. However, validation research is expensive and difficult to publish in prestigious journals. As a result, psychological science has a validity crisis. Many measures are used in hundreds of articles without clear definitions of constructs and without quantitative information about their validity (Schimmack, 2010).

The Implicit Association Test (AT) is no exception. The IAT was introduced in 1998 with strong and highly replicable evidence that average attitudes towards objects pairs (e.g., flowers vs. spiders) can be measured with reaction times in a classification task (Greenwald et al., 1998). Although the title of the article promised a measure of individual differences, the main evidence in the article were mean differences between groups. Thus, the original article provided little evidence that the IAT is a valid measure of individual differences.

The use of the IAT as a measure of individual differences in attitudes requires scientific evidence that tests scores are linked to variation in attitudes. Key evidence for the validity of a test are reliability, convergent validity, discriminant validity, and incremental predictive validity (Campbell & Fiske, 1959).

The validity of the IAT as a measure of attitudes has to be examined on a case by case basis because the link between associations and attitudes can vary depending on the attitude object. For attitude objects like pop drinks, Coke vs. Pepsi, associations may be strongly related to attitudes. In fact, the IAT has good predictive validity for choices between two pop drinks (Hofmann, Gawronski, Gschwendner, & Schmitt, 2005). However, it lacks convergent validity when it is used to measure self-esteem (Bosson & Swan, & Pennebaker, 2000).

The IAT is best known as a measure of prejudice, racial bias, or attitudes of White Americans towards African Americans. On the one hand, the inventor of the IAT, Greenwald, argues that the race IAT has predictive validity (Greenwald et al., 2009). Others take issue with the evidence: “Implicit Association Test scores did not permit prediction of individual-level behaviors” (Blanton et al., 2009, p. 567); “the IAT provides little insight into who will discriminate against whom, and provides no more insight than explicit measures of bias” (Oswald et al., 2013).

Nine years later, Greenwald and colleagues present a new meta-analysis of predictive validity of the IAT (Kurdi et al., 2018) based on 217 research reports and a total sample size of N = 36,071 participants. The results of this meta-analysis are reported in the abstract.

We found significant implicit– criterion correlations (ICCs) and explicit– criterion correlations (ECCs), with unique contributions of implicit (beta = .14) and explicit measures (beta = .11) revealed by structural equation modeling.

The problem with meta-analyses is that they aggregate information with diverse methods, measures, and criterion variables, and the meta-analysis showed high variability in predictive validity. Thus, the headline finding does not provide information about the predictive validity of the race IAT. As noted by the authors, “Statistically, the high degree of heterogeneity suggests that any single point estimate of the implicit– criterion relationship would be misleading” (p. 7).

Another problem of meta-analysis is that it is difficult to find reliable moderator variables if original studies have small samples and large sampling error. As a result, a non-significant moderator effect cannot be interpreted as evidence that results are homogeneous. Thus, a better way to examine the predictive validity of the race IAT is to limit the meta-analysis to studies that used the race IAT.

Another problem of small studies is that they introduce a lot of noise because point estimates are biased by sampling error. Stanley, Jarrell, and Doucouliagos (2010) made the ingenious suggestion to limit meta-analysis to the top 10% of studies with the largest sample sizes. As these studies have small sampling error to begin with, aggregating them will produce estimates with even smaller sampling error and inclusion of many small studies with high heterogeneity is not necessary. A smaller number of studies also makes it easier to evaluate the quality of studies and to examine sources of heterogeneity across studies. I used this approach to examine the predictive validity of the race IAT using the studies included in Kurdi et al.’s (2018) meta-analysis (data).

### Description of the Data

The datafile contained the variable groupStemCat2 that coded the groups compared in the IAT. Only studies classified as groupStemCat2 == “African American and Africans” were selected, leaving 1328 entries (rows). Next, I selected only studies with an IAT-criterion correlation, leaving 1004 entries. Next, I selected only entries with a minimum sample size of N = 100, leaving 235 entries (more than 10%).

The 235 entries were based on 21 studies, indicating that the meta-analysis coded, on average, more than 10 different effects for each study.

The median IAT-criterion correlation across all 235 studies was r = .070. In comparison, the median r for the 769 studies with N < 100 was r = .044. Thus, selecting for studies with large N did not reduce the effect size estimate.

When I first computed the median for each study and then the median across studies, I obtained a similar median correlation of r = .065. There was no significant correlation between sample size and median ICC-criterion correlation across the 21 studies, r = .12. Thus, there is no evidence of publication bias.

I now review the 21 studies in decreasing order of the median IAT-criterion correlation. I evaluate the quality of the studies with 1 to 5 stars ranging from lowest to highest quality. As some studies were not intended to be validation studies, this evaluation does not reflect the quality of a study per se. The evaluation is based on the ability of a study to validate the IAT as a measure of racial bias.

#### 1. * Ma et al. (Study 2), N = 303, r = .34

Ma et al. (2012) used several IATs to predict voting intentions in the 2012 US presidential election. Importantly, Study 2 did not include the race IAT that was used in Study 1 (#15, median r = .03). Instead, the race IAT was modified to include pictures of the two candidates Obama and Romney. Although it is interesting that an IAT that requires race classifications of candidates predicted voting intentions, this study cannot be used to claim that the race IAT as a measure of racial bias has predictive validity because the IAT measures specific attitudes towards candidates rather than attitudes towards African Americans in general.

#### 2. *** Knowles et al., N = 285, r = .26

This study used the race IAT to predict voting intentions and endorsement of Obama’s health care reforms. The main finding was that the race IAT was a significant predictor of voting intentions (Odds Ratio = .61; r = .20) and that this relationship remained significant after including the Modern Racism scale as predictor (Odds Ratio = .67, effect size r = .15). The correlation is similar to the result obtained in the next study with a larger sample.

#### 3. ***** Greenwald et al. (2009), N = 1,057, r = .17

The most conclusive results come from Greenwald et al.’s (2009) study with the largest sample size of all studies. In a sample of N = 1,057 participants, the race IAT predicted voting intentions in the 2008 US election (Obama vs. McCain), r = .17. However, in a model that included political orientation as predictor of voting intentions, only explicit attitude measures added incremental predictive validity, b = .10, SE = .03, t = 3.98, but the IAT did not, b = .00, SE = .02, t = 0.18.

#### 4. * Cooper et al., N = 178, r = .12

The sample size in the meta-analysis does not match the sample size of the original study. Although 269 patients were involved, the race IAT was administered to 40 primary care clinicians. Thus, predictive validity can only be assessed on a small sample of N = 40 physicians who provided independent IAT scores. Table 3 lists seven dependent variables and shows two significant results (p = .02, p = .02) for Black patients.

#### 5. * Biernat et al. (Study 1), N = 136, r = .10

Study 1 included the race IAT and donations to a Black vs. other student organizations as the criterion variable. The negative relationship was not significant (effect size r = .05). The meta-analysis also included the shifting standard variable (effect size r = .14). Shifting standards refers to the extent to which participants shifted standards in their judgments of Black versus White targets’ academic ability. The main point of the article was that shifting standards rather than implicit attitude measures predict racial bias in actual behavior. “In three studies, the tendency to shift standards was uncorrelated with other measures of prejudice but predicted reduced allocation of funds to a Black student organization.” Thus, it seems debatable to use shifting standards as a validation criterion for the race IAT because the key criterion variable were the donations, while shifting standards were a competing indirect measure of prejudice.

#### 6. ** Zhang et al. (Study 2), N = 196, r = .10

This study examined thought listings after participants watched a crime committed by a Black offender on Law and Order. “Across two programs, no statistically significant relations between the nature of the thoughts and the scores on IAT were found, F(2, 85) = 2.4, p < .11 for program 1, and F(2, 84) = 1.98, p < .53 for program 2.” The main limitation of this study is that thought listings are not a real social behavior. As the effect size for this study is close to the median, excluding it has no notable effect on the final result.

#### 7. * Ashburn et al., N = 300, r = .09

The title of this article is “Race and the psychological health of African Americans.” The sample consists of 300 African American participants. Although it is interesting to examine racial attitudes of African Americans, this study does not address the question whether the race IAT is a valid measure of prejudice against African Americans.

#### 8. *** Eno et al. (Study 1), N = 105, r = .09

This article examines responses to a movie set during the Civil Rights Era; “Remember the Titans.” After watching the movie, participants made several ratings about interpretations of events. Only one event, attributing Emma’s actions to an accident, showed a significant correlation with the IAT, r = .20, but attributions to racism also showed a correlation in the same direction, r = .10. For the other events, attributions had the same non-significant effect size, Girls interests r = .12, Girls race, r = .07; Brick racism, r = -.10, Brick Black coach’s actions, r = -.10.

#### 9. *** Aberson & Haag, N = 153, r = .07

Abserson and Haag administered the race IAT to 153 participants and asked questions about quantity and quality of contact with African Americans. They found non-significant correlations with quantity, r = -.12 and quality, r = -.10, and a significant positive correlation with the interaction, r = .17. The positive interaction effect suggests that individuals with low contact, which implies low quality contact as well, are not different from individuals with frequent high quality contact.

#### 10. *Hagiwara et al., N = 106, r = .07

This study is another study of Black patients and non-Black physician. The main limitation is that there were only 14 physicians and only 2 were White.

#### 11. **** Bar-Anan & Nosek, N = 397, r = .06

This study used contact as a validation criterion. The race IAT showed a correlation of r = -.14 with group contact. , N in the range from 492-647. The Brief IAT showed practically the same relationship, r = -.13. The appendix reports that contact was more strongly correlated with the explicit measures; thermometer r = .27, preference r = .31. Using structural equation modeling, as recommended by Greenwald and colleagues, I found no evidence that the IAT has unique predictive validity in the prediction of contact when explicit measures were included as predictors, b = .03, SE = .07, t = 0.37.

#### 12. *** Aberson & Gaffney, N = 386, median r = .05

This study related the race IAT to measures of positive and negative contact, r = .10, r = -.01, respectively. Correlations with an explicit measure were considerably stronger, r = .38, r = -.35, respectively. These results mirror the results presented above.

#### 13. * Orey et al., N = 386, median r = .04

This study examined racial attitudes among Black respondents. Although this is an interesting question, the data cannot be used to examine the predictive validity of the race IAT as a measure of prejudice.

#### 14. * Krieger et al., N = 708, median r = .04

This study used the race IAT with 442 Black participants and criterion measures of perceived discrimination and health. Although this is a worthwhile research topic, the results cannot be used to evaluate the validity of the race IAT as a measure of prejudice.

#### 15. *** Ma et al. (Study 1), N = 335, median r = .03

This study used the race IAT to predict voter intentions in the 2012 presidential election. The study found no significant relationship. “However, neither category-level measures were related to intention to vote for Obama (rs ≤ .06, ps ≥ .26)” (p. 31). The meta-analysis recorded a correlation of r = .045, based on email correspondence with the authors. It is not clear why the race IAT would not predict voting intentions in 2012, when it did predict voting intentions in 2008. One possibility is that Obama was now seen as a an individual rather than as a member of a particular group so that general attitudes towards African Americans no longer influenced voting intentions. No matter what the reason is, this study does not provide evidence for the predictive validity of the race IAT.

#### 16. **** Oliver et al., N = 105, median r = .02

This study was on online study of 543 family and internal medicine physicians. They completed the race IAT and gave treatment recommendations for a hypothetical case. Race of the patient was experimentally manipulated. The abstract states that “physicians possessed explicit and implicit racial biases, but those biases did not predict
treatment recommendations” (p. 177). The sample size in the meta-analysis is smaller because the total sample was broken down into smaller subgroups.

#### 17. * Nosek & Hansen, N = 207, median r = .01

This study did not include a clear validation criterion. The aim was to examine the relationship between the race IAT and cultural knowledge about stereoetypes. “In seven studies (158 samples, N = 107,709), the IAT was reliably and variably related to explicit attitudes, and explicit attitudes accounted for the relationship between the IAT and cultural knowledge.” The cultural knowledge measures were used as criterion variables. A positive relation, r = .10, was obtained for the item “If given the choice, who would most employers choose to hire, a Black American or a White American? (1 definitely White to 7 definitely Black).” A negative relation, r = -.09, was obtained for the item “Who is more likely to be a target of discrimination, a Black American or a White American? (1 definitely White to 7 definitely Black).”

#### 18. *Plant et al., N = 229, median r = .00

This article examined voting intentions in a sample of 229 students. The results are not reported in the article. The meta-analysis reported a positive r = .04 and a negative r = -.04 for two separate entries with different explicit measures, which must be a coding mistake. As voting behavior has been examined in larger and more representative samples (#3, #15), these results can be ignored.

#### 19. *Krieger et al. (2011), N = 503, r = .00

This study recruited 504 African Americans and 501 White Americans. All participants completed the race IAT. However, the study did not include clear validation criteria. The meta-analysis used self-reported experiences of discrimination as validation criterion. However, the important question is whether the race IAT predicts behaviors of people who discriminate, not the experience of victims of discrimination.

#### 20. *Fiedorowicz, N = 257, r = -.01

This study is a dissertation and the validation criterion was religious fundamentalism.

#### 21. *Heider & Skowronski, N = 140, r = -.02

This study separated the measurement of prejudice with the race IAT and the measurement of the criterion variables by several weeks. The criterion was cooperative behavior in a prisoner dilemma game. The results showed that “both the IAT (b = -.21, t = -2.51, p = .013) and the Pro-Black subscore (b = .17, t = 2.10, p = .037) were significant predictors of more cooperation with the Black confederate. However, these results were false and have been corrected (see Carlsson et al., 2018, for a detailed discussion).

Heider, J. D., & Skowronski, J.J. (2011). Addendum to Heider and Skowronski (2007): Improving the predictive validity of the Implicit Association Test. North American Journal of Psychology, 13, 17-20

### Discussion

In summary, a detailed examination of the race IAT studies included in the meta-analysis shows considerable heterogeneity in the quality of the studies and their ability to examine the predictive validity of the race IAT. The best study is Greenwald et al.’s (2009) study with a large sample and voting in the Obama vs. McCain race as the criterion variable. However, another voting study failed to replicate these findings in 2012. The second best study was BarAnan and Nosek’s study with intergroup contact as a validation criterion, but it failed to show incremental predictive validity of the IAT.

Studies with physicians show no clear evidence of racial bias. This could be due to the professionalism of physicians and the results should not be generalized to the general population. The remaining studies were considered unsuitable to examine predictive validity. For example, some studies with African American participants did not use the IAT to measure prejudice.

Based on this limited evidence it is impossible to draw strong conclusions about the predictive validity of the race IAT. My assessment of the evidence is rather consistent with the authors of the meta-analysis, who found that “out of the 2,240 ICCs included in this metaanalysis, there were only 24 effect sizes from 13 studies that (a) had the relationship between implicit cognition and behavior as their primary focus” (p. 13).

This confirms my observation in the introduction that psychological science has a validation crisis because researchers rarely conduct validation studies. In fact, despite all the concerns about replicability, the lack of replication studies are much more numerous than validation studies. The consequences of the validation crisis is that psychologists routinely make theoretical claims based on measures with unknown validity. As shown here, this is also true for the IAT. At present, it is impossible to make evidence-based claims about the validity of the IAT because it is unknown what the IAT measures and how well it measures what it measures.

#### Theoretical Confusion about Implicit Measures

The lack of theoretical understanding of the IAT is evident in Greenwald and Banaji’s (2017) recent article, where they suggest that “implicit cognition influences explicit cognition that, in turn, drives behavior” (Kurdi et al., p. 13). This model would imply that implicit measures like the IAT do not have a direct link to behavior because conscious processes ultimately determine actions. This speculative model is illustrated with Bar-Anan and Nosek’s (#11) data that showed no incremental predictive validity on contact. The model can be transformed into a causal chain by changing the bidiretional path into an assumed causal relationship between implicit and explicit attitudes.

However, it is also possible to change the model into a single factor model, that considers unique variance in implicit and explicit measures as mere method variance.

Thus, any claims about implicit bias and explicit bias is premature because the existing data are consistent with various theoretical models. To make scientific claims about implicit forms of racial bias, it would be necessary to obtain data that can distinguish empirically between single construct and dual-construct models.

### Conclusion

The race IAT is 20 years old. It has been used in hundreds of articles to make empirical claims about prejudice. The confusion between measures and constructs has created a public discourse about implicit racial bias that may occur outside of awareness. However, this discourse is removed from the empirical facts. The most important finding of the recent meta-analysis is that a careful search of the literature uncovered only a handful of serious validation studies and that the results of these studies are suggestive at best. Even if future studies would provide more conclusive evidence of incremental predictive validity, this finding would be insufficient to claim that the IAT is a valid measure of implicit bias. The IAT could have incremental predictive validity even if it were just a complementary measure of consciously accessible prejudice that does not share method variance with explicit measures. A multi-method approach is needed to examine the construct validity of the IAT as a measure of implicit race bias. Such evidence simply does not exist. Greenwald and colleagues had 20 years and ample funding to conduct such validation studies, but they failed to do so. In contrast, their articles consistently confuse measures and constructs and give the impression that the IAT measures unconscious processes that are hidden from introspection (“conscious experience provides only a small window into how the mind works”, “click here to discover your hidden thoughts”).

Greenwald and Banaji are well aware that their claims matter. “Research on implicit social cognition has witnessed higher levels of attention both from the general public and from governmental and commercial entities, making regular reporting of what is known an added responsibility” (Kurdi et al., 2018, p. 3). I concur. However, I do not believe that their meta-analysis fulfills this promise. An unbiased assessment of the evidence shows no compelling evidence that the race IAT is a valid measure of implicit racial bias; and without a valid measure of implicit racial bias it is impossible to make scientific statements about implicit racial bias. I think the general public deserves to know this. Unfortunately, there is no need for scientific evidence that prejudice and discrimination still exists. Ideally, psychologists will spend more effort in developing valid measures of racism that can provide trustworthy information about variation across individuals, geographic regions, groups, and time. Many people believe that psychologists are already doing it, but this review of the literature shows that this is not the case. It is high time to actually do what the general public expects from us. # No Incremental Predictive Validity of Implicit Attitude Measures

The general public has accepted the idea of implicit bias; that is, individuals may be prejudice without awareness. For example, in 2018 Starbucks closed their stores for one day to train employees to detect and avoid implicit bias (cf. Schimmack, 2018).

However, among psychological scientists the concept of implicit bias is controversial (Blanton et al., 2009; Schimmack, 2019). The notion of implicit bias is only a scientific construct if it can be observed with scientific methods, and this requires valid measures of implicit bias.

Valid measures of implicit bias require evidence of reliability, convergent validity, discriminant validity, and incremental predictive validity. Proponents of implicit bias claim that measures of implicit bias have demonstrated these properties. Critics are not convinced.

For example, Cunningham, Preacher, and Banaji (2001) conducted a multi-method study and claimed that their results showed convergent validity among implicit measures and that implicit measures correlated more strongly with each other than with explicit measures. However, Schimmack (2019) demonstrated that a model with a single factor fit the data better and that the explicit measures loaded higher on this factor than the evaluative priming measure. This finding challenges the claim that implicit measures possess discriminant validity. That is, the are implicit measures of racial bias, but they are not measures of implicit racial bias.

A forthcoming meta-analysis claims that implicit measures have unique predictive validity (Kurdi et al., 2018). The average effect size for the correlation between an implicit measure and a criterion was r = .14. However, this estimate is based on studies across many different attitude objects and includes implicit measures of stereotypes and identity. Not surprisingly, the predictive validity was heterogeneous. Thus, the average does not provide information about the predictive validity of the race IAT as a measure of implicit bias. The most important observation was that sample sizes of many studies were too small to investigate predictive validity given the small expected effect size. Most studies had sample sizes with fewer than 100 participants (see Figure 1).

A notable exception is a study of voting intentions in the historic 2008 presidential election, where US voters had a choice to elect the first Black president, Obama, or the Republican candidate McCain. A major question at that time was how much race and prejudice would influence the vote. Greenwald, Tucker Smith, Sriram, Bar-Anan, and Nosek (2009) conducted a study to address this question. They obtained data from N = 1,057 participants who completed online implicit measures and responded to survey questions. The key outcome variable was a simple dichotomous question about voting intentions. The sample was not a national representative sample as indicated by 84.2% declared votes for Obama versus 15.8% declared votes for McCain. The predictor variables were two self-report measures of prejudice (feeling-thermometer, Likert scale), two implicit measures (Brief IAT, AMP), the Symbolic Racism Scale, and a measure of political orientation (Conservative vs. Liberal).

The correlation among all measures were reported in Table 1.

The results for the Brief IAT (BIAT) are highlighted. First, the BIAT does predict voting intentions (r = .17). Second, the BIAT shows convergent validity with the second implicit measure; the Affective Missattribution Paradigm (AMP). Third, the IAT also correlates with the explicit measures of racial bias. Most important, the correlations with the implicit AMP are weaker than the correlations with the explicit measures. This finding confirms Schimmack’s (2019) finding that implicit measures lack discriminant validity.

The correlation table does not address the question whether implicit measures have incremental predictive validity. To examine this question, I fit a structural equation model to the reproduced covariance matrix based on the reported correlations and standard deviations using MPLUS8.2. The model shown in Figure 1 had good overall fit, chi2(9, N = 1057) = 15.40, CFI = .997, RMSEA = .026, 90%CI = .000 to .047.

The model shows that explicit and implicit measures of racial bias load on a common factor (att). Whereas the explicit measures share method variance, the residuals of the two implicit measures are not correlated. This confirms the lack of discriminant validity. That is, there is no unique variance shared only by implicit measures. The strongest predictor of voting intentions is political orientation. Symbolic racism is a mixture of conservatism and racial bias, and it has no unique relationship with voting intentions. Racial bias does make a unique contribution to voting intentions, (b = .22, SE = .05, t = 4.4). The blue path shows that the BIAT does have predictive validity above and beyond political orientation, but the effect is indirect. That is, the IAT is a measure of racial bias and racial bias contributes to voter intentions. The red path shows that the BIAT has no unique relationship with voting intentions. The negative coefficient is not significant. Thus, there is no evidence that the unique variance in the BIAT reflects some form of implicit racial bias that influences voting intentions.

In short, these results provide no evidence for the claim that implicit measures tap implicit racial biases. In fact, there is no scientific evidence for the concept of implicit bias, which would require evidence of discriminant validity and incremental validity.

### Conclusion

The use of structural equation modeling (SEM) was highly recommended by the authors of the forthcoming meta-analysis (Kurdi et al., 2018). Here I applied SEM used the best data with multiple explicit and implicit measures, an important criterion variable, and a large sample size that is sufficient to detect small relationships. Contrary to the meta-analysis, the results do not support the claim that implicit measures have incremental predictive validity. In addition, the results confirmed Schimmack’s (2019) results that implicit measures lack discriminant validity. Thus, the construct of implicit racial bias lacks empirical support. Implicit measures like the IAT are best considered as implicit measures of racial bias that is also reflected in explicit measures.

With regard to the political question whether racial bias influenced voting in the 2008 election, these results suggest that racial bias did indeed matter. Using only explicit measures would have underestimated the effect of racial bias due to the substantial method variance in these measures. Thus, the IAT can make an important contribution to the measurement of racial bias because it doesn’t share method variance with explicit measures.

In the future, users of implicit measures need to be more careful in their claims about the construct validity of implicit measures. Greenwald et al. (2009) constantly conflate implicit measures of racial bias with measures of implicit racial bias. For example, the title claims “Implicit Race Attitudes Predicted Vote” , the term “Implicit race attitude measure” is ambiguous because it could mean implicit measure or implicit attitude, whereas the term “implicit measures of race attitudes” implies that the measures are implicit but the construct is racial bias; otherwise it would be “implicit measures of implicit racial bias.” The confusion arises from a long tradition in psychology to conflate measures and constructs (e.g., intelligence is whatever an IQ test measures) (Campbell & Fiske, 1959). Structural equation modeling makes it clear that measures (boxes) and constructs (circles) are distinct and that measurement theory is needed to relate measures to constructs. At present, there is clear evidence that implicit measures can measure racial bias, but there is no evidence that attitudes have an explicit and an implicit component. Thus, scientific claims about racial bias do not support the idea that racial bias is implicit. This idea is based on the confusion of measures and constructs in the social cognition literature. # Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large. Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance. In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

# Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.