Klaus Fiedler’s Response to the Replication Crisis: In/actions speaks louder than words

Klaus Fiedler  is a prominent experimental social psychologist.  Aside from his empirical articles, Klaus Fiedler has contributed to meta-psychological articles.  He is one of several authors of a highly cited article that suggested numerous improvements in response to the replication crisis; Recommendations for Increasing Replicability in Psychology (Asendorpf, Conner, deFruyt, deHower, Denissen, K. Fiedler, S. Fiedler, Funder, Kliegel, Nosek, Perugini, Roberts, Schmitt, vanAken, Weber, & Wicherts, 2013).

The article makes several important contributions.  First, it recognizes that success rates (p < .05) in psychology journals are too high (although a reference to Sterling, 1959, is missing). Second, it carefully distinguishes reproducibilty, replicabilty, and generalizability. Third, it recognizes that future studies need to decrease sampling error to increase replicability.  Fourth, it points out that reducing sampling error increases replicabilty because studies with less sampling error have more statistical power and reduce the risk of false negative results that often remain unpublished.  The article also points out problems with articles that present results from multiple underpowered studies.

“It is commonly believed that one way to increase replicability is to present multiple studies. If an effect can be shown in different studies, even though each one may be underpowered, many readers, reviewers, and editors conclude that it is robust and replicable. Schimmack (2012), however, has noted that the opposite can be true. A study with low power is, by definition, unlikely to obtain a significant result with a given effect size.” (p. 111)

If we assume that co-authorship implies knowledge of the content of an article, we can infer that Klaus Fiedler was aware of the problem of multiple-study articles in 2013. It is therefore disconcerting to see that Klaus Fiedler is the senior author of an article published in 2014 that illustrates the problem of multiple study articles (T. Krüger,  K. Fiedler, Koch, & Alves, 2014).

I came across this article in a response by Jens Forster to a failed replication of Study 1 in Forster, Liberman, and Kuschel, 2008).  Forster cites the Krüger et al. (2014) article as evidence that their findings have been replicated to discredit the failed replication in the Open Science Collaboration replication project (Science, 2015).  However, a bias-analysis suggests that Krüger et al.’s five studies had low power and a surprisingly high success rate of 100%.

No N Test p.val z OP
Study 1 44 t(41)=2.79 0.009 2.61 0.74
Study 2 80 t(78)=2.81 0.006 2.73 0.78
Study 3 65 t(63)=2.06 0.044 2.02 0.52
Study 4 66 t(64)=2.30 0.025 2.25 0.61
Study 5 170 t(168)=2.23 0.027 2.21 0.60

z = -qnorm(p.val/2);  OP = observed power  pnorm(z,1.96)

Median observed power is only 61%, but the success rate (p < .05) is 100%. Using the incredibility index from Schimmack (2012), we find that the binomial probability of obtaining at least one non-significant result with median power of 61% is 92%.  Thus, the absence of non-significant results in the set of five studies is unlikely.

As Klaus Fiedler was aware of the incredibility index by the time this article was published, the authors could have computed the incredibility of their results before they published the results (as Micky Inzlicht blogged “check yourself, before you wreck yourself“).

Meanwhile other bias tests have been developed.  The Test of Insufficient Variance (TIVA) compares the observed variance of p-values converted into z-scores to the expected variance of independent z-scores (1). The observed variance is much smaller,  var(z) = 0.089 and the probability of obtaining such small variation or less by chance is p = .014.  Thus, TIVA corroberates the results based on the incredibility index that the reported results are too good to be true.

Another new method is z-curve. Z-curve fits a model to the density distribution of significant z-scores.  The aim is not to show bias, but to estimate the true average power after correcting for bias.  The figure shows that the point estimate of 53% is high, but the 95%CI ranges from 5% (all 5 significant results are false positives) to 100% (all 5 results are perfectly replicable).  In other words, the data provide no empirical evidence despite five significant results.  The reason is that selection bias introduces uncertainty about the true values and the data are too weak to reduce this uncertainty.

Fiedler4

The plot also shows visually how unlikely the pile of z-scores between 2 and 2.8 is. Given normal sampling error there should be some non-significant results and some highly significant (p < .005, z > 2.8) results.

In conclusion, Krüger et al.’s multiple-study article cannot be used by Forster et al. as evidence that their findings have been replicated with credible evidence by independent researchers because the article contains no empirical evidence.

The evidence of low power in a multiple study article also shows a dissociation between Klaus Fiedler’s  verbal endorsement of the need to improve replicability as co-author of the Asendorpf et al. article and his actions as author of an incredible multiple-study article.

There is little excuse for the use of small samples in Krüger et al.’s set of five studies. Participants in all five studies were recruited from Mturk and it would have been easy to conduct more powerful and credible tests of the key hypotheses in the article. Whether these tests would have supported the predictions or not remains an open question.

Automated Analysis of Time Trends

It is very time consuming to carefully analyze individual articles. However, it is possible to use automated extraction of test statistics to examine time trends.  I extracted test statistics from social psychology articles that included Klaus Fiedler as an author. All test statistics were converted into absolute z-scores as a common metric of the strength of evidence against the null-hypothesis.  Because only significant results can be used as empirical support for predictions of an effect, I limited the analysis to significant results (z >  1.96).  I computed the median z-score and plotted them as a function of publication year.

The plot shows a slight increase in strength of evidence (annual increase = 0.009 standard deviations), which is not statistically significant, t(16) = 0.30.  Visual inspection shows no notable increase after 2011 when the replication crisis started or 2013 when Klaus Fiedler co-authored the article on ways to improve psychological science.

Given the lack of evidence for improvement,  I collapsed the data across years to examine the general replicability of Klaus Fiedler’s work.

The estimate of 73% replicability suggests that randomly drawing a published result from one of Klaus Fiedler’s articles has a 73% chance of being replicated if the study and analysis was repeated exactly.  The 95%CI ranges from 68% to 77% showing relatively high precision in this estimate.   This is a respectable estimate that is consistent with the overall average of psychology and higher than the average of social psychology (Replicability Rankings).   The average for some social psychologists can be below 50%.

Despite this somewhat positive result, the graph also shows clear evidence of publication bias. The vertical red line at 1.96 indicates the boundary for significant results on the right and non-significant results on the left. Values between 1.65 and 1.96 are often published as marginally significant (p < .10) and interpreted as weak support for a hypothesis. Thus, the reporting of these results is not an indication of honest reporting of non-significant results.  Given the distribution of significant results, we would expect more (grey line) non-significant results than are actually reported.  The aim of reforms such as those recommended by Fiedler himself in the 2013 article is to reduce the bias in favor of significant results.

There is also clear evidence of heterogeneity in strength of evidence across studies. This is reflected in the average power estimates for different segments of z-scores.  Average power for z-scores between 2 and 2.5 is estimated to be only 45%, which also implies that after bias-correction the corresponding p-values are no longer significant because 50% power corresponds to p = .05.  Even z-scores between 2.5 and 3 average only 53% power.  All of the z-scores from the 2014 article are in the range between 2 and 2.8 (p < .05 & p > .005).  These results are unlikely to replicate.  However, other results show strong evidence and are likely to replicate. In fact, a study by Klaus Fiedler was successfully replicated in the OSC replication project.  This was a cognitive study with a within-subject design and a z-score of 3.54.

The next Figure shows the model fit for models with a fixed percentage of false positive results.

Model fit starts to deteriorate notably with false positive rates of 40% or more.  This suggests that the majority of published results by Klaus Fiedler are true positives. However, selection for significance can inflate effect size estimates. Thus, observed effect sizes estimates should be adjusted.

Conclusion

In conclusion, it is easier to talk about improving replicability in psychological science, particularly experimental social psychology, than to actually implement good practices. Even prominent researchers like Klaus Fiedler have responsibilities to their students to publish as much as possible.  As long as reputation is measured in terms of number of publications and citations, this will not change.

Fortunately, it is now possible to quantify replicability and to use these measures to reward research that require more resources to provide replicable and credible evidence without the use of questionable research practices.  Based on these metrics, the article by Krüger et al. is not the norm for publications by Klaus Fiedler and Klaus Fiedler’s replicability index of 73 is higher than the index of other prominent experimental social psychologists.

An easy way to improve it further would be to retract the weak T. Krüger et al. article. This would not be a costly retraction because the article has not been cited in Web of Science so far (no harm, no foul).  In contrast, the Asendorph et al. (2013) article has been cited 245 times and is Klaus Fiedler’s second most cited article in WebofScience.

The message is clear.  Psychology is not in the year 2010 anymore. The replicability revolution is changing psychology as we speak.  Before 2010, the norm was to treat all published significant results as credible evidence and nobody asked how stars were able to report predicted results in hundreds of studies. Those days are over. Nobody can look at a series of p-values of .02, .03, .049, .01, and .05 and be impressed by this string of statistically significant results.  Time to change the saying “publish or perish” to “publish real results or perish.”

 

Leave a ReplyCancel reply