## Abstract

I reinvestigate the performance of prediction markets for the Open Science Collaboration replicability project. I show that performance of prediction markets varied considerably across the two markets, with the second market failing to replicate the excellent performance of the first market. I also show that the markets did not perform significantly better than a “burn everything to the ground” rule that bets on failure every time. Finally, I suggest a simple rule that can be easily applied to published studies that only treats results with p-values below .005 as significant. Finally, I discuss betting on future studies as a way to optimize resource allocation for future studies.

## Introduction

For decades, psychologists failed to properly test their hypotheses. Statistically significant results in journals are meaningless because published results are selected for significance. A replication project with 100 studies from three journals that reported significant results found that only 37% (36/97) of published significant results could be replicated (Open Science Collaboration, 2015).

Unfortunately, it is impossible to rely on actual replication studies to examine the credibility of thousands of findings that have been reported over the years. Dreber, Pfeiffer, Almenberg, Isakssona, Wilson, Chen, Nosek, and Johannesson (2015) proposed prediction markets as a solution to this problem. Prediction markets rely on a small number of traders to bet on the outcome of replication studies. They can earn a small amount of real money for betting on studies that actually replicate.

To examine the forecasting abilities of prediction markets, Dreber et al. (2015) conducted two studies. The first study with 23 studies started in November 2012 and lasted two month (N = 47 participants). The second study with 21 studies started in October 2014 (N = 45 participants). The studies are very similar to each other. Thus, we can consider Study 2 a close replication of Study 1.

Studies were selected from the set of 100 studies based on time of completion. To pay participants, studies were chosen that were scheduled to be completed within two month after the completion of the prediction market. It is unclear how completion time may influence the type of study that was included or the outcome of the included studies.

The published article reports the aggregated results across the two studies. A market price above 50% was considered to be a prediction of a successful replication and a market price below 50% was considered to be a prediction of a replication failure. The key finding was that “the prediction markets correctly predict the outcome of 71% of the replications (29 of 41 studies” (p. 15344). The authors compare this finding to a proverbial coin flip which implies a replication rate of 50% and find that 71% is [statistically] significantly higher than than 50%.

Below I am conducting some additional statistical analyses of the open data. First, we can compare the performance of the prediction market with a different prediction rule. Given the higher prevalence of replication failures than successes, a simple rule is to use the higher base rate of failures to predict that all studies will fail to replicate. As the failure rate for the total set of 97 studies was 37%, this prediction rule has a success rate of 1-.37 = 63%. For the 43 studies with significant results, the success rate of replication studies was also 37% (15/41). Somewhat surprisingly, the success rates were also close to 37% for Prediction Market 1, 32% (7/22) and Prediction Market 2, 42% (8 / 19).

In comparison to a simple prediction rule that everything in psychology journals does not replicate, prediction markets are no longer statistically significantly better, chi2(1) = 1.82, p = .177.

Closer inspection of the original data also revealed a notable difference in the performance of the two prediction markets. Table 1 shows the results separately for prediction markets 1 and 2. Whereas the performance of the first prediction market is nearly perfect, 91% (20/22), the replication market performed only at chance levels (flip a coin), 53% (10/19). Despite the small sample size, the success rates in the two studies are statistically significantly different, chi2(1) = 5.78, p = .016.

There is no explanation for these discrepancies and the average result reported in the article can still be considered the best estimate of prediction markets’ performance, but trust in their ability is diminished by the fact that a close replication of excellent performance failed to replicate. Not reporting the different outcomes for two separate studies could be considered a questionable decision.

The main appeal of prediction markets over the nihilistic trash-everything rule is that decades of research would have produced some successes. However, the disadvantage of prediction markets is that they take a long time, cost money, and the success rates are currently uncertain. A better solution might be to find rules that can be applied easily to large sets of studies (Yang, Wu, & Uzzi, 2020). One simple rule is suggested by the simple relationship between strength of evidence and replicability. The stronger the evidence against the null-hypothesis is (i.e., lower p-values), the more likely it is that the original study had high power and that the results will replicate in a replication study. There is no clear criterion for a cut-off point to optimize prediction, but the results of the replication project can be used to validate cut-off points empirically.

One suggests has been to consider only p-values below .005 as statistically significant (Benjamin et al., 2017). This rule is especially useful when selection bias is present. Selection bias can produce many results with p-values between .05 and .005 that have low power. However, p-values below .005 are more difficult to produce with statistical tricks.

The overall success rate for the 41 studies included in the Prediction Markets was 63% (26/41), a difference of 4 studies. The rule also did better for the first market, 81% (18.22) than for the second market, 42% (8/19).

Table 2 shows that the main difference between the two markets was that the first market contained more studies with questionable p-values between .05 and .005 (15 vs. 6). For the second market, the rule overpredicts successes and there are more false (8) than correct (5) predictions. This finding is consistent with examinations of the total set of replication studies in the replicability project (Schimmack, 2015). Based on this observation, I recommended a 4-sigma rule, p < .00006. The overall success rate increases to 68% (28/41) and improvement by 2 studies. However, an inspection of correct predictions of successes shows that the rule only correctly predicts 5 of the 15 successes (33%), whereas the p < .005 rule correctly predicted 10 of the 15 successes (67%). Thus, the p < .005 rule has the advantage that it salvages more studies.

## Conclusion

Meta-scientists are still scientists and scientists are motivated to present their results in the best possible light. This is also true for Derber et al.’s (2015) article. The authors enthusiastically recommend prediction markets as a tool “to quickly identify findings

that are unlikely to replicate” Based on their meta-analysis of two prediction markets with a total of just 41 studies, the authors conclude that “prediction markets are well suited” to assess the replicability of published results in psychology. I am not the only one to point out that this conclusion is exaggerated (Yang, Wu, & Uzzi, 2020). First, prediction markets are not quick at identifying replicable results, especially when we compare the method to a simple computation of the exact p-values to decide whether the p-value is below .005 or not. It is telling that nobody has used prediction markets to forecast the outcome of new replication studies. One problem is that a market requires a set of studies, which makes it difficult to use them to predict outcomes of single studies. It is also unclear how well prediction markets really work. The original article omitted the fact that it worked extremely well in the first market and not at all in the second market, a statistically significant difference. The outcome seems to depend a lot on the selection of studies in the market. Finally, I showed that a simple statistical rule alone can predict replication outcomes nearly as well as prediction markets.

There is no reason to use markets for multiple studies. One could also set up betting for individual studies, just like individuals can bet on the outcome of a single match in sports or a single election outcome. Betting might be more usefully employed for the prediction of original studies than to vet the outcome of replication studies. For example, if everybody bets that a study will produce a significant result, there appears to be little uncertainty about the outcome, and the actual study may be a waste of resources. One concerns in psychology is that many studies merely produce significant p-values for obvious predictions. Betting on effect sizes would help to make effect sizes more important. If everybody bets on a very small effect size a study might not be useful to run because the expected effect size is trivial, even if the effect is greater than zero. Betting on effect sizes could also be used for informal power analyses to determine the sample size of the actual study.

## References

Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. PNAS Proceedings of the National Academy of Sciences of the United States of America, 112(50), 15343–15347. https://doi.org/10.1073/pnas.1516179112 1 commen

Nice discussion, but update for last month’s article in PLoS One (Gordon et al. 2021, below) that summarizes four previous replication prediction markets. Average 73% accuracy. Variation among studies, full data available in an R package. Worth a look!

Quick thoughts on your post:

* Quick rules are good baselines. “Just say no” is a great nil model, and does well here. I suspect less well on the larger dataset? Regardless, it has no signal (eg AUC). Markets do have signal, though need a little recalibrating: they were optimistic and under confident.

* P-value models are a better baseline. You’d expect a market to start with statistical info, and improve using outside info. These *do* have signal. Gordon et al. use a simple one.

* Interim results for Replication Markets project (part of SCORE) do not look promising there. Replications still arriving, but so far our starting prices (p-value) are doing as well or better.

Ideally any crowd platform will provide feedback so the crowd can learn. That hasn’t happened yet. We’re left with a puzzle: crowds, P-values, and stupid bag-of-words models all get ~70% accuracy, clearly using different signals. But so far no ensemble has beat this.

Links:

* Article: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0248780

* Data: https://github.com/MichaelbGordon/pooledmaRket

thank you for the additional information. It would be interesting to compare prediction markets with other models for the other studies, but I haven’t had the time to do that. This post is really just a commentary on the 2015 article. Thanks again.

thanks for sharing post.