Category Archives: Uncategorized

The Gino-Colada Affair

There is no doubt that social psychology and its applied fields like behavioral economics and consumer psychology have a credibility problem. Many of the findings cannot be replicated because they were obtained with questionable research practices or p-hacking. QRPs are statistical tricks that help researchers to obtain p-values below the necessary threshold to claim a discovery (p < .05). To be clear, although lay people and undergraduate students consider these practices to be deceptive, fraudulent, and unscientific, they are not considered fraudulent by researchers, professional organizations, funding agencies, or universities. Demonstrating that a researchers used QRPs to obtain significant results is easy-peasy, undermines the credibility of their work, but they can keep their jobs because it is not (yet) illegal to use these practices.

The Gino-Harvard scandal is different because the DataColada team claimed that they found “four studies for which we had accumulated the strongest evidence of fraud” and that they “believe that many more Gino-authored papers contain fake data.” To lay people, it can be hard to understand the difference between allowed QRPs and forbidden fraud or data manipulation. An example of QRPs, could be selectively removing extreme values so that the difference between two groups becomes larger (e.g., removing extremely low depression scores from a control group to show a bigger treatment effect). Outright data manipulation would be switching participants with low scores from the control group to the treatment group and vice versa.

DataColada used features of the excel spreadsheet that contained the data to claim that the data were manually manipulated.

The focus is on six rows that have a strong influence on the results for all three dependent variables that were reported in the article, namely cheated or not, overreporting of performance, and deductions.

Based on the datasheet, participants in the sign-at-the-top condition (1) in rows 67, 68, and 69, did not cheat and therewith also did not overreport performance, and had very low deductions an independent measure of cheating. In contrast, participants in rows 70, 71, and 72 all cheated, had moderate amounts of overreporting, and very high deductions.

Yadi, yadi, yada, yesterday Gino posted a blog post that responded to these accusations. Personally, the most interesting rebuttal was the claim that there was no need to switch rows because the study results hold even without the flagged rows.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?

This argument makes sense to me because fraud appears to be the last resort for researchers who are eager to present a statistically significant results. After all, nobody claims that there was no data collection as in some cases by Diederik Stapel, who committed blatant fraud around the time this article in question was published and the use of questionable research practices was rampant. When researchers conduct an actual study, they probably hope to get the desired result without QRPs or fraud. As significance requires luck, they may just hope to get lucky. When this does not work, they can use a few QRPs. When this does not work, they can just shelf the study and try again. All of this would be perfectly legal by current standards of research ethics. However, if the results are close and it is not easy to collect more data to hope for better results), it may be tempting to change a few labels of conditions to reach p < .05. And the accusation here (there are other studies) is that only 6 (or a couple more) rows were switched to get significance. However, Gino claims that the results were already significant and I agree that it makes no sense for somebody to temper with data, if the p-value is already below .05.

However, Gino did not present evidence that the results hold without the contested cases. So, I downloaded the data and took a look.

First, I was able to reproduce the published result of an ANOVA with the three conditions as categorical predictor variable and deductions as outcome variable.

In addition, the original article reported that the differences between the experimental “signature-on-top” and each of the two control conditions (“signature-on-bottom”, “no signature”) were significant. I also confirmed these results.

Now I repeated the analysis without rows 67 to 72. Without the six contested cases, the results are no longer statistically significant, F(2, 92) = 2.96, p = .057.

Interestingly, the comparisons of the experimental group with the two control groups were statistically significant.

Combining the two control groups and comparing it to the experimental group and presenting the results as a planned contrast would also have produced a significant result.

However, these results do not support Gino’s implication that the same analysis that was reported in the article would have produced a statistically significant result, p < .05, without the six contested cases. Moreover, the accusation is that she switched rows with low values to the experimental condition and rows with high values to the control condition. To simulate this scenario, I recoded the contested rows 67-69 as signature-at-the-bottom and 70-72 as signature-at-the-top and repeated the analysis. In this case, there was no evidence that the group means differed from each other, F(2,98) = 0.45, p = .637.

Conclusion

Experimental social psychology has a credibility crisis because researchers were (and still are) allowed to use many statistical tricks to get significant results or to hide studies that didn’t produce the desired results. The Gino scandal is only remarkable because outright manipulation of data is the only ethics violations that has personal consequences for researchers when it can be proven. Lack of evidence that fraud was committed or lack of fraud do not imply that results are credible. For example, the results in Study 2 are meaningless even without fraud because the null-hypothesis was rejected with a confidence interval that had a value close to zero as a plausible value. While the article claims to show evidence of mediation, the published data alone show that there is no empirical evidence for this claim even if p < .05 was obtained without p-hacking or fraud. Misleading claims based on weak data, however, do not violate any ethics guidelines and are a common, if not essential, part of a game called social psychology.

This blog post only examined one minor question. Gino claimed that she did not have to manipulate data because the results were already significant.

“Finally, recall the lack of motive for the supposed manipulation: If you re-run the entire study excluding all of the red observations (the ones that should be considered “suspicious” using Data Colada’s lens), the findings of the study still hold. Why would I manipulate data, if not to change the results of a study?

My results suggest that this claim lacks empirical support. A key result was only significant with the rows of data that have been contested. Of course, this finding does not warrant the conclusion that the data were tempered with to get statistical significance. We have to wait to get the answer to this 25 million dollar question.

Replicability Report 2023: Aggressive Behavior

This report was created in collaboration with Anas Alsayed Hasan.
Citation: Alsayed Hasan, A. & Schimmack, U. (2023). Replicability Report 2023: Aggressive Behavior. Replicationindex.com

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Aggressive Behavior

Aggressive Behavior is the official journal of the International Society for Research on Aggression.  Founded in 1974, this journal provides a multidisciplinary view of aggressive behavior and its physiological and behavioral consequences on subjects.  Published articles use theories and methods from psychology, psychiatry, anthropology, ethology, and more. So far, Aggressive Behavior has published close to 2,000 articles. Nowadays, it publishes about 60 articles a year in 6 annual issues. The journal has been cited by close to 5000 articles in the literature and has an H-Index of 104 (i.e., 104 articles have received 104 or more citations). The journal also has a moderate impact factor of 3. This journal is run by an editorial board containing over 40 members. The Editor-In-Chief is Craig Anderson. The associate editors are Christopher Barlett, Thomas Denson, Ann Farrell, Jane Ireland, and Barbara Krahé.

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).

Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 71%, the expected discovery rate is 45%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result. An EDR of 45% implies that no more than 7% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 12%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published. The ERR of 69% suggests that the majority of results published in Aggressive Behavior are replicable, but the EDR allows for a replication rate as low as 45%. Thus, replicability is estimated to range from 45% to 69%. There are currently no large replication studies in this field, making it difficult to compare these estimates to outcomes of empirical replication studies. However, the ERR for the OSC reproducibility project that produced 36% successful actual replications was around 60%, suggesting that roughly 50% of actual replication studies of articles in this journal would be significant. It is unlikely that the success rate would be lower than the EDR of 45%. Given the relatively low risk of type-I errors, most of these replication failures are likely to occur because studies in this journal tend to be underpowered. Thus, replication studies should use larger samples.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The ODR, EDR, and ERR were regressed on time and time-squared to allow for non-linear relationships. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.52 percentage points per year (SE = .22). The EDR showed no significant trends, p > .30. There were no linear or quadratic time trends for the ERR, p > .10. Figure 2 shows the ODR and EDR to examine selection bias.

Figure 2

The decrease in the ODR implies that selection bias is decreasing over time. In the last years, the confidence intervals for the ODR and EDR overlap, indicating that there are no longer statistically reliable differences. However, this does not imply that all results are being reported. The main reason for the overlap is the low certainty about the annual EDR. Given the lack of a significant time trend for the EDR, the average EDR across all years implies that there is still selection bias. Finally, automatically extracted test-statistics make it impossible to say whether researchers are reporting more focal or non-focal results as non-significant. To investigate this question, it is necessary to hand-code focal tests (see Limitation section).

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

Figure 3

The FDR is based on the EDR that also showed no time trends. Thus, the estimates for all years can be used to obtain more precise estimates than the annual ones. Based on the results in Figure 1, the expected failure rate is 31% and the FDR is 7%. This suggests that replication failures are more likely to be false negatives due to modest power rather than false positive results in original studies. To avoid false negative results in replication studies, these studies should use larger samples.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present, and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Figure 4

Using alpha = .01 lowers the discovery rate by about 15 percentage points. The stringent criterion of alpha = .001 lowers it by another 10 percentage points to around 40% discoveries. This would mean that many published results that were used to make claims no longer have empirical support.

Figure 5 shows the effects of alpha on the false positive risk. Even alpha = .01 is sufficient to ensure a false positive risk of 5% or less. Thus, alpha = .01 seems a reasonable criterion to avoid too many false positive results without discarding too many true positive results. Authors may want to increase statistical power to increase their chances of obtaining a p-value below .01 when their hypotheses are true to produce credible evidence for their hypotheses.

Figure 5

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

Hand-coding of other journals shows that publications of non-significant focal hypothesis tests are still rare. As a result, the ODR for focal hypothesis tests in Aggressive Behavior is likely to be higher and selection bias larger than the present results suggest. Hand-coding of a representative sample of articles in this journal is needed.

Conclusion

The replicability report for Aggressive Behavior shows clear evidence of selection bias, although there is a trend selection bias may be decreasing in the last years. The results also suggest that replicability is in a range from 40% to 70%. This replication rate does not deserve to be called a crisis, but it is does suggest that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Finally, time trend analyses show no important changes in response to the open science movement. An important goal is to reduce the selective publishing of studies that worked (p < .05) and to hide studies that did not work (p > .05). Preregistration or registered reports can help to address this problem. Given concerns that most published results in psychology are false positives, the present results are reassuring and suggest that most results with p-values below .01 are true positive results.

Replicability Report 2023: Cognition & Emotion

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain. Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Cognition & Emotion

The study of emotions largely disappeared from psychology after the second world war and during the rain of behaviorism or was limited to facial expressions. The study of emotional experiences reemerged in the 1980. Cognition & Emotion was established in 1987 as an outlet for this research.

So far, the journal has published close to 3,000 articles. The average number of citations per article is 46. The journal has an H-Index of 155 (i.e., 155 articles have 155 or more citations). These statistics show that Cognition & Emotion is an influential journal for research on emotions.

Nine articles have more than 1,000 citations. The most highly cited article is a theoretical article by Paul Ekman arguing for basic emotions (Ekman, 1992);

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 68%, the expected discovery rate is 34%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 34% implies that up to 10% of the significant results could be false positives. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of statistical results in this journal need to examine the range of plausible effect sizes, confidence intervals, to see whether results have practical significance. Unfortunately, these estimates are inflated by selection bias, especially when the evidence is weak and the confidence interval already includes effect sizes close to zero.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 70% suggests that most results published in this journal are replicable, but the EDR allows for a replication rate as low as 34%. Thus, replicability is estimated to range from 34% to 70%. There is no representative sample of replication studies from this journal to compare this estimate with the outcome of actual replication studies. However, a journal with lower ERR and EDR estimates, Psychological Science, had an actual replication rate of 41%. Thus, it is plausible to predict a higher actual replication rate than this for Cognition & Emotion.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. Confidence intervals were created by regressing the estimates on time and time-squared to examine non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.79, percentage points per year (SE = .10). The EDR showed no significant linear, b = .23, SE = .41, or non-linear, b = -.10, SE = .07, trends.

Figure 2

The decreasing ODR implies that selection bias is decreasing, but it is not clear whether this trend also applies to focal hypothesis tests (see limitations section). The lack of an increase in the EDR implies that researchers continue to conduct studies with low statistical power and that the non-significant results often remain unpublished. To improve credibility of this journal, editors could focus on power rather than statistical significance in the review process.

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

Figure 3

There was a significant linear, b = .24, SE = .11, tend for the ERR, indicating an increase in the ERR. The increase in the ERR implies fewer replication failures in the later years. However, because the FDR is not decreasing, a larger portion of these replication failures could be false positives.

Retrospective Improvement of Credibility

he criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Figure 4

Lowering alpha to .01 reduces the observed discovery rate by 20 to 30 percentage points. It is also interesting that the ODR decreases more with alpha = .05 than for other alpha levels. This suggests that changes in the ODR are in part caused by fewer p-values between .05 and .01. These significant results are more likely to result from unscientific methods and are often do not replicate.

Figure 5 shows the effects of alpha on the false positive risk. Lowering alpha to .01 reduces the false positive risk to less than 5%. Thus, readers can use this criterion to reduce the false positive risk to an acceptable level.

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Cognition & Emotion a small set of articles were hand-coded as part of a study on the effects of open science reforms on the credibility of psychological science. Figure 6 shows the z-curve plot and results for 117 focal hypothesis tests.

Figure 6

The main difference between manually and automatically coded data is a much higher ODR (95%) for manually coded data. This finding shows that selection bias for focal hypothesis tests is much more severe than the automatically extracted data suggest.

The point estimate of the EDR, 37%, is similar to the EDR for automatically extracted data, 34%. However, due to the small sample size, the 95%CI for manually coded data is wide and it is impossible to draw firm conclusions about the EDR, but results from other journals and large samples also show similar results.

The ERR estimates are also similar and the 95%CI for hand-coded data suggests that the majority of results are replicable.

Overall, these results suggest that automatically extracted results are informative, but underestimate selection bias for focal hypothesis tests.

Conclusion

The replicability report for Cognition & Emotion shows clear evidence of selection bias, but also a relatively low risk of false positive results that can be further reduced by using alpha = .01 as a criterion to reject the null-hypothesis. There are no notable changes in credibility over time. Editors of this journal could improve credibility by reducing selection bias. The best way to do so would be to evaluate the strength of evidence rather than using alpha = .05 as a dichotomous criterion for acceptance. Moreover, the journal needs to publish more articles that fail to support theoretical predictions. The best way to do so is to accept articles that preregistered predictions and failed to confirm them or to invite registered reports that publish articles independent of outcome of a study. Readers can set their own level of alpha depending on their appetite for risk, but alpha = .01 is a reasonable criterion because it (a) maintains a false positive risk below 5%, and eliminates p-values between .01 and .05 that are often obtained with unscientific practices and fail to replicate.

Link to replicability reports for other journals.

Replicability Report 2023: Acta Psychologica

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Acta Psychologica

Acta Psychologica is an old psychological journal that was founded in 1936. The journal publishes articles from various areas of psychology, but cognitive psychological research seems to be the most common area.

So far, Acta Psychologica has published close to 6,000 articles. Nowadays, it publishes about 150 articles a year in 10 annual issues. Over the past 30 years, articles have an average citation rate of 24.48 citations, and the journal has an H-Index of 116 (i.e., 116 articles have received 116 or more citations). The journal has an impact factor of 2 which is typical of most empirical psychology journals.

So far, the journal has published 4 articles with more than 1,000 citations, but all of these articles were published in the 1960s and 1970s. The most highly cited article in the 2000s, examined the influence of response categories on the psychometric properties of survey items (Preston & Colman, 2000; 947 citations).

Given the multidisciplinary nature of the journal, the journal has a team of editors. The current editors are Mohamed Alansari, Martha Arterberry, Colin Cooper, Martin Dempster, Tobias Greitemeyer, Matthieu Guitton, and Nhung T Hendy.

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).

A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.

Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 70%, the expected discovery rate is 46%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 46% implies that no more than 6% of the significant results are false positives. The 95%CI puts the upper limit at false positive results at 8%. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of original articles need to focus on confidence intervals of effect size estimates and take into account that selection for significance inflates effect size estimates. Thus, published results are likely to show the correct direction of a relationship, but may not provide enough information to determine whether a statistically significant result is theoretically or practically important.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 72% suggests that the majority of results published in Acta Psychologica is replicable, but the EDR allows for a replication rate as low as 46%. Thus, replicability is estimated to range from 46% to 72%. Actual replications of cognitive research suggest that 50% of results produce a significant result again (Open Science Collaboration, 2015). This finding is consistent with the present results. Taking the low false positive risk into account, most replication failures are likely to be false negatives due to insufficient power in the original and replication studies. This suggests that replication studies should increase sample sizes to have sufficient statistical power to replicate true positive effects.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. The results were regressed on time and time-squared to allow for non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.43 percentage points per year (SE = .13). The EDR showed no significant trends, p > .20.

Figure 2

The decrease in the ODR implies that selection bias is decreasing over time. However, a low EDR still implies that many studies that produced non-significant results remain unpublished. Moreover, it is unclear whether researcher are reporting more focal results as non-significant. To investigate this question it is necessary to hand-code focal tests (see Limitation section).

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

Figure 3

There were no linear or quadratic time trends for the ERR, p > .2. The FDR is based on the EDR that also showed no time trends. Thus, the estimates for all years can be used to obtain more precise estimates than the annual ones. Based on the results in Figure 1, the expected failure rate is 28% and the FDR is 5%. This suggests that replication failures are more likely to be false negatives due to modest power rather than false positive results in original studies. To avoid false negative results in replication studies, these studies should use larger samples.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Figure 4

Using alpha = .01 lowers the discovery rate by about 20 percentage points. The stringent criterion of alpha = .001 lowers it by another 20 percentage points to around 40% discoveries. This would mean that many published results that were used to make claims no longer have empirical support.

Figure 5 shows the effects of alpha on the false positive risk. Even alpha = .01 is sufficient to ensure a false positive risk of 5% or less. Thus, alpha = .01 seems a reasonable criterion to avoid too many false positive results without discarding too many true positive results. Authors may want to increase statistical power to increase their chances of obtaining a p-value below .01 when their hypotheses are true to produce credible evidence for their hypotheses.

Figure 5

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Acta Psychologica, hand-coded data are available for the years 2010 and 2020 from a study that examines changes in replicability from 2010 to 2020. Figure 6 shows the results.

Figure 6

The most notable difference is the higher observed discovery rate for hand-coding of focal hypothesis tests (94%) than for automatically extracted test statistics (70%). Thus, results based on automatically extracted data underestimate selection bias.

In contrast, the expected discovery rates are similar in hand-coded (46%) and automatically extracted (46%) data. Given the small set of hand-coded tests, the 95% confidence interval around the 46% estimate is wide, but there is no evidence that automatically extracted data overestimate the expected discovery rate and by implication underestimate the false discovery rate.

The ERR for hand-coded focal tests (70%) is also similar to the ERR for automatically extracted tests (72%).

This comparison suggests that the main limitation of automatic extraction of test statistics is that this method underestimates the amount of selection bias because authors are more likely to report non-focal tests than focal results that are not significant. Thus, selection bias remains a pervasive problem in this journal.

Conclusion

The replicability report for Acta Psychologica shows clear evidence of selection bias, although there is a trend selection bias may be decreasing in the last years. The results also suggest that replicability is in a range from 40% to 70%. This replication rate does not deserve to be called a crisis, but it is does suggest that many studies are underpowered and require luck to get a significant result. The false positive risk is modest and can be controlled by setting alpha to .01. Finally, time trend analysis show no important changes in response to the open science movement. An important goal is to reduce the selective publishing of studies that worked (p < .05) and to hide studies that did not work (p > .05). Preregistration or registered reports can help to address this problem. Given concerns that most published results in psychology are false positives, the present results are reassuring and suggest that most results with p-values below .01 are true positive results.

Replicability Reports of Psychology Journals

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.

The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.

My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021; Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain.

Research reports use z-curve to provide information about psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

List of Journals with Links to Replicability Report

Psychological Science (2000-2022)

Acta Psychologica (2000-2022)

Replicability Report 2023: Psychological Science

updated 7/11/23
[slightly different results due to changes in the extraction code and a mistake in the formula for the false discovery risk with different levels of alpha]

In the 2010s, it became apparent that empirical psychology has a replication problem. When psychologists tested the replicability of 100 results, they found that only 36% of the 97 significant results in original studies could be reproduced (Open Science Collaboration, 2015). In addition, several prominent cases of research fraud further undermined trust in published results. Over the past decade, several proposals were made to improve the credibility of psychology as a science. Replicability reports are the results of one of these initiatives.The main problem in psychological science is the selective publishing of statistically significant results and the blind trust in statistically significant results as evidence for researchers’ theoretical claims. Unfortunately, psychologists have been unable to self-regulate their behavior and continue to use unscientific practices to hide evidence that disconfirms their predictions. Moreover, ethical researchers who do not use unscientific practices are at a disadvantage in a game that rewards publishing many articles without any concern about the replicability of these findings.My colleagues and I have developed a statistical tools that can reveal the use of unscientific practices and predict the outcome of replication studies (Brunner & Schimmack, 2021Bartos & Schimmack, 2022). This method is called z-curve. Z-curve cannot be used to evaluate the credibility of a single study. However, it can provide valuable information about the research practices in a particular research domain. Replicability-Reports (RR) use z-curve to provide information about the research and publication practices of psychological journals. This information can be used by authors to chose journals they want to publish in, provides feedback to journal editors who have influence on selection bias and replicability of results published in their journals, and most importantly to readers of these journals.

Psychological Science

Psychological Science is often called the flagship journal of the Association for Psychological Science. It publishes journals from all areas of psychology, but most articles are experimental studies.

The journal started in 1990. So far, it has published over 5,000 articles with an average citation rate of 90 citations per article. The journal currently has an H-Index of 300 (i.e., 300 articles have received 300 or more citations).

Ironically, the most cited article (3,800 citations) is a theoretical article that illustrated how easy it is to produce statistically significant results with statistical tricks that capitalize on chance and increase the risk of a false discovery and inflate effect size estimates (Simmons, Nelson, & Simmonsoh, 2011). This article is often cited as evidence that published results lack credibility. The impact of this journal also suggests that most researchers are now aware that selective publishing of significant results is harmful.

After concerns about the replicability of psychological science emerged in the early 2010s, Erich Eid initiated changes to increase the credibility of published results. Further changes were made by Stephen Lindsay during his editorship from 2015 to 2019. Replicability reports provide an opportunity to examine the effect of these changes on the credibility of published results.

Report

Replication reports are based on automatically extracted test-statistics (F-tests, t-tests, z-tests) from the text potion of articles. The reports do not include results reported as effect sizes (r), confidence intervals, or results reported in tables or figures.

Figure 1

Figure 1 shows a z-curve plot for all articles from 2000-2022 (see Schimmack, 2023, for a detailed description of z-curve plots). The plot is essentially a histogram of all test statistics converted into absolute z-scores (i.e., the direction of an effect is ignored). Z-scores can be interpreted as the strength of evidence against the null-hypothesis that there is no statistical relationship between two variables (i.e., the effect size is zero and the expected z-score is zero).A z-curve plot shows the standard criterion of statistical significance (alpha = .05, z = 1.97) as a vertical red line. It also shows a dotted vertical red line at z = 1.65 because results with z-scores between 1.65 and 1.97 are often interpreted as evidence for an effect using a more liberal alpha criterion, alpha = .10, a one-sided test, or with qualifiers (e.g., marginally significant). Thus, values in this range cannot be interpreted as reporting of results that failed to support a hypothesis.

Z-curve plots are limited to values less than z = 6. The reason is that values greater than 6 are so extreme that a successful replication is all but certain, unless the value is a computational error or based on fraudulent data. The extreme values are still used for the computation of z-curve statistics, but omitted from the plot to highlight the shape of the distribution for diagnostic z-scores in the range from 2 to 6.Z-curve fits a statistical model to the distribution of these z-scores. The predicted distribution is shown as a grey curve. Importantly, the model is fitted to the significant z-scores, but the model makes a prediction about the distribution of non-significant results. This makes it possible to examine publication bias (i.e., selective publishing of significant results).

Selection for Significance

Visual inspection of Figure 1 shows that there are fewer observed non-significant results (z-scores between 0 and 1.65) than predicted z-scores. This is evidence of selection for significance. It is possible to quantify the bias in favor of significant results by comparing the proportion of observed significant results (i.e., the observed discovery rate, ODR) with the expected discovery rate (EDR) based on the grey curve. While the observed discovery rate is 71%, the expected discovery rate is 25%. The 95% confidence intervals of the ODR and EDR do not overlap. Thus, there is clear evidence of selection bias in this journal.

False Positive Risk

The replication crisis has led to concerns that many or even most published results are false positives (i.e., the true effect size is zero). The false positive risk is inversely related to the expected discovery rate and z-curve uses the EDR to estimate the risk that a published significant result is a false positive result.

An EDR of 25% implies that up to 16% of the significant results could be false positives. Thus, concerns that most published results are false are overblown. However, a focus on false positives is misleading because it ignores effect sizes. Even if an effect is not exactly zero, it may be too small to be relevant (i.e., practically significant). Readers of statistical results in Psychological Science need to examine the range of plausible effect sizes, confidence intervals, to see whether results have practical significance. Unfortunately, these estimates are inflated by selection bias, especially when the evidence is weak and the confidence interval already includes effect sizes close to zero.

Expected Replication Rate

The expected replication rate estimates the percentage of studies that would produce a significant result again if exact replications with the same sample size were conducted. A comparison of the ERR with the outcome of actual replication studies shows that the ERR is higher than the actual replication rate. There are several factors that can explain this discrepancy such as the difficulty of conducting exact replication studies. Thus, the ERR is an optimist estimate. A conservative estimate is the EDR. The EDR predicts replication outcomes if significance testing does not favor studies with higher power (larger effects and smaller sampling error) because statistical tricks make it just as likely that studies with low power are published.

The ERR of 67% suggests that most results published in this journal are replicable, but the EDR allows for a replication rate as low as 25%. Thus, replicability is estimated to range from 25% to 67%. Actual replications of results in this journal suggest a replication rate of 41% (Open Science Collaboration, 2015). This finding is consistent with the present results. Thus, replicability of results in Psychological Science is much lower than trusting readers might suspect.

Time Trends

To examine changes in credibility over time, z-curves were fitted to test statistics for each year from 2000 to 2022. Confidence intervals were created by regressing the estimates on time and time-squared to examine non-linear relationships.

Figure 2 shows the ODR and EDR to examine selection bias. The ODR showed a significant linear trend, indicating more publications of non-significant results, b = -.53 percentage points per year (SE = .09). The EDR showed significant linear, b = .70, SE = .31, and non-linear, b = .24, SE = .05, trends.

Figure 2

The decreasing ODR implies that selection bias is decreasing, but it is not clear whether this trend also applies to focal hypothesis tests (see limitations section). The curvilinear trend for the EDR is notable because it suggests that concerns about the credibility of published results were triggered by a negative trend in the EDR from 2000 to 2010. Since then, the EDR has been moving up. The positive trend can be attributed to the reforms initiated by Eric Eich and Steven Lindsey that have been maintained by the current editor Patricia J. Bauer.

Figure 3 shows the false discovery risk (FDR) and the estimated replication rate (ERR). It also shows the expected replication failure rate (EFR = 1 – ERR). A comparison of the EFR with the FDR provides information for the interpretation of replication failures. If the FDR is close to the EFR, many replication failures may be due to false positive results in original studies. In contrast, if the FDR is low, most replication failures are likely to be false negative results in underpowered replication studies.

Figure 3

There were linear, b = .40, SE = .11, and quadratic, b = .14, SE = .02, time trends for the ERR. The FDR is based on the EDR that also showed linear and quadratic trends. The non-linear trends imply that credibility was lowest from 2005 to 2015. During this time up to 40% of published results might not be replicable and up to 50% of these results might be false positive results. The Open Science replication project replicated studies from 2008. Given the present findings, this result cannot be generalized to other years.

Retrospective Improvement of Credibility

The criterion of alpha = .05 is an arbitrary criterion to make decisions about a hypothesis. It was used by authors to conclude that an effect is present and editors accepted articles on the basis of this evidence. However, readers can demand stronger evidence. A rational way to decide what alpha criterion to use is the false positive risk. A lower alpha, say alpha = .005, reduces the false positive risk, but also increases the percentage of false negatives (i.e.., there is an effect even if the p-value is above alpha).

Figure 4 shows the implications of using different significance criteria for the observed discovery rate (lower alpha implies fewer significant results).

Figure 4

Lowering alpha to .01 reduces the observed discovery rate by 20 to 30 percentage points. The effect is stronger during the dark period from 2005 to 2015 because more results during this period had p-values between .05 and .01. These results often do not replicate and are more likely to be the result of unscientific research practices.

Figure 5 shows the effects of alpha on the false positive risk. Lowering alpha to .01 reduces the false positive risk considerably, but it remains above 5% during the dark period from 2005 to 2015. These results suggest that readers could use alpha = .005 from 2005 to 2015 and alpha = .01 during other years to achieve a false positive risk below 5%.

Figure 5

Limitations

The main limitation of these results is the use of automatically extracted test statistics. This approach cannot distinguish between theoretically important statistical results and other results that are often reported, but do not test focal hypotheses (e.g., testing statistical significance of a manipulation check, reporting a non-significant result for a factor in a complex statistical design that was not expected to produce a significant result).

For the journal Psychological Science, hand-coded data are available from coding by Motyl et al. (2017) and my own lab. The datasets were combined and analyzed with z-curve (Figure 4).

The ODR of 84% is higher than the ODR of 68% for automatic extraction. The EDR of 34% is identical to the estimate for automatic extraction. The ERR of 61% is 8 percentage points lower than the ERR for automatic extraction. Given the period effects on z-curve estimates, I also conducted a z-curve analysis for automatically extracted tests for the matching years (2003, 2004, 2010, 2016, 2020). The results were similar, ODR = 73%, EDR = 25%, and ERR = 64%. Thus, automatically extracted results produce similar results to results based on handcoded data. The main difference is that non-significant results are less likely to be focal tests.

Conclusion

The replicability report for Psychological Science shows (a) clear evidence of selection bias, (b) unacceptably high false positive risks at the conventional criterion for statistical significance, and modest replicability. However, time trend analyses show that credibility of published results decreased in the beginning of this century, but improved since 2015. Further improvements are needed to eliminate selection bias and increase the expected discovery rate by increasing power (reducing sampling error). Reducing sampling error is also needed to produce strong evidence against theoretical predictions that are important for theory development. The present results can be used as benchmark for further improvements that can increase the credibility of results in psychological science (e.g., more Registered Reports that publish results independent of outcomes). The results can also help readers of psychological science to chose significance criteria that match their personal preferences for risk and their willingness to “err on the side of discovery” (Bem, 2004).

The Relationship between Positive Affect and Negative Affect: It’s Complicated

About 20 years ago, I was an emotion or affect researcher. I was interested in structural models of affect, which was a hot research topic in the 1980s (Russell, 1980; Watson & Tellegen, 1985; Diener & Iran-Nejad, 1986′ Shaver et al., 1987). In the 1990s, a consensus emerged that the structure of affect has a two-dimensional core, but a controversy remained about the basic dimensions that create the two-dimensional space. One model assumed that Positive Affect and Negative Affect are opposite ends of a single dimension (like hot and cold are opposite ends of a bipolar temperature dimension). The other model assumed that Positive Affect and Negative Affect are independent dimensions. This controversy was never resolved, probably because neither model is accurate (Schimmack & Grob, 2000).

When Seligman was pushing positive psychology as a new discipline in psychology, I was asked to write a chapter for a Handbook of Methods in Positive Psychology. This was a strange request because it is questionable whether Positive Psychology is really a distinct discipline and there are no distinct methods to study topics under the umbrella term positive psychology. Nevertheless, I obliged and wrote a chapter about the relationship between Positive Affect and Negative Affect that questions the assumption that positive emotions are a new and previously neglected topic and the assumption that Positive Affect can be studied separately from Negative Affect. The chapter basically summarized the literature on the relationship between PA and NA up to this point, including some mini meta-analysis that shed light on moderators of the relationship between PA and NA.

As with many handbooks that are expensive and not easily available as electronic documents, the chapter had very little impact on the literature. WebofScience shows only 25 citations. As the topic is still unresolved, I thought I make the chapter available as a free text in addition to the Google Book option that is a bit harder to navigate.

Here is a PDF version of the chapter.

Key points

  • The correlation between PA and NA varies as a function of items, response formats, and other method factors.
  • Pleasure and displeasure are not opposite ends of a single bipolar dimension
  • Pleasure and displeasure are not independent.

Psychological Science and Real World Racism

The prompt for this essay is my personal experience with accusations of racism in response to my collaboration with my colleague Judith Andersen and her research team who investigated the influence of race on shooting errors in police officers’ annual certification (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2023a). Our article was heavily criticizes as racially insensitive and racially biased (Williams et al., 2023). We responded to the specific criticisms of our article (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2003b). This essay is takes a broader perspective on the study of race-related topics in psychological science. It is also entirely based on my own experiences and views and I do not speak for my colleagues.

Science

The term science is used to distinguish claims that are backed-up by scientific facts from claims that are based on other evidence or belief systems. For people who believe in science, these claims have a stronger influence on their personal belief systems than other claims. Take “flat-earth theorists” as an example. Most educated people these days believe that the Earth is round and point to modern astronomy as a science that supports this claim. However, some people seriously maintain the belief that the earth is flat (https://en.wikipedia.org/wiki/Behind_the_Curve). Debates between individuals or groups who “follow the science” or not are futile. In this regard, believing in science is like a religion. This article is addressed at readers who “believe in science.”

What does it mean to believe in science? A fundamental criterion that distinguishes science from other belief systems is falsifiability. At some point, empirical evidence has to be able to correct pre-existing beliefs. For this to happen, the evidence has to be strong. For example, there should be little doubt about the validity of the measures (e.g., thermometers are good measures of temperature) and the replicability of the results (different research teams obtain the same results). When these preconditions are fulfilled, scientific discoveries are made and knowledge is gained (e.g., better telescopes produce new discoveries in astronomy, microscopes showed the influence of bacteria on diseases, etc.). The success of Covid-19 vaccines (if you believe in science) was possible due to advances in microbiology. The modern world we live in would not exist without actions by individuals who believe in science.

Psychological Science

Psychological science emerged in the late 19th century as an attempt to use the scientific method to study human experiences and behavior. The biggest success stories in psychological science can be found in areas that make it possible to conduct tightly controlled laboratory studies. For example, asking people to read color words in the same color or a different color shows a robust effect that it is harder to name color of a color word if the color word does not match (say purple when the word purple is printed in green).

Psychological science of basic phenomena like perception and learning has produced many robust scientific findings. Many of these findings are so robust because they are universal; that is, shared by all humans. This is consistent with other evidence that humans are more alike than different from each other and that peripheral differences like height, hair texture, and pigmentation are superficial differences and not symptoms of clearly distinguishable groups of humans (different races).

Social Psychology

Social psychology emerged as a sub-discipline in psychological science in the 1950s. A major goal of social psychology was to use the methods of psychological science to study social behaviors with bigger social implications than the naming of colors. The most famous studies from the 1950 tried to explain the behavior of Germans during World War II who were involved in the Holocaust. The famous Milgram experiments, for example, showed that social pressure can have a strong influence on behavior. Asch showed that conformity pressure can make people say things that are objectively false. These studies are still powerful today because they used actual behaviors as the outcome. In Milgram’s studies participants were led to believe that they gave electro shocks to another person who screamed in pain.

From the beginning, social psychologists were also interested in prejudice (Allport, 1954), at a time when the United States were segregated and blatantly racist. White Americans’ racial attitudes were easy to study because White Americans openly admitted that they did not consider White and Black Americans to be equal. For example, in the 1950s, nearly 100% of Americans disapproved of interracial marriages, which were also illegal in some states at that time.

It was more difficult to study the influence of racism on behavior. To ensure that behavior is influenced by an individual’s race and not some other factor (psychology jargon for cause), it is necessary to keep all other causes constant and then randomly assign participants to the two conditions and show a difference in outcome. My search for studies of this type revealed only a handful of studies with small student samples that showed no evidence of prejudice (e.g., Genthner & Taylor, 1973). There are many reasons why these studies may have failed to produce evidence of prejudice. For example, participants knew that they were in a study and that their behaviors were observed, which may have influenced how they behaved. Most important is the fact that the influence of prejudice on behavior was not a salient topic in social psychology.

This changed in the late 1980s (at a time when I became a student of psychology), when social psychologists became interested in unconscious processes that were called implicit processes (Devine, 1989). The novel idea was that racial biases can influence behavior outside of conscious awareness. Thus, some individuals might claim that they have no prejudices, but their behaviors show otherwise. Twenty years later, this work led to the claim that most White people have racial biases that influence their behavior even if they do not want to (Banaji & Greenwald, 2013).

Notably, in the late 1980s, 40% of US Americans still opposed interracial marriages, showing that consciously accessible, old fashioned racism was still prevalent in the United States. However, the primary focus of social psychologists was not the study of prejudice, but the study of unconscious/implicit processes, implicit prejudice was just one of many implicit topics that were being the topic of investigation.

While the implicit revolution led to hundreds of studies that examined White people’s behaviors in responses to Black and White persons, the field also made an important methodological change. Rather than studying real behaviors to real people, most studies examined how fast participants can press a button in response to a stimulus (e.g. a name, a face, or simply the words Black/White) on a computer screen. The key problem with this research is that button presses on computer screens are not the same as button presses on dating profiles or pressing the trigger on a gun during a use of force situation.

This does not mean that these studies are useless, but it is evident that they cannot produce scientific evidence about the influence of race on behavior in the real world. In the jargon of psychological science, these studies lack external validity (i.e., the results cannot be generalized from button presses in computer tasks to real world behaviors).

Psychological Science Lacks Credibility

Psychology faces many challenges to be recognized as a science equal to physics, chemistry, or biology. One major challenge is that the behaviors of humans vary a lot more than the behaviors of electrons, atoms, or cells. As a result, many findings in social psychology are general trends that explain only a small portion of the variability in behavior (e.g., some White people are in interracial relationships). To deal with this large amount of variability (noise, randomness), psychologists rely on statistical methods that aim to detect small effects on the variability in behavior. Since the beginning of psychological science, the statistical method to find these effects is a statistical method called null-hypothesis significance testing or simply significance testing (Is p < .05?). Although this method has been criticized for decades, it continues to be taught to undergraduate students and is used to make substantive claims in research articles.

The problem with significance testing is that it is designed to confirm researchers’ hypotheses, but it cannot falsify them. Thus, the statistical tool cannot serve the key function of science to inform researchers that they ideas are wrong. As researchers are human and humans already have a bias to find evidence that supports their beliefs, significance testing is an ideal tool for scientists to delude themselves that their claims are supported by scientific evidence (p < .,05), when their beliefs are wrong.

Awareness of this problem increased after a famous social psychologist, Daryl Bem, used NHST to convince readers that humans have extrasensory perception and can foresee future events (Bem, 2011). Attesting to the power of confirmation bias, Bem still believes in ESP, but the broader community has realized that the statistical practices in social psychology are unscientific and that decades of published research lacks scientific credibility. It did not help that a replication project found that only 25% of published results in the most prestigious journals of social psychology could be replicated.

Despite growing awareness about the lack of credible scientific evidence, claims about prejudice and racism in textbooks, popular books, and media articles continue to draw on this literature because there is no better evidence (yet). The general public and undergraduate students make the false assumption that social psychologists are like astronomers who are interpreting the latest pictures from the new space telescope. Social psychologists are mainly presenting their own views as if they were based on scientific evidence, when there is no scientific evidence to support these claims. This explains why social psychologists often vehemently disagree about important issues. There is simply no shared empirical evidence that resolves these conflicts.

Thus, the disappointing and honest answer is that social psychology simply cannot provide scientific answers to real world questions about racial biases in behavior. Few studies actually examined real behavior, studies of button presses on computers have little ecological validity, and published results are often not replicable.

The Politicization of Psychological Science

In the absence of strong and unambiguous scientific evidence, scientists are no different from other humans and confirmation biases will influence scientist’s beliefs. The problem is that the general public confuses their status as university professors and researchers with expertise that is based on superior knowledge. As a result, claims by professors and researchers in journal articles or in books, talks, or newspaper interviews are treated as if they deserve more weight than other views. Moreover, other people may refer to the views of professors or their work to claim that their own view are scientific because they echo those printed in scientific articles. When these claims are not backed by strong scientific evidence, scientific articles become weaponized in political conflicts.

A scientific article on racial biases in use of force errors provides an instructive example. In 2019, social psychologist Joseph Cesario and four graduate students published an article on racial disparities in use of force errors by police (a.k.a., unnecessary killings of US civilians). The article passed peer-review at a prestigious scientific journal, the Proceedings of the National Academy of Sciences (PNAS). Like many journals these days, PNAS asks authors to provide a Public Significance Statement.

The key claim in the significance statement is that the authors found “no evidence of anti-Black or anti-Hispanic disparities across shootings.” Scientists may look at this statement and realize that it is not equivalent to the claim that “there is no racial bias in use of force errors.” First of all, the authors clearly say that they did not find evidence. This leaves the possibility that other people looking at the same data might have found evidence. Among scientists it is well known that different analyses can produce different results. Scientists also know the important distinction between the absence of evidence and evidence of the absence of an effect. The significance statement does not say that the results show that there are no racial biases, only that the authors did not find evidence for biases. However, significance statements are not written for scientists and it is easy to see how these statement could be (unintentionally or intentionally) misinterpreted as saying that science shows that there are no racial biases in police killings of innocent civilians.

And this is exactly what happened. Black-Lives-Anti-Matter Heather Mac Donald, used this research as “scientific evidence” to support the claim that the liberal left is fighting an unjustified “War on Cops” Her bio on Wikipedia shows that she received degrees in English, without any indication that she has a background in science. Yet, the Wall Street journal allowed her to summarize the evidence in an opinion article with the title “The myth of systemic police racism.” Thus, a racially biased and politically motivated non-scientist was able to elevate her opinion by pointing to the PNAS article as evidence that her opinion is the truth.

In this particular case, the journal was forced to retract the article after post-publication peer-reviewed revealed statistical errors in the paper and it became clear that the significance statement was misleading. An editorial reviewed this case-study of politicized science in great detail (Massey & Waters, 2020).

Although this editorial makes it clear that mistakes were made, it doesn’t go far enough in admitting the mistakes that were made by the journal editors. Most important, even if the authors had not made mistakes, it would be wrong to allow for any generalized conclusions in a significance statement. The clearest significance statement would be that “This is only one study of the issue with limitations and the evidence is insufficient to draw conclusions based on this study alone.” But journals are also motivated to exaggerate the importance of articles to increase their prestige.

The editorial also fails to acknowledge that the authors, reviewers, and editor were White and that it is unlikely that the article would have made misleading statements if African American researchers were involved in the research, peer-review, or the editorial decision process. To African Americans the conclusion that there is no racial bias in policing is preposterous, while it seemed plausible to the White researchers who gave the work the stamp of approval. Thus, this case study also illustrates the problems of systemic racism in psychology that African Americans are underrepresented and often not involved in research that directly affects them and their community.

My Colleague’s Research with Police Officers

My colleague, Judith Andersen, is a trained health psychologist, with a focus on stress and health. One area of research is how police officers cope with stress and traumatic experiences they encounter in their work. This research put her in a unique position to study racial biases in the use of force with actual police officers (i.e., many social psychologists studied shooting games with undergraduate students). Getting the cooperation of police department and individual officers to study such a highly politicized topic is not easy and without cooperation there are no PARTICIPANTS, no data, and no scientific evidence. A radical response to this reality would be to reject any data that require police officers’ consent. That is a principled response, but not a criticism of researchers who conduct studies and note the requirement of consent as a potential limitation and refrain from making bold statements that their data settle a political issue.

The actual study is seemingly simple. Officers have to pass a use of force test for certification to keep their service weapon on duty. To do so, officers go through a series of three realistic scenarios with their actual service weapon and do not know whether “shoot” or “don’t shoot” is the right response. Thus, they may fail the test if they fail to shoot in scenarios where shooting is the right response. The novel part of the study was to create two matched scenarios with a White or Black suspect and randomly assigned participating officers to these scenarios. Holding all other possible causes constant make it possible to see whether shooting errors are influenced by the race of a suspect.

After several journals, including PNAS, showed no interest in this work, it was eventually accepted for publication by the editor of The Canadian Journal of Behavioural Science. The journal also requires a Significance statement and we provided one.

Scientists might notice that our significance statement is essentially identical to Johnson et al.’s fateful significance statement. In plain English, we did not find evidence of racial biases in shooting errors. The problem is that significance testing often lead to the confusion of lack of evidence and evidence of no bias. To avoid this misinterpretation, we made it clear that our results cannot be interpreted as evidence that there are no biases. To do so, we emphasized that the shooting errors in the sample did show a racial bias. However, we could not rule out that this bias was unique to this sample and that the next sample might show no bias or even the opposite bias. We also point out that the bias in this sample might be smaller than the actual bias and that the actual bias might fully account for the real world disparities. In short, our significance statement is an elaborate, jargony way of saying “our results are inconclusive and have no real-world significance.”

It is remarkable that the editor published our article because 95% of articles in psychology present a statistically significant result that justifies a conclusion. This high rate of successful studies, however, is a problem because selective publishing of only significant results undermines the credibility of published results. Even cray claims like mental time travel are supported by statistically significant results. Only the publication of studies that failed to replicate these results help us to see that the original results were wrong. It follows that journals have to publish articles with inconclusive results to be credible and researchers have to be allowed to present inconclusive results to ensure that conclusive results are trustworthy. It also follows that not all scientific articles are in need of media attention and publicity. The primary aim of scientific journals is communication among scientists and to maintain a record of scientific results. Even Nadal or Federer did not win every tournament. So, scientists should be allowed to publish articles that are not winners and nobody should trust scientists who only publish articles that confirm their predictions.

It is also noteworthy that our results were inconclusive because the sample size was too small to draw stronger conclusions. However, it was the first study of its kind and it was already a lot of effort to get even these data. The primary purpose of publishing a study like this is to stimulate interest and to provide an example for future studies. Eventually, the evidence base grows and more conclusive results could be obtained. Ultimately it is up to the general public and policy makers to fund this research and to require participation of police departments in studies of racial bias. It would be foolish to criticize our study because it didn’t produce conclusive results in the first investigation. Even if the study had produced statistically significant results, replication studies would be needed before any conclusions can be drawn.

Social Activism in Science

Williams et al. (2023) wrote a critical commentary of our article with the title “Performative Shooting Exercises Do Not Predict Real-World Racial Bias in Police Officers” We were rather surprised by this criticism because our main finding was basically a non-significant, inconclusive result. Apparently, this was not the result that we were supposed to get or we should not have reported these results that contradict Williams et al.’s beliefs. Williams et al. start with the strong belief that any well-designed scientific study must find evidence for racial biases in shooting errors; otherwise there must be a methodological flaw. They are not shy to communicate this belief in their title. Our study of shooting errors during certification are called performative and they “do not predict real world racial biases in police officers.” The question is how Williams et al. (2023) know the real-world racial biases of police officers to make this claim.

The answer is that they do not know anything more than anybody else about the real racial biases of police officers (You are invited to read the commentary and see whether I missed that crucial piece of information). Their main criticism is that we made unjustified assumptions about the external validity of the certification task. “The principal flaw with Andersen et al.’s (2023) paper is unscientific assumptions around the validity of the recertification shooting test” That is, the bias that we observed in the certification task is taken at face value as information about the bias in real-world shooting situations.

The main problem with this criticism is that we never made the claim that biases in the certification task can be used to draw firm conclusions about biases in the real world. We even pointed out that we observed biases and that our results are consistent with the assumption that all of the racial disparities in real-world shootings are caused by racial biases in the immediate shooting decisions. As it turns out, Williams et al.’s critique is unscientific because it makes unscientific claims in the title and misrepresents our work. Our real sin was to be scientific and to publish inconclusive results that do not fit into the narrative of anti-police leftwing propaganda.

It is not clear why the authors were allowed to make many false and hurtful statements in their commentary, but personally I think it is better to have this example of politicization in the open to show that left-wing and right-wing political activists are trying to weaponize science to elevate their beliefs to the status of truth and knowledge.

Blatant examples of innocent African Americans killed by police officers (wikipedia) are a reason to conduct scientific studies, but these incidences cannot be used to evaluate the scientific evidence. And without good science, resources might be wasted on performative implicit bias training sessions that only benefit the trainers and do not protect the African American community.

Conclusion

The simple truth remains that psychological science has done little to answer real-world questions around race. Although social psychology has topics like prejudice and intergroup relationships as core topics, the research is often too removed from the real world to be meaningful. Unfortunately, incentives reward professors to use the inconclusive evidence selectively to confirm their own beliefs and then to present these beliefs as scientific claims. These pseudo-scientific claims are then weaponized by like-minded ideologues. This creates the illusion that we have scientific evidence, which is contradicted by the fact that opposing camps both cite science to believe they are right just like opposing armies can pray to the same God for victory.

To change this, stakeholders in science, like government funding organizations, need to change the way money is allocated. Rather than giving grants to White researchers at elite universities to do basic (a.k.a., irrelevant) research on button-presses of undergraduate students, money should be given to diverse research teams with a mandate to answer practical, real-world questions. The reward structure at universities also has to change. To collect real world data from 150 police officers is 100 times more difficult than collecting 20 brain measures from undergraduate students. Yet, a publication in a neuroscience journal is seen as more scientific and prestigious than an article in journal that addresses real-world problems that are by nature of interest to smaller communities.

Finally, it is important to recognize that a single study cannot produce conclusive answers to important and complex questions. All the major modern discoveries in the natural (real) sciences are made by teams. Funders need to provide money for teams that work together on a single important question rather than funding separate labs who work against each other. This is not new and has been said many times before, but so far there is little evidence of change. As a result, we have more information about galaxies millions of years ago than about our own behaviors and the persistent problem of racism in modern society. Don’t look to the scientists to provide a solution. Real progress has and will come from social activists and political engagement. And with every generation, more old racists will be replaced by a more open new generation. This is the way.

Who Holds Meta-Scientists accountable?

“Instead of a scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (p. 88).  [Fiske & Taylor, 1984)

A critical examination of Miller and Ulrich’s article “Optimizing Research Output:
How Can Psychological Research Methods Be Improved?” (https://doi.org/10.1146/annurev-psych-020821-094927)

Introduction

Meta-science has become a big business over the past decade because science has become a big business. With the increase in scientific production and the minting of Ph.D. students, competition has grown and researchers are pressured to produce ever more publications to compete with each other. At the same time, academia still pretends that it plays by the rules of English lords with “peer”-review and a code of honor. Even outright fraud is often treated like jaw-walking.

The field of meta-science puts researchers’ behaviors under the microscope and often reveals shady practices and shoddy results. However, meta-scientists are subject to the same pressures as the scientists they examine. They get paid, promoted, and funded based on the quantity of their publications, and citations. It is therefore reasonable to ask whether meta-scientists are any more trustworthy than other scientists. Sadly, that is not the case. Maybe this is not surprising because they are human like everybody else. Maybe the solution to human biases will be artificial intelligence programs. For now, the only way to reduce human biases is to call them out whenever you see them. Meta-scientists do not need meta-meta-scientists to hold them accountable, just like meta-scientists are not needed to hold scientists accountable. In the end, scientists hold each other accountable by voicing scientific criticism and responding to these criticisms. The key problem is that open exchange of arguments and critical discourse is often lacking because insiders use peer-review and other hidden power structures to silence criticism.

Here I want to use the chapter “Optimizing Research Output: How Can Psychological
Research Methods Be Improved?” by Jeff Miller and Rolf Ulrich as an example of biased and unscientific meta-science. The article was published in the series “Annual Reviews of Psychology” that publishes invited review articles. One of the editors is Susan Fiske, a social psychologists who once called critical meta-scientists like me “method terrorists” because they make her field look bad. So far, this series has published several articles on the replication crisis in psychology with titles like “Psychology’s Renaissance.” I was never asked to write or review any of these articles, although I have been invited to review articles on this topic by several editors of other journals. However, Miller and Ulrich did cite some of my work and I was curious to see how they cited it.

Consistent with the purpose of the series, Miller and Ulrich claim that their article provides “a (mostly) nontechnical overview of this ongoing metascientific work.” (p. 692). They start with a discussion of possible reasons for low replicability.

2. WHY IS REPLICABILITY SO POOR?

The state “there is growing consensus that the main reason for low replication rates is that many original published findings are spurious” (p. 693).

To support this claim they point out that psychology journals mostly publish statistically significant results (Sterling, 1959; Sterling et al., 1959), and then conclude “current evidence of low replication rates tends to suggest that many published findings are FPs [false positives] rather than TPs.[true positives]. This claim is simply wrong because it is very difficult to distinguish false positives from true positives with very low power to produce a significant result. They do not mention attempts to estimate the false positive rate (Jager & Leek, 2014; Gronau et al., 2016; Schimmack & Bartos, 2021). These methods typically show low to moderate estimates of the false positive rate and do not justify the claim that most replication failures occur when an article reported a false positive result.

Miller and Ulrich now have to explain how false positive results can enter the literature in large numbers when the alpha criterion of .05 is supposed to keep most of these results out of publications. The propose that many “FPs [false positive] may reflect honest research errors at many points during the research process” (p. 694). This argument ignores the fact that concerns about shady research practices first emerged when Bem (2011) published a 9 study article that seemed to provide evidence for pre-cognition. According to Miller and Ulrich we have to believe that Bem made 9 honest errors in a row that miraculously produced evidence for his cherished hypothesis that pre-cognition is real. If you believe this is possible, you do not have to read further and I wish you a good life. However, if you share my skepticism, you might feel relieved that there is actually meta-scientific evidence that Bem used shady practices to produce his evidence (Schimmack, 2018).

3. STATISTICAL CAUSES OF FALSE POSITIVES

Honest mistakes alone cannot explain a high percentage of false positive results in psychology journals. Another contributing factor has to be that psychologists test a lot more false hypotheses than true hypotheses. Miller and Ulrich suggest that social psychologists test only 1 out of 10 hypotheses tests tests a true hypothesis. Research programs with such a high rate of false hypotheses are called high-risk. However, this description does not fit the format of typical social psychology articles that have lengthy theory sections and often state “as predicted” in the results section, often repeatedly for similar studies. Thus, there is a paradox. Either social psychology is risky and results are surprising or it is theory-driven and results are predicted. It cannot be both.

Miller and Ulrich ignore the power of replication studies to reveal false positive results. This is not only true in articles with multiple replication studies, but across different articles that publish conceptual replication studies of the same theoretical hypothesis. How is it possible that all of these conceptual replication studies produced significant results, when the hypothesis is false? The answer is that researchers simply ignored replication studies that failed to produce the desired results. This selection bias, also called publication bias, is well-known and never called an honest mistake.

All of this gaslighting serves the purpose to present social psychologists as honest and competent researchers. High false positive rates and low replication rates happen “for purely statistical reasons, even if researchers use only the most appropriate scientific methods.” This is bullshit. Competent researchers would not hide non-significant results and continue to repeatedly test false hypotheses, while writing articles that claim all of the evidence supports their theories. Replication failures are not an inevitable statistical phenomenon. They are man-made in the service of self-preservation during early career stages and ego-preservation during later ones.

4. SUGGESTIONS FOR REDUCING FALSE POSITIVES

Conflating false positives and replication failures, Miller and Ulrich review suggestions to improve replication rates.

4.1. Reduce the α Level

One solution to reducing false positive results is to lower the significance threshold. An influential article called for alpha to be set to .005 (1 out of 200 tests can produce a false positive result). However, Miller and Ulrich falsely cite my 2012 article in support of this suggestion. This ignores that my article made a rather different recommendation, namely to conduct fewer studies with a higher probability to provide evidence for a true hypothesis. This would also reduce the false positive rate without having to lower the alpha criterion. Apparently, they didn’t really read or understand my article.

4.2 Eliminate Questionable Research Practices

A naive reader might think that eliminating shady research practices should help to increase replication rates and to reduce false positive rates. For example, if all results have to be published, researchers would think twice about the probability of obtaining a significant results. Which sane researcher would test their cherished hypothesis twice with 50% power; that is, the probability of finding evidence for it. Just like flipping a coin twice, the chance of getting at least one embarrassing non-significant result would be 75%. Moreover, if they had to publish all of their results, it would be easy to detect hypotheses with low replication rates and either give up on them or increase sample sizes to detect small effect sizes. Not surprisingly, consumers of scientific research (e.g., undergraduate students) assume that results are reported honestly and scientific integrity statements often imply that this is the norm.

However, Miller and Ulrich try to spin this topic in a way that suggests shady practices are not a problem. They argue that shady practices are not as harmful as some researchers have suggested, citing my 2020 article, because “because QRPs also increase power by making it easier to reject null hypotheses that are false as well as those that are true (e.g., Ulrich & Miller 2020).” Let’s unpack this nonsense in more detail.

Yes, questionable researcher practices increase the chances of obtaining a significant result independent of the truth of the hypothesis. However, if researchers test only 1 true hypotheses for every 9 false hypotheses, QRPs can have a much more sever effect on the rate of significant results when the null-hypothesis is false. Also a false hypotheses starts with a low probability of a significant result when researchers are honest, namely 5% with the standard criterion of significance. In contrast, a true hypothesis can have anywhere between 5% and 100% power, limiting the room for shady practices to inflate the rate of significant results when the hypothesis is true. In short, the effect of shady practices are not equal and false hypotheses benefit more from shady practices than true hypotheses.

The second problem is that Miller and Ulrich conflate false positives and replication failures. Shady practices in original studies will also produce replication failures when the hypothesis is true. The reason is that shady practices lead to inflated effect size estimates, while the outcome of the honest replication study is based on the true population effect size. As this is often 50% smaller than the inflated estimates in published articles, replication studies with similar sample sizes are bound to produce non-significant results (Open Science Collaboration, 2015). Again, this is true even if the hypothesis is true (i.e., the effect size is not zero).

4.3 Increase Power

As Miller and Ulrich point out, increasing power has been a recommendation to improve psychological science (or a recommendation for psychology to become a science) for a long time (Cohen, 1961). However, they point out that this recommendation is not very practical because “it is very difficult to say what sample sizes are needed to attain specific target
power levels, because true effect sizes are unknown” (p. 698). This argument against proper planning of sample sizes is false for several reasons.

First, I advocated for higher power in the context of multi-study papers. Rather than conducting 5 studies with 20% power, researchers should use their resources to conduct one study with 80% power. The main reason researchers do not do this is that the single study might still not produce a significant result and they are allowed to hide underpowered studies that failed to produce a significant result. Thus, the incentive structure that rewards publication of significant results rewards researchers who conduct many underpowered studies and only report those that worked. Of course, Miller and Ulrich avoid discussing this reason for the lack of proper power analysis to maintain the image that psychologists are honest researchers with the best intentions.

Second, researchers do not need to know the population effect size to plan sample sizes. One way to plan future studies is to base the sample size on previous studies. This is of course what researchers have been doing only to find out that results do not replicate because the original studies used shady practices to produce significant results. Many graduate students who left academia spent years of their Ph.D. trying to replicate published findings and failed to do so. However, all of these failures remain hidden so that power analyses based on published effect sizes lead to more underpowered studies that do not work. Thus, the main reason why it is difficult to plan sample sizes is that the published literature reports inflated effect sizes that imply small samples are sufficient to have adequate power.

Finally, it is possible to plan studies with the minimal effect size of interest. These studies are useful because a non-significant result implies that the hypothesis is not important even if the strict nil-hypothesis is false. The effect size is just so small that it doesn’t really matter and requires extremely large effect sizes to study them. Nobody would be interested in doing studies on this irrelevant effects that require large resources. However, to know that the population (true) effect size is too small to matter, it is important to conduct studies that are able to estimate small effect sizes precisely. In contrast, Miller and Ulrich warn that sample sizes could be too large because large samples “provide high power to detect effects that are too small to be of practical interest”. (p. 698). This argument is rooted in the old statistical approach to ignore effect sizes and be satisfied with a conclusion that the effect size is not zero, p < .05, what Cohen called nil-hypothesis testing and others have called a statistical ritual. Sample sizes are never too large because larger samples provide more precision in the estimation of effect sizes, which is the only way to establish that a true effect size is too small to be important. A study that define the minimum effect size of interest and uses this effect size as the null-hypothesis can determine whether the effect is relevant or not.

4.4. Increase the Base Rate

Increasing the base rate means testing more true hypotheses. Of course, researchers do not know a priori which hypotheses are true or not. Otherwise, the study would not be necessary (actually many studies in psychology test hypotheses where the null-hypothesis is false a priori, but that is a different issue). However, hypotheses can be more or less likely to be true based on exiting knowledge. For example, exercise is likely to reduce weight, but counting backwards from 100 to 1 every morning is not likely to reduce weight. Many psychological studies are at least presented as tests of theoretically derived hypotheses. The better the theory, the more often a hypothesis is true and a properly powered study will produce a true positive result. Thus, theoretical progress should increase the percentage of true hypotheses that are tested. Moreover, good theories would even make quantitative predictions about effect sizes that can be used to plan sample sizes (see previous section).

Yet, Miller and Ulrich conclude that “researchers have little direct control over their base
rates” (p. 698). This statement is not only inconsistent with the role of theory in the scientific process, it is also inconsistent with the nearly 100% success rate in published articles that always show the predicted results, if only because the prediction was made after the results were observed rather than from an a priori theory (Kerr, 1998).

In conclusion, Miller and Ulrich’s review of recommendations is abysmal and only serves the purpose to exonerate psychologists from justified accusations that they are playing a game that looks like science, but is not science, because researchers are rewarded for publishing significant results that fail to provide evidence for hypotheses because even false hypotheses produce significant results with the shady practices that psychologists use.

5. OBJECTIONS TO PROPOSED CHANGES

Miller and Ulrich start this section with the statement that “although the above suggestions for reducing FPs all seem sensible, there are several reasonable objections to them” (p. 698). Remember one of the proposed changes was to curb the use of shady practices. According to Miller and Ulrich there is a reasonable objection to this recommendation. However, what would be a reasonable objection to the request that researchers should publish all of their data, even those that do not support their cherished theory? Every undergraduate student immediately recognizes that selective reporting of results undermines the essential purpose of science. Yet, Miller and Ulrich want readers to believe that there are reasonable objections to everything.

“Although to our knowledge there have been no published objections to the idea that QRPs should be eliminated to hold the actual Type 1 error rate at the nominal α level, even this
suggestion comes with a potential cost. QRPs increase power by providing multiple opportunities to reject false null hypotheses as well as true ones” (p. 699).

Apparently, academic integrity only applies to students, but not to their professors when they go into the lab. Dropping participants, removing conditions, dependent variables, or entire studies, or presenting exploratory results as if they were predicted a priori are all ok because these practices can help to produce a significant result even when the nil-hypothesis is false (i.e., there is an effect).

This absurd objection has several flaws. First, it is based on the old and outdated assumption that the only goal of studies is to decide whether there is an effect or not. However, even Miller and Ulrich earlier acknowledged that effect sizes are important. Sometimes effect sizes are too small to be practically important. What they do not tell their readers is that shady practices produce significant results by inflating effect sizes, which can lead to the false impression that the true effect size is large, when it is actually tiny. For example, the effect size of an intervention to reduce implicit bias on the Implicit Association Test was d = .8 in a sample of 30 participants and shrank to d = .08 in a sample of 3,000 participants (cf. Schimmack, 2012). What looked like a promising intervention when shady practices were used, turned out to be a negligible effect in an honest attempt to investigate the effect size.

The other problem is of course that shady practices can produce significant results when a hypothesis is true and when a hypothesis is false. If all studies are statistically significant, statistical significance no longer distinguishes between true and false hypotheses (Sterling, 1959). It is therefore absurd to suggest that shady practices can be beneficial because they can produce true positive results. The problem of shady practices is the same problem as a liar. They sometimes say something true and sometimes they lie, but you don’t know when they are honest or lying.

9. CONCLUSIONS

The conclusion merely solidifies Miller and Ulrich’s main point that there are no simple recommendations to improve psychological science. Even the value of replications can be debated.

“In a research scenario with a 20% base rate of small effects (i.e., d = 0.2), for example, a researcher would have the choice between either running a certain number of large studies with α = 0.005 and 80% power, obtaining results that are 97.5% replicable, or running six times as many small studies with α = 0.05 and 40% power, obtaining results that are 67% replicable. It is debatable whether choosing the option producing higher replicability would necessarily result in the fastest scientific progress.”

Fortunately, we have a real example of scientific progress to counter Miller and Ulrich’s claim that fast science leads to faster scientific progress. The lesson comes from molecular genetics research. When it became possible to measure variability in the human genome, researchers were quick to link variations in one specific gene to variation in phenotypes. This candidate gene research produced many significant results. However, unlike psychological scientists journals in this area of research also published replication failures and it became clear that discoveries could often not be replicated. This entire approach has been replaced by collaborative projects that rely on very large data sets and many genetic predictors to find relationships. Most important, they reduced the criterion for significance from .05 to .000000005 to increase the ratio of true positives and false positives. The need for large samples slows down this research, but at least this approach has produced some solid findings.

In conclusion, Miller and Ulrich pretend to engage in a scientific investigation of scientific practices and a reasonable discussion of their advantages and disadvantages. However, in reality they are gaslighting their readers and fail to point out a simple truth about science. Science is build on trust and trust requires honest and trustworthy behavior. The replication crisis in psychology has revealed that psychological science is not trustworthy because researchers use shady practices to support their cherished theories. While they pretend to subject their theories to empirical tests, the tests are a sham and rigged in their favor. The researcher always wins because they have control over the results that are published. As long as these shady practices persist, psychology is not a science. Miller and Ulrich disguise this fact in a seemingly scientific discussion of trade-offs, but there is no trade-off between honesty and lying in science. Only scientists who report all of their data and analyses decision can be trusted. This seems obvious to most consumers of science, but it is not. Psychological scientists who are fed up with the dishonest reporting of results in psychology journals created the term Open Science to call for transparent reporting and open sharing of data, but these aspects of science are integral to the scientific method. There is no such thing as closed science where researchers go to their lab and then present a gold nugget and claim to have created it in their lab. Without open and transparent sharing of the method, nobody should believe them. The same is true for contemporary psychology. Given the widespread use of shady practices, it is necessary to be skeptical and to demand evidence that shady practices were not used.

It is also important to question the claims of meta-psychologists. Do you really think it is ok to use shady practices because they can produce significant results when the nil-hypothesis is false? This is what Miller and Ulrich want you to believe. If you see a problem with this claim, you may wonder what other claims are questionable and not in the best interest of science and consumers of psychological research. In my opinion, there is no trade-off between honest and dishonest reporting of results. One is science, the other is pseudo-science. But hey, that is just my opinion and the way the real sciences work. Maybe psychological science is special.