Category Archives: Estimating Reproducibility

Estimating the Replicability of Psychological Science

Over the past years, psychologists have become increasingly concerned about the credibility of published results. The credibility crisis started in 2011, when Bem published incredible results that seemed to suggest that humans can foresee random future events. Bem’s article revealed fundamental flaws in the way psychologists conduct research. The main problem is that psychology journals only publish statistically significant results (Sterling, 1959). If only significant results are published, all hypotheses will receive empirical support as long as they are tested. This is akin to saying that everybody has a 100% free throw average or nobody ever makes a mistake if we do not count failures.

The main problem of selection for significance is that we do not know the real strength of evidence that empirical studies provide. Maybe the selection effect is small and most studies would replicate. However, it is also possible that many studies might fail a replication test. Thus, the crisis of confidence is a crisis of uncertainty.

The Open Science Collaboration conducted actual replication studies to estimate the replicability of psychological science. They replicated 97 studies with statistically significant results and were able to reproduce 35 significant results (a 36% success rate). This is a shockingly low success rate. Based on this finding, most published results cannot be trusted, especially because there is heterogeneity across studies. Some studies would have an even lower chance of replication and several studies might even be outright false positives (there is actually no real effect).

As important as this project was to reveal major problems with the research culture in psychological science, there are also some limitations that cast doubt about the 36% estimate as a valid estimate of the replicability of psychological science. First, the sample size is small and sampling error alone might have lead to an underestimation of the replicability in the population of studies. However, sampling error could also have produced a positive bias. Another problem is that most of the studies focused on social psychology and that replicability in social psychology could be lower than in other fields. In fact, a moderator analysis suggested that the replication rate in cognitive psychology is 50%, while the replication rate in social psychology is only 25%. The replicated studies were also limited to a single year (2008) and three journals. It is possible that the replication rate has increased since 2008 or could be higher in other journals. Finally, there have been concerns about the quality of some of the replication studies. These limitations do not undermine the importance of the project, but they do imply that the 36% estimate is an estimate and that it may underestimate the replicability of psychological science.

Over the past years, I have been working on an alternative approach to estimate the replicability of psychological science. This approach starts with the simple fact that replicabiliity is tightly connected to the statistical power of a study because statistical power determines the long-run probability of producing significant results (Cohen, 1988). Thus, estimating statistical power provides valuable information about replicability. Cohen (1962) conducted a seminal study of statistical power in social psychology. He found that the average power to detect an average effect size was around 50%. This is the first estimate of replicability of psychological science, although it was only based on one journal and limited to social psychology. However, subsequent studies replicated Cohen’s findings and found similar results over time and across journals (Sedlmeier & Gigerenzer, 1989). It is noteworthy that the 36% estimate from the OSC project is not statistically different from Cohen’s estimate of 50%. Thus, there is convergent evidence that replicability in social psychology is around 50%.

In collaboration with Jerry Brunner, I have developed a new method that can estimate mean power for a set of studies that are selected for significance and that vary in effect sizes and samples sizes, which produces heterogeneity in power (Brunner & Schimmack, 2018). The input for this method are the actual test statistics of significance tests (e.g., t-tests, F-tests). These test-statistics are first converted into two-tailed p-values and then converted into absolute z-scores. The magnitude of these absolute z-scores provides information about the strength of evidence against the null-hypotheses. The histogram of these z-scores, called a z-curve, is then used to fit a finite mixture model to the data that estimates mean power, while taking selection for significance intro account. Extensive simulation studies demonstrate that z-curve performs well and provides better estimates than alternative methods. Thus, z-curve is the method of choice for estimating the replicability of psychological science on the basis of the test statistics that are reported in original articles.

For this blog post, I am reporting results based on preliminary results from a large project that extracts focal hypothesis from a broad range of journals that cover all areas of psychology for the years 2010 to 2017. The hand-coding of these articles complements a similar project that relies on automatic extraction of test statistics (Schimmack, 2018).

Table 1 shows the journals that have been coded so far. It also shows the estimates based on the automated method and for hand-coding of focal hypotheses.

JournalHandAutomated
Psychophysiology8475
Journal of Abnormal Psychology7668
Journal of Cross-Cultural Psychology7377
Journal of Research in Personality6875
J. Exp. Psych: Learning, Memory, & Cognition5877
Journal of Experimental Social Psychology5562
Infancy5368
Behavioral Neuroscience5368
Psychological Science5266
JPSP-Interpersonal Relations & Group Processes3363
JPSP-Attitudes and Social Cognition3065
Mean5869

Hand coding of focal hypothesis produces lower estimates than the automated method because the automated analysis also codes manipulation checks and other highly significant results that are not theoretically important. The correlation between the two methods shows consistency across the two methods, r = .67. Finally, the mean for the automated method, 69%, is close to the mean for over 100 journals, 72%, suggesting that the sample of journals is an unbiased sample.

The hand coding results also confirm results found with the automated method that social psychology has a lower replicability than some other disciplines. Thus, the OSC reproducibility results that are largely based on social psychology should not be used to make claims about psychological science in general.

The figure below shows the output of the latest version of z-curve. The first finding is that the replicability estimate for all 1,671 focal tests is 56% with a relatively tight confidence interval ranging from 45% to 56%. ZZZ The next finding is that the discovery rate or success rate is 92%, using p < .05 as the criterion. This confirms that psychology journals continue to published results are selected for significance (Sterling, 1959). The histogram further shows that even more results would be significant if p-values below .10 are included as evidence for “marginal significance.”

Z-Curve.19.1 also provides an estimate of the size of the file drawer. It does so by projecting the distribution of observed significant results into the range of non-significant results (grey curve). The file drawer ratio shows that for every published result, we would expect roughly two unpublished studies with non-significant results. However, z-curve cannot distinguish between different questionable research practices. Rather than not disclosing failed studies researchers may not disclose other statistical analyses within a published study to report significant results.

Z-Curve.19.1 also provides an estimate of the false positive rate (FDR). FDR is the percentage of significant results that may arise from testing a true nil-hypothesis, where the population effect size is zero. For a long time, the consensus has been that false positives are rare because the nil-hypothesis is rarely true (Cohen, 1994). Consistent with this view, Soric’s estimate of the maximum false discovery rate is only 10% with a tight CI ranging from 8% to 16%.

However, the focus on the nil-hypothesis is misguided because it treats tiny deviations from zero as true hypotheses even if the effect size has no practical or theoretical significance. These effect sizes also lead to low power and replication failures. Therefore, Z-Curve 19.1 also provides an estimate of the FDR that treats studies with very low power as false positives. This broader definition of false positives raises the FDR estimate slightly, but 15% is still a low percentage. Thus, the modest replicability of results in psychological science is mostly due to low statistical power to detect true effects rather than a high number of false positive discoveries.

The reproducibility project showed that studies with low p-values were more likely to replicate. This relationship follows from the influence of statistical power on p-values and replication rates. To achieve a replication rate of 80%, p-values had to be less than .00005 or the z-score had to exceed 4 standard deviations. However, this estimate was based on a very small sample of studies. Z-Curve.19.1 also provides estimates of replicability for different levels of evidence. These values are shown below the x-axis. Consistent with the OSC results, a replication rate over 80% is only expected once z-scores are greater than 4.

The results also provide information about the choice of the alpha criterion to draw inferences from significance tests in psychology. To do so, it is important to distinguish observed p-values and type-I probabilities. For a single unbiased tests, we can infer from an observed p-value less than .05 that the risk of a false positive result is less than 5%. However, when multiple comparisons are made or results are selected for significance, an observed p-values less than .05 does not imply that the type-I error risk is below .05. To claim a type-I error risk of 5% or less, we have to correct the observed p-values, just like a Bonferroni correction. As 50% power corresponds to statistical significance, we see that z-scores between 2 and 3 are not statistically significant; that is, the type-I error risk is greater than 5%. Thus, the standard criterion to claim significance with alpha = .05 is a p-value of .003. Given the popularity of .005, I suggest to use p = .005 as a criterion for statistical significance. However, this claim is not based on lowering the criterion for statistical significance because p < .005 still only allows to claim that the type-I error probability is less than 5%. The need for a lower criterion value stems from the inflation of the type-I error rate due to selection for significance. This is a novel argument that has been overlooked in the significance wars, which ignored the influence of publication bias on false positive risks.

Finally, z-curve.19.1 makes it possible to examine the robustness of the estimates by using different selection criteria. One problem with selection models is that p-values just below .05, say in the .01 to .05 range, can arise from various questionable research practices that have different effects on replicability estimates. To address this problem, it is possible to estimate the density with a different selection criterion, while still estimating the replicability with alpha = .05 as the criterion. Figure 2 shows the results by using only z-scores greater than 2.5, p = .012) to fit the observed z-curve for z-scores greater than 2.5.

The blue dashed line at z = 2.5 shows the selection criterion. The grey curve between 1.96 and 2.5 is projected form the distribution for z-scores greater than 2.5. Results show a close fit with the observed distribution. A s a result, the parameter estimates are also very similar. Thus, the results are robust and the selection model seems to be reasonable.

Conclusion

Psychology is in a crisis of confidence about the credibility of published results. The fundamental problems are as old as psychology itself. Psychologists have conducted low powered studies and selected only studies that worked for decades (Cohen, 1962; Sterling, 1959). However, awareness of these problems has increased in recent years. Like many crises, the confidence crisis in psychology has created confusion. Psychologists are aware that there is a problem, but they do not know how large the problem is. Some psychologists believe that there is no crisis and pretend that most published results can be trusted. Others are worried that most published results are false positives. Meta-psychologists aim to reduce the confusion among psychologists by applying the scientific method to psychological science itself.

This blog post provided the most comprehensive assessment of the replicability of psychological science so far. The evidence is largely consistent with previous meta-psychological investigations. First, replicability is estimated to be slightly above 50%. However, replicability varies across discipline and the replicability of social psychology is below 50%. The fear that most published results are false positives is not supported by the data. Replicability increases with the strength of evidence against the null-hypothesis. If the p-value is below .00001, studies are likely to replicate. However, significant results with p-values above .005 should not be considered statistically significant with an alpha level of 5%, because selection for significance inflates the type-I error. Only studies with p < .005 can claim statistical significance with alpha = .05.

The correction for publication bias implies that researchers have to increase sample sizes to meet the more stringent p < .005 criterion. However, a better strategy is to preregister studies to ensure that reported results can be trusted. In this case, p-values below .05 are sufficient to demonstrate statistical significance with alpha = .05. Given the low prevalence of false positives in psychology, I do see no need to lower the alpha criterion.

Future Directions

This blog post is just an interim report. The final project requires hand-coding of a broader range of journals. Readers who think that estimating the replicability of psychological science is beneficial and who want information about a particular journal are invited to collaborate on this project and can obtain authorship if their contribution is substantial enough to warrant authorship. Please consider taking part in this project. Although it is a substantial time commitment, it doesn’t require participants or materials that are needed for actual replication studies. Please consider taking part in this project. Contact me, if you are interested and want to know how you can get involved.

Estimating Reproducibility of Psychology (No. 43): An open, post-publication review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

This post examines the reproducibilty of article No. 43 “The rejection of moral rebels: Resenting those who do the right thing”  by Monin, Sawyer, and Marquez (JPSP, 2008).

Abstract:

Four studies document the rejection of moral rebels. In Study 1, participants who made a counterattitudinal speech disliked a person who refused on principle to do so, but uninvolved observers preferred this rebel to an obedient other. In Study 2, participants taking part in a racist task disliked a rebel who refused to go along, but mere observers did not. This rejection was mediated by the perception that rebels would reject obedient participants (Study 3), but did not occur when participants described an important trait or value beforehand (Study 4). Together, these studies suggest that rebels are resented when their implicit reproach threatens the positive self-image of individuals who did not rebel.

The main conclusion of the article is that moral rebels are resented by those who hold the same moral values, but did not rebel.

Four social psychological experiments are used to provide empirical support for this hypothesis.  Thus, the original article already contains one original study and three replication studies.

Study 1

Study 1 used the induced compliance paradigm (Galinsky, Stone, & Cooper, 2000; Zanna & Cooper, 1974).  Presumably, earlier studies that used this paradigm showed that it is effective in inducing dissonance by having participants write an essay that goes against their own beliefs. Importantly, participants are not forced to do so, but merely comply to a request.

An analogy could be a request by an editor to remove a study with imperfect results from an article. The scientists has the option to reject this request because it violates her sense of scientific standards, but she may also comply to get a publication.  This internal conflict is called cognitive dissonance.

After the experimental manipulation of cognitive dissonance, participants were asked to make personality ratings of another participant who refused to comply to the request (rebel) or who did not (compliant).   There was also a group of control participants without the dissonance induction.   This creates a 2 x 2 between-subject (BS) design with induction (yes/no) and rebel target (yes, no) as predictor variables.  The outcome were personality and liking ratings of the targets.

There were 70 participants, but 10 were eliminated because they were suspicious.  So, there were 60 participants for 4 experimental conditions (average cell size n = 15).

The data analysis revealed a significant cross-over interaction, F(1, 56) = 11.00.   Follow-up analysis showed that observers preferred the rebel (M = 0.90, SD = 0.87) to the obedient target (M = 0.07, SD = 1.51), t(56) = 2.22, p = .03, d = 0.72.

Most important, actors preferred the obedient other (M = 0.59, SD = 0.70) to the rebel (M = 0.50, SD = 1.21), t(56) = 2.46, p = .02, d = 1.19.

Although this reported result appears to support the main hypothesis, the results of Study 1 raise concerns about the way the results in Study 1 were obtained.  The reason is that a 2 x 2 BS design essentially combines two experiments in one.  One experiment is conducted without dissonance induction. The other experiment is conducted with dissonance induction.  For each experiment, the data are analyzed and the results showed p = .03 and p = .02.  The problem is that it is unlikely to obtain two p-values that are so similar in two independent studies.  We can quantify the probability of this event under the null-hypothesis that a researchers simply conducted two studies and found two similar p-values just by chance using the Test of Insufficient Variance (TIVA).

Tiva converts p-values into z-values. The sampling distribution of z-values has a variance of 1.  So, we can compare the observed variance of the z-values corresponding to the p-values to an expected variance of 1.  For p = .03 and .02, the observed variance is Var(z) = 0.025.  With one degree of freedom, the probabilty of observing a variance of 0.025 or less is, pchisq(0.025,1) = .125.  This means we would expect two just significant results in independent post-hoc tests in about 1 out of 8 attempts.   This means the results in Study 1 are unusual or surprising, but clearly this does not mean that they are not just a chance finding.

Study 2

Study 2 tested whether the results from Study 1 generalize to racism.  Participants were asked to imagine that they are a detective and they were given three suspects who might have committed a burglary.  The information implicated the African American suspect. Participants all chose the African American target.  For the rebel manipulation participants were asked to make ratings of somebody who had made the same choice or a rebel.  The rebel refused to identify the African American suspect. “I refuse
to make a choice here—this task is obviously biased. . . . Offensive to make black man the obvious suspect. I refuse to play this game.”

56 participants took part in the study, but some had to be excluded (e.g. picked a White suspect), leaving  49 participants for the data analysis (n = 12 per cell).

The interaction effect in the ANOVA was again significant. F(1, 45) = 4.38, p = .04.

The test of the key hypothesis showed the expected mean differences, but it was not significant, (obedient: M = 0.50, SD = 1.34; rebel M = -0.67, SD = 2.03), t(45) = 1.68, p = .10, d = 0.71.  However, p-values in the range between .10 and .05 are often called marginally significant and interpreted as support for a hypothesis.  Thus, Study 2 appears to provide weak support for the main claim of the article.

The comparison of rebels and obedient targets in the observer condition was also not significant, ( obedient target M = 0.12, SD = 1.71; the rebel (M = 0.98, SD = 1.64; t(45) = 1.27, p = .21, d = 0.53.

TIVA for all four p-values from Study 1 and Study 2 shows more variance, var(z) = 0.39, but with 3 df, it is even less likely to observe such similar p-values in four independent studies, p = .033.

Most important, the OSC replication study replicated the detective task and Study 2 failed to show a significant effect for this task.  Thus, obtaining a non-significant result in the OSC replication study is not entirely surprising because the original study also reported a non-significant result.

Study 3 

Study 3 is a replication and extension of Study 2. It replicated the detective task and it extended Study 2 by testing mediation.   The benefit of testing mediation was that the sample size increased to 132 participants (all male).

Study 3 modified four aspects of Study 2.

1. Study 2 asked observers to do the task themselves after rating the target. This serves no purpose and was dropped from Study 3.

2. Participants were explicitly informed that the target they were rating was a White male.

3. Third, the study used only male participants (not clear why).

4. Fourth, the study included additional questions to test mediation. These questions were asked after the ratings of the target and therefore also do not change the experiment form Study 2.

So the only difference was that participants were only males and they were told that they were rating the personality of a White male rebel or conformist.

Two participants expressed suspicion and 13 picked a White suspect, reducing the final sample size to N = 117.

The results were as expected. The interaction was significant, F(1, 113) = 5.58, p = .02. More important, the follow-up test showed that participants in the dissonance condition preferred the conformist  (M = 1.63, SD = 1.15) to a rebel (M = 0.53, SD = 2.27), t(113) = 2.33, p = .02, d = 0.61.  There was no significant difference for observers, t(113) = .98, p = .33, d = 0.27.

Even with the non-significant p-value of 0.33,  the variance of p-values across the 3 studies remains unusually low, var(z) = 0.35,  p = .049.   It is also surprising that the much larger sample size in Study 3 did not produce stronger evidence for the main hypothesis.

Study 4

Study 4 is crucial because this is the study that the OSC project attempted to replicate.  The focus is on Study 4 because the OSC sampling plan asked to focus on the last study.  One problem with this sampling approach is that the last study may be different from studies in a multiple study article.

Like Study 2, Study 4 used male (N = 52) and female (N = 27) participants.  The novel contribution of Study 4 was the addition of a third condition called affirmation.  The study did not include a control condition.  For the replication part of the study, 19 participants judged an obedient target, and 29 judged a rebel.

The results showed a significant interaction effect, F(2,64) = 10.17, p = .0001.  The difference in ratings of obedient and rebel targets was significant and large, t(48) = 3.24, p = .001, d = .96.   The difference was even larger in comparison to the self-affirmation condition, t(48) = 4.39, p = .00003, d = 1.30.

REPLICATION STUDY

The replication study was carried out by Taylor Holubar.  The authors use the strong results of Study 4 to conduct a power analysis and concluded that they needed only n =  18 participants per cell (N = 54) in total to have 95% power to replicate the results of Study 4.   This power analysis overlooks that the replication part of Study 4 produced larger effect sizes than the previous two studies.  Even without this concern, it is questionable to use observed results to plan replication studies because observed effects are influenced by sampling error.  A between-subject study should have a minimum of n = 20 participants per condition (Simmons et al., 2011).  There is also no reason to reduce sample sizes when the replication study is conducted on Mturk, which makes it possible to recruit large samples quickly.

Another concern is that a replication study on Mturk may produce different results than a study with face to face contact between an experimenter and a participant.

The initial Mturk sample consisted of 117 participants. After exclusion of participants for various reasons, the final sample size was N = 75, thus higher than the power analysis suggested.  Nevertheless, the study failed to replicate the significant ANOVA result of the original study., F(2,72)=1.97, p = .147.  This finding was used to consider the finding of the replication study a failure.

However, the comparison of the obedient and rebel condition showed the difference that was observed in the original article and the effect size was similar to the effect size in Study 2 and 3 (obedient M = 0.98, SD = 1.20; rebel M = 0.27, SD = 1.72), t(48) = 1.69, p = .097.

The result falls short of the criterion for statistical significance, but the problem is that the replication study had low power. The power analysis of the replication study used an unusually large effect size in Study 4.

RESPONSE TO REPLICATION ATTEMPT

Monin wrote a response to the replication failure.  Monin pointed out that he was not consulted and never approved of the replication design.  Monin also points out that consultation would have been easy because the replication author and he were both at Stanford.

Monin expresses serious concerns about the self-affirmation manipulation in the replication study.  “The methods differed in important ways from the original lab study (starting with transferring it online), yet the replicators describe their methods as “virtually identical to the original…The self affirmation manipulation was changed from an 8-minute-in-lab-paper-and-pencil essay to a short online question.”

Given this criticism, it seems problematic to consider the failure to produce a self-affirmation effect as crucial for a successful replication.  The key finding in the article was that moral rebels are rated less favorable by individuals in the same situation who comply.  While the replication study failed to show a significant effect for this test as well, this was partially due to the reliance on the unusually strong effect size in Study 4 to plan the sample size for Study 4.  At least it has to be noted that the replication study did not have 95% power as the authors assumed.

Prediction of Replication Outcome

The strong result in Study 4 alone leads to the prediction of a successful replication outcome (Replicability Index = 0.74:  > .50 = success more likely than failure). However, taking the results of  Study 2 and 3 into account leads to the prediction that an exact replication with similar sample size would not replicate.

Obs.Power Success Inflation R-Index
Study 4 0.87 1.00 0.13 0.74
Study 3 0.63 1.00 0.37 0.26
Study 2 0.50* 1.00* 0.50 0.00
Combined 0.63 1.00 0.37 0.26

using p < .10 as criterion because marginally significant result was treated as success.

Conclusion

Neither the original results nor the replication study are flawless.  The original article reported results that are unlikely without the use of some questionable research practices to produce just significant results.  The replication study failed to replicate the effect, but a slightly larger sample might have moved produced a significant result.  It would not be surprising, if another replication study with N = 100 (n = 50 per cell; 80% power with d = .4) would produce a significant result.

At the same time, the key hypothesis of the article remains to be demonstrated.  Moreover, the results are limited to a single paradigm with a hypothetical decision in a detective game. Hopefully, future studies can learn from this discussion of the original and replication study to plan better studies that can produce more conclusive evidence.

Another interesting question could be how moral rebels evaluate targets who are obedient or rebels. My prediction is that rebels will show a preference for rebels and a dislike of obedient individuals.