Estimating Reproducibility of Psychology (No. 43): An open, post-publication review


In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

This post examines the reproducibilty of article No. 43 “The rejection of moral rebels: Resenting those who do the right thing”  by Monin, Sawyer, and Marquez (JPSP, 2008).


Four studies document the rejection of moral rebels. In Study 1, participants who made a counterattitudinal speech disliked a person who refused on principle to do so, but uninvolved observers preferred this rebel to an obedient other. In Study 2, participants taking part in a racist task disliked a rebel who refused to go along, but mere observers did not. This rejection was mediated by the perception that rebels would reject obedient participants (Study 3), but did not occur when participants described an important trait or value beforehand (Study 4). Together, these studies suggest that rebels are resented when their implicit reproach threatens the positive self-image of individuals who did not rebel.

The main conclusion of the article is that moral rebels are resented by those who hold the same moral values, but did not rebel.

Four social psychological experiments are used to provide empirical support for this hypothesis.  Thus, the original article already contains one original study and three replication studies.

Study 1

Study 1 used the induced compliance paradigm (Galinsky, Stone, & Cooper, 2000; Zanna & Cooper, 1974).  Presumably, earlier studies that used this paradigm showed that it is effective in inducing dissonance by having participants write an essay that goes against their own beliefs. Importantly, participants are not forced to do so, but merely comply to a request.

An analogy could be a request by an editor to remove a study with imperfect results from an article. The scientists has the option to reject this request because it violates her sense of scientific standards, but she may also comply to get a publication.  This internal conflict is called cognitive dissonance.

After the experimental manipulation of cognitive dissonance, participants were asked to make personality ratings of another participant who refused to comply to the request (rebel) or who did not (compliant).   There was also a group of control participants without the dissonance induction.   This creates a 2 x 2 between-subject (BS) design with induction (yes/no) and rebel target (yes, no) as predictor variables.  The outcome were personality and liking ratings of the targets.

There were 70 participants, but 10 were eliminated because they were suspicious.  So, there were 60 participants for 4 experimental conditions (average cell size n = 15).

The data analysis revealed a significant cross-over interaction, F(1, 56) = 11.00.   Follow-up analysis showed that observers preferred the rebel (M = 0.90, SD = 0.87) to the obedient target (M = 0.07, SD = 1.51), t(56) = 2.22, p = .03, d = 0.72.

Most important, actors preferred the obedient other (M = 0.59, SD = 0.70) to the rebel (M = 0.50, SD = 1.21), t(56) = 2.46, p = .02, d = 1.19.

Although this reported result appears to support the main hypothesis, the results of Study 1 raise concerns about the way the results in Study 1 were obtained.  The reason is that a 2 x 2 BS design essentially combines two experiments in one.  One experiment is conducted without dissonance induction. The other experiment is conducted with dissonance induction.  For each experiment, the data are analyzed and the results showed p = .03 and p = .02.  The problem is that it is unlikely to obtain two p-values that are so similar in two independent studies.  We can quantify the probability of this event under the null-hypothesis that a researchers simply conducted two studies and found two similar p-values just by chance using the Test of Insufficient Variance (TIVA).

Tiva converts p-values into z-values. The sampling distribution of z-values has a variance of 1.  So, we can compare the observed variance of the z-values corresponding to the p-values to an expected variance of 1.  For p = .03 and .02, the observed variance is Var(z) = 0.025.  With one degree of freedom, the probabilty of observing a variance of 0.025 or less is, pchisq(0.025,1) = .125.  This means we would expect two just significant results in independent post-hoc tests in about 1 out of 8 attempts.   This means the results in Study 1 are unusual or surprising, but clearly this does not mean that they are not just a chance finding.

Study 2

Study 2 tested whether the results from Study 1 generalize to racism.  Participants were asked to imagine that they are a detective and they were given three suspects who might have committed a burglary.  The information implicated the African American suspect. Participants all chose the African American target.  For the rebel manipulation participants were asked to make ratings of somebody who had made the same choice or a rebel.  The rebel refused to identify the African American suspect. “I refuse
to make a choice here—this task is obviously biased. . . . Offensive to make black man the obvious suspect. I refuse to play this game.”

56 participants took part in the study, but some had to be excluded (e.g. picked a White suspect), leaving  49 participants for the data analysis (n = 12 per cell).

The interaction effect in the ANOVA was again significant. F(1, 45) = 4.38, p = .04.

The test of the key hypothesis showed the expected mean differences, but it was not significant, (obedient: M = 0.50, SD = 1.34; rebel M = -0.67, SD = 2.03), t(45) = 1.68, p = .10, d = 0.71.  However, p-values in the range between .10 and .05 are often called marginally significant and interpreted as support for a hypothesis.  Thus, Study 2 appears to provide weak support for the main claim of the article.

The comparison of rebels and obedient targets in the observer condition was also not significant, ( obedient target M = 0.12, SD = 1.71; the rebel (M = 0.98, SD = 1.64; t(45) = 1.27, p = .21, d = 0.53.

TIVA for all four p-values from Study 1 and Study 2 shows more variance, var(z) = 0.39, but with 3 df, it is even less likely to observe such similar p-values in four independent studies, p = .033.

Most important, the OSC replication study replicated the detective task and Study 2 failed to show a significant effect for this task.  Thus, obtaining a non-significant result in the OSC replication study is not entirely surprising because the original study also reported a non-significant result.

Study 3 

Study 3 is a replication and extension of Study 2. It replicated the detective task and it extended Study 2 by testing mediation.   The benefit of testing mediation was that the sample size increased to 132 participants (all male).

Study 3 modified four aspects of Study 2.

1. Study 2 asked observers to do the task themselves after rating the target. This serves no purpose and was dropped from Study 3.

2. Participants were explicitly informed that the target they were rating was a White male.

3. Third, the study used only male participants (not clear why).

4. Fourth, the study included additional questions to test mediation. These questions were asked after the ratings of the target and therefore also do not change the experiment form Study 2.

So the only difference was that participants were only males and they were told that they were rating the personality of a White male rebel or conformist.

Two participants expressed suspicion and 13 picked a White suspect, reducing the final sample size to N = 117.

The results were as expected. The interaction was significant, F(1, 113) = 5.58, p = .02. More important, the follow-up test showed that participants in the dissonance condition preferred the conformist  (M = 1.63, SD = 1.15) to a rebel (M = 0.53, SD = 2.27), t(113) = 2.33, p = .02, d = 0.61.  There was no significant difference for observers, t(113) = .98, p = .33, d = 0.27.

Even with the non-significant p-value of 0.33,  the variance of p-values across the 3 studies remains unusually low, var(z) = 0.35,  p = .049.   It is also surprising that the much larger sample size in Study 3 did not produce stronger evidence for the main hypothesis.

Study 4

Study 4 is crucial because this is the study that the OSC project attempted to replicate.  The focus is on Study 4 because the OSC sampling plan asked to focus on the last study.  One problem with this sampling approach is that the last study may be different from studies in a multiple study article.

Like Study 2, Study 4 used male (N = 52) and female (N = 27) participants.  The novel contribution of Study 4 was the addition of a third condition called affirmation.  The study did not include a control condition.  For the replication part of the study, 19 participants judged an obedient target, and 29 judged a rebel.

The results showed a significant interaction effect, F(2,64) = 10.17, p = .0001.  The difference in ratings of obedient and rebel targets was significant and large, t(48) = 3.24, p = .001, d = .96.   The difference was even larger in comparison to the self-affirmation condition, t(48) = 4.39, p = .00003, d = 1.30.


The replication study was carried out by Taylor Holubar.  The authors use the strong results of Study 4 to conduct a power analysis and concluded that they needed only n =  18 participants per cell (N = 54) in total to have 95% power to replicate the results of Study 4.   This power analysis overlooks that the replication part of Study 4 produced larger effect sizes than the previous two studies.  Even without this concern, it is questionable to use observed results to plan replication studies because observed effects are influenced by sampling error.  A between-subject study should have a minimum of n = 20 participants per condition (Simmons et al., 2011).  There is also no reason to reduce sample sizes when the replication study is conducted on Mturk, which makes it possible to recruit large samples quickly.

Another concern is that a replication study on Mturk may produce different results than a study with face to face contact between an experimenter and a participant.

The initial Mturk sample consisted of 117 participants. After exclusion of participants for various reasons, the final sample size was N = 75, thus higher than the power analysis suggested.  Nevertheless, the study failed to replicate the significant ANOVA result of the original study., F(2,72)=1.97, p = .147.  This finding was used to consider the finding of the replication study a failure.

However, the comparison of the obedient and rebel condition showed the difference that was observed in the original article and the effect size was similar to the effect size in Study 2 and 3 (obedient M = 0.98, SD = 1.20; rebel M = 0.27, SD = 1.72), t(48) = 1.69, p = .097.

The result falls short of the criterion for statistical significance, but the problem is that the replication study had low power. The power analysis of the replication study used an unusually large effect size in Study 4.


Monin wrote a response to the replication failure.  Monin pointed out that he was not consulted and never approved of the replication design.  Monin also points out that consultation would have been easy because the replication author and he were both at Stanford.

Monin expresses serious concerns about the self-affirmation manipulation in the replication study.  “The methods differed in important ways from the original lab study (starting with transferring it online), yet the replicators describe their methods as “virtually identical to the original…The self affirmation manipulation was changed from an 8-minute-in-lab-paper-and-pencil essay to a short online question.”

Given this criticism, it seems problematic to consider the failure to produce a self-affirmation effect as crucial for a successful replication.  The key finding in the article was that moral rebels are rated less favorable by individuals in the same situation who comply.  While the replication study failed to show a significant effect for this test as well, this was partially due to the reliance on the unusually strong effect size in Study 4 to plan the sample size for Study 4.  At least it has to be noted that the replication study did not have 95% power as the authors assumed.

Prediction of Replication Outcome

The strong result in Study 4 alone leads to the prediction of a successful replication outcome (Replicability Index = 0.74:  > .50 = success more likely than failure). However, taking the results of  Study 2 and 3 into account leads to the prediction that an exact replication with similar sample size would not replicate.

Obs.Power Success Inflation R-Index
Study 4 0.87 1.00 0.13 0.74
Study 3 0.63 1.00 0.37 0.26
Study 2 0.50* 1.00* 0.50 0.00
Combined 0.63 1.00 0.37 0.26

using p < .10 as criterion because marginally significant result was treated as success.


Neither the original results nor the replication study are flawless.  The original article reported results that are unlikely without the use of some questionable research practices to produce just significant results.  The replication study failed to replicate the effect, but a slightly larger sample might have moved produced a significant result.  It would not be surprising, if another replication study with N = 100 (n = 50 per cell; 80% power with d = .4) would produce a significant result.

At the same time, the key hypothesis of the article remains to be demonstrated.  Moreover, the results are limited to a single paradigm with a hypothetical decision in a detective game. Hopefully, future studies can learn from this discussion of the original and replication study to plan better studies that can produce more conclusive evidence.

Another interesting question could be how moral rebels evaluate targets who are obedient or rebels. My prediction is that rebels will show a preference for rebels and a dislike of obedient individuals.





























2 thoughts on “Estimating Reproducibility of Psychology (No. 43): An open, post-publication review

  1. “Another concern is that a replication study on Mturk may produce different results than a study with face to face contact between an experimenter and a participant.”

    Although this may be true and might be an interesting difference between studies that may inform follow-up studies, the original theory—as stated in the article—most likely does not define “face to face contact” as a necessary pre-condition for the effect. I therefore do not think that this is a good argument against the method used in the replication.

Leave a Reply