Estimating Reproducibility of Psychology (No. 64): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 68 “The Effect of Global Versus Local Processing Styles on Assimilation
Versus Contrast in Social Judgment”  is no ordinary article.  The first author, Jens Forster, has been under investigation for scientific misconduct and it is not clear whether published results in some articles are based on real or fabricated data.  Some articles that build on the same theory and used similar methods as this article have been retracted.  Scientific fraud would be one reason why an original study cannot be replicated.

Summary of Original Article

The article uses the first author’s model of global/local processing style model (GLOMO) to examine assimilation and contrast effects in social judgment. The article reports five experiments that showed that processing styles elicited in one task can carry over to other tasks and influence social judgments.

Study 1 

This study was chosen for the replication project.

Participants were 88 students.  Processing styles were manipulated by projecting a city map on a screen and asking participants to either (a) focus on the broader shape of the city or (b) to focus on specific details on the map. The study also included a control condition.  This task was followed by a scrambled sentence task with neutral or aggressive words.  The main dependent variable were aggression ratings in a person perception task.

With 88 participants and six conditions, there are n = 13 participants per condition.

The ANOVA results showed a highly significant interaction between the processing style and priming manipulations, F(2, 76) = 21.57, p < .0001.

We can think about the 2 x 3 design as three priming experiments for each of the three processing style conditions.

The global condition shows a strong assimilation effect, (prime M = 6.53, SD =1.21; no prime M =  4.15, SD = 1.25), t(26) = 5.10, p = .000007, d = 1.94.

In the control processing condition, priming shows an assimilation effect (priming (M = 5.63, SD = 1.25) than after nonaggression priming (M = 4.29, SD =1.23), t(25) = 2.79,  p = .007, d = 1.08.

The local processing condition shows a significant contrast effect (M = 2.86, SD = 1.15) than after nonaggression priming
(M = 4.62; SD = 1.16), t(25) = 3.96, p = .0005, d = -1.52.

Although the reported results appear to provide strong evidence, the extremely large effect sizes raise concern about the reported results.  After all, these are not the first studies that have examined priming effects on person perception.  The novel contribution was to demonstrate that these effects change (are moderated) as a function of processing styles.  What is surprising is that processing styles also appear to have magnified the typical effects without any theoretical explanation for this magnification.

The article was cited by Isbell, Rovenpor, and Lair (2016) because they used the map manipulation in combination with a mood manipulation. The article reports a significant interaction between processing and mood, F(1,73) = 6.33, p = .014.  In the global condition, more abstract statements in an open ended task were recorded in the angry mood condition, but the effect was not significant and much smaller than in Forster’s studies, F(1,73) = 3.21, p = .077, d = .55.  In the local condition, sad participants listed more abstract statements, but again the effect was not significant and smaller than in Forster et al.’s studies, F(1,73) = 3.20, p = .078, d = .67.  As noted before, these results are also questionable because it is unlikely to get p = .077 and p = .078 in two independent statistical tests.

In conclusion, the effect sizes reported by Foerster et al. in Study 1 are unbelievable because they are much larger than can be expected.

Study 2

Study 2 was a replication and extension of a study by Mussweiler and Strack (2000). Participants were 124 students from the same population.  This study used a standard processing style manipulation (Navon, 1977) that presented global letters composed of several smaller different letters (the letter E made up of several n).  The main dependent variable were judgments of drug use.   The design had 2 between subject factors: 3 (processing styles) x 2 (high vs. low comparison standard). Thus, there were about 20 to 21 participants per condition.  The study also had a within-subject factor (subjective vs. objective rating).

The ANOVA shows a 3-way interaction, F(2, 118) = 5.51, p = .005.

Once more, the 3 x 2 design can be treated as 3 independent studies of comparison standards. Because subjective and objective ratings are not independent, I focus on the objective ratings that produced stronger effects.

In the global condition, the high standard produced higher reports of drug use than the low standard (M = 0.66, SD = 1.13 vs. M = -0.47, SD = 0.57), t(39) = 4.04, p = .0004, d = 1.26.

In the control condition, a similar pattern was observed but it was not significant (M = 0.07, SD = 0.79 vs. M = -0.45, SD = 0.98), t(39) = 1.87, p = .07, d = 0.58.

In the local condition, the pattern is reversed (M = -0.41, SD = 0.83 vs. M = 0.60, SD = 0.99), t(39) = 3.54, p = .001, d = -1.11.

As the basic paradigm was a replication of Mussweiler and Strack’s (2000) Study 4, it is possible to compare the effect sizes in this study with the effect size in the original study.   The effect size in the original study was d = .31; 95%CI = -0.24, 1.01.  The effect is not significant, but the interaction effect for objective and subjective judgments was, F(1,30) = 4.49, p = .04.  The effect size is comparable to the control condition, but the  effect sizes for the global and local processing conditions are unusually large.

Study 3

132 students from the same population took part in Study 3.  This study was another replication and extension of Mussweiler and Strack (2000).  In this study, participants made ratings of their athletic abilities.  The extension was to add a manipulation of time (imagine being in an athletic competition today or in one year).  The design was a 3 (temporal distance: distant future vs. near future vs. control) by 2 (high vs. low standard) BS design with objective vs. subjective ratings as a within factor.

The three-way interaction was significant, F(2, 120) = 4.51, p = .013.

In the distant future condition,  objective ratings were higher with the high standard than with the low standard (high  M = 0.56, SD  = 1.04; low M = -0.58, SD = .51), t(41)  =
4.56, p = .0001, d = 1.39.

In the control condition,  objective ratings of athletic ability were higher after the high standard than after the low standard (high M = 0.36, SD = 1.08; low M = -0.36, SD = 0.77), t(38) = 2.44, p = .02, d = 0.77.

In the near condition, the opposite pattern was reported (high M = -0.35, SD = 0.33, vs. low M = 0.36, SD = 1.29), t(41) = 2.53; p = .02,  d = -.75.

In the original study by Mussweiler and Strack the effect size was smaller and not significant (high M = 5.92, SD = 1.88; low M = 4.89, SD = 2.37),  t(34) =  1.44, p = .15, d = 0.48.

Once more the reported effect sizes by Forster et al. are surprisingly large.

Study 4

120 students from the same population participated in Study 4.  The main novel feature of Study 4 was the inclusion of a lexical decision task and the use of reaction times as the dependent variable.   It is important to realize that most of the variance in lexical decision tasks is random noise and fixed individual differences in reaction times.  This makes it difficult to observe large effects in between-subject comparisons and it is common to use within-subject designs to increase statistical power.  However, this study used a between-subject design.  The ANOVA showed the predicted four-way interaction, F(1,108) = 26.17.

The four way interaction was explained by a 3-way interaction for self-primes F(1, 108)  = 39.65,, and no significant effects with control primes.

For moderately high standards, reaction times to athletic words were slower after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

For unathletic words, the revers pattern was observed.

For moderately high standards, reaction times were faster after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

In sum, Study 4 reported reaction time differences as a function of global versus local processing styles that were surprisingly large.

Study 5

Participants in Study 5 were 128 students.  The main novel contribution of Study 5 was the inclusion of a line-bisection task that is supposed to measure asymmetries in brain activation.  The authors predicted that local processing induces more activation of the left-side of the brain and global processing induces more activation of the right side of the brain.  The comparisons of the local and global condition with the control condition showed the predicted mean differences, t(120) = 1.95, p = .053 (reported as p = .05) and t(120) = 2.60, p = .010.   Also as predicted, the line-bisection measure was a significant mediator, z = 2.24, p = .03.

The Replication Study 

The replication project called for replication of the last study, but the replication team in the US found that it was impossible to do so because the main outcome measure of Study 5 was alcohol consumption and drug use (just like Study 2) and pilot studies showed that incidence rates were much lower than in the German sample.  Therefore the authors replicated the aggression priming study of Study 1.

The focal test of the replication study was the interaction between processing condition and priming condition. As noted earlier, this interaction was very strong,  F(2, 76) = 21.57, p < .0001, and therefore seemingly easy to replicate.

Fortunately, the replication team dismissed the outcome of a post-hoc power analysis, which suggested that only 32 participants would be needed and used the same sample size as the original study.

The processing manipulation was changed from a map of the German city of Oldenburg to a state map of South Carolina.  This map was provided by the original authors. The replication report emphasizes that “all changes were endorsed by the first author of the original study.”

The actual sample size was a bit smaller (N = 74) and after exclusion of 3 suspicious participants data analyses were based on 71 (vs. 80 in original study) participants.

The ANOVA failed to replicate a significant interaction effect, F(2, 65) = .865, p =
.426.

The replication study also included questions about the effectiveness of the processing style manipulation.  Only 32 participants indicated that they followed instructions.  Thus, one possible explanation for the replication failure is that the replication study did not successfully manipulate processing styles. However, the original study did not include a similar question and it is not clear why participants in the original study were more compliant.

More troublesome, is that the replication study did not replicate the simple priming effect in the control condition or the global condition, which should have produced the effect with or without successful manipulation of processing styles.

In the control condition, the mean was lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.27, SD = 1.29, neutral M = 7.00, SD = 1.30), t(22) = 1.38, p = .179, d = -.56.

In the global condition, the mean was also lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.38, SD = 1.75, neutral M = 7.23, SD = 1.46), t(22) = 1.29, p = .207, d = -.53.

In the local condition, the means were nearly identical (aggression M = 7.77, SD = 1.16, neutral M = 7.67, SD = 1.27), t(22) = 0.20, p = .842, d = .08.

The replication report points out that the priming task was introduced by Higgins, Rholes, and Jones (1977).   Careful reading of this article shows that the original article also did not show immediate effects of priming.  The study obtained target ratings immediately and 10 to 14 days later.  The ANOVA showed a significant interaction with time, F(1,36) = 4.04, p = .052 (reported as p < .05).

“A further analysis of the above Valence x Time interaction indicated that the difference in evaluation under positive and negative conditions was small and nonsignificant on the immediate measure (M = .8 and .3 under positive and negative conditions, respectively), t(38)= 0.72, p > .25 two-tailed; but was substantial and significant on the delayed measure.” (Higgins et al., 1977).

Conclusion

There are serious concerns about the strong effects in the original article by Forster et al. (2008).   Similar results have raised concerns about data collected by Jens Forster. Although investigations have yielded no clear answers about the research practices, some articles have been retracted (Retraction Watch).  Social priming effects have also proven to be difficult to replicate (R-Index).

The strong effects reported by Forster et al. are not the result of typical questionable research practices that result in just significant results.  Thus, statistical methods that predict replicability falsely predict that Forster’s results would be easy to replicate and only actual replication studies or forensic analysis of original data might be able to reveal that reported results are not trustworthy.  Thus, statistical predictions of replicability are likely to overestimate replicability because they do not detect all questionable practices or fraud.

 

 

 

 

 

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s