Tag Archives: Open-Science-Framework

The Replicability of Cognitive Psychology in the OSF-Reproducibility-Project

The OSF-Reproducibility Project (Psychology) aimed to replicate 100 results published in original research articles in three psychology journals in 2008. The selected journals focus on publishing results from experimental psychology. The main paradigm of experimental psychology is to recruit samples of participants and to study their behaviors in controlled laboratory conditions. The results are then generalized to the typical behavior of the average person.

An important methodological distinction in experimental psychology is the research design. In a within-subject design, participants are exposed to several (a minimum of two) situations and the question of interest is whether responses to one situation differ from behavior in other situations. The advantage of this design is that individuals serve as their own controls and variation due to unobserved causes (mood, personality, etc.) does not influence the results. This design can produce high statistical power to study even small effects. The design is often used by cognitive psychologists because the actual behaviors are often simple behaviors (e.g., pressing a button) that can be repeated many times (e.g., to demonstrate interference in the Stroop paradigm).

In a between-subject design, participants are randomly assigned to different conditions. A mean difference between conditions reveals that the experimental manipulation influenced behavior. The advantage of this design is that behavior is not influenced by previous behaviors in the experiment (carry over effects). The disadvantage is that many uncontrolled factors (e..g, mood, personality) also influence behavior. As a result, it can be difficult to detect small effects of an experimental manipulation among all of the other variance that is caused by uncontrolled factors. As a result, between-subject designs require large samples to study small effects or they can only be used to study large effects.

One of the main findings of the OSF-Reproducibility Project was that results from within-subject designs used by cognitive psychology were more likely to replicate than results from between-subject designs used by social psychologists. There were two few between-subject studies by cognitive psychologists or within-subject designs by social psychologists to separate these factors.   This result of the OSF-reproducibility project was predicted by PHP-curves of the actual articles as well as PHP-curves of cognitive and social journals (Replicability-Rankings).

Given the reliable difference between disciplines within psychology, it seems problematic to generalize the results of the OSF-reproducibility project across all areas of psychology. For this reason, I conducted separate analyses for social psychology and for cognitive psychology. This post examines the replicability of results in cognitive psychology. The results for social psychology are posted here.

The master data file of the OSF-reproducibilty project contained 167 studies with replication results for 99 studies. 42 replications were classified as cognitive studies. I excluded Reynolds and Bresner was excluded because the original finding was not significant. I excluded C Janiszewski, D Uy (doi:10.1111/j.1467-9280.2008.02057.x) because it examined the anchor effect, which I consider to be social psychology. Finally, I excluded two studies with children as participants because this research falls into developmental psychology (E Nurmsoo, P Bloom; V Lobue, JS DeLoache).

I first conducted a post-hoc-power analysis of the reported original results. Test statistics were first converted into two-tailed p-values and two-tailed p-values were converted into absolute z-scores using the formula (1 – norm.inverse(1-p/2). Post-hoc power was estimated by fitting the observed z-scores to predicted z-scores with a mixed-power model with three parameters (Brunner & Schimmack, in preparation).

Estimated power was 75%. This finding reveals the typical presence of publication bias because the actual success rate of 100% is too high given the power of the studies.  Based on this estimate, one would expect that only 75% of the 38 findings (k = 29) would produce a significant result in a set of 38 exact replication studies with the same design and sample size.

PHP-Curve OSF-REP Cognitive Original Data

The Figure visualizes the discrepancy between observed z-scores and the success rate in the original studies. Evidently, the distribution is truncated and suggests a file-drawer of missing studies with non-significant results. However, the mode of the curve (it’s highest point) is projected to be on the right side of the significance criterion (z = 1.96, p = .05 (two-tailed)), which suggests that more than 50% of results should replicate. Given the absence of reliable data in the range from 0 to 1.96, the data make it impossible to estimate the exact distribution in this region, but the gentle decline of z-scores on the right side of the significance criterion suggests that the file-drawer is relatively small.

Sample sizes of the replication studies were based on power analysis with the reported effect sizes. The problem with this approach is that the reported effect sizes are inflated and provide an inflated estimate of true power. With a true power estimate of 75%, the inflated power estimates were above 80% and often over 90%. As a result, many replication studies used the same sample size and some even used a smaller sample size because the original study appeared to be overpowered (the sample size was much larger than needed). The median sample size for the original studies was 32. The median sample size for the replication studies was N = 32. Changes in sample sizes make it difficult to compare the replication rate of the original studies with those of the replication study. Therefore, I adjusted the z-scores of the replication study to match z-scores that would have been obtained with the original sample size. Based on the post-hoc-power analysis above, I predicted that 75% of the replication studies would produce a significant result (k = 29). I also had posted predictions for individual studies based on a more comprehensive assessment of each article. The success rate for my a priori predictions was 69% (k = 27).

The actual replication rate based on adjusted z-scores was 63% (k = 22), although 3 studies produced only p-values between .05 and .06 after the adjustment was applied. If these studies were not counted, the success rate would have been 50% (19/38). This finding suggests that post-hoc power analysis overestimates true power by 10% to 25%. However, it is also possible that some of the replication studies failed to reproduce the exact experimental conditions of the original studies, which would lower the probability of obtaining a significant result. Moreover, the number of studies is very small and the discrepancy may simply be due to random sampling error. The important result is that post-hoc power curves correctly predict that the success rate in a replication study will be lower than the actual success rate because it corrects for the effect of publication bias. It also correctly predicts that a substantial number of studies will be successfully replicated, which they were. In comparison, post-hoc power analysis of social psychology predicted only 35% of successful replications and only 8% successfully replicated. Thus, post-hoc power analysis correctly predicts that results in cognitive psychology are more replicable than results in social psychology.

The next figure shows the post-hoc-power curve for the sample-size corrected z-scores of the replication studies.

PHP-Curve OSF-REP Cognitive Adj. Rep. Data

The PHP-Curve estimate of power for z-scores in the range from 0 to 4 is 53% for the heterogeneous model that fits the data better than a homogeneous model. The shape of the distribution suggests that several of the non-significant results are type-II errors; that is, the studies had insufficient statistical power to demonstrate a real effect.

I also conducted a power analysis that was limited to the non-significant results. The estimated average power was 22%. This power is a mixture of true power in different studies and may contain some cases of true false positives (power = .05), but the existing data are insufficient to determine whether results are true false positives or whether a small effect is present and sample sizes were too small to detect it. Again, it is noteworthy that the same analysis for social psychology produced an estimate of 5%, which suggests that most of the non-significant results in social psychology are true false positives (the null-effect is true).

Below I discuss my predictions of individual studies.

Eight studies reported an effect with a z-score greater than 4 (4 sigma), and I predicted that all of the 4-sigma effects would replicate. 7 out of 8 effects were successfully replicated (D Ganor-Stern, J Tzelgov; JI Campbell, ND Robert; M Bassok, SF Pedigo, AT Oskarsson; PA White; E Vul, H Pashler; E Vul, M Nieuwenstein, N Kanwisher; J Winawer, AC Huk, L Boroditsky). The only exception was CP Beaman, I Neath, AM Surprenant (DOI: 10.1037/0278-7393.34.1.219). It is noteworthy that the sample size of the original study was N = 99 and the sample size of the replication study was N = 14. Even with an adjusted z-score the study produced a non-significant result (p = .19). However, small samples produce less reliable results and it would be interesting to examine whether the result would become significant with an actual sample of 99 participants.

Based on more detailed analysis of individual articles, I predicted that an additional 19 studies would replicate. However, 9 out these 19 studies were not successfully replicated. Thus, my predictions of additional successful replications are just at chance level, given the overall success rate of 50%.

Based on more detailed analysis of individual articles, I predicted that 11 studies would not replicate. However, 5 out these 11 studies were successfully replicated. Thus, my predictions of failed replications are just at chance level, given the overall success rate of 50%.

In short, my only rule that successfully predicted replicability of individual studies was the 4-sigma rule that predicts that all findings with a z-score greater than 4 will replicate.

In conclusion, a replicability of 50-60% is consistent with Cohen’s (1962) suggestion that typical studies in psychology have 60% power. Post-hoc power analysis slightly overestimated the replicability of published findings despite its ability to correct for publication bias. Future research needs to examine the sources that lead to a discrepancy between predicted and realized success rate. It is possible that some of this discrepancy is due to moderating factors. Although a replicability of 50-60% is not as catastrophic as the results for social psychology with estimates in the range from 8-35%, cognitive psychologists should aim to increase the replicability of published results. Given the widespread use of powerful within-subject designs, this is easily achieved by a modest increase in sample sizes from currently 30 participants to 50 participants, which would increase power from 60% to 80%.

Which Social Psychology Results Were Successfully Replicated in the OSF-Reproducibility Project? Recommeding a 4-Sigma Rule

After several years and many hours of hard work by hundreds of psychologists, the results of the OSF-Reproducibility project are in. The project aimed to replicate a representative set of 100 studies from top journals in social and cognitive psychology. The replication studies aimed to reproduce the original studies as closely as possible, while increasing sample sizes somewhat to reduce the risk of type-II errors (failure to replicate a true effect).

The results have been widely publicized in the media. On average, only 36% of studies were successfully replicated; that is, the replication study reproduced a significant result. More detailed analysis shows that results from cognitive psychology had a higher success rate (50%) than results from social psychology (25%).

This post describes the 9 results from social psychology that were successfully replicated. 6 out of the 9 successfully replicated studies reported highly significant results with a z-score greater than 4 sigma (standard deviations) from 0 (p < .00003). Particle physics uses a 5-sigma rule to avoid false positives and industry has adopted a 6-sigma rule in quality control.

Based on my analysis of the OSF-results, I recommend a 4-sigma rule for textbook writers, journalists, and other consumers of scientific findings in social psychology to avoid dissemination of false information.

List of Studies in Decreasing Order of Strength of Evidence

1. Single Study, Self-Report, Between-Subject Analysis, Extremely large sample (N = 230,047), Highly Significant Result (z > 4 sigma)

CJ Soto, OP John, SD Gosling, J Potter (2008). The developmental psychometrics of big five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20, JPSP-PPID.

This article reported results of a psychometric analysis of self-reports of personality traits in a very large sample (N = 230,047). The replication study used the exact same method with participants from the same population (N = 455,326). Not surprisingly, the results were replicated. Unfortunately, it is not an option to conduct all studies with huge samples like this one.

2.  4 Studies, Self-Report, Large Sample (N = 211), One-Sample Test, Highly Significant Result (z > 4 sigma)

JL Tracy, RW Robins. (2008). The nonverbal expression of pride: Evidence for cross-cultural recognition. JPSP;PPID.

The replication project focussed on the main effect in Study 4. The main effect in question was whether raters (N = 211) would accurately recognize non-verbal displays of pride in six pictures that displayed pride. The recognition rates were high (range 70%–87%) and highly significant. The sample size of N = 211 is large for a one-sample test that compares a sample mean against a fixed value.

3. Five Studies, Self-Report, Moderate Sample Size (N = 153), Correlation, Highly Significant Result (z > 4 sigma)

EP Lemay, MS Clark (2008). How the head liberates the heart: Projection of communal responsiveness guides relationship promotion. JPSP:IRGP.

Study 5 examined accuracy and biases in perceptions of responsiveness (caring and support for a partner). Participants (N = 153) rated their own responsiveness and how responsive their partner was. Ratings of perceived responsiveness were regressed on self-ratings of responsiveness and targets’ self-ratings of responsiveness. The results revealed a highly significant projection effect; that is, perceptions of responsiveness were predicted by self-ratings of responsiveness. This study produced a highly significant result despite a moderate sample size because the effect size was large.

4. Single Study, Behavior, Moderate Sample (N = 240), Highly Significant Result (z > 4 sigma)

N Halevy, G Bornstein, L Sagiv (2008). In-Group-Love and Out-Group-Hate as Motives for Individual Participation in Intergroup Conflict: A New Game Paradigm, Psychological Science.

This study had a sample size of N = 240. Participants were recruited in groups of six. The experiment had four conditions. The main dependent variable was how a monetary reward was allocated. One manipulation was that some groups had the option to allocate money to the in-group whereas others did not have this option. Naturally, the percentages of allocation to the in-group differed across these conditions. Another manipulation allowed some group-members to communicate whereas in the other condition players had to make decisions on their own. This study produced a highly significant interaction between the two experimental manipulations that was successfully replicated.

5. Single Study, Self-Report, Large Sample (N = 82), Within-Subject Analysis, Highly Significant Result (z > 4 sigma)

M Tamir, C Mitchell, JJ Gross (2008). Hedonic and instrumental motives in anger regulation. Psychological Science.

In this study, 82 participants were asked to imagine being in two types of situations; either scenarios with a hypothetical confrontation or scenarios without a confrontation. They also listened to music that was designed to elicit an excited, angry, or neutral mood. Afterwards participants rated how much they would like to listen to the music they heard if they were in the hypothetical situation. Each participant listened to all pairings of situation and music and the data were analyzed within-subject. A sample size of 82 is large for within-subject designs. A highly significant interaction revealed that a preference for angry music in confrontations and a dislike of angry music without a confrontation that was successfully replicated. A sample of 82 participants is large for a within-subject comparison of means for different conditions.

6. Single Study, Self-Report, Large Sample (N = 124), One-Sample Test, Highly-Significant Result (z > 4 sigma)

DA Armor, C Massey, AM Sackett (2008). Prescribed optimism: Is it right to be wrong about the future? Psychological Science.

In this study, participants (N = 124) were asked to read 8 vignettes that involved making decisions. Participants were asked to judge whether they would recommend making pessimistic, realistic, or optimistic predictions. The main finding was that the average recommendation was to be optimistic. The effect was highly significant. A sample of N = 124 is very large for a design that compares a sample mean to a fixed value.

7. Four Studies, Self-Report, Small Sample (N = 71), Experiment, Moderate Support (z = 2.97)

BK Payne, MA Burkley, MB Stokes (2008). Why do implicit and explicit attitude tests diverge? The role of structural fit. JPSP:ASC.

In this study, participants worked on the standard Affect Misattribution Paradigm (AMP). In the AMP, two stimuli are presented in brief succession. In this study, the first stimulus was a picture of a European or African American face. The second stimulus was a picture of a Chinese pictogram. In the standard paradigm, participants are asked to report how much they like the second stimulus (Chinese pictogram) and to ignore the first stimulus (Black or White face). The AMP is typically used to measure racial attitudes because racial attitudes can influences responses to the Chinese characters.

In this study, the standard AMP was modified by giving two different sets of instructions. One instruction was the standard instructions to respond to the Chinese pictograms. The other instruction was to respond directly to the faces.   All participants (N = 71) completed both tasks. The participants were randomly assigned to two conditions. One condition made it easier to honestly report prejudice (low social pressure). The other condition emphasized that prejudice is socially undesirable (high social pressure). The results showed a significantly stronger correlation between the two tasks (ratings of Chines pictographs & faces) in the low social pressure condition than in the high social pressure condition, which was replicated in the replication study.

8. Two Studies, Self-Report, moderate sample (N = 119), Correlation, Weak Support (z = 2.27)

JT Larsen, AR McKibban (2008). Is happiness having what you want, wanting what you have, or both? Psychological Science.

In this study, participants (N = 124) received a list of 62 material items and were asked to check whether they had the item or not (e.g., a cell phone). They then rated how much they wanted each item. Based on these responses, the authors computed measures of (a) how much participants’ wanted what they have and (b) have what they wanted. The main finding was that life-satisfaction was significantly predicted by wanting what one has while controlling for having what one wants.   This finding was also found in Study 1 (N = 124) and successfully replicated in the OSF-project with a larger sample (N = 238).

9. Five Studies, Behavior, Small Sample (N = 28), Main Effect, Very Weak Support (z = 1.80)

SM McCrea (2008). Self-handicapping, excuse making, and counterfactual thinking: Consequences for self-esteem and future motivation. JPSP:ASC.

In this study, all participants (N = 28) first worked on a math task that was very difficult and participants received failure feedback.   Participants were then randomly assigned to two groups. One group was given feedback that they had insufficient practice (self-handicap). The control group was not given an explanation for their failure. All participants then worked again on a second math task. The main effect showed that performance on the second task was better (higher percentage of correct answers) in the control group than in the self-handicap condition. Although this difference was only marginally significant (p < .05, one-tailed) in the original study, it was significant in the replication study with a larger sample (N = 61).

Although the percentage of correct answers showed only a marginally significant effect, the number of attempted answers and the absolute number of correct answers showed significant effects. Thus, this study does not count as a publication of a null-result. Moreover, these results suggest that participants in the control group were more motivated to do well because they worked on more problems and got more correct answers.