Tag Archives: OSF-Reproducibility-Project

Dr. R’s comment on the Official Statement by the Board of the German Psychological Association (DGPs) about the Results of the OSF-Reproducibility Project published in Science.

Thanks to social media, geography is no longer a barrier for scientific discourse. However, language is still a barrier. Fortunately, I understand German and I can respond to the official statement of the board of the German Psychological Association (DGPs), which was posted on the DGPs website (in German).

BACKGROUND

On September 1, 2015, Prof. Dr. Andrea Abele-Brehm, Prof. Dr. Mario Gollwitzer, and Prof. Dr. Fritz Strack published an official response to the results of the OSF-Replication Project – Psychology (in German) that was distributed to public media in order to correct potentially negative impressions about psychology as a science.

Numerous members of DGPs felt that this official statement did not express their views and noticed that members were not consulted about the official response of their organization. In response to this criticism, DGfP opened a moderated discussion page, where members could post their personal views (mostly in German).

On October 6, 2015, the board closed the discussion page and posted some final words (Schlussbeitrag). In this blog, I provide a critical commentary on these final words.

BOARD’S RESPONSE TO COMMENTS

The board members provide a summary of the core insights and arguments of the discussion from their (personal/official) perspective.

„Wir möchten nun die aus unserer Sicht zentralen Erkenntnisse und Argumente der unterschiedlichen Forumsbeiträge im Folgenden zusammenfassen und deutlich machen, welche vorläufigen Erkenntnisse wir im Vorstand aus ihnen ziehen.“

1. 68% success rate?

The first official statement suggested that the replication project showed that 68% of studies. This number is based on significance in a meta-analysis of the original and replication study. Critics pointed out that this approach is problematic because the replication project showed clearly that the original effect sizes were inflated (on average by 100%). Thus, the meta-analysis is biased and the 68% number is inflated.

In response to this criticism, the DGPs board states that “68% is the maximum [größtmöglich] optimistic estimate.” I think the term “biased and statistically flawed estimate” is a more accurate description of this estimate.   It is common practice to consider fail-safe-N or to correct meta-analysis for publication bias. When there is clear evidence of bias, it is unscientific to report the biased estimate. This would be like saying that the maximum optimistic estimate of global warming is that global warming does not exist. This is probably a true statement about the most optimistic estimate, but not a scientific estimate of the actual global warming that has been taking place. There is no place for optimism in science. Optimism is a bias and the aim of science is to remove bias. If DGPs wants to represent scientific psychology, the board should post what they consider the most accurate estimate of replicability in the OSF-project.

2. The widely cited 36% estimate is negative.

The board members then justify the publication of the maximally optimistic estimate as a strategy to counteract negative perceptions of psychology as a science in response to the finding that only 36% of results were replicated. The board members felt that these negative responses misrepresent the OSF-project and psychology as a scientific discipline.

„Dies wird weder dem Projekt der Open Science Collaboration noch unserer Disziplin insgesamt gerecht. Wir sollten jedoch bei der konstruktiven Bewältigung der Krise Vorreiter innerhalb der betroffenen Wissenschaften sein.“

However, reporting the dismal 36% replication rate of the OSF-replication project is not a criticism of the OSF-project. Rather, it assumes that the OSF-replication project was a rigorous and successful attempt to provide an estimate of the typical replicability of results published in top psychology journals. The outcome could have been 70% or 35%. The quality of the project does not depend on the result. The result is also not a negatively biased perception of psychology as a science. It is an objective scientific estimate of the probability that a reported significant result in a journal would produce a significant result again in a replication study.   Whether 36% is acceptable or not can be debated, but it seems problematic to post a maximally optimistic estimate to counteract negative implications of an objective estimate.

3. Is 36% replicability good or bad?

Next, the board ponders the implications of the 36% success rate. “How should we evaluate this number?” The board members do not know.  According to their official conclusion, this question is complex as divergent contributions on the discussion page suggest.

„Im Science-Artikel wurde die relative Häufigkeit der in den Replikationsstudien statistisch bedeutsamen Effekte mit 36% angegeben. Wie ist diese Zahl zu bewerten? Wie komplex die Antwort auf diese Frage ist, machen die Forumsbeiträge von Roland Deutsch, Klaus Fiedler, Moritz Heene (s.a. Heene & Schimmack) und Frank Renkewitz deutlich.“

To help the board members to understand the number, I can give a brief explanation of replicability. Although there are several ways to define replicability, one plausible definition of replicability is to equate it with statistical power. Statistical power is the probability that a study will produce a significant result. A study with 80% power has an 80% probability to produce a significant result. For a set of 100 studies, one would expect roughly 80 significant results and 20 non-significant results. For 100 studies with 36% power, one would expect roughly 36 significant results and 64 non-significant results. If researchers would publish all studies, the percentage of published significant results would provide an unbiased estimate of the typical power of studies.   However, it is well known that significant results are more likely to be written up, submitted for publication, and accepted for publication. These reporting biases explain why psychology journals report over 90% significant results, although the actual power of studies is less than 90%.

In 1962, Jacob Cohen provided the first attempt to estimate replicability of psychological results. His analysis suggested that psychological studies have approximately 50% power. He suggested that psychologists should increase power to 80% to provide robust evidence for effects and to avoid wasting resources on studies that cannot detect small, but practically important effects. For the next 50 years, psychologists have ignored Cohen’s warning that most studies are underpowered, despite repeated reminders that there are no signs of improvement, including reminders by prominent German psychologists like Gerg Giegerenzer, director of a Max Planck Institute (Sedlmeier & Giegerenzer, 1989; Maxwell, 2004; Schimmack, 2012).

The 36% success rate for an unbiased set of 100 replication studies, suggest that the actual power of published studies in psychology journals is 36%.  The power of all studies conducted is even lower because the p < .05 selection criterion favors studies with higher power.  Does the board think 36% power is an acceptable amount of power?

4. Psychologists should improve replicability in the future

On a positive note, the board members suggest that, after careful deliberation, psychologists need to improve replicability so that it can be demonstrated in a few years that replicability has increased.

„Wir müssen nach sorgfältiger Diskussion unter unseren Mitgliedern Maßnahmen ergreifen (bei Zeitschriften, in den Instituten, bei Förderorganisationen, etc.), die die Replikationsquote im temporalen Vergleich erhöhen können.“

The board members do not mention a simple solution to the replicabilty problem that was advocated over 50 years ago by Jacob Cohen. To increase replicability, psychologists have to think about the strength of the effects that they are investigating and they have to conduct studies that have a realistic chance to distinguish these effects from variation due to random error.   This often means investing more resources (larger samples, repeated trials, etc.) in a single study.   Unfortunately, the leaders of German psychologists appear to be unaware of this important and simple solution to the replication crisis. They neither mention power as a cause of the problem, nor do they recommend increasing power to increase replicability in the future.

5. Do the Results Reveal Fraud?

The DGPs board members then discuss the possibility that the OSF-reproducibilty results reveal fraud, like the fraud committed by Stapel. The board points out that the OSF-results do not imply that psychologists commit fraud because failed replications can occur for various reasons.

„Viele Medien (und auch einige Kolleginnen und Kollegen aus unserem Fach) nennen die Befunde der Science-Studie im gleichen Atemzug mit den Betrugsskandalen, die unser Fach in den letzten Jahren erschüttert haben. Diese Assoziation ist unserer Meinung nach problematisch: sie suggeriert, die geringe Replikationsrate sei auf methodisch fragwürdiges Verhalten der Autor(inn)en der Originalstudien zurückzuführen.“

It is true that the OSF-results do not reveal fraud. However, the board members confuse fraud with questionable research practices. Fraud is defined as fabricating data that were never collected. Only one of the 100 studies in the OSF-replication project (by Jens Förster, a former student of Fritz Strack, one of the board members) is currently being investigated for fraud by the University of Amsterdam.  Despite very strong results in the original study, it failed to replicate.

The more relevant question is how much questionable research practices contributed to the results. Questionable research practices are practices where data are being collected, but statistical results are only being reported if they produce a significant result (studies, conditions, dependent variables, data points that do not produce significant results are excluded from the results that are being submitted for publication. It has been known for over 50 years that these practices produce a discrepancy between the actual power of studies and the rate of significant results that are published in psychology journals (Sterling, 1959).

Recent statistical developments have made it possible to estimate the true power of studies after correcting for publication bias.   Based on these calculations, the true power of the original studies in the OSF-project was only 50%.   Thus a large portion of the discrepancy between nearly 100% reported significant results and a replication success rate of 36% is explained by publication bias (see R-Index blogs for social psychology and cognitive psychology).

Other factors may contribute to the discrepancy between the statistical prediction that the replication success rate would be 50% and the actual success rate of 36%. Nevertheless, the lion share of the discrepancy can be explained by the questionable practice to report only evidence that supports a hypothesis that a researcher wants to support. This motivated bias undermines the very foundations of science. Unfortunately, the board ignores this implication of the OSF results.

6. What can we do?

The board members have no answer to this important question. In the past four years, numerous articles have been published that have made suggestions how psychology can improve its credibility as a science. Yet, the DPfP board seems to be unaware of these suggestions or unable to comment on these proposals.

„Damit wären wir bei der Frage, die uns als Fachgesellschaft am stärksten beschäftigt und weiter beschäftigen wird. Zum einen brauchen wir eine sorgfältige Selbstreflexion über die Bedeutung von Replikationen in unserem Fach, über die Bedeutung der neuesten Science-Studie sowie der weiteren, zurzeit noch im Druck oder in der Phase der Auswertung befindlichen Projekte des Center for Open Science (wie etwa die Many Labs-Studien) und über die Grenzen unserer Methoden und Paradigmen“

The time for more discussion has passed. After 50 years of ignoring Jacob Cohen’s recommendation to increase statistical power it is time for action. If psychologists are serious about replicability, they have to increase the power of their studies.

The board then discusses the possibility of measuring and publishing replication rates at the level of departments or individual scientists. They are not in favor of such initiatives, but they provide no argument for their position.

„Datenbanken über erfolgreiche und gescheiterte Replikationen lassen sich natürlich auch auf der Ebene von Instituten oder sogar Personen auswerten (wer hat die höchste Replikationsrate, wer die niedrigste?). Sinnvoller als solche Auswertungen sind Initiativen, wie sie zurzeit (unter anderem) an der LMU an der LMU München implementiert wurden (siehe den Beitrag von Schönbrodt und Kollegen).“

The question is why replicability should not be measured and used to evaluate researchers. If the board really valued replicability and wanted to increase replicability in a few years, wouldn’t it be helpful to have a measure of replicability and to reward departments or researchers who invest more resources in high powered studies that can produce significant results without the need to hide disconfirming evidence in file-drawers?   A measure of replicability is also needed because current quantitative measures of scientific success are one of the reasons for the replicability crisis. The most successful researchers are those who publish the most significant results, no matter how these results were obtained (with the exception of fraud). To change this unscientific practice of significance chasing, it is necessary to have an alternative indicator of scientific quality that reflects how significant results were obtained.

Conclusion

The board makes some vague concluding remarks that are not worthwhile repeating here. So let me conclude with my own remarks.

The response of the DGPs board is superficial and does not engage with the actual arguments that were exchanged on the discussion page. Moreover, it ignores some solid scientific insights into the causes of the replicability crisis and it makes no concrete suggestions how German psychologists should change their behaviors to improve the credibility of psychology as a science. Not once do they point out that the results of the OSF-project were predictable based on the well-known fact that psychological studies are underpowered and that failed studies are hidden in file-drawers.

I received my education in Germany all the way to the Ph.D at the Free University in Berlin. I had several important professors and mentors that educated me about philosophy of science and research methods (Rainer Reisenzein, Hubert Feger, Hans Westmeyer, Wolfgang Schönpflug). I was a member of DGPs for many years. I do not believe that the opinion of the board members represent a general consensus among German psychologists. I hope that many German psychologists recognize the importance of replicability and are motivated to make changes to the way psychologists conduct research.  As I am no longer a member of DGfP, I have no direct influence on it, but I hope that the next election will elect a candidate that will promote open science, transparency, and above all scientific integrity.

The Replicability of Social Psychology in the OSF-Reproducibility Project

Abstract:  I predicted the replicability of 38 social psychology results in the OSF-Reproducibility Project. Based on post-hoc-power analysis I predicted a success rate of 35%.  The actual success rate was 8% (3 out of 38) and post-hoc-power was estimated to be 3% for 36 out of 38 studies (5% power = type-I error rate, meaning the null-hypothesis is true).

The OSF-Reproducibility Project aimed to replicate 100 results published in original research articles in three psychology journals in 2008. The selected journals focus on publishing results from experimental psychology. The main paradigm of experimental psychology is to recruit samples of participants and to study their behaviors in controlled laboratory conditions. The results are then generalized to the typical behavior of the average person.

An important methodological distinction in experimental psychology is the research design. In a within-subject design, participants are exposed to several (a minimum of two) situations and the question of interest is whether responses to one situation differ from behavior in other situations. The advantage of this design is that individuals serve as their own controls and variation due to unobserved causes (mood, personality, etc.) does not influence the results. This design can produce high statistical power to study even small effects. The design is often used by cognitive psychologists because the actual behaviors are often simple behaviors (e.g., pressing a button) that can be repeated many times (e.g., to demonstrate interference in the Stroop paradigm).

In a between-subject design, participants are randomly assigned to different conditions. A mean difference between conditions reveals that the experimental manipulation influenced behavior. The advantage of this design is that behavior is not influenced by previous behaviors in the experiment (carry over effects). The disadvantage is that many uncontrolled factors (e..g, mood, personality) also influence behavior. As a result, it can be difficult to detect small effects of an experimental manipulation among all of the other variance that is caused by uncontrolled factors. As a result, between-subject designs require large samples to study small effects or they can only be used to study large effects.

One of the main findings of the OSF-Reproducibility Project was that results from within-subject designs used by cognitive psychology were more likely to replicate than results from between-subject designs used by social psychologists. There were two few between-subject studies by cognitive psychologists or within-subject designs by social psychologists to separate these factors.   This result of the OSF-reproducibility project was predicted by PHP-curves of the actual articles as well as PHP-curves of cognitive and social journals (Replicability-Rankings).

Given the reliable difference between disciplines within psychology, it seems problematic to generalize the results of the OSF-reproducibility project to all areas of psychology. The Replicability-Rankings suggest that social psychology has a lower replicability than other areas of psychology. For this reason, I conducted separate analyses for social psychology and for cognitive psychology. Other areas of psychology had two few studies to conduct a meaningful analysis. Thus, the OSF-reproducibility results should not be generalized to all areas of psychology.

The master data file of the OSF-reproducibilty project contained 167 studies with replication results for 99 studies.   57 studies were classified as social studies. However, this classification used a broad definition of social psychology that included personality psychology and developmental psychology. It included six articles published in the personality section of the Journal of Personality and Social Psychology. As each section functions essentially like an independent journal, I excluded all studies from this section. The file also contained two independent replications of two experiments (experiment 5 and 7) in Albarracín et al. (2008; DOI: 10.1037/a0012833). As the main sampling strategy was to select the last study of each article, I only included Study 7 in the analysis (Study 5 did not replicate, p = .77). Thus, my selection did not lower the rate of successful replications. There were also two independent replications of the same result in Bressan and Stranieri (2008). Both replications produced non-significant results (p = .63, p = .75). I selected the replication study with the larger sample (N = 318 vs. 259). I also excluded two studies that were not independent replications. Rule and Ambady (2008) examined the correlation between facial features and success of CEOs. The replication study had new raters to rate the faces, but used the same faces. Heine, Buchtel, and Norenzayan (2008) examined correlates of conscientiousness across nations and the replication study examined the same relationship across the same set of nations. I also excluded replications of non-significant results because non-significant results provide ambiguous information and cannot be interpreted as evidence for the null-hypothesis. For this reason, it is not clear how the results of a replication study should be interpreted. Two underpowered studies could easily produce consistent results that are both type-II errors. For this reason, I excluded Ranganath and Nosek (2008) and Eastwick and Finkel (2008). The final sample consisted of 38 articles.

I first conducted a post-hoc-power analysis of the reported original results. Test statistics were first converted into two-tailed p-values and two-tailed p-values were converted into absolute z-scores using the formula (1 – norm.inverse(1-p/2). Post-hoc power was estimated by fitting the observed z-scores to predicted z-scores with a mixed-power model with three parameters (Brunner & Schimmack, in preparation).

Estimated power was 35%. This finding reflects the typical finding that reported results are a biased sample of studies that produced significant results, whereas non-significant results are not submitted for publication. Based on this estimate, one would expect that only 35% of the 38 findings (k = 13) would produce a significant result in an exact replication study with the same design and sample size.

PHP-Curve OSF-REP-Social-Original

The Figure visualizes the discrepancy between observed z-scores and the success rate in the original studies. Evidently, the distribution is truncated and the mode of the curve (it’s highest point) is projected to be on the left side of the significance criterion (z = 1.96, p = .05 (two-tailed)). Given the absence of reliable data in the range from 0 to 1.96, the data make it impossible to estimate the exact distribution in this region, but the step decline of z-scores on the right side of the significance criterion suggests that many of the significant results achieved significance only with the help of inflated observed effect sizes. As sampling error is random, these results will not replicate again in a replication study.

The replication studies had different sample sizes than the original studies. This makes it difficult to compare the prediction to the actual success rate because the actual success rate could be much higher if the replication studies had much larger samples and more power to replicate effects. For example, if all replication studies had sample sizes of N = 1,000, we would expect a much higher replication rate than 35%. The median sample size of the original studies was N = 86. This is representative of studies in social psychology. The median sample size of the replication studies was N = 120. Given this increase in power, the predicted success rate would increase to 50%. However, the increase in power was not uniform across studies. Therefore, I used the p-values and sample size of the replication study to compute the z-score that would have been obtained with the original sample size and I used these results to compare the predicted success rate to the actual success rate in the OSF-reproducibility project.

The depressing finding was that the actual success rate was much lower than the predicted success rate. Only 3 out of 38 results (8%) produced a significant result (without the correction of sample size 5 findings would have been significant). Even more depressing is the fact that a 5% criterion, implies that every 20 studies are expected to produce a significant result just by chance. Thus, the actual success rate is close to the success rate that would be expected if all of the original results were false positives. A success rate of 8% would imply that the actual power of the replication studies was only 8%, compared to the predicted power of 35%.

The next figure shows the post-hoc-power curve for the sample-size corrected z-scores.

PHP-Curve OSF-REP-Social-AdjRep

The PHP-Curve estimate of power for z-scores in the range from 0 to 4 is 3% for the homogeneous case. This finding means that the distribution of z-scores for 36 of the 38 results is consistent with the null-hypothesis that the true effect size for these effects is zero. Only two z-scores greater than 4 (one shown, the other greater than 6 not shown) appear to be replicable and robust effects.

One replicable finding was obtained in a study by Halevy, Bornstein, and Sagiv. The authors demonstrated that allocation of money to in-group and out-group members is influenced much more by favoring the in-group than by punishing the out-group. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.

The other successful replication was a study by Lemay and Clark (DOI: 10.1037/0022-3514.94.4.647). The replicated finding was that participants’ projected their own responsiveness in a romantic relationship onto their partners’ responsiveness while controlling for partners’ actual responsiveness. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.

Based on weak statistical evidence in the original studies, I had predicted failures of replication for 25 studies. Given the low success rate, it is not surprising that my success rate was 100.

I made the wrong prediction for 11 results. In all cases, I predicted a successful replication when the outcome was a failed replication. Thus, my overall success rate was 27/38 = 71%. Unfortunately, this success rate is easily beaten by a simple prediction rule that nothing in social psychology replicates, which is wrong in only 3 out of 38 predictions (89% success rate).

Below I briefly comment on the 11 failed predictions.

1   Based on strong statistics (z > 4), I had predicted a successful replication for Förster, Liberman, and Kuschel (DOI: 10.1037/0022-3514.94.4.579). However, even when I made this predictions based on the reported statistics, I had my doubts about this study because statisticians had discovered anomalies in Jens Förster’s studies that cast doubt on the validity of these reported results. Post-hoc power analysis can correct for publication bias, but it cannot correct for other sources of bias that lead to vastly inflated effect sizes.

2   I predicted a successful replication of Payne, MA Burkley, MB Stokes. The replication study actually produced a significant result, but it was no longer significant after correcting for the larger sample size in the replication study (180 vs. 70, p = .045 vs. .21). Although the p-value in the replication study is not very reassuring, it is possible that this is a real effect. However, the original result was probably still inflated by sampling error to produce a z-score of 2.97.

3   I predicted a successful replication of McCrae (DOI: 10.1037/0022-3514.95.2.274). This prediction was based on a transcription error. Whereas the z-score for the target effect was 1.80, I posted a z-score of 3.5. Ironically, the study did successfully replicate with a larger sample size, but the effect was no longer significant after adjusting the result for sample size (N = 61 vs. N = 28). This study demonstrates that marginally significant effects can reveal real effects, but it also shows that larger samples are needed in replication studies to demonstrate this.

4   I predicted a successful replication for EP Lemay, MS Clark (DOI: 10.1037/0022-3514.95.2.420). This prediction was based on a transcription error because EP Lemay and MS Clark had another study in the project. With the correct z-score of the original result (z = 2.27), I would have predicted correctly that the result would not replicate.

5  I predicted a successful replication of Monin, Sawyer, and Marquez (DOI: 10.1037/0022-3514.95.1.76) based on a strong result for the target effect (z = 3.8). The replication study produced a z-score of 1.45 with a sample size that was not much larger than the original study (N = 75 vs. 67).

6  I predicted a successful replication for Shnabel and Nadler (DOI: 10.1037/0022-3514.94.1.116). The replication study increased sample size by 50% (Ns = 141 vs. 94), but the effect in the replication study was modest (z = 1.19).

7  I predicted a successful replication for van Dijk, van Kleef, Steinel, van Beest (DOI: 10.1037/0022-3514.94.4.600). The sample size in the replication study was slightly smaller than in the original study (N = 83 vs. 103), but even with adjustment the effect was close to zero (z = 0.28).

8   I predicted a successful replication of V Purdie-Vaughns, CM Steele, PG Davies, R Ditlmann, JR Crosby (DOI: 10.1037/0022-3514.94.4.615). The original study had rather strong evidence (z = 3.35). In this case, the replication study had a much larger sample than the original study (N = 1,490 vs. 90) and still did not produce a significant result.

9  I predicted a successful replication of C Farris, TA Treat, RJ Viken, RM McFall (doi:10.1111/j.1467-9280.2008.02092.x). The replication study had a somewhat smaller sample (N = 144 vs. 280), but even with adjustment of sample size the effect in the replication study was close to zero (z = 0.03).

10   I predicted a successful replication of KD Vohs and JW Schooler (doi:10.1111/j.1467-9280.2008.02045.x)). I made this prediction of generally strong statistics, although the strength of the target effect was below 3 (z = 2.8) and the sample size was small (N = 30). The replication study doubled the sample size (N = 58), but produced weak evidence (z = 1.08). However, even the sample size of the replication study is modest and does not allow strong conclusions about the existence of the effect.

11   I predicted a successful replication of Blankenship and Wegener (DOI: 10.1037/0022-3514.94.2.94.2.196). The article reported strong statistics and the z-score for the target effect was greater than 3 (z = 3.36). The study also had a large sample size (N = 261). The replication study also had a similarly large sample size (N = 251), but the effect was much smaller than in the original study (z = 3.36 vs. 0.70).

In some of these failed predictions it is possible that the replication study failed to reproduce the same experimental conditions or that the population of the replication study differs from the population of the original study. However, there are twice as many studies where the failure of replication was predicted based on weak statistical evidence and the presence of publication bias in social psychology journals.

In conclusion, this set of results from a representative sample of articles in social psychology reported a 100% success rate. It is well known that this success rate can only be achieved with selective reporting of significant results. Even the inflated estimate of median observed power is only 71%, which shows that the success rate of 100% is inflated. A power estimate that corrects for inflation suggested that only 35% of results would replicate, and the actual success rate is only 8%. While mistakes by the replication experimenters may contribute to the discrepancy between the prediction of 35% and the actual success rate of 8%, it was predictable based on the results in the original studies that the majority of results would not replicate in replication studies with the same sample size as the original studies.

This low success rate is not characteristic of other sciences and other disciplines in psychology. As mentioned earlier, the success rate for cognitive psychology is higher and comparisons of psychological journals show that social psychology journals have lower replicability than other journals. Moreover, an analysis of time trends shows that replicability of social psychology journals has been low for decades and some journals even show a negative trend in the past decade.

The low replicability of social psychology has been known for over 50 years, when Cohen examined the replicability of results published in the Journal of Social and Abnormal Psychology (now Journal of Personality and Social Psychology), the flagship journal of social psychology. Cohen estimated a replicability of 60%. Social psychologists would rejoice if the reproducibility project had shown a replication rate of 60%. The depressing result is that the actual replication rate was 8%.

The main implication of this finding is that it is virtually impossible to trust any results that are being published in social psychology journals. Yes, two articles that posted strong statistics (z > 4) replicated, but several results with equally strong statistics did not replicate. Thus, it is reasonable to distrust all results with z-scores below 4 (4 sigma rule), but not all results with z-scores greater than 4 will replicate.

Given the low credibility of original research findings, it will be important to raise the quality of social psychology by increasing statistical power. It will also be important to allow publication of non-significant results to reduce the distortion that is created by a file-drawer filled with failed studies. Finally, it will be important to use stronger methods of bias-correction in meta-analysis because traditional meta-analysis seemed to show strong evidence even for incredible effects like premonition for erotic stimuli (Bem, 2011).

In conclusion, the OSF-project demonstrated convincingly that many published results in social psychology cannot be replicated. If social psychology wants to be taken seriously as a science, it has to change the way data are collected, analyzed, and reported and demonstrate replicability in a new test of reproducibility.

The silver lining is that a replication rate of 8% is likely to be an underestimation and that regression to the mean alone might lead to some improvement in the next evaluation of social psychology.

Which Social Psychology Results Were Successfully Replicated in the OSF-Reproducibility Project? Recommeding a 4-Sigma Rule

After several years and many hours of hard work by hundreds of psychologists, the results of the OSF-Reproducibility project are in. The project aimed to replicate a representative set of 100 studies from top journals in social and cognitive psychology. The replication studies aimed to reproduce the original studies as closely as possible, while increasing sample sizes somewhat to reduce the risk of type-II errors (failure to replicate a true effect).

The results have been widely publicized in the media. On average, only 36% of studies were successfully replicated; that is, the replication study reproduced a significant result. More detailed analysis shows that results from cognitive psychology had a higher success rate (50%) than results from social psychology (25%).

This post describes the 9 results from social psychology that were successfully replicated. 6 out of the 9 successfully replicated studies reported highly significant results with a z-score greater than 4 sigma (standard deviations) from 0 (p < .00003). Particle physics uses a 5-sigma rule to avoid false positives and industry has adopted a 6-sigma rule in quality control.

Based on my analysis of the OSF-results, I recommend a 4-sigma rule for textbook writers, journalists, and other consumers of scientific findings in social psychology to avoid dissemination of false information.

List of Studies in Decreasing Order of Strength of Evidence

1. Single Study, Self-Report, Between-Subject Analysis, Extremely large sample (N = 230,047), Highly Significant Result (z > 4 sigma)

CJ Soto, OP John, SD Gosling, J Potter (2008). The developmental psychometrics of big five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20, JPSP-PPID.

This article reported results of a psychometric analysis of self-reports of personality traits in a very large sample (N = 230,047). The replication study used the exact same method with participants from the same population (N = 455,326). Not surprisingly, the results were replicated. Unfortunately, it is not an option to conduct all studies with huge samples like this one.

2.  4 Studies, Self-Report, Large Sample (N = 211), One-Sample Test, Highly Significant Result (z > 4 sigma)

JL Tracy, RW Robins. (2008). The nonverbal expression of pride: Evidence for cross-cultural recognition. JPSP;PPID.

The replication project focussed on the main effect in Study 4. The main effect in question was whether raters (N = 211) would accurately recognize non-verbal displays of pride in six pictures that displayed pride. The recognition rates were high (range 70%–87%) and highly significant. The sample size of N = 211 is large for a one-sample test that compares a sample mean against a fixed value.

3. Five Studies, Self-Report, Moderate Sample Size (N = 153), Correlation, Highly Significant Result (z > 4 sigma)

EP Lemay, MS Clark (2008). How the head liberates the heart: Projection of communal responsiveness guides relationship promotion. JPSP:IRGP.

Study 5 examined accuracy and biases in perceptions of responsiveness (caring and support for a partner). Participants (N = 153) rated their own responsiveness and how responsive their partner was. Ratings of perceived responsiveness were regressed on self-ratings of responsiveness and targets’ self-ratings of responsiveness. The results revealed a highly significant projection effect; that is, perceptions of responsiveness were predicted by self-ratings of responsiveness. This study produced a highly significant result despite a moderate sample size because the effect size was large.

4. Single Study, Behavior, Moderate Sample (N = 240), Highly Significant Result (z > 4 sigma)

N Halevy, G Bornstein, L Sagiv (2008). In-Group-Love and Out-Group-Hate as Motives for Individual Participation in Intergroup Conflict: A New Game Paradigm, Psychological Science.

This study had a sample size of N = 240. Participants were recruited in groups of six. The experiment had four conditions. The main dependent variable was how a monetary reward was allocated. One manipulation was that some groups had the option to allocate money to the in-group whereas others did not have this option. Naturally, the percentages of allocation to the in-group differed across these conditions. Another manipulation allowed some group-members to communicate whereas in the other condition players had to make decisions on their own. This study produced a highly significant interaction between the two experimental manipulations that was successfully replicated.

5. Single Study, Self-Report, Large Sample (N = 82), Within-Subject Analysis, Highly Significant Result (z > 4 sigma)

M Tamir, C Mitchell, JJ Gross (2008). Hedonic and instrumental motives in anger regulation. Psychological Science.

In this study, 82 participants were asked to imagine being in two types of situations; either scenarios with a hypothetical confrontation or scenarios without a confrontation. They also listened to music that was designed to elicit an excited, angry, or neutral mood. Afterwards participants rated how much they would like to listen to the music they heard if they were in the hypothetical situation. Each participant listened to all pairings of situation and music and the data were analyzed within-subject. A sample size of 82 is large for within-subject designs. A highly significant interaction revealed that a preference for angry music in confrontations and a dislike of angry music without a confrontation that was successfully replicated. A sample of 82 participants is large for a within-subject comparison of means for different conditions.

6. Single Study, Self-Report, Large Sample (N = 124), One-Sample Test, Highly-Significant Result (z > 4 sigma)

DA Armor, C Massey, AM Sackett (2008). Prescribed optimism: Is it right to be wrong about the future? Psychological Science.

In this study, participants (N = 124) were asked to read 8 vignettes that involved making decisions. Participants were asked to judge whether they would recommend making pessimistic, realistic, or optimistic predictions. The main finding was that the average recommendation was to be optimistic. The effect was highly significant. A sample of N = 124 is very large for a design that compares a sample mean to a fixed value.

7. Four Studies, Self-Report, Small Sample (N = 71), Experiment, Moderate Support (z = 2.97)

BK Payne, MA Burkley, MB Stokes (2008). Why do implicit and explicit attitude tests diverge? The role of structural fit. JPSP:ASC.

In this study, participants worked on the standard Affect Misattribution Paradigm (AMP). In the AMP, two stimuli are presented in brief succession. In this study, the first stimulus was a picture of a European or African American face. The second stimulus was a picture of a Chinese pictogram. In the standard paradigm, participants are asked to report how much they like the second stimulus (Chinese pictogram) and to ignore the first stimulus (Black or White face). The AMP is typically used to measure racial attitudes because racial attitudes can influences responses to the Chinese characters.

In this study, the standard AMP was modified by giving two different sets of instructions. One instruction was the standard instructions to respond to the Chinese pictograms. The other instruction was to respond directly to the faces.   All participants (N = 71) completed both tasks. The participants were randomly assigned to two conditions. One condition made it easier to honestly report prejudice (low social pressure). The other condition emphasized that prejudice is socially undesirable (high social pressure). The results showed a significantly stronger correlation between the two tasks (ratings of Chines pictographs & faces) in the low social pressure condition than in the high social pressure condition, which was replicated in the replication study.

8. Two Studies, Self-Report, moderate sample (N = 119), Correlation, Weak Support (z = 2.27)

JT Larsen, AR McKibban (2008). Is happiness having what you want, wanting what you have, or both? Psychological Science.

In this study, participants (N = 124) received a list of 62 material items and were asked to check whether they had the item or not (e.g., a cell phone). They then rated how much they wanted each item. Based on these responses, the authors computed measures of (a) how much participants’ wanted what they have and (b) have what they wanted. The main finding was that life-satisfaction was significantly predicted by wanting what one has while controlling for having what one wants.   This finding was also found in Study 1 (N = 124) and successfully replicated in the OSF-project with a larger sample (N = 238).

9. Five Studies, Behavior, Small Sample (N = 28), Main Effect, Very Weak Support (z = 1.80)

SM McCrea (2008). Self-handicapping, excuse making, and counterfactual thinking: Consequences for self-esteem and future motivation. JPSP:ASC.

In this study, all participants (N = 28) first worked on a math task that was very difficult and participants received failure feedback.   Participants were then randomly assigned to two groups. One group was given feedback that they had insufficient practice (self-handicap). The control group was not given an explanation for their failure. All participants then worked again on a second math task. The main effect showed that performance on the second task was better (higher percentage of correct answers) in the control group than in the self-handicap condition. Although this difference was only marginally significant (p < .05, one-tailed) in the original study, it was significant in the replication study with a larger sample (N = 61).

Although the percentage of correct answers showed only a marginally significant effect, the number of attempted answers and the absolute number of correct answers showed significant effects. Thus, this study does not count as a publication of a null-result. Moreover, these results suggest that participants in the control group were more motivated to do well because they worked on more problems and got more correct answers.