Category Archives: r-index

2016 Blogs

DECEMBER

12/31 ****
Review of an “eventful” 2016 (“Method Terrorists”)

12/6
A Forensic Analysis of Stapel: Fabrication or Falsification?

12/3
Replicability Analysis of Dijksterhuis’s “Enhancing Implicit Self-Esteem by Subliminal Evaluative Conditioning”

SEPTEMBER

9/13 ***
Critique of Finkel, Eastwick, & Reis’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

JUNE

6/30 ***
Wagenmaker’s Default Prior is Unrealistic

6/25 ****
A Principled Approach to Setting the Prior of the Null-Hypothesis

6/18 ***
What is the Difference between the Test of Excessive Significance and the Incredibility Index?

6/16 ****
The A Priori Probability of the Point Null-Hypothesis is not 50%

MAY

5/21
Replicability Report on Social Priming Studies with Mating Primes

5/18
Critique of Jeffrey N. Rouder, Richard D. Morey, and Eric-Jan Wagenmakers’s article “The Interplay between Subjectivity, Statistical Practice, and Psychological Science” 

5/9 ***
Questionable Research Practices Invalidate Bayes-Factors Just As Much as P-Values

APRIL

4/18 *****
Replicability Report of the Ego Depletion Literature

FEBRUARY

2/16 ****
Discussion of Sterling et al.’s (1995) Seminal Article on Inflated Success Rates in Psychological Science [also recommend reading the original article]

2/10
Replicability AudiT of a 10 Study Article by Adam D. Galinsky

2/9
A Replicability AudiT of Yaacov Trope’s Publications

2/3 ***
A Critique of Finkel, Eastwick, & Reis’s Views on the Replication Crisis

JANUARY

1/31 *****
Introduction to the R-Index
[The R-Index builds on the Incredibility Index, Schimmack (2012)]

1/31
Replicability Analysis of Damisch, Stoberock, & Mussweiler (2010)
[Anonymous Submission to R-Index Blog]

1/31
Replicability Analysis of Williams & Bargh (2008)

1/14 ***
Discussion of Hoenig and Heisey’s Critique of Observed Power Calculations

How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press). 

Forthcoming article: 
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J., Sun, J., Washburn, A. N., Wong, K., Yantis, C. A., & Skitka, L. J. (in press). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. (preprint)

Brief Introduction

Since JPSP published incredbile evidence for mental time travel (Bem, 2011), the credibility of social psychological research has been questioned.  There is talk of a crisis of confidence, a replication crisis, or a credibility crisis.  However, hard data on the credibility of empirical findings published in social psychology journals are scarce.

There have been two approaches to examine the credibility of social psychology.  One approach relies on replication studies.  Authors attempt to replicate original studies as closely as possible.  The most ambitious replication project was carried out by the Open Science Collaboration (Science, 2015) that replicated 1 study from 100 articles; 54 articles were classified as social psychology.   For original articles that reported a significant result, only a quarter replicated a significant result in the replication studies.  This estimate of replicability suggests that researches conduct many more studies than are published and that effect sizes in published articles are inflated by sampling error, which makes them difficult to replicate. One concern about the OSC results is that replicating original studies can be difficult.  For example, a bilingual study in California may not produce the same results as a bilingual study in Canada.  It is therefore possible that the poor outcome is partially due to problems of reproducing the exact conditions of original studies.

A second approach is to estimate replicability of published results using statistical methods.  The advantage of this approach is that replicabiliy estimates are predictions for exact replication studies of the original studies because the original studies provide the data for the replicability estimates.   This is the approach used by Motyl et al.

The authors sampled 30% of articles published in 2003-2004 (pre-crisis) and 2013-2014 (post-crisis) from four major social psychology journals (JPSP, PSPB, JESP, and PS).  For each study, coders identified one focal hypothesis and recorded the statistical result.  The bulk of the statistics were t-values from t-tests or regression analyses and F-tests from ANOVAs.  Only 19 statistics were z-tests.   The authors applied various statistical tests to the data that test for the presence of publication bias or whether the studies have evidential value (i.e., reject the null-hypothesis that all published results are false positives).  For the purpose of estimating replicability, the most important statistic is the R-Index.

The R-Index has two components.  First, it uses the median observed power of studies as an estimate of replicability (i.e., the percentage of studies that should produce a significant result if all studies were replicated exactly).  Second, it computes the percentage of studies with a significant result.  In an unbiased set of studies, median observed power and percentage of significant results should match.  Publication bias and questionable research practices will produce more significant results than predicted by median observed power.  The discrepancy is called the inflation rate.  The R-Index subtracts the inflation rate from median observed power because median observed power is an inflated estimate of replicability when bias is present.  The R-Index is not a replicability estimate.  That is, an R-Index of 30% does not mean that 30% of studies will produce a significant result.  However, a set of studies with an R-Index of 30 will have fewer successful replications than a set of studies with an R-Index of 80.  An exception is an R-Index of 50, which is equivalent with a replicability estimate of 50%.  If the R-Index is below 50, one would expect more replication failures than successes.

Motyl et al. computed the R-Index separately for the 2003/2004 and the 2013/2014 results and found “the R-index decreased numerically, but not statistically over time, from .62 [CI95% = .54, .68] in 2003-2004 to .52 [CI95% = .47, .56] in 2013-2014. This metric suggests that the field is not getting better and that it may consistently be rotten to the core.”

I think this interpretation of the R-Index results is too harsh.  I consider an R-Index below 50 an F (fail).  An R-Index in the 50s is a D, and an R-Index in the 60s is a C.  An R-Index greater than 80 is considered an A.  So, clearly there is a replication crisis, but social psychology is not rotten to the core.

The R-Index is a simple tool, but it is not designed to estimate replicability.  Jerry Brunner and I developed a method that can estimate replicability, called z-curve.  All test-statistics are converted into absolute z-scores and a kernel density distribution is fitted to the histogram of z-scores.  Then a mixture model of normal distributions is fitted to the density distribution and the means of the normal distributions are converted into power values. The weights of the components are used to compute the weighted average power. When this method is applied only to significant results, the weighted average power is the replicability estimate;  that is, the percentage of significant results that one would expect if the set of significant studies were replicated exactly.   Motyl et al. did not have access to this statistical tool.  They kindly shared their data and I was able to estimate replicability with z-curve.  For this analysis, I used all t-tests, F-tests, and z-tests (k = 1,163).   The Figure shows two results.  The left figure uses all z-scores greater than 2 for estimation (all values on the right side of the vertical blue line). The right figure uses only z-scores greater than 2.4.  The reason is that just-significant results may be compromised by questionable research methods that may bias estimates.

Motyl.2d0.2d4

The key finding is the replicability estimate.  Both estimations produce similar results (48% vs. 49%).  Even with over 1,000 observations there is uncertainty in these estimates and the 95%CI can range from 45 to 54% using all significant results.   Based on this finding, it is predicted that about half of these results would produce a significant result again in a replication study.

However, it is important to note that there is considerable heterogeneity in replicability across studies.  As z-scores increase, the strength of evidence becomes stronger, and results are more likely to replicate.  This is shown with average power estimates for bands of z-scores at the bottom of the figure.   In the left figure,  z-scores between 2 and 2.5 (~ .01 < p < .05) have only a replicability of 31%, and even z-scores between 2.5 and 3 have a replicability below 50%.  It requires z-scores greater than 4 to reach a replicability of 80% or more.   Similar results are obtained for actual replication studies in the OSC reproducibilty project.  Thus, researchers should take the strength of evidence of a particular study into account.  Studies with p-values in the .01 to .05 range are unlikely to replicate without boosting sample sizes.  Studies with p-values less than .001 are likely to replicate even with the same sample size.

Independent Replication Study 

Schimmack and Brunner (2016) applied z-curve to the original studies in the OSC reproducibility project.  For this purpose, I coded all studies in the OSC reproducibility project.  The actual replication project often picked one study from articles with multiple studies.  54 social psychology articles reported 173 studies.   The focal hypothesis test of each study was used to compute absolute z-scores that were analyzed with z-curve.

OSC.soc

The two estimation methods (using z > 2.0 or z > 2.4) produced very similar replicability estimates (53% vs. 52%).  The estimates are only slightly higher than those for Motyl et al.’s data (48% & 49%) and the confidence intervals overlap.  Thus, this independent replication study closely replicates the estimates obtained with Motyl et al.’s data.

Automated Extraction Estimates

Hand-coding of focal hypothesis tests is labor intensive and subject to coding biases. Often studies report more than one hypothesis test and it is not trivial to pick one of the tests for further analysis.  An alternative approach is to automatically extract all test statistics from articles.  This makes it also possible to base estimates on a much larger sample of test results.  The downside of automated extraction is that articles also report statistical analysis for trivial or non-critical tests (e.g., manipulation checks).  The extraction of non-significant results is irrelevant because they are not used by z-curve to estimate replicability.  I have reported the results of this method for various social psychology journals covering the years from 2010 to 2016 and posted powergraphs for all journals and years (2016 Replicability Rankings).   Further analyses replicated the results from the OSC reproducibility project that results published in cognitive journals are more replicable than those published in social journals.  The Figure below shows that the average replicability estimate for social psychology is 61%, with an encouraging trend in 2016.  This estimate is about 10% above the estimates based on hand-coded focal hypothesis tests in the two datasets above.  This discrepancy can be due to the inclusion of less original and trivial statistical tests in the automated analysis.  However, a 10% difference is not a dramatic difference.  Neither 50% nor 60% replicability justify claims that social psychology is rotten to the core, nor do they meet the expectation that researchers should plan studies with 80% power to detect a predicted effect.

replicability-cog-vs-soc

Moderator Analyses

Motyl et al. (in press) did extensive coding of the studies.  This makes it possible to examine potential moderators (predictors) of higher or lower replicability.  As noted earlier, the strength of evidence is an important predictor.  Studies with higher z-scores (smaller p-values) are, on average, more replicable.  The strength of evidence is a direct function of statistical power.  Thus, studies with larger population effect sizes and smaller sampling error are more likely to replicate.

It is well known that larger samples have less sampling error.  Not surprisingly, there is a correlation between sample size and the absolute z-scores (r = .3).  I also examined the R-Index for different ranges of sample sizes.  The R-Index was the lowest for sample sizes between N = 40 and 80 (R-Index = 43), increased for N = 80 to 200 (R-Index = 52) and further for sample sizes between 200 and 1,000 (R-Index = 69).  Interestingly, the R-Index for small samples with N < 40 was 70.  This is explained by the fact that research designs also influence replicability and that small samples often use more powerful within-subject designs.

A moderator analysis with design as moderator confirms this.  The R-Indices for between-subject designs is the lowest (R-Index = 48) followed by mixed designs (R-Index = 61) and then within-subject designs (R-Index = 75).  This pattern is also found in the OSC reproducibility project and partially accounts for the higher replicability of cognitive studies, which often employ within-subject designs.

Another possibility is that articles with more studies package smaller and less replicable studies.  However,  number of studies in an article was not a notable moderator:  1 study R-Index = 53, 2 studies R-Index = 51, 3 studies R-Index = 60, 4 studies R-Index = 52, 5 studies R-Index = 53.

Conclusion 

Motyl et al. (in press) coded a large and representative sample of results published in social psychology journals.  Their article complements results from the OSC reproducibility project that used actual replications, but a much smaller number of studies.  The two approaches produce different results.  Actual replication studies produced only 25% successful replications.  Statistical estimates of replicability are around 50%.   Due to the small number of actual replications in the OSC reproducibility project, it is important to be cautious in interpreting the differences.  However, one plausible explanation for lower success rates in actual replication studies is that it is practically impossible to redo a study exactly.  This may even be true when researchers conduct three similar studies in their own lab and only one of these studies produces a significant result.  Some non-random, but also not reproducible, factor may have helped to produce a significant result in this study.  Statistical models assume that we can redo a study exactly and may therefore overestimate the success rate for actual replication studies.  Thus, the 50% estimate is an optimistic estimate for the unlikely scenario that a study can be replicated exactly.  This means that even though optimists may see the 50% estimate as “the glass half full,” social psychologists need to increase statistical power and pay more attention to the strength of evidence of published results to build a robust and credible science of social behavior.

 

 

Hidden Figures: Replication Failures in the Stereotype Threat Literature

In the past five years, it has become apparent that many classic and important findings in social psychology fail to replicate (Schimmack, 2016).  The replication crisis is often considered a new phenomenon, but failed replications are not entirely new.  Sometimes these studies have simply been ignored.  These studies deserve more attention and need to be reevaluated in the context of the replication crisis in social psychology.

In the past, failed replications were often dismissed because seminal articles were assumed to provide robust empirical support for a phenomenon, especially if an article presented multiple studies. The chance of reporting a false positive results in a multiple study article is low because the risk of a false positive decreases exponentially (Schimmack, 2012). However, the low risk of a false positive is illusory if authors only publish studies that worked. In this case, even false positives can be supported by significant results in multiple studies, as demonstrated in the infamous ESP study by Bem (2011).  As a result, publication bias undermines the reporting of statistical significance as diagnostic information about the risk of false positives (Sterling, 1959) and many important theories in social psychology rest on shaky empirical foundations that need to be reexamined.

Research on stereotype threat and women’s performance on math tests is one example where publication bias undermines the findings in a seminal study that produced a large literature of studies on gender differences in math performance. After correcting for publication bias, this literature shows very little evidence that stereotype threat has a notable and practically significant effect on women’s math performance (Flore & Wicherts, 2014).

Another important line of research has examined the contribution of stereotype threat to differences between racial groups on academic performance tests.  This blog post examines the strength of the empirical evidence for stereotype threat effects in the seminal article by Steele and Aronson (1995). This article is currently the 12th most cited article in the top journal for social psychology, Journal of Personality and Social Psychology (2,278 citations so far).

According to the abstract, “stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one’s group.” Studies 1 and 2 showed that “reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled).”  “Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it.”  “Study 4 showed that mere salience of the stereotype could impair Blacks’ performance even when the test was not
ability diagnostic.”

The results of Study 4 motivated Stricker and colleagues to examine the influence of stereotype-treat on test performance in a real-world testing situation.  These studies had large samples and were not limited to students at Stanford. One study was reported in a College Board Report (Stricker and Ward, 1998).   Another two studies were published in the Journal of Applied Social Psychology (Stricker & Ward, 2004).  This article received only 52 citations, although it reported two studies with an experimental manipulation of stereotype threat in a real assessment context.  One group of participants were asked about their gender or ethnicity before the text, the other group did not receive these questions.  As noted in the abstract, neither the inquiry about race, nor about gender, had a significant effect on test performance. In short, this study failed to replicate Study 4 of the classic and widely cited article by Steele and Aronson.

Stricker and Ward’s Abstract
Steele and Aronson (1995) found that the performance of Black research participants on
ability test items portrayed as a problem-solving task, in laboratory experiments, was affected adversely when they were asked about their ethnicity. This outcome was attributed to stereotype threat: Performance was disrupted by participants’ concerns about fulfilling the negative stereotype concerning Black people’s intellectual ability. The present field experiments extended that research to other ethnic groups and to males and females taking operational tests. The experiments evaluated the effects of inquiring about ethnicity and gender on the performance of students taking 2 standardized tests-the Advanced Placement Calculus AB Examination, and the Computerized Placement Tests-in actual test administrations. This inquiry did not have any effects on the test performance of Black, female, or other subgroups of students that were both statistically and practically significant.

The article also mentions a personal communication with Steele, in which Steele mentions an unpublished study that also failed to demonstrate the effect under similar conditions.

“In fact, Steele found in an unpublished pilot study that inquiring about ethnicity did not affect Black participants’ performance when the task was described as diagnostic of their ability (C. M. Steele, personal communication, May 2 1, 1997), in contrast to the
substantial effect of inquiring when the task was described as nondiagnostic.”

A substantive interpretation of this finding is that inquires about race or gender do not produce stereotype threat effects when a test is diagnostic because a diagnostic test already activates stereotype threat.  However, if this were a real moderator, it would be important to document this fact and it is not clear why this finding obtained in an earlier study by Steele remained unpublished. Moreover, it is premature to interpret the significant result in the published study with a non-diagnostic task and the non-significant result in an unpublished study with a diagnostic task as evidence that diagnosticity moderates the effect of the stereotype-threat manipulation. A proper test of this moderator hypothesis would require the demonstration of a three-way interaction between race, inquiry about race, and diagnosticity. Absent this evidence, it remains possible that diagnosticity is not a moderator and that the published result is a false positive (or a positive result with an inflated effect size estimate). In contrast, there appears to be consistent evidence that inquiries about race or gender before a real assessment of academic performance does not influence performance. This finding is not widely publicized, but is important for a better understanding of performance differences in real world settings.

The best way to examine the replicability of Steele and Aronson’s seminal finding with non-diagnostic tasks would be to conduct an exact replication study.  However, exact replication studies are difficult and costly.  An alternative is to examine the robustness of the published results by taking a closer look at the strength of the statistical results reported by Steele and Aronson, using modern statistical tests of publication bias and statistical power like the R-Index (Schimmack, 2014) and the Test of Insufficient Variance (TIVA, Schimmack, 2014).

Replicability Analysis of Steele and Aronson’s four studies

Study 1. The first study had a relatively large sample of N = 114 participants, but it is not clear how many of the participants were White or Black.  The study also had a 2 x 3 design, which leaves less than 20 participants per condition.   The study produced a significant main effect of condition, F(2, 107) = 4.74, and race, F(1,107) = 5.22, but the critical condition x race interaction was not significant (reported as p > .19).   However, a specific contrast showed significant differences between Black participants in the diagnostic condition and the non-diagnostic condition, t(107) = 2.88, p = .005, z = 2.82.  The authors concluded “in sum, then, the hypothesis was supported by the pattern of contrasts, but when tested over the whole design, reached only marginal significance” (p. 800).  In other words, Study 1 provided only weak support for the stereotype threat hypothesis.

Study 2. Study 2 eliminated one of the three experimental conditions. Participants were 20 Black and 20 White participants. This means there were only 10 participants in each condition of a 2 x 2 design. The degrees of freedom further indicate that the actual sample size was only 38 participants. Given the weak evidence in Study 1, there is no justification for a reduction in the number of participants per cell, although the difficulty of recruiting Black participants at Stanford may explain this inadequate sample size. Nevertheless, the study showed a significant interaction between race and test description, F(1,35) = 8.07, p = .007. The study also replicated the contrast from Study 1 that Black participants in the diagnostic condition performed significantly worse than Black participants in the non-diagnostic group, t(35) = 2.38, p = .023, z = 2.28.

Studies 1 and 2 are close replications of each other.  The consistent finding across the two studies that supports stereotype-treat theory is the finding that merely changing the description of an assessment task changes Black participants performance, as revealed by significant differences between the diagnostic and non-diagnostic condition in both studies.  The problem is that both studies had small numbers of Black participants and that small samples have low power to produce significant results. As a result, it is unlikely that a pair of studies would produce significant results in both studies.

Observed power  in the two studies is .81 and .62 with median observed power of .71. Thus, the actual success rate of 100% (2 out of 2 significant results) is 29 percentage points higher than the expected success rate. Moreover, when inflation is evident, median observed power is also inflated. To correct for this inflation, the Replicability-Index (R-Index) subtracts inflation from median observed power, which yields an R-Index of 42.  Any value below 50 is considered unacceptably low and I give it a letter grade F, just like students at American Universities receive an F for exams with less than 50% correct answers.  This does not mean that stereotype threat is not a valid theory or that there was no real effect in this pair of studies. It simply means that the evidence in this highly cited article is insufficient to make strong claims about the causes of Black’s performance on academic tests.

The Test of Insufficient Variance (TIVA) provides another way to examine published results.  Test statistics like t-values vary considerably from study to study even if the exact same study is conducted twice (or if one larger sample is randomly split into two sub-samples).  When test-statistics are converted into z-scores, sampling error (the random variability from sample to sample) follows approximately a standard normal distribution with a variance of 1.  If the variance is considerably smaller than 1, it suggests that the reported results represent a selected sample. Often the selection is a result of publication bias.  Applying TIVA to the pair of studies, yields a variance of Var(z) = 0.15.  As there are only two studies, it is possible that this outcome occurred by chance, p = .300, and it does not imply intentional selection for significance or other questionable research practices.  Nevertheless, it suggests that future replication studies will be more variable and produce some non-significant results.

In conclusion, the evidence presented in the first two studies is weaker than we might assume if we focused only on the fact that both studies produced significant contrasts. Given publication bias, the fact that both studies reported significant results provides no empirical evidence because virtually all published studies report significant results. The R-Index quantifies the strength of evidence for an effect while taking the influence of publication bias into account and it shows that the two studies with small samples provide only weak evidence for an effect.

Study 3.  This study did not examine performance. The aim was to demonstrate activation of stereotype threat with a sentence completion task.  The sample size of 68 participants  (35 Black, 33 White) implied that only 11 or 12 participants were assigned to one of the six cells in a 2 (race) by 3 (task description) design. The study produced main effects for race and condition, but most importantly it produced a significant interaction effect, F(2,61) = 3.30, p = .044.  In addition, Black participants in the diagnostic condition had more stereotype-related associations than Black participants in the non-diagnostic condition, t(61) = 3.53,

Study 4.  This study used inquiry about race to induce stereotype-threat. Importantly, the task was described as non-diagnostic (as noted earlier, a similar study produced no significant results when the task was described as diagnostic).  The design was a 2 x 2 design with 47 participants, which means only 11 or 12 participants were allocated to the four conditions.  The degrees of freedom indicated that cell frequencies were even lower. The study produced a significant interaction effect, F(1,39) = 7.82, p = .008.  The study also produced a significant contrast between Blacks in the race-prime condition and the no-prime condition, t(39) = 2.43, p = .020.

The contrast effect in Study 3 is strong, but it is not a performance measure.  If stereotype threat mediates the effect of task characteristics and performance, we would expect a stronger effect on the measure of the mediator than on the actual outcome of interest, task performance.  The key aim of stereotype threat theory is to explain differences in performance.  With a focus on performance outcomes, it is possible to examine the R-Index and TIVA of Studies 1, 2, and 4.  All three studies reported significant contrasts between Black students randomly assigned to two groups that were expected to show performance differences (Table 1).

Table 1

Study Test Statistic p-value z-score obs.pow
Study 1 t(107) = 2.88 0.005 2.82 0.81
Study 2 t(35)=2.38 0.023 2.28 0.62
Study 4 t(39) = 2.43 0.020 2.33 0.64

Median observed power is 64 and the R-Index is well below 50, 64 – 36 = 28 (F).  The variance in z-scores is Var(z) = 0.09, p = .086.  These results cast doubt about the replicability of the performance effects reported in Steele and Aronson’s seminal stereotype threat article.

Conclusion

Racial stereotypes and racial disparities are an important social issue.  Social psychology aims and promises to contribute to the understanding of this issue by conducting objective, scientific studies that can inform our understanding of these issues.  In order to live up to these expectations, social psychology has to follow the rules of science and listen to the data.  Just like it is important to get the numbers right to send men and women into space (and bring them back), it is important to get the numbers right when we use science to understand women and men on earth.  Unfortunately, social psychologists have not followed the examples of astronomers and the numbers do not add up.

The three African American women, features in this years movie “Hidden Figures”***,  Katherine Johnson, Dorothy Vaughan, and Mary Jackson might not approve of the casual way social psychologists use numbers in their research, especially the wide-spread practice of hiding numbers that do not match expectations.  No science that wants to make a real-world contribution can condone this practice.  It is also not acceptable to simply ignore published results from well-conducted studies with large samples that challenge a prominent theory.

Surely, the movie Hidden Figures dramatized some of the experiences of Black women at NASA, but there is little doubt that Katherine Johnson, Dorothy Vaughan, and Mary Jackson encountered many obstacles that might be considered stereotype threatening situations.  Yet, they prevailed and they paved the way for future generations of stereotyped groups.  Understanding racial and gender bias and performance differences remains an important issue and that is the reason why it is important to shed a light on hidden numbers and put simplistic theories under the microscope. Stereotype threat is too often used as a simple explanation that avoids tackling deeper and more difficult issues that cannot be easily studied in a quick laboratory experiment with undergraduate students at top research universities.  It is time for social psychologists to live up to its promises by tackling real world issues with research designs that have real world significance that produce real evidence using open and transparent research practices.

————————————————————————————————————————————

*** If you haven’t seen the movie, I highly recommend it.

 

How Selection for Significance Influences Observed Power

Two years ago, I posted an Excel spreadsheet to help people to understand the concept of true power, observed power, and how selection for significance inflates observed power. Two years have gone by and I have learned R. It is time to update the post.

There is no mathematical formula to correct observed power for inflation to solve for true power. This was partially the reason why I created the R-Index, which is an index of true power, but not an estimate of true power.  This has led to some confusion and misinterpretation of the R-Index (Disjointed Thought blog post).

However, it is possible to predict median observed power given true power and selection for statistical significance.  To use this method for real data with observed median power of only significant results, one can simply generate a range of true power values, generate the predicted median observed power and then pick the true power value with the smallest discrepancy between median observed power and simulated inflated power estimates. This approach is essentially the same as the approach used by pcurve and puniform, which only
differ in the criterion that is being minimized.

Here is the r-code for the conversion of true.power into the predicted observed power after selection for significance.

true.power = seq(.01,.99,.01)
obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)

And here is a pretty picture of the relationship between true power and inflated observed power.  As we can see, there is more inflation for low true power because observed power after selection for significance has to be greater than 50%.  With alpha = .05 (two-tailed), when the null-hypothesis is true, inflated observed power is 61%.   Thus, an observed median power of 61% for only significant results supports the null-hypothesis.  With true power of 50%, observed power is inflated to 75%.  For high true power, the inflation is relatively small. With the recommended true power of 80%, median observed power for only significant results is 86%.

inflated-mop

Observed power is easy to calculate from reported test statistics. The first step is to compute the exact two-tailed p-value.  These p-values can then be converted into observed power estimates using the standard normal distribution.

z.crit = qnorm(.975)
Obs.power = pnorm(qnorm(1-p/2),z.crit)

If there is selection for significance, you can use the previous formula to convert this observed power estimate into an estimate of true power.

This method assumes that (a) significant results are representative of the distribution and there are no additional biases (no p-hacking) and (b) all studies have the same or similar power.  This method does not work for heterogeneous sets of studies.

P.S.  It is possible to proof the formula that transforms true power into median observed power.  Another way to verify that the formula is correct is to confirm the predicted values with a simulation study.

Here is the code to run the simulation study:

n.sim = 100000
z.crit = qnorm(.975)
true.power = seq(.01,.99,.01)
obs.pow.sim = c()
for (i in 1:length(true.power)) {
z.sim = rnorm(n.sim,qnorm(true.power[i],z.crit))
med.z.sig = median(z.sim[z.sim > z.crit])
obs.pow.sim = c(obs.pow.sim,pnorm(med.z.sig,z.crit))
}
obs.pow.sim

obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,z.crit)),z.crit)
obs.pow
cbind(true.power,obs.pow.sim,obs.pow)
plot(obs.pow.sim,obs.pow)

 

 

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

Authors:  Ulrich Schimmack, Moritz Heene, and Kamini Kesavan

 

Abstract:
We computed the R-Index for studies cited in Chapter 4 of Kahneman’s book “Thinking Fast and Slow.” This chapter focuses on priming studies, starting with John Bargh’s study that led to Kahneman’s open email.  The results are eye-opening and jaw-dropping.  The chapter cites 12 articles and 11 of the 12 articles have an R-Index below 50.  The combined analysis of 31 studies reported in the 12 articles shows 100% significant results with average (median) observed power of 57% and an inflation rate of 43%.  The R-Index is 14. This result confirms Kahneman’s prediction that priming research is a train wreck and readers of his book “Thinking Fast and Slow” should not consider the presented studies as scientific evidence that subtle cues in their environment can have strong effects on their behavior outside their awareness.

Introduction

In 2011, Nobel Laureate Daniel Kahneman published a popular book, “Thinking Fast and Slow”, about important finding in social psychology.

In the same year, questions about the trustworthiness of social psychology were raised.  A Dutch social psychologist had fabricated data. Eventually over 50 of his articles would be retracted.  Another social psychologist published results that appeared to demonstrate the ability to foresee random future events (Bem, 2011). Few researchers believed these results and statistical analysis suggested that the results were not trustworthy (Francis, 2012; Schimmack, 2012).  Psychologists started to openly question the credibility of published results.

In the beginning of 2012, Doyen and colleagues published a failure to replicate a prominent study by John Bargh that was featured in Daniel Kahneman’s book.  A few month later, Daniel Kahneman distanced himself from Bargh’s research in an open email addressed to John Bargh (Young, 2012):

“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research… people have now attached a question mark to the field, and it is your responsibility to remove it… all I have personally at stake is that I recently wrote a book that emphasizes priming research as a new approach to the study of associative memory…Count me as a general believer… My reason for writing this letter is that I see a train wreck looming.”

Five years later, Kahneman’s concerns have been largely confirmed. Major studies in social priming research have failed to replicate and the replicability of results in social psychology is estimated to be only 25% (OSC, 2015).

Looking back, it is difficult to understand the uncritical acceptance of social priming as a fact.  In “Thinking Fast and Slow” Kahneman wrote “disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”

Yet, Kahneman could have seen the train wreck coming. In 1971, he co-authored an article about scientists’ “exaggerated confidence in the validity of conclusions based on small samples” (Tversky & Kahneman, 1971, p. 105).  Yet, many of the studies described in Kahneman’s book had small samples.  For example, Bargh’s priming study used only 30 undergraduate students to demonstrate the effect.

Replicability Index

Small samples can be sufficient to detect large effects. However, small effects require large samples.  The probability of replicating a published finding is a function of sample size and effect size.  The Replicability Index (R-Index) makes it possible to use information from published results to predict how replicable published results are.

Every reported test-statistic can be converted into an estimate of power, called observed power. For a single study, this estimate is useless because it is not very precise. However, for sets of studies, the estimate becomes more precise.  If we have 10 studies and the average power is 55%, we would expect approximately 5 to 6 studies with significant results and 4 to 5 studies with non-significant results.

If we observe 100% significant results with an average power of 55%, it is likely that studies with non-significant results are missing (Schimmack, 2012).  There are too many significant results.  This is especially true because average power is also inflated when researchers report only significant results. Consequently, the true power is even lower than average observed power.  If we observe 100% significant results with 55% average powered power, power is likely to be less than 50%.

This is unacceptable. Tversky and Kahneman (1971) wrote “we refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.”

To correct for the inflation in power, the R-Index uses the inflation rate. For example, if all studies are significant and average power is 75%, the inflation rate is 25% points.  The R-Index subtracts the inflation rate from average power.  So, with 100% significant results and average observed power of 75%, the R-Index is 50% (75% – 25% = 50%).  The R-Index is not a direct estimate of true power. It is actually a conservative estimate of true power if the R-Index is below 50%.  Thus, an R-Index below 50% suggests that a significant result was obtained only by capitalizing on chance, although it is difficult to quantify by how much.

How Replicable are the Social Priming Studies in “Thinking Fast and Slow”?

Chapter 4: The Associative Machine

4.1.  Cognitive priming effect

In the 1980s, psychologists discovered that exposure to a word causes immediate and measurable changes in the ease with which many related words can be evoked.

[no reference provided]

4.2.  Priming of behavior without awareness

Another major advance in our understanding of memory was the discovery that priming is not restricted to concepts and words. You cannot know this from conscious experience, of course, but you must accept the alien idea that your actions and your emotions can be primed by events of which you are not even aware.

“In an experiment that became an instant classic, the psychologist John Bargh and his collaborators asked students at New York University—most aged eighteen to twenty-two—to assemble four-word sentences from a set of five words (for example, “finds he it yellow instantly”). For one group of students, half the scrambled sentences contained words associated with the elderly, such as Florida, forgetful, bald, gray, or wrinkle. When they had completed that task, the young participants were sent out to do another experiment in an office down the hall. That short walk was what the experiment was about. The researchers unobtrusively measured the time it took people to get from one end of the corridor to the other.”

“As Bargh had predicted, the young people who had fashioned a sentence from words with an elderly theme walked down the hallway significantly more slowly than the others. walking slowly, which is associated with old age.”

“All this happens without any awareness. When they were questioned afterward, none of the students reported noticing that the words had had a common theme, and they all insisted that nothing they did after the first experiment could have been influenced by the words they had encountered. The idea of old age had not come to their conscious awareness, but their actions had changed nevertheless.“

[John A. Bargh, Mark Chen, and Lara Burrows, “Automaticity of Social Behavior: Direct Effects of Trait Construct and Stereotype Activation on Action,” Journal of Personality and Social Psychology 71 (1996): 230–44.]

t(28)=2.86 0.008 2.66 0.76
t(28)=2.16 0.039 2.06 0.54

MOP = .65, Inflation = .35, R-Index = .30

4.3.  Reversed priming: Behavior primes cognitions

“The ideomotor link also works in reverse. A study conducted in a German university was the mirror image of the early experiment that Bargh and his colleagues had carried out in New York.”

“Students were asked to walk around a room for 5 minutes at a rate of 30 steps per minute, which was about one-third their normal pace. After this brief experience, the participants were much quicker to recognize words related to old age, such as forgetful, old, and lonely.”

“Reciprocal priming effects tend to produce a coherent reaction: if you were primed to think of old age, you would tend to act old, and acting old would reinforce the thought of old age.”

t(18)=2.10 0.050 1.96 0.50
t(35)=2.10 0.043 2.02 0.53
t(31)=2.50 0.018 2.37 0.66

MOP = .53, Inflation = .47, R-Index = .06

4.4.  Facial-feedback hypothesis (smiling makes you happy)

“Reciprocal links are common in the associative network. For example, being amused tends to make you smile, and smiling tends to make you feel amused….”

“College students were asked to rate the humor of cartoons from Gary Larson’s The Far Side while holding a pencil in their mouth. Those who were “smiling” (without any awareness of doing so) found the cartoons funnier than did those who were “frowning.”

[“Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis,” Journal of Personality and Social Psychology 54 (1988): 768–77.]

The authors used the more liberal and unconventional criterion of p < .05 (one-tailed), z = 1.65, as a criterion for significance. Accordingly, we adjusted the R-Index analysis and used 1.65 as the criterion value.

t(89)=1.85 0.034 1.83 0.57
t(75)=1.78 0.034 1.83 0.57

MOP = .57, Inflation = .43, R-Index = .14

These results could not be replicated in a large replication effort with 17 independent labs. Not a single lab produced a significant result and even a combined analysis failed to show any evidence for the effect.

4.5. Automatic Facial Responses

In another experiment, people whose face was shaped into a frown (by squeezing their eyebrows together) reported an enhanced emotional response to upsetting pictures—starving children, people arguing, maimed accident victims.

[Ulf Dimberg, Monika Thunberg, and Sara Grunedal, “Facial Reactions to

Emotional Stimuli: Automatically Controlled Emotional Responses,” Cognition and Emotion, 16 (2002): 449–71.]

The description in the book does not match any of the three studies reported in this article. The first two studies examined facial muscle movements in response to pictures of facial expressions (smiling or frowning faces).  The third study used emotional pictures of snakes and flowers. We might consider the snake pictures as being equivalent to pictures of starving children or maimed accident victims.  Participants were also asked to frown or to smile while looking at the pictures. However, the dependent variable was not how they felt in response to pictures of snakes, but rather how their facial muscles changed.  Aside from a strong effect of instructions, the study also found that the emotional picture had an automatic effect on facial muscles.  Participants frowned more when instructed to frown and looking at a snake picture than when instructed to frown and looking at a picture of a flower. “This response, however, was larger to snakes than to flowers as indicated by both the Stimulus factor, F(1, 47) = 6.66, p < .02, and the Stimulus 6 Interval factor, F(1, 47) = 4.30, p < .05.”  (p. 463). The evidence for smiling was stronger. “The zygomatic major muscle response was larger to flowers than to snakes, which was indicated by both the Stimulus factor, F(1, 47) = 18.03, p < .001, and the Stimulus 6 Interval factor, F(1, 47) = 16.78, p < .001.”  No measures of subjective experiences were included in this study.  Therefore, the results of this study provide no evidence for Kahneman’s claim in the book and the results of this study are not included in our analysis.

4.6.  Effects of Head-Movements on Persuasion

“Simple, common gestures can also unconsciously influence our thoughts and feelings.”

“In one demonstration, people were asked to listen to messages through new headphones. They were told that the purpose of the experiment was to test the quality of the audio equipment and were instructed to move their heads repeatedly to check for any distortions of sound. Half the participants were told to nod their head up and down while others were told to shake it side to side. The messages they heard were radio editorials.”

“Those who nodded (a yes gesture) tended to accept the message they heard, but those who shook their head tended to reject it. Again, there was no awareness, just a habitual connection between an attitude of rejection or acceptance and its common physical expression.”

F(2,66)=44.70 0.000 7.22 1.00

MOP = 1.00, Inflation = .00,  R-Index = 1.00

[Gary L. Wells and Richard E. Petty, “The Effects of Overt Head Movements on Persuasion: Compatibility and Incompatibility of Responses,” Basic and Applied Social Psychology, 1, (1980): 219–30.]

4.7   Location as Prime

“Our vote should not be affected by the location of the polling station, for example, but it is.”

“A study of voting patterns in precincts of Arizona in 2000 showed that the support for propositions to increase the funding of schools was significantly greater when the polling station was in a school than when it was in a nearby location.”

“A separate experiment showed that exposing people to images of classrooms and school lockers also increased the tendency of participants to support a school initiative. The effect of the images was larger than the difference between parents and other voters!”

[Jonah Berger, Marc Meredith, and S. Christian Wheeler, “Contextual Priming: Where People Vote Affects How They Vote,” PNAS 105 (2008): 8846–49.]

z = 2.10 0.036 2.10 0.56
p = .05 0.050 1.96 0.50

MOP = .53, Inflation = .47, R-Index = .06

4.8  Money Priming

“Reminders of money produce some troubling effects.”

“Participants in one experiment were shown a list of five words from which they were required to construct a four-word phrase that had a money theme (“high a salary desk paying” became “a high-paying salary”).”

“Other primes were much more subtle, including the presence of an irrelevant money-related object in the background, such as a stack of Monopoly money on a table, or a computer with a screen saver of dollar bills floating in water.”

“Money-primed people become more independent than they would be without the associative trigger. They persevered almost twice as long in trying to solve a very difficult problem before they asked the experimenter for help, a crisp demonstration of increased self-reliance.”

“Money-primed people are also more selfish: they were much less willing to spend time helping another student who pretended to be confused about an experimental task. When an experimenter clumsily dropped a bunch of pencils on the floor, the participants with money (unconsciously) on their mind picked up fewer pencils.”

“In another experiment in the series, participants were told that they would shortly have a get-acquainted conversation with another person and were asked to set up two chairs while the experimenter left to retrieve that person. Participants primed by money chose to stay much farther apart than their nonprimed peers (118 vs. 80 centimeters).”

“Money-primed undergraduates also showed a greater preference for being alone.”

[Kathleen D. Vohs, “The Psychological Consequences of Money,” Science 314 (2006): 1154–56.]

F(2,49)=3.73 0.031 2.16 0.58
t(35)=2.03 0.050 1.96 0.50
t(37)=2.06 0.046 1.99 0.51
t(42)=2.13 0.039 2.06 0.54
F(2,32)=4.34 0.021 2.30 0.63
t(38)=2.13 0.040 2.06 0.54
t(33)=2.37 0.024 2.26 0.62
F(2,58)=4.04 0.023 2.28 0.62
chi^2(2)=10.10 0.006 2.73 0.78

MOP = .58, Inflation = .42, R-Index = .16

4.9  Death Priming

“The evidence of priming studies suggests that reminding people of their mortality increases the appeal of authoritarian ideas, which may become reassuring in the context of the terror of death.”

The cited article does not directly examine this question.  The abstract states that “three experiments were conducted to test the hypothesis, derived from terror management theory, that reminding people of their mortality increases attraction to those who consensually validate their beliefs and decreases attraction to those who threaten their beliefs” (p. 308).  Study 2 found no general effect of death priming. Rather, the effect was qualified by authoritarianism. Mortality salience enhanced the rejection of dissimilar others in Study 2 only among high authoritarian subjects.” (p. 314), based on a three-way interaction with F(1,145) = 4.08, p = .045.  We used the three-way interaction for the computation of the R-Index.  Study 1 reported opposite effects for ratings of Christian targets, t(44) = 2.18, p = .034 and Jewish targets, t(44)= 2.08, p = .043. As these tests are dependent, only one test could be used, and we chose the slightly stronger result.  Similarly, Study 3 reported significantly more liking of a positive interviewee and less liking of a negative interviewee, t(51) = 2.02, p = .049 and t(49) = 2.42, p = .019, respectively. We chose the stronger effect.

[Jeff Greenberg et al., “Evidence for Terror Management Theory II: The Effect of Mortality Salience on Reactions to Those Who Threaten or Bolster the Cultural Worldview,” Journal of Personality and Social Psychology]

t(44)=2.18 0.035 2.11 0.56
F(1,145)=4.08 0.045 2.00 0.52
t(49)=2.42 0.019 2.34 0.65

MOP = .56, Inflation = .44, R-Index = .12

4.10  The “Lacy Macbeth Effect”

“For example, consider the ambiguous word fragments W_ _ H and S_ _ P. People who were recently asked to think of an action of which they are ashamed are more likely to complete those fragments as WASH and SOAP and less likely to see WISH and SOUP.”

“Furthermore, merely thinking about stabbing a coworker in the back leaves people more inclined to buy soap, disinfectant, or detergent than batteries, juice, or candy bars. Feeling that one’s soul is stained appears to trigger a desire to cleanse one’s body, an impulse that has been dubbed the “Lady Macbeth effect.”

[Lady Macbeth effect”: Chen-Bo Zhong and Katie Liljenquist, “Washing Away Your Sins:

Threatened Morality and Physical Cleansing,” Science 313 (2006): 1451–52.]

F(1,58)=4.26 0.044 2.02 0.52
F(1,25)=6.99 0.014 2.46 0.69

MOP = .61, Inflation = .39, R-Index = .22

The article reports two more studies that are not explicitly mentioned, but are used as empirical support for the Lady Macbeth effect. As the results of these studies were similar to those in the mentioned studies, including these tests in our analysis does not alter the conclusions.

chi^2(1)=4.57 0.033 2.14 0.57
chi^2(1)=5.02 0.025 2.24 0.61

MOP = .59, Inflation = .41, R-Index = .18

4.11  Modality Specificity of the “Lacy Macbeth Effect”

“Participants in an experiment were induced to “lie” to an imaginary person, either on the phone or in e-mail. In a subsequent test of the desirability of various products, people who had lied on the phone preferred mouthwash over soap, and those who had lied in e-mail preferred soap to mouthwash.”

[Spike Lee and Norbert Schwarz, “Dirty Hands and Dirty Mouths: Embodiment of the Moral-Purity Metaphor Is Specific to the Motor Modality Involved in Moral Transgression,” Psychological Science 21 (2010): 1423–25.]

The results are presented as significant with a one-sided t-test. “As shown in Figure 1a, participants evaluated mouthwash more positively after lying in a voice mail (M = 0.21, SD = 0.72) than after lying in an e-mail (M = –0.26, SD = 0.94), F(1, 81) = 2.93, p = .03 (one-tailed), d = 0.55 (simple main effect), but evaluated hand sanitizer more positively after lying in an e-mail (M = 0.31, SD = 0.76) than after lying in a voice mail (M = –0.12, SD = 0.86), F(1, 81) = 3.25, p = .04 (one-tailed), d = 0.53 (simple main effect).”  We adjusted the significance criterion for the R-Index accordingly.

F(1,81)=2.93 0.045 1.69 0.52
F(1,81)=3.25 0.038 1.78 0.55

MOP = .54, Inflation = .46, R-Index = .08

4.12   Eyes on You

“On the first week of the experiment (which you can see at the bottom of the figure), two wide-open eyes stare at the coffee or tea drinkers, whose average contribution was 70 pence per liter of milk. On week 2, the poster shows flowers and average contributions drop to about 15 pence. The trend continues. On average, the users of the kitchen contributed almost three times as much in ’eye weeks’ as they did in ’flower weeks.’ ”

[Melissa Bateson, Daniel Nettle, and Gilbert Roberts, “Cues of Being Watched Enhance Cooperation in a Real-World Setting,” Biology Letters 2 (2006): 412–14.]

F(1,7)=11.55 0.011 2.53 0.72

MOP = .72, Inflation = .28, R-Index = .44

Combined Analysis

We then combined the results from the 31 studies mentioned above.  While the R-Index for small sets of studies may underestimate replicability, the R-Index for a large set of studies is more accurate.  Median Obesrved Power for all 31 studies is only 57%. It is incredible that 31 studies with 57% power could produce 100% significant results (Schimmack, 2012). Thus, there is strong evidence that the studies provide an overly optimistic image of the robustness of social priming effects.  Moreover, median observed power overestimates true power if studies were selected to be significant. After correcting for inflation, the R-Index is well below 50%.  This suggests that the studies have low replicability. Moreover, it is possible that some of the reported results are actually false positive results.  Just like the large-scale replication of the facial feedback studies failed to provide any support for the original findings, other studies may fail to show any effects in large replication projects. As a result, readers of “Thinking Fast and Slow” should be skeptical about the reported results and they should disregard Kahneman’s statement that “you have no choice but to accept that the major conclusions of these studies are true.”  Our analysis actually leads to the opposite conclusion. “You should not accept any of the conclusions of these studies as true.”

k = 31,  MOP = .57, Inflation = .43, R-Index = .14,  Grade: F for Fail

Powergraph of Chapter 4kfs

Schimmack and Brunner (2015) developed an alternative method for the estimation of replicability.  This method takes into account that power can vary across studies. It also provides 95% confidence intervals for the replicability estimate.  The results of this method are presented in the Figure above. The replicability estimate is similar to the R-Index, with 14% replicability.  However, due to the small set of studies, the 95% confidence interval is wide and includes values above 50%. This does not mean that we can trust the published results, but it does suggest that some of the published results might be replicable in larger replication studies with more power to detect small effects.  At the same time, the graph shows clear evidence for a selection effect.  That is, published studies in these articles do not provide a representative picture of all the studies that were conducted.  The powergraph shows that there should have been a lot more non-significant results than were reported in the published articles.  The selective reporting of studies that worked is at the core of the replicability crisis in social psychology (Sterling, 1959, Sterling et al., 1995; Schimmack, 2012).  To clean up their act and to regain trust in published results, social psychologists have to conduct studies with larger samples that have more than 50% power (Tversky & Kahneman, 1971) and they have to stop reporting only significant results.  We can only hope that social psychologists will learn from the train wreck of social priming research and improve their research practices.

How did Diedrik Stapel Create Fake Results? A forensic analysis of “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations”

Diederik Stapel represents everything that has gone wrong in experimental social psychology.  Until 2011, he was seen as a successful scientists who made important contributions to the literature on social priming.  In the article “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations” he presented 8 studies that showed that social comparisons can occur in response to stimuli that were presented without awareness (subliminally).  The results were published in the top journal of social psychology published by the American Psychological Association (APA) and APA published a press-release for the general public about this work.  
In 2011, an investigation into Diedrik Stapel’s reserach practices revealed scientific fraud, which resulted in over 50 retractions (Retraction Watch), including the article on unconscious social comparisons (Retraction Notice).  In a book, Diederik Stapel told his story about his motives and practices, but the book is not detailed enough to explain how particular datasets were fabricated.  All we know, is that he used a number of different methods that range from making up datasets to the use of questionable research practices that increase the chance of producing a significant result.  These practices are widely used and are not considered scientific fraud, although the end result is the same. Published results no longer provide credible empirical evidence for the claims made in a published article.
I had two hypotheses. First, the data could be entirely made up. When researchers make up fake data they are likely to overestimate the real effect sizes and produce data that show the predicted pattern much more clearly than real data would. In this case, bias tests would not show a problem with the data.  The only evidence that the data are fake would be that the evidence is stronger than in other studies that relied on real data.
In contrast, a researcher who starts with real data and then uses questionable practices is likely to use as little dishonest practices as possible because this makes it easier to justify the questionable decisions.  For example, removing 10% of data may seem justified, especially if some rational for exclusion can be found.  However, removing 60% of data cannot be justified.  The researcher will need to use these practices to produce the desired outcome, namely a p-value below .05 (or at least very close to .05).  As more use of questionable practices is not needed and harder to justify, the researcher will stop producing stronger evidence.  As a result, we would expect a large number of just significant results.
There are two bias tests that detect the latter form of fabricating significant results by means of questionable statistical methods; the Replicability-Index (R-Index) and the Test of Insufficient Variance (TIVA).   If Stapel used questionable statistical practices to produce just significant results, R-Index and TIVA would show evidence of bias.
The article reported 8 studies. The table shows the key finding of each study.
Study Statistic p z OP
1 F(1,28)=4.47 0.044 2.02 0.52
2A F(1,38)=4.51 0.040 2.05 0.54
2B F(1,32)=4.20 0.049 1.97 0.50
2C F(1,38)=4.13 0.049 1.97 0.50
3 F(1,42)=4.46 0.041 2.05 0.53
4 F(2,49)=3.61 0.034 2.11 0.56
5 F(1,29)=7.04 0.013 2.49 0.70
6 F(1,55)=3.90 0.053 1.93 0.49
All results were interpreted as evidence for an effect and the p-value for Study 6 was reported as p = .05.
All p-values are below .053 but greater than .01.  This is an unlikely outcome because sampling error should produce more variability in p-values.  TIVA examines whether there is insufficient variability.  First, p-values are converted into z-scores.  The variance of z-scores due to sampling error alone is expected to be approximately 1.  However, the observed variance is only Var(z) = 0.032.  A chi-square test shows that this observed variance is unlikely to occur by chance alone,  p = .00035. We would expect such an extremely small variability or even less variability in only 1 out of 2857 sets of studies by chance alone.
The last column transforms z-scores into a measure of observed power. Observed power is an estimate of the probability of obtaining a significant result under the assumption that the observed effect size matches the population effect size.  These estimates are influenced by sampling error.  To get a more reliable estimate of the probability of a successful outcome, the R-Index uses the median. The median is 53%.  It is unlikely that a set of 8 studies with a 53% chance of obtaining a significant result produced significant results in all studies.  This finding shows that the reported success rate is not credible. To make matters worse, the probability of obtaining a significant result is inflated when a set of studies contains too many significant results.  To correct for this bias, the R-Index computes the inflation rate.  With 53% probability of success and 100% success rate, the inflation rate is 47%.  To correct for inflation, the inflation rate is subtracted from median observed probability, which yields an R-Index of 53% – 47% = 6%.  Based on this value, it is extremely unlikely that a researcher would obtain a significant result, if they would actually replicate the original studies exactly.  The published results show that Stapel could not have produced these results without the help of questionable methods, which also means nobody else can reproduce these results.
In conclusion, bias tests suggest that Stapel actually collected data and failed to find supporting evidence for his hypotheses.  He then used questionable practices until the results were statistically significant.  It seems unlikely that he outright faked these data and intentionally produced a p-value of .053 and reported it as p = .05.  However, statistical analysis can only provide suggestive evidence and only Stapel knows what he did to get these results.