Estimating the Replicability of Results in ‘Infancy’

A common belief is that the first two years of life are the most important years of development (Cohn, 2011). This makes research on infants very important. At the same time, studying infants is difficult. One problem is that it is hard to recruit infants for research. This makes it difficult to reduce sampling error and to obtain replicable results. Noisy data also make it possible that questionable research practices inflate effect sizes in order to publish because journals hardly ever publish non-significant results (Peterson, 2016). Even disciplines who are able to recruit larger samples of undergraduate students, like social psychology, have encounter replication failures, and a major replication effort suggested that only a quarter of published results can be replicated (Open Science Collaboration, 2015). This raises concerns about the replicability of results published in Infancy.

Despite much talk about a replication crisis in psychology, infancy researchers seem to be unaware of problems with research practices in psychology. Editorials by Bell (2009), Colombo (2014), and Bremner (2019) celebrate quantitative indicators like submission rates and impact factors, but do not comment on the practices that are used to produce significant results. In a special editorial, Colombo (2017) introduces registered reports that accept study ideas before data are collected and publish results independent of the outcome. However, he doesn’t mention why such an initiative would be necessary (e.g., standard articles use QRPs and studies are only submitted if they show a significant result). Bremner (2019) makes an interesting observation that “it is really rather easy to fail to obtain an effect with infants.” If this is the case and results are reported without selection for significance, articles in Infancy should report many non-significant results. This seems unlikely, given the general bias against non-significant results in psychology.

To examine the replicability of results published in Infancy, I conducted a z-curve analysis (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Z-curve uses the test-statistics (t-values, F-values) in articles to examine how replicable significant results are, how often researchers obtain significant results, how many false positive results are reported, and whether researchers use questionable research practices to inflate the percentage of significant results that are being reported.

I downloaded articles from 2000 to 2019 and used an r-program to automatically extract test-statistics. Figure 1 shows the z-curve of the 9,109 test statistics.

First, visual inspection shows a steep cliff around z = 1.96, which corresponds to a p-value of .05 (two-tailed). The fact that there are many more just significant results than just non-significant results reveals that questionable practices inflate the percentage of significant results. This impression is confirmed by comparing the observed discovery rate of 64%, 95%CI = 63% to 65% to the estimated discovery rate 37%, 95%CI = 19% to 46%. The confidence intervals clearly do not overlap, indicating that questionable practices inflate the observed discovery rate.

The expected replication rate is 60%, 95%CI = 55% to 65%. This finding implies that exact replications of studies with significant results would produce 60% significant results. This is not a terrible success rate, but this estimate comes with several caveats. First, the estimate is an average of all reported statistical tests. Some of these tests are manipulation checks that are expected to have strong effects. Other tests are novel predictions that may have weaker effects. The replicability estimate for studies with just-significant results (z = 2 to 2.5) is only 35% (see values below x-axis).

The results are similar to estimates for social psychology, which has witnesses a string of replication failures in actual replication attempts. Based on the present results, I predict similar replication failures in infancy research when studies are actually replicated.

Given the questionable status of just-significant results, it is possible to exclude them from the z-curve analysis. Figure 2 shows the results when z-curve is fitted to z-values greater than 2.8, which corresponds to a p-value of .005.

Questionable research practices are now revealed by the greater proportion of just-significant results than the model predicts. Given the uncertainty about these results, readers may focus on p-values less than .005 to ensure that results are replicable.

The next figure shows results when the ERR (black) and EDR (grey) are estimated for all significant results (solid) and only for z-scores greater than 2.8 for each year.

As the number of tests per year is relatively small, estimates are fairly noisy. Tests of time trends did not reveal any significant changes over time. Thus, there is no evidence that infancy researchers have changed their research practices in response to concerns about a replication crisis in psychology.

The results in Figure 1 and 2 also suggest that Infancy research produces many false negative results; that is the hypothesis is true, but studies have insufficient power to produce a significant result. This is consistent with concerns about low power in psychology (Cohen, 1962) and Bremner’s (2019) observation that non-significant results are common, even when they are not reported. False negative results are a problem because they are sometimes falsely interpreted as evidence against a hypothesis. For infancy research to gain credibility, researchers need to change their research practices. First, they need to improve power by increasing reliability of measures, using within-subject designs whenever possible, or collaborating across labs to increase sample sizes. Second, they need to report all results honestly, not only when studies are pre-registered or in special registered reports. Honest reporting of results is a fundamental aspect of science and evidence of questionable research practices undermines the credibility of infancy research.

2 thoughts on “Estimating the Replicability of Results in ‘Infancy’

  1. First, thanks for your wonderful blog and your excellent work on replication.

    I am a mid-career psychology researcher with a broad publication history. I am sure that if I measured my own replication index it wouldn’t be terrible, but it wouldn’t be good. I can see things in my publication history that in hindsight I shouldn’t have submitted for publication without first replicating the experiment: I can think of at least one published student project that revealed predicted but still exciting findings but was probably under powered. The pressure to maintain output was too high to resist publication at that stage of my career.

    Now I see the error of my ways. What I wanted to ask you is what do you think I should do about those papers that are unlikely to replicate? Obviously, if the result is important then I should now try to replicate it with a higher powered pre-registered experiment. However, the field has moved on and no one is interested in that topic now (probably due to poor replication within the whole topic). Should I ask the journal to retract that paper due to questionable research practices (albeit, done in good faith at the time)? Or should I leave it as it is and just assume that anyone reading it will see the poor power of the experiment and judge that it has poor replicability for themselves.

    I am sure that I am not the only researcher in this position. Maybe you could write a blog post on what mid-career researchers who have seen the errors of their ways should do, not just about their future research, but also about their past research.

    1. Thank you for sharing your thoughts. I don’t think it is worthwhile to use resources for a high-powered replication study if the field moved on. More important to use resources wisely for new, more promising research. If you mean mid-career (after tenure), it is possible to focus on quality , but depending on your institution this could still mean a pay cut (lower raise), but knowing to do real science may be worth the trade-off.

Leave a Reply