Bill von Hippel and Ulrich Schimmack discuss Bill’s Replicability Index

Background:

We have never met in person or interacted professionally before we embarked on this joint project. This blog emerged from a correspondence that was sparked by one of Uli’s posts in the Replicability-Index blog. We have carried the conversational nature of that correspondence over to the blog itself.

Bill: Uli, before we dive into the topic at hand, can you provide a super brief explanation of how the Replicability Index works? Just a few sentences for those who might be new to your blog and don’t understand how one can assess the replicability of the findings from a single paper.

Uli: The replicability of a finding is determined by the true power of a study, and true power depends on the sample size and the population effect size. We have to estimate replicability because the population effect size is unknown, but studies with higher power are more likely to produce smaller p-values. We can convert the p-values into a measure of observed power. For a single statistical test this estimate is extremely noisy, but it is the best guess we can make. So, a result with a p-value of .05 (50% observed power) is less likely to replicate than a p-value of .005 (80% power). A value of 50% may still look good, but observed power is inflated if we condition on significance. To correct for this inflation, the R-Index takes the difference between the success rate and observed power. For a single test, the success rate is 1 (100%) because a significant result was observed. This means that observed power of 50%, produces an R-Index of 50% – (100% – 50%) = 0. In contrast, 80% power, still produces an R-Index of 80% – (100% – 80%) = 60%. The main problem is that observed p-values are very variable. It is therefore better to use several p-values and to compute the R-Index based on average power. For larger sets of studies, a more sophisticated method called z-curve can produce actual estimates of power. It also can be used to estimate the false positive risk. Sorry, if this is not super brief.

Bill: That wasn’t super brief, but it was super informative. It does raise a worry in me, however. Essentially, your formula stipulates that every p value of .05 is inherently unreliable. Do we have any empirical evidence that the true replicability of p = .05 is functionally zero?

Uli:  The inference is probabilistic. Sometimes p = .05 will occur with high power (80%). However, empirically we know that p = .05 more often occurs with low power. The Open Science Collaboration project showed that results with p > .01 rarely replicated, whereas results with p < .005 replicated more frequently. Thus, in the absence of other information, it is rational to bet on replication failures when the p-value is just or marginally significant.

Bill: Good to know. I’ll return to this “absence of other information” issue later in our blog, but in the meantime, back to our story…I was familiar with your work prior to this year, as Steve Lindsay kept the editorial staff at Psychological Science up to date with your evaluations of the journal. But your work was made much more personally relevant on January 19, when you wrote a blog on “Personalized p-values for social/personality psychologists.”

Initially, I was curious if I would be on the list, hoping you had taken the time to evaluate my work so that I could see how I was doing. Your list was ordered from the highest replicability to lowest, so when I hadn’t seen my name by the half-way point, my curiosity changed to trepidation. Alas, there I was – sitting very near the bottom of your list with one of the lowest replicability indices of all the social psychologist’s you evaluated.

I was aghast. I had always thought we maintained good data practices in our lab: We set a desired N at the outset and never analyzed the data until we were done (having reached our N or occasionally run out of participants); we followed up any unexpected findings to be sure they replicated before reporting them, etc. But then I thought about the way we used to run the lab before the replication crisis:

  1. We never reported experiments that failed to support our hypotheses, but rather tossed them in the file drawer and tried again.
  2. When an effect had a p value between .10 and the .05 cut-off, we tried various covariates/control variables to see if they would push our effect over that magical line. Of course we reported the covariates, but we never reported their ad-hoc nature – we simply noted that we included them.
  3. We typically ran studies that were underpowered by today’s standards, which meant that the effects we found were bouncy and could easily be false positives.
  4. When we found an effect on one set of measures and not another, sometimes we didn’t report the measures that didn’t work.

The upshot of these reflections was that I emailed you to get the details on my individual papers to see where the problems were (which led to my next realization; I doubt I would have bothered contacting you if my numbers had been better). So here’s my first question: How many people have contacted you about this particular blog and is there any evidence that people are taking it seriously?

Uli:  I have been working on replicability for 10 years now. The general response to my work is to ignore it with the justification that it is not peer-reviewed. I also recall only two requests. One to evaluate a department and one to evaluate an individual. However, the R-Index analysis is easy to do and Mickey Inzlicht published a blog post about his self-analysis. I don’t know how many researchers have evaluated their work in private. It is harder to evaluate how many people take my work seriously. The main indicator of impact is the number of views of my blogs which has increased from 50,000 in 2015 to 250,000 in 2020. The publication of the z-curve package for R also has generated interest among researchers to conduct their own analyses.

Bill: That’s super impressive. I don’t think my entire body of research has been viewed anywhere near 250K times.

OK, once again, back to our story. When you sent me the data file on my papers, initially I was unhappy that you only used a subset of my empirical articles (24 of the 70 or so empirical papers I’ve published) and that your machine coding had introduced a bit of error into the process. But we decided to turn this into a study of sorts, so we focused on those 24 papers and the differences that would emerge as a function of machine vs. hand-coding and as a function of how many stats we pulled out of each experiment (all the focal statistics vs. just one stat for each experiment). Was that process useful for you? If so, what did you learn from it?

Uli: First of all, I was pleased that you engaged with the results. More important, I was also curious how the results would compare to hand-coding. I had done some comparisons for other social psychologists and I had some confidence in the results to post them, but I am aware that the method is not flawless and can produce misleading results in individual cases. I am also aware that my own hand-coding can be biased. So, for you to offer to do your own coding was a fantastic opportunity to examine the validity of my results.

Bill: Great. I’m still a little unsure what I’ve learned from this particular proctology exam, so let’s see what we can figure out here. If you’ll humor me, let’s start with our paper that has the lowest replicability index in your subsample – no matter which we way we calculate it, we find less than a 50% chance that it will replicate. It was published in 2005 in PSPB and took an evolutionary approach on grandparenting. Setting aside the hypothesis, the relevant methods were as follows:

  1. We recruited all of the participants who were available that year in introductory psychology, so our N was large, but determined by external constraints.
  2. We replicated prior findings that served as the basis of our proposed effect.
  3. The test of our new hypothesis yielded a marginally significant interaction (F(1, 412) = 2.85, p < .10). In decomposing the interaction, we found a simple effect where it was predicted (F(1, 276) = 5.92, p < .02) and no simple effect where it wasn’t predicted (F < 1, ns. – *apologies for the imprecise reporting practices).

Given that: 1) we didn’t exclude any data (a poor practice we sometimes engaged in by reporting some measures and not others, but not in this paper), 2) we didn’t include any ad-hoc control variables (a poor practice we sometimes engaged in, but not in this paper), 3) we didn’t run any failed studies that were tossed out (a poor practice we regularly engaged in, but not in this paper), and 4) we reported the a priori test of our hypothesis exactly as planned…what are we to conclude from the low replicability index? Is the only lesson here that marginally significant interactions are highly unlikely to replicate? What advice would you have given me in 2004, if I had shown you these data and said I wanted to write them up?

Uli: There is a lot of confusion about research methods, the need for preregistration, and the proper interpretation of results. First, there is nothing wrong with the way you conducted the study. The problems arise when the results are interpreted as a successful replication of prior studies. Here is why. First, we do not know whether prior studies used questionable research practices and reported inflated effect sizes. Second, the new findings are reported without information about effect sizes. What we really would like to know is the confidence interval around the predicted interaction effect, which would be the difference in the effect sizes between the two conditions. With a p-value greater than .05, we know that the 95%CI includes a value of 0. So, we cannot reject the hypotheses that the two conditions differ at that level of confidence. We can increase uncertainty in the conclusion by using a 90% or 80% confidence interval, but we still would want to know what effect sizes we can reject. It would also be important to specify what effect sizes would be considered too small to warrant a theory that predicts this interaction effect. Finally, the results suggest that the sample size of about 400 participants was still too small to have good power to detect and replicate the effect. A conclusive study would require a larger sample.

Bill: Hmm, very interesting. But let me clarify one thing before we go on. In this study, the replication of prior effects that I mentioned wasn’t the marginal interaction that yielded the really low replicability index. Rather, it was a separate main effect, whereby participants felt closest to their mother’s mother, next to their mother’s father, next to their father’s mother, and last to their father’s father. The pairwise comparisons were as follows: “participants felt closer to mothers’ mothers than mothers’ fathers, F(1,464) = 35.88, p < .001, closer to mothers’ fathers than fathers’ mothers, F(1, 424) = 3.96, p < .05, and closer to fathers’ mothers than fathers’ fathers, F(1, 417) = 4.88, p < .03.”

We were trying to build on that prior effect by explaining the difference in feelings toward father’s mothers and mothers’ fathers, and that’s where we found the marginal interaction (which emerged as a function of a third factor that we had hypothesized would moderate the main effect).

I know it’ll slow things down a bit, but I’m inclined to take your advice and rerun the study with a larger sample, as you’ve got me wondering whether this marginal interaction and simple effect are just random junk or meaningful. We could run the study with one or two thousand people on Prolific pretty cheaply, as it only involves a few questions.

Shall we give it a try before we go on? In the spirit of both of us trying to learn something from this conversation, you could let me know what sample size would satisfy you as giving us adequate power to attempt a replication of the simple effect that I’ve highlighted in red above. I suspect that the sample size required to have adequate power for a replication of the marginal interaction would be too expensive, but I believe a study that is sufficiently powered to detect that simple effect will reveal an interaction of at least the magnitude we found in that paper (as I still believe in the hypothesis we were testing).

If that suits you, I’m happy to post this opener on your blog and then return in a few weeks with the results of the replication effort and the goal of completing our conversation.

Uli:  Testing our different intuitions about this particular finding with empirical data is definitely interesting, but I am a bit puzzled about the direction this discussion has taken. It is surely interesting to see whether this particular finding is real and can be replicated. Let’s assume for the moment that it does. This unfortunately, increases the chances that some of the other studies in the z-curve are even less likely to be replicated because there is clear evidence of selection bias and a low probability of replication. Think about it as an urn with 9 red and 1 green marble. Red ones do not replicate and green ones do replicate. After we pick the green marble on the first try, there are only red marbles left.

One of the biggest open questions is what researchers actually did to get too many significant results. We have a few accounts of studies with non-significant results that were dropped and anonymous surveys show that a variety of questionable research practices were used. Even though these practices are different from fraud and may have occurred without intent, researchers have been very reluctant to talk about the mistakes they made in the past. Carney walked away from power posing by admitting to the use of statistical shortcuts. I wonder whether you can tell us a bit more about the practices that led to the low EDR estimate for your focal tests. I know it is a big ask, but I also know that young social psychologists would welcome open disclosure of past practices. As Mickey Inzlicht always tells me “Hate the sin. Love the sinner.” As my own z-curve shows, I also have a mediocre z-curve and I am currently in the process of examining my past articles to see which ones I still believe and which ones I no longer believe.

Bill: Fair question Uli – I’ve made more mistakes than I care to remember! But (at least until your blog post) I’ve comforted myself in the belief that peer-review corrected most of them and that the work I’ve published is pretty solid. So forgive me for banging on for so long, but I have a two-part answer to your question. Part 1 refers back to your statement above, about your work being “in the absence of other information”, and also incorporates your statement above about red and green marbles in an urn. And Part 2 builds on Part 1 by digging through my studies with low Replicability indices and letting you know whether (and if so where) I think they were problematic.

Part 1: “In the absence of other information” is a really important caveat. I understand that it’s the basis of your statistical approach, but of course research isn’t conducted in the absence of other information. In my own case, some of my hypotheses were just hunches about the world, based on observations or possible links between other ideas. I have relatively little faith in these hypotheses and have abandoned them frequently in the face of contrary or inconsistent evidence. But some of my hypotheses are grounded in a substantial literature or prior theorizing that strike me as rock solid. The Darwinian Grandparenting paper is just such an example, and thus it seems like a perfect starting point. The logic is so straightforward and sensible that I’d be very surprised if it’s not true. As a consequence, despite the weak statistical support for it, I’m putting my money on it to replicate (and it’s just self-report, so super easy to conduct a replication online).

And this line of reasoning leads me to dispute your “red and green marbles in the urn” metaphor. Your procedure doesn’t really tell us how many marbles are in the urn of these two colors. Rather, your procedure makes a guess about the contents of the urn, and that guess intentionally ignores all other information. Thus, I’d argue that a successful or failed replication of the grandparenting paper tells us nothing at all about the probability of replicating other papers I’ve published, as I’m bringing additional information to bear on the problem by including the theoretical strength of the claims being made in the paper. In other words, I believe your procedure has grossly underestimated the replicability of this paper by focusing only on the relevant statistics and ignoring the underlying theory. That doesn’t mean your procedure has no value, but it does mean that it’s going to make predictable mistakes.

Part 2: HereI’m going to focus on papers that I first authored, as I don’t think it’s appropriate for me to raise concerns about work that other people led without involving them in this conversation. With that plan in mind, let’s start at the bottom of the replication list you made for me in your collection of 24 papers and work our way up.

  1. Darwinian Grandparenting – discussed above and currently in motion to replicate (Laham, S. M., Gonsalkorale, K., & von Hippel, W. (2005). Darwinian grandparenting: Preferential investment in more certain kin. Personality and Social Psychology Bulletin, 31, 63-72.)
  2. The Chicken-Foot paper – I love this paper but would never conduct it that way now. The sample was way too small and the paper only allowed for a single behavioral DV, which was how strongly participants reacted when they were offered a chicken foot to eat. As a consequence, it was very under-powered. Although we ran that study twice, first as a pilot study in an undergraduate class with casual measurement and then in the lab with hidden cameras, and both studies “worked”, the first one was too informal and the second one was too small and would never be published today. Do I believe it would replicate? The effect itself is consistent with so many other findings that I continue to believe in it, but I would never place my money on replicating this particular empirical demonstration without a huge sample to beat down the inevitable noise (which must have worked in our favor the first time).

(von Hippel, W., & Gonsalkorale, K. (2005). “That is bloody revolting!” Inhibitory control of thoughts better left unsaid. Psychological Science, 16, 497-500.)

  • Stereotyping Against your Will – this paper was wildly underpowered, but I think it’s low R-index reflects the fact that in our final data set you asked me to choose just a single statistic for each experiment. In this study there were a few key findings with different measures and they all lined up as predicted, which gave me a lot more faith in it. Since its publication 20 years ago, we (and others) have found evidence consistent with it in a variety of different types of studies. I think we’ve failed to find the predicted effect in one or maybe two attempts (which ended up in the circular file, as all my failed studies did prior to the replication crisis), but all other efforts have been successful and are published. When we included all the key statistics from this paper in our Replicability analysis, it has an R-index of .79, which may be a better reflection of the reliability of the results.

Important caveat: With all that said, the original data collection included three or four different measures of stereotyping, only one of which showed the predicted age effect. I never reported the other measures, as the goal of the paper was to see if inhibition would mediate age differences in stereotyping and prejudice. In retrospect that’s clearly problematic, but at the time it seemed perfectly sensible, as I couldn’t mediate an effect that didn’t exist. On the positive side, the experiment included only two measures of prejudice, and both are reported in the paper.

(von Hippel, W., Silver, L. A., & Lynch, M. E. (2000). Stereotyping against your will: The role of inhibitory ability in stereotyping and prejudice among the elderly. Personality and Social Psychology Bulletin, 26, 523-532.)

  • Inhibitory effect of schematic processing on perceptual encoding – given my argument above that your R-index makes more sense when we include all the focal stats from each experiment, I’ve now shifted over to the analysis you conducted on all of my papers, including all of the key stats that we pulled out by hand (ignoring only results with control variables, etc.). That analysis yields much stronger R-indices for most of my papers, but there are still quite a few that are problematic. Sadly, this paper is the second from the bottom on my larger list. I say sadly because it’s my dissertation. But…when I reflect back on it, I remember numerous experiments that failed. I probably ran two failed studies for each successful one. At the time, no one was interested in them, and it didn’t occur to me that I was engaging in poor practices when I threw them in the bin. The main conclusion I came to when I finished the project was that I didn’t want to work on it anymore as it seemed like I spent all my time struggling with methodological details trying to get the experiments to work. Maybe each successful study was the one that found just the right methods and materials (as I thought at the time), but in hindsight I suspect not. And clearly the evidentiary value for the effect is functionally zero if we collapse across all the studies I ran. With that said, the key finding followed from prior theory in a pretty straightforward manner and we later found evidence for the proposed mechanism (which we published in a follow-up paper*). I guess I’d conclude from all this that if other people have found the effect since then, I’d believe in it, but I can’t put any stock in my original empirical demonstration.

(von Hippel, W., Jonides, J., Hilton, J. L., & Narayan, S. (1993). Inhibitory effect of schematic processing on perceptual encoding. Journal of Personality and Social Psychology, 64, 921-935.

*von Hippel, W., & Hawkins, C. (1994). Stimulus exposure time and perceptual memory. Perception and Psychophysics, 56, 525-535.)

  • The Linguistic Intergroup Bias (LIB) as an Indicator of Prejudice. This is the only other paper on which I was first author that gets an R-index of less than .5 when you include all the focal stats in the analysis. I have no doubt it’s because, like all of my work at the time, it was wildly under-powered and the effects weren’t very strong. Nonetheless, we’ve used the LIB many times since, and although we haven’t found the predicted results every time, I believe it works pretty reliably. Of course, I could easily be wrong here, so I’d be very interested if any readers of this blog have conducted studies using the LIB as an indicator of prejudice, and if so, whether it yielded the predicted results.

(von Hippel, W., Sekaquaptewa, D., & Vargas, P. (1997). The Linguistic Intergroup Bias as an implicit indicator of prejudice. Journal of Experimental Social Psychology, 33, 490-509.)

  • All my articles published in the last ten years with a low Rindex – Rather than continuing to torture readers with the details of each study, in this last section I’ve gone back and looked at all my papers with an R-index less than .70 based on all the key statistics (not just a single stat for each experiment) that were published in the last 10 years. This exercise yields 9 out of 25 empirical papers with an R-index ranging from .30 to .62 (with five other papers in which an R-index apparently couldn’t be calculated). The evidentiary value of these 9 papers is clearly in doubt, despite the fact that they were published at a time when we should have known better. So what’s going on here? Six of them were conducted on special samples that are expensive to run or incredibly difficult to recruit (e.g., people who have suffered a stroke, people who inject drugs, studies in an fMRI scanner), and as a result they are all underpowered. Perhaps we shouldn’t be doing that work, as we don’t have enough funding in our lab to run the kind of sample sizes necessary to have confidence in the small effects that often emerge. Or perhaps we should publish the papers anyway, and let readers decide if the effects are sufficiently meaningful to be worthy of further investigation. I’d be curious to hear your thoughts on this Uli. Of the remaining three papers, one of them reports all four experiments we ran prior to publication, but since then has proven difficult to replicate and I have my doubts about it (let’s call that a clear predictive win for the R-Index). Another is well powered and largely descriptive without much hypothesis testing and I’m not sure an R-index makes sense for it. And the last one is underpowered (despite being run on undergraduates), so we clearly should have done better.

What do I conclude from this exercise? A consistent theme in our findings that have a low R-index is that they have small sample sizes and report small effects. Some of those probably reflect real findings, but others probably don’t. I suspect the single greatest threat to their validity (beyond the small samples sizes) was the fact that until very recently we never reported experiments that failed. In addition, sometimes we didn’t report measures we had gathered if they didn’t work out as planned and sometimes we added control variables into our equations in an ad-hoc manner. Failed experiments, measures that don’t work, and impactful ad-hoc controls are all common in science and reflect the fact that we learn what we’re doing as we go. But the capacity for other people to evaluate the work and its evidentiary value is heavily constrained when we don’t report those decisions. In retrospect, I deeply regret placing a greater emphasis on telling a clear story than on telling a transparent and complete story.

Has this been a wildly self-serving tour through the bowels of a social psychology lab whose R-index is in the toilet? Research on self-deception suggests you should probably decide for yourself.

Uli:  Thank you for your candid response. I think for researchers our age (not sure really how old you are) it will be easier to walk away from some articles published in the anything-goes days of psychological science because we still have time to publish some new and better work. As time goes on, it may become easier for everybody to acknowledge mistakes and become less defensive. I hope that your courage and our collaboration encourage more people to realize that the value of a researcher is not measured in terms of number of publications or citations. Research is like art and not every van Gogh is a masterpiece. We are lucky if we make at least one notable contribution to our science. So, losing a few papers to replication failures is normal. Let’s see what the results of the replication study will show.

Bill: I couldn’t agree with you more! (Except for the ‘courage’ part; I’m trepidatious as hell that my work won’t be taken seriously [or funded] anymore. But so be it.) I’ll be in touch as soon as we get ethics approval and run our replication study…

Leave a Reply