Raw! First Draft! Manuscript in Preparation for Meta-Psychology
Open Comments are welcome.
The f/utility of psychological research has been debated since psychology became an established discipline after the second world-war (Cohen, 1962, 1994; Lykken, 1968; Sterling, 1959; lots of Meehl). There also have been many proposals to improve psychological science. However, most articles published today follow the same old recipe that was established decades ago; a procedure that Gigerenzer (2018) called the significance-testing ritual .
Step 1 is to assign participants to experimental conditions.
Step 2 is to expose groups to different stimuli or interventions.
Step 3 is to examine whether the differences between means of the groups are statistically significant.
Step 4a: If Step 3 produces a p-value below .05, write up the results and submit to a journal.
Step 4b: If Step 3 produces a p-value above .05, forget about the study, and go back to Step 1.
This recipe produces a literature where the empirical content of journal articles are only significant results that suggest the manipulation had an effect. As Sterling (1959) pointed out, this selective publishing of significant results essentially renders significance testing meaningless. The problem with this recipe became apparent when Bem (2011) published 9 successful demonstration of a phenomenon that does not exist: mental time travel where feelings about random future events seemed to cause behavior. If only successes are reported, significant results only show how motivated researchers are to collect data that support their beliefs.
I argue that the key problem in psychology is the specification of the null-hypothesis. The most common approach is to specify the null-hypothesis as the absence of an effect. Cohen called this the nil-hypothesis. The effect size is zero. Even after an original study rejects the nil-hypothesis, follow-up studies (direct or conceptual replication studies) again specify the nil-hypothesis as the hypothesis that has to be rejected, although the original study already rejected it. I propose to abandon nil-hypothesis testing and to replace it with null-hypothesis testing where the null-hypothesis specifies effect sizes. Contrary to the common practice to start with rejecting the nil-hypothesis, I argue that original studies should start with testing large effect sizes. Subsequent studies should use information from the earlier studies to modify the null-hypothesis. This recipe can be considered a stepwise process of parameter estimation. The advantage of a step-wise approach is that parameter estimation requires large samples that are often impossible to obtain during the early stages of a research program. Moreover, parameter estimation may be wasteful when the ultimate conclusion is that an effect size is too small to be meaningful. I illustrate the approach with a simple between-subject design that compares two groups. For mean differences, the most common effect size is the standardized mean difference (the mean difference when the dependent variable is standardized) and Cohen suggested values of d = .2, .5, and .8 as values for small, medium, and large effect sizes, respectively.
The first test of a novel hypothesis (e.g., taking Daniel Lakens’ course on statistics improves understanding of statistics), starts with the assumption that the effect size is large (H0: |d| = .8).
The next step is to specify what value should be considered a meaningful deviation from this effect size. A reasonable value would be d = .5, which is only a moderate effect size. Another reasonable approach is to half the starting effect size, d = .8/2 = .4. I use d = .4.
The third step is to conduct a power analysis for a mean difference of d = .4. This power analysis is not identical to a typical power analysis with H0: d = 0 and an effect size of d = .4 because the t-distribution is no longer symmetrical when it is centered over values other than zero (this may be a statistical reason for the choice of the nil-hypothesis). However, conceptually the power analysis does not differ. We are postulating a null-hypothesis of d = .8 and are willing to reject it when the population effect size is a meaningfully smaller effect size of d = .4 or less. With some trial and error, we find a sample size of N = 68 (n = 34 per cell). With this sample size, d-values below .4 occur only 5% of the time. Thus, we can reject the null-hypothesis of d = .8, if the study produces an effect size below .4.
The next step depends on the outcome of the first study. If the first study produced a result with an effect size estimate greater than .4, the null-hypothesis lives another day. Thus, the replication study is conducted with the same sample size as the original study (N = 68. The rational is that we have good reason to believe that the effect size is large and it would be wasteful to conduct replication studies with much larger samples (e.g., 2.5 times larger than the original study, N = 170. It is also not necessary to use much larger samples to demonstrate that the original finding was obtained with questionable research practices. An honest replication study has high power to reject the null-hypothesis of d = .8, if the true effect size is only d = .2 or even closer to zero. This makes it easier to reveal the use of questionable research practices with actual replication studies. The benefits are obtained because the original study makes a strong claim that the effect size is large rather than merely claiming that the effect size is positive or negative without specifying an effect size.
If the original study produces a significant result with an effect size less than d = .4, the null-hypothesis is rejected. The new null-hypotheses is the point-estimate of the study. Given a significant result, we know that this value is somewhere between 0 and .4. Let’s assume it is d = .25. This estimate comes with a two-sided 95% confidence interval ranging from d = -.23 to d = .74. The wide confidence interval shows that we can reject d = .8, but not a medium effect size of d = .5 or even a small effect in the opposite direction, d = -.2. Thus, we need to increase sample sizes in the next study to provide a meaningful test of the new null-hypothesis that the effect size is positive, but small (d = .25). We want to ensure that the effect size is indeed positive, d > 0, but weaker than a medium effect size, d = .5. Thus, we need to power the study to be able to reject the null-hypothesis (H0: d = .25) in both direction. This is achieved with a sample size of N = 256 (n = 128 per cell) and sampling error of .125. The 95% confidence interval centered over d = .25, ranges from 0 to .5. Thus, any observed d-value greater than .25 rejects the hypothesis that there is no effect and any value below .25 rejects the hypothesis of a medium effect size, d = .5.
The next step depends again on the outcome of the study. If the observed effect size is d = .35 with a 95% confidence interval ranging from d = .11 to d = 60, the new question is whether the effect size is at least small, d = .2 or whether it is even moderate. We could meta-analyze the results of both studies, but as the second study is larger, it will have a stronger influence on the weighted average. In this case, the weighted average of d = .33 is very close to the estimate of the larger second study. Thus, I am using the estimate of Study 2 for the planning of the next study. With the null-hypothesis of d = .35, a sample size of N = 484 (n = 242 per cell) is required to have 95% power to find a significant result if the population effect size is d = .2 or less, 90% confidence interval, d = .20 to d = .50. Thus, if an effect size less than d = .2 is observed, it is possible to reject the hypothesis that there is at least a statistically small effect size of d = .2. In this case, researchers have to decide whether they want to invest in a much larger study to see whether there is a positive effect at all or whether they would rather abandon this line of research because the effect size is too small to be theoretically or practically meaningful. The estimation of the effect size makes it at least clear that any further studies with small samples are meaningless because they have insufficient power to demonstrate that a small effect exists. This can be a meaningful result in itself because researchers currently waste resources on studies that test small effects with small samples.
If the effect size in Study 2 is less than d = .25, researchers know (with a 5% error probability) that the effect size is less than d = .5. However, it is not clear whether there is a positive effect or not. Say, the observed effect size was d = .10 with a 95%CI ranging from d = -.08 to d = .28. This leaves open the possibility of no effect, but also a statistically small effect of d = .2. Researchers may find it worthwhile to purse this research in the hope that the effect size is at least greater than d = .10, assuming a population effect size of d = .2. Adequate power is achieved with a sample size of N = 1,100 (n = 550 per cell). In this case, the 90% confidence interval around d = .2 ranges from d = .10 to d = .30. Thus, any value less than d = .10, rejects the hypothesis that the effect size is statistically small, d = .2, while any value greater than d = .30 would confirm that the effect size is at least a small effect size of d = .2.
This new way of thinking about null-hypothesis testing requires some mental effort (it is still difficult for me). To illustrate it further, I used open data from the many-lab project (Klein et al., 2014). I start with a project with a strong and well-replicated effect.
The first sample in the ML dataset is from Penn State U – Abington (‘abington’) with N = 84. Thus, the sample has good power to test the first hypothesis that d > .4, assuming an effect size of d = .8. The statistical test of the first anchoring effect (distance from New York to LA with 1,500 mile vs. 6,000 mile anchor) produced a standardized effect size of d = .98 with a 95%CI ranging from d = .52 to 1.44. The confidence interval includes a value of d = .8. Therefore the null-hypothesis cannot be rejected. Contrary to nil-hypothesis testing, however, this finding is highly informative and significant. It does suggest that anchoring is a strong effect.
As Study 1 was consistent with the null-hypothesis of a strong effect, Study 2 replicates the effect with the same sample size. To make this a conceptual replication study, I used the second anchoring question (anchoring2, population of Chicago with 200,000 vs. 6 million as anchor). The sample from Charles University, Prague, Czech Republic provided an equal sample size of N = 84. The study replicated the finding of Study 1, that the 95%CI includes a value of d = .8, 95%CI = .72 to 1.41.
To further examine the robustness of the effect, Study 3 used a different anchoring problem (height of Mt. Everest with 2,000 vs. 45,500 feet as anchors). To keep sample sizes similar, I used the UVA sample (N = 81). This time, the null-hypothesis was rejected with an even larger effect size, d = 1.47, 95%CI = 1.19 to 1.76.
Although additional replication studies can further examine the generalizability of the main effect, the three studies alone are sufficient to provide robust evidence for anchor effects, even with a modest total sample size of N = 249 participants. Researchers could therefore examine replicability and generalizabilty in the context of new research questions that explore boundary conditions, mediators, or moderators. More replication studies or replication studies with larger samples would be unnecessary.
To maintain good comparability, I start again with the Penn State U – Abington sample (N = 84). The effect size estimate for the flag prime is close to zero, d = .05. More important, the 95% confidence interval does not include d = .8, 95%CI = -.28 to .39. Thus, the null-hypothesis that flag priming is a strong effect is rejected. The results are so disappointing that even a moderate effect size is not included in the confidence interval. Thus, the only question is whether there could be a small effect size. If this is theoretically interesting, the study would have to be sufficiently powered to distinguish a small effect size from zero. Thus, the study could be powered to examine whether the effect size is at least d = .1, assuming an effect size of d = .2. The previous power analysis suggested that a sample of N = 1,100 participants is needed to test this hypothesis. I used the Mturk sample (N = 1000) and the osu (N = 107) samples to get this sample size.
The results showed a positive effect size of d = .12. Using traditional NHST, this finding rejects the nil-hypothesis, but allows for extremely small effect sizes close to zero, 95%CI = .0003 to .25. More important, the results do not reject the actual null-hypothesis that there is a small effect size d = .2, but also do not ensure that the effect size is greater than d = .10. Thus, the results remain inconclusive.
To make use of the large sample of Study 2, it is not necessary to increase the sample size again. Rather, a third study can be conducted with the same sample size, and the results of the two studies can be combined to test the null-hypothesis that d is at least d = .10. I used the Project Implicit sample, although it is a bit bigger (N = 1329).
Study 3 alone produced an effect size of d = .03, 95%CI = -.09 to d = .14. An analysis that combines data from all three samples, produces an estimate of d = .02, 95%CI = -.06 to .10. These results clearly reject the null-hypothesis that d = .2, and they even suggest that d = .10 is unlikely. At this point, it seems reasonable to stop further study of this phenomenon, at least using the same paradigm. Although this program required over 2,000 participants, the results are conclusive and publishable with the conclusion that flag priming has negligible effects on ratings of political values. The ability to provide meaningful results arises from the specification of the null-hypothesis with an effect size rather than the nil-hypothesis that can only test direction of effects without making claims about effect sizes.
The comparison of the two examples shows why it is important to think about effect sizes, even when these effect sizes do not generalize to the real word. Effect sizes are needed to calibrate sample sizes so that resources are not wasted on overpowered studies (studying anchoring with N = 1,000) or on underpowered studies (studying flag priming with N = 100). Using a simple recipe that starts with the assumption that effect sizes are large, it is possible to use few resources first and then increase sample sizes as needed, if effect sizes turn out to be small.
Low vs. High Category Scales
To illustrate the recipe with a small-to-medium effect size, I picked Schwartz et al.’s (1985) manipulation of high versus low frequencies as labels for a response category. I started again with the U Penn State – Abington sample (N = 84). The effect size was d = .33, but the 95% confidence interval ranged from d = -.17 to d = .84. Although, the interval does not exclude d = .8, it seems unlikely that the effect size is large, but it is not unreasonable to assume that the effect size could be moderate rather than small. Thus, the next study used d = .5 as the null-hypothesis and examined whether the effect size is at least d = .2. A power analysis shows that N = 120 (n = 60 per cell) participants are needed. I picked the sample from Brasilia (N = 120) for this purpose. The results showed a strong effect size, d = .88. The 95% confidence interval even excluded a medium effect size, d = .51 to d = 1.23, but given the results of study 1, it is reasonable to conclude that the effect size is not small, but could be medium or even large. A sample size of N = 120 seems reasonable for replication studies that examine the generalizability of results across populations (or conceptual replicaiton studies, but they were not available in this dataset).
To further examine generalizability, I picked the sample from Instanbul (N = 113). Surprisingly, the 95% confidence interval, d = -.31 to d = .14 did not include d = .5. The confidence interval also does not overlap with the confidence interval in Study 2. Thus, there is some uncertainty about the effect and under what conditions it can be produced. However, a meta-analysis across all three studies shows a 95%CI that includes a medium effect size, 95%CI = .21 to .65.
Thus, it seems reasonable to examine replicability in other samples with the same sample size. The next sample with a similar sample size is Laurier (N = 112). The results show an effect size of d = .43 and the 95%CI includes d = .5, 95%CI = .17 to d = .69. The meta-analytic confidence interval, 95%CI = .27 to .61, excludes small effect sizes of d = .2 and large effect sizes of d = .8.
Thus, a research program with four samples and a total sample size of N = 429 participants helped to establish a medium effect size for the effect of low versus high scale labels on ratings. The effect size estimate based on the full ML dataset is d = .48.
At this point, it may seem as if I cheery-picked samples to make the recipe look good. I didn’t, but I don’t have a preregistered analysis plan to show that I did not. I suggest others try it out with other open data where we have a credible estimate of the real effect based on a large sample and then try to approach this effect size using the recipe I proposed here.
The main original contribution of this blog post is to move away from nil-hypothesis significance testing. I am not aware of any other suggestions that are similar to the proposed recipe, but the ideas are firmly based on Neyman-Pearson’s approach to significance testing and Cohen’s recommendation to think about effect sizes in the planning of studies. The use of confidence intervals makes the proposal similar to Cummings’ suggestion to focus more on estimation than hypothesis testing. However, I am not aware of a recipe for the systematic planning of sample sizes that vary as a function of effect sizes. Too often confidence intervals are presented as if the main goal is to provide precise effect size estimates, although the meaning of these precise effect sizes in psychological research is unclear. What a medium effect size for category labels means in practice is not clear, but knowing that it is medium allows researchers to plan studies with adequate power. Finally, the proposal is akin to sequential testing, where researchers look at their data to avoid collecting too many data. However, sequential testing still suffers from the problem that it tests the nil-hypothesis and that a non-significant result is inconclusive. In contrast, this recipe provides valuable information even if the fist study produces a non-significant result. If the first study fails to produce a significant result, it suggests that the effect size is large. This is valuable and publishable information. Significant results are also meaningful because they suggest that the effect size is not large. Thus, results are informative with significant and non-significant results, removing the asymmetry of nil-hypothesis testing where non-significant results are uninformative. The only studies that are not informative are studies where confidence intervals are too wide to be meaningful or replication studies that are underpowered. The recipe helps researchers to avoid these mistakes.
The proposal also addresses the main reason why researchers do not use power analysis to plan sample sizes. The mistaken belief is that it is necessary to guess the population effect size. Here I showed that this is absolutely not necessary. Rather researchers can start with the most optimistic assumptions and test the hypothesis that their effect is large. More often then not, the result will be disappointing, but not useless. The results of the first study provide valuable information for the planning of future studies.
I would be foolish to believe that my proposal can actually change research practices in psychology. Yet, I cannot help thinking that it is a novel proposal that may appeal to some researchers who are struggling in the planning of sample sizes for their studies. The present proposal allows them to shoot for the moon and fail, as long as they document this failure and then replicate with a larger sample. It may not solve all problems, but it is better than p-rep or Bayes-Factors and several other proposals that failed to fix psychological science.