Link to open access PlosOne article:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0109019

Information about typical sample sizes is informative for a number of reasons. Most important, sampling error is related to sample size. Everything else being equal, larger samples have less sampling error. Studies with less sampling error (a) are more likely to produce statistically significant evidence for an effect when an effect is present, (b) can produce more precise estimates of effect sizes, and (c) are more likely to produce replicable results.

Fraley and Vazire (2014) proposed that typical sample sizes (median N) in journals can be used to evaluate the replicability of results published in these journals. They called this measure the N-pact Factor (NF). The authors propose that the N-pact Factor (NF) is a reasonable proxy for statistical power; that is, the probability of obtaining a statistically significant result that is real rather than a simple fluke finding in a particular study.

“The authors evaluate the quality of research reported in major journals in social-personality psychology by ranking those journals with respect to their N-pact Factors (NF)—the statistical power of the empirical studies they publish to detect” (Abstract, p. 1).

The article also contains information about the typical sample size in six psychology journals for the years 2006 to 2010. The numbers are fairly consistent across years and the authors present a combined NF for the total time period.

CSV To HTML using codebeautify.org

Journal Name | NF (median N) | Pow/d=.41 | Pow/d=.5 |
---|---|---|---|

Journal of Personality | 178 | 0.78 | 0.91 |

Journal of Research in Personality | 129 | 0.64 | 0.8 |

Personality and Social Psychology Bulletin | 95 | 0.5 | 0.67 |

Journal of Personality and Social Psychology | 90 | 0.49 | 0.65 |

Journal of Experimental Social Psychology | 87 | 0.47 | 0.64 |

Psychological Science | 73 | 0.4 | 0.56 |

The results show that median sample sizes range from 73 to 178. The authors also examined the relationship between NF and the impact factor of a journal. They found a negative correlation of r = -.48, 95%CI = -.93, +.54. Based on this non-significant correlation in a study with a rather low NF of 6, the authors suggest that “journals that have the highest impact also tend to publish studies that have smaller samples” (p. 8).

In their conclusions, the authors suggest that “journals that have a tendency to publish higher power studies should be held in higher regard than journals that publish lower powered studies—a quality we indexed using the N-pact Factor.” (p. 8). According to their NF-Index, the Journal of Personality should be considered the best of the six journals. In contrast, the journal with the highest impact factor, Psychological Science, should be considered the worst journal because the typical sample size is the smallest.

The authors also make some more direct claims about statistical power. To make inferences about the typical power of statistical tests in a journal, the authors assume that “Statistical power is a function of three ingredients: a, N, and the population effect size” (p. 6).

Consistent with previous post-hoc power analyses, the authors set the criterion to p < .05 (two-tailed). The sample size is provided by the median sample size in a journal which is equivalent to the N-pact factor (NF). The only missing information is the median population effect size. The authors rely on a meta-analysis by Richard et al. (2001) to estimate the population effect size as d = .41. This value is the median effect size in a meta-analysis of over 300 meta-analyses in social psychology that covers the entire history of social psychology. Alternatively, they could have used d = .50, which is a moderate effect size according to Cohen. This value has been used in previous studies of statistical power of journals (Sedlmeier & Gigerenzer, 1989). The table above shows both power estimates. Accordingly, the Journal of Personality has good power (Cohen recommended 80% power) and the journal Psychological Science would have low power to produce significant results.

In the end, the authors suggest that the N-pact factor can be used to evaluate journals and that journal editors should strive towards a high NF. “One of our goals is to encourage journals (and their editors, publishers, and societies that sponsor them) to pay attention to and strive to improve their NFs” (p. 11). The authors further suggest that NF provides “an additional heuristic to use when deciding which journal to submit to, what to read, what to believe, or where to look to find studies to publicize” (p. 11).

Before I present my criticism of the N-pact Factor, I want to emphasize that I agree on several points with the authors. First, I believe that statistical power is important (Schimmack, 2012). Second, I believe that quantitative indicators that provide information about the typical statistical power of studies in a journal are valuable. Third, I agree with the authors that everything else being equal, statistical power increases with sample size.

My first concern is that sample sizes can provide misleading information about power because researchers often tend to conduct analyses on subsamples of their data. For example, with 20 participants per cell, a typical 2 x 2 ANOVA design has a total sample size of N = 80. The ANOVA with all participants is often followed by post-hoc tests that aim to test differences between two theoretically important means.For example, after showing an interaction between men and women, the post-hoc tests are used to show that there is a significant increase for men and a significant decrease in women. Although the interaction effect can have high power because the pattern in the two groups goes into opposite directions (cross-over interaction), the comparisons within gender with N = 40 have considerably less power. A comparison of sample sizes and degrees of freedoms in Psychological Science shows that many tests have smaller df than N (e.g., 37/76, 65/131, 62/130, 66/155, 57/182 for the first five articles in 2010 in alphabetical order). This problem could be addressed by using information about df to compute median N of statistical tests.

A more fundamental concern is the use of sample size as a proxy for statistical power. This is only true, if all studies had the same effect size and all studies used the same research design. These restrictive conditions are clearly violated when the goal is to provide information about the typical statistical power of diverse articles in a scientific journal. Some research areas could have larger effects than others. For example, animal studies make it is easier to control variables, which reduces sampling error. Perception studies can often gather hundreds of observations in a one-hour session, where social psychologists may end up with a single behavior in a carefully staged deception study. The use of a single effect size for all journals benefits journals that use large samples to study small effects and punishes journals that publish carefully controlled studies that produce large effects. At a minimum, one would expect that the information about sample sizes is complemented with information about the median effect size in a journal, but the authors did not consider this option, presumably because it is much harder to to obtain than information about sample sizes, but this information is essential for power estimation.

A related concern is that sample size can only be used to estimate power for a simple between-subject design. Estimating statistical power for more complex designs is more difficult and often not possible without information that is not reported. Applying the simple formula for between-subject designs to these studies can severely underestimate statistical power. A within-subject design with many repeated trials can produce more power than a between-subject design with 200 participants. If the NF were used to evaluate journals or researchers, it would favor researchers who use inefficient between-subject designs rather than efficient designs, which would incentivize waste of research funds. It would be like evaluating cars based on their gasoline consumption rather than on their mileage.

AN EMPIRICAL TEST OF NF AS A MEASURE OF POWER

The problem of equating sample size with statistical power is apparent in the results of the OSF-reproducibility project. In this project, a team of researchers conducted exact replication studies of 97 statistically significant results published in three prominent psychology journals. Only 36% of the replication studies were significant. The authors examined several predictors of replication success (p < .05 in the replication study), including sample size.

Importantly, they found a negative relationship between sample size of the original studies and replication success (r = -.15). One might argue that a more appropriate measure of power would be the sample size of the replication studies, but even this measure failed to predict replication success (r = -.09).

The reason for this failure of NF is that the OSF-reproducibility project mixed studies from the cognitive literature that often use powerful within-subject designs with small samples and studies from social psychology that often use the less powerful between-subject design. Although sample sizes are larger in these studies, studies with small samples in cognitive psychology are more powerful and tended to replicate at a higher rate.

This example illustrates that the focus on sample size is misleading and that the N-pact Factor would have led to the wrong conclusion about the replicability of research in social versus cognitive psychology.

CONCLUSION

Everything being equal, studies with larger samples have more statistical power to demonstrate real effects, and statistical power is monotonically related to sample size. Everything else being equal, larger samples are better because bigger statistical power is better. However, in real life everything else is not equal and rewarding sample size without taking effect sizes and design features of a study into account creates a false incentive structure. In other words, bigger samples are not necessarily better.

To increase replicability and to reward journals for publishing replicable results it would be better to measure the typical statistical power of studies than to use sample size as a simple, but questionable proxy.

_____________________________________________________________

P.S. The authors briefly discuss the possibility of using observed power, but reject this option based on a common misinterpretation of Hoenig and Heisey (2001). Hoenig and Heisey (2001) pointed out that observed power is a useless statistic when an observed effect size is used to estimate the power of this particular study. Their critique does not invalidate the use of observed power for a set of studies or a meta-analysis of studies. In fact, the authors used a meta-analytically derived effect size to compute observed power for their median sample sizes. They could also have computed a meta-analytic effect size for each journal and used this effect size for a power analysis. One may be concerned about the effect of publication bias on effect sizes published in journals, but this concern applies equally to the meta-analytic results by Richard et al. (2001).

P.P.S. Conflict of Interest. I am working on a statistical method that provides estimates of power. I am personally motivated to find reasons to like my method better than the N-pact Factor, which may have influenced my reasoning and my presentation of the facts.

Hi Uli,

Thanks for your review! I agree with most of the points you raise.

The main point I would emphasize about our paper (speaking just for myself) is that we explicitly were only interested in comparing social/personality studies and journals (for psych science, we coded only social/personality papers). This somewhat mitigates the concern about variations in effect sizes across sub-areas of psych (the Richard et al. meta-analysis didn’t find much variation across topics within social/personality psych if I remember correctly). The focus on social/personality research also somewhat mitigates the concern that our conclusions don’t apply to purely within-subjects design (which is true), because we believe it’s very rare for social/personality research to have, as a key research question, a purely within-subjects question/design. I might be wrong about that, and it might be changing (which would be great), but that was the idea.

This also could speak to why sample size didn’t predict higher replicability in the RP:P – as you point out, this could be driven by the fact that the RP:P included both social/personality and cognitive studies, and if the cognitive studies tended to study larger effects and/or use more within-subjects designs, sample size wouldn’t be a good proxy for power. I’d be curious to know if sample size predicted replicability just within the social/personality studies.

All that said, the NF is absolutely flawed and crude. I just think the world with NF is better than the world with just IF, but I would be delighted if it was replaced with a better index. I think you might be just the right person to do that, and I hope you do!

-simine

Hi Simine,

thank you for your positive review of my review and thank you for pointing out that you only coded social/personality articles in Psychological Science.

This precludes within-subject designs by cognitive psychologists as an explanation for the variation in sample size.

Another factor could be that some journals publish personality research (JP, JRP), some publish a mix of both (PSPB, JPSP), and others publish mostly social psychology (JESP, PS).

Personality psychologists conduct correlational studies and used N = 100 as a rule of thumb for sample size (I think the new rule of thumb is N = 200).

Social psychologists use n=20 per cell as a rule of thumb, and the total sample size is sometimes N = 40.

This might explain the small sample size in Psychological Science.

In any case, we probably agree that the main problem are between-subject experiments with n = 20 per cell. This rule of thumb leads to severely underpowered studies with low replicability (see OSF results).

I wonder how n=20 became a rule of thumb in social psychology.