A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.

I recorded a meeting with my research assistants who are coding articles to estimate the replicability of psychological research. It is unedited and raw, but you might find it interesting to listen to. Below I give a short description of the topics that were discussed starting from an explanation of effect sizes and ending with a discussion about the choice of a graduate supervisor.

Link to video

The meeting is based on two blog posts that introduce personalized p-values.
1. https://replicationindex.com/2021/01/15/men-are-created-equal-p-values-are-not/
2. https://replicationindex.com/2021/01/19/personalized-p-values/

1. Rant about Fischer’s approach to statistics that ignores effect sizes.
– look for p < .05, and do a happy dance if you find it, now you can publish.
– still the way statistics is taught to undergraduate students.

2. Explaining statistics starting with effect sizes.
– unstandardized effect size (height difference between men and women in cm)
– unstandardized effect sizes depend on the unit of measurement
– to standardize effect sizes we divide by standard deviation (Cohen’s d)

3. Why do/did social psychologists run studies with n = 20 per condition?
– limited resources, small subject pool, statistics can be used with n = 20 ~ 30.
– obvious that these sample sizes are too small after Cohen (1961) introduced power analysis
– but some argued that low power is ok because it is more efficient to get significant results.

4. Simulation of social psychology: 50% of hypothesis are true, 50% are false, the effect size of true hypotheses is d = .4 and the sample size of studies is N = 20.
– Analyzing the simulated results (with k = 200 studies) with z-curve.2.0. In this simulation, the true discovery rate is 14%. That is 14% of the 200 studies produced a significant result.
– Z-curve correctly estimates this discovery rate based on the distribution of the significant p-values, converted into z-scores.
– If only significant results are published, the observed discovery rate is 100%, but the true discovery rate is only 14%.
– Publication bias leads to false confidence in published results.
– Publication is wasteful because we are discarding useful information.

5. Power analysis.
– Fischer did not have power analysis.
– Neyman and Pearson invented power analysis, but Fischer wrote the textbook for researchers.
– We had 100 years to introduce students to power analysis, but it hasn’t happened.
– Cohen wrote books about power analysis, but he was ignored.
– Cohen suggested we should aim for 80% power (more is not efficient).
– Think a priori about effect size to plan sample sizes.
– Power analysis was ignored because it often implied very large samples.
(very hard to get participants in Germany with small subject pools).
– no change because all p-values were treated as equal. p < .05 = truth.
– Literature reviews or textbook treat every published significant results as truth.

6. Repeating simulation (50% true hypotheses, effect size d = .4) with 80% power, N = 200.
– much higher discovery rate (58%)
– much more credible evidence
– z-curve makes it possible to distinguish between p-values from research with low or high discovery rate.
– Will this change the way psychologists look at p-values? Maybe, but Cohen and others have tried to change psychology without success. Will z-curve be a game-changer?

7. Personalized p-values
– P-values are being created by scientists.
– Scientists have some control about the type of p-values they publish.
– There are systemic pressures to publish more p-values based on low powered studies.
– But at some point, researchers get tenure.
– nobody can fire you if you stop publishing
– social media allow researchers to publish without censure from peers.
– tenure also means you have a responsibility to do good research.
– Researcher who are listed on the post with personalized p-values all have tenure.
– Some researchers, like David Matsumoto, have a good z-curve.
– Other researchers have way too many just significant results.
– The observed discovery rates between good and bad researchers are the same.
– Z-curve shows that the significant results were produced very differently and differ in credibility and replicability; this could be a game changer if people care about it.
– My own z-curve doesn’t look so good. 🙁
– How can researchers improve their z-curve
– publish better research now
– distance yourself from bad old research
– So far, few people have distanced themselves from bad old work because there was no incentive to do so.
– Now there is an incentive to do so, because researchers can increase credibility of their good work.
– some people may move up when we add the 2020 data.
– hand-coding of articles will further improve the work.

8. Conclusion and Discussion
– not all p-values are created equal.
– working with undergraduate is easy because they are unbiased.
– once you are in grad school, you have to produce significant results.
– z-curve can help to avoid getting into labs that use questionable practices.
– I was lucky to work in labs that cared about the science.

3 thoughts on “A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.”

Robin Vickery says:

February 9, 2021 at 9:47 am

Hi, thanks for posting this. The link to the video require login/pasword. Any other way to watch the video?

Loading...

1. Ulrich Schimmack says:
  
  February 9, 2021 at 9:56 am
  
  I fixed it. It should work now.
  
  Loading...
  
Pingback: Replicability Rankings 2010-2020 | Replicability-Index

Replicability-Index

Improving the replicability of empirical research

A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.

Like this:

3 thoughts on “A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.”

Leave a Reply to Robin VickeryCancel reply

Share this:

Like this:

3 thoughts on “A tutorial about effect sizes, power, z-curve analysis, and personalized p-values.”

Leave a Reply to Robin VickeryCancel reply

Discover more from Replicability-Index