Authors: Ulrich Schimmack & Yue Chen
“Any man whose errors take ten years to correct is quite a man.” (J. Robert Oppenheimer)
More than a century ago, Charles Darwin proposed that facial expressions of emotions not only communicate emotional experiences to others, but play an integral role in the experience of emotions themselves (Darwin, 1872). This hypothesis later became known as the facial feedback hypothesis.
Nearly a century later, a review article concluded that empirical evidence for the facial feedback hypothesis was inconclusive and suffered from some methodological problems (Ross, 1980). Most important, positive results may have been due to demand effects. That is, participants may have been aware that the manipulation of their facial muscles was intended to induce a specific emotion and respond accordingly.
Strack, Martin, and Stepper (1988) invented the pen-in–mouth-paradigm to overcome these limitations of prior studies. In this paradigm, participants are instructed to hold a pen in their mouth either with their lips or with their teeth. Holding the pen with the teeth is supposed to activate the muscles involved in smiling (zygomaticus major). Holding the pen with the lips prevents smiling. To ensure that participants are not aware of the purpose of the manipulation, the study is conducted as a between-subject study with participants being randomly assigned to either the teeth or the lips condition. Furthermore, they are given a cover story for holding the pen in the mouth.
“The study you are participating in has to do with psychomotoric coordination. More specifically, we are interested in people’s ability to perform various tasks with parts of their body that they would normally not use for such tasks…The tasks we would like you to perform are actually part of a pilot study for a more complicated experiment we are planning to do next semester to better understand this substitution process.” (p. 770).
In Study 1, participants were shown several cartoons and asked to rate how funny each cartoon was. According to FFH, inducing smiling by holding a pen with teeth should induce amusement and amplify the funniness of cartoons. The average rating of funniness was consistent with this prediction (teeth M = 5.14 vs. lips M = 4.33 on a 0 to 9 scale). A second study replicated the pen-in-mouth paradigm with amusement ratings as the dependent variable (M = 6.43 vs. 5.40).
Strack et al.’s article has been widely cited as conclusive evidence for FFH (cf. Wagenmakers et al., 2016) and the article has been featured prominently in textbooks (cf. Coles & Larsen, 2017) and popular psychology books (cf. Schimmack, 2017). However, in 2011 psychologists encountered a crisis of confidence after some classic findings could not be replicated and Nobel Laureate Daniel Kahneman asked for replications of classic studies (Kahneman, 2012).
Wagenmakers et al. (2016) answered this call using the newly established format of a Registered Replication Report (Simons & Holcombe, 2014). In this format, original authors, replication authors, and editors work together to design the replication study and the original study is replicated across several labs. Wagenmakers et al. (2016) reported the results of 17 preregistered replications of Strack et al.’s Study 1. The minimum sample size for each study was N = 50. Actual sample sizes ranged from N = 87 to 139. These sample sizes do not provide sufficient statistical power to replicate the effect in each study. However, a meta-analysis of all 17 studies ensures a high probability of replicating the original finding even with a statistically small effect size. Nevertheless, the replication study failed to provide evidence for FFH.
Some psychologists interpreted these results as challenging Darwin’s century old hypothesis that facial expressions play an important role in emotional experiences. After all, results based on the best test of the theory that were widely used to support FFH could not be replicated. However, some psychologists raised concerns about the replication study. Reber (2016) compared psychology to chemistry. For an experiment in chemistry to work as predicted, chemists need to use pure chemicals. Even small impurities may cause failures to demonstrate chemical processes. Reber suggested that the replication failure of the FFH could have been caused by “impurities” in the replication study. This line of argumentation is dangerous because it can lead to circular reasoning. That is, if a study provides evidence for a theoretically predicted effect, the study was pure, but if a study fails to provide evidence for the effect, the study was impure. Accordingly, a theoretical prediction can never be falsified.
It is also possible to question the results of the original study. Schimmack (2017) pointed out that both studies failed to reach the standard criterion of statistical significance in a two-tailed test and were only significant in a one-tailed test. These results are often called marginally significant. Two marginally significant results are suggestive, but do not provide conclusive evidence for an effect. Thus, these results were prematurely accepted as evidence for FFH, when additional evidence was needed.
It would be surprising if nobody had ever tried to replicate the pen-in-mouth paradigm given its prominence and theoretical importance. In fact, numerous published articles have used the paradigm to replicate and extend the original findings (see Appendix). We conducted a replicability analysis of studies that used the pen-in-mouth paradigm prior to the controversial registered replication report. If previous studies consistently found evidence for FFH, it suggests that the replication report studies were impure. However, if previous studies also had difficulties demonstrating the effect, it suggests that the pen-in-mouth paradigm does not reliably produce facial feedback effects.
A replicabiliy analysis differs from conventional meta-analyses in two ways. First, the goal of a replicability analysis is not to estimate an effect size. Instead, the goal is to estimate the average replicability of a set of studies, where replicability is defined as the probability of obtaining a statistically significant result (Schimmack, 2014). Second, a replicability analysis examines whether a set of studies shows signs of publication bias by comparing the percentage of significant results to the average statistical power of studies. In an unbiased set of studies, the success rate should match median observed power. However, if publication bias is present, the success rate is higher than median observed power justifies (Schimmack, 2012, 2014).
Median observed power is only an estimate of average power and the estimate is imprecise with small sets of studies. However, precision increases as the number of studies increases. For this replicability analysis, we conducted a cumulative replicability analysis where studies are added in chronological order. The cummulative analysis shows how strong the evidence for FFH was in the beginning and how it changed over time.
We used three search strategies to retrieve original articles that used the pen-in-mouth paradigm. First, we conducted full text searchers of social psychology journals looking for the word pen. Second, we searched for articles that cited Strack et al.’s seminal study that introduced the pen-in-mouth paradigm. Third, articles that were found using the first two strategies were searched for references to additional studies. We fond 12 published articles with 19 independent studies that used the pen-in-mouth paradigm, including the original pair of studies.
For each independent study, we converted reported test statistics into a z-score as a standardized measure of the strength of evidence for FFH. If the means were not in the predicted direction, the z-scores were negative. We then computed observed power with z = 1.96 (p < .05, two-tailed) as criterion, unless the authors interpreted a marginally significant result as evidence for FFH; in this case, we used z = 1.65 as criterion for significance. The formula to convert z-scores into observed power is simply
1-pnorm(criterion.z,obs.z); obs.z = observed z-score; criterion.z = 1.96 or 1.65
The outcome of each study is dichotomous with 0 = not significant and 1 = significant. Averaging this outcome across studies yields the success rate.
We then compute an inflation index as the difference between the success rate and median observed power. In the long run, these two values should be equivalent if there is no publication bias. If there is publication bias, the inflation index is positive and reflects the amount of publication bias.
Finally, I computed the Replicability Index (R-Index; Schimmack, 2014). As publication bias also inflates median observed power, we subtracted the inflation index from median observed power. The result is called the R-Index. An R-Index of 50% or less suggests that it would be difficult to replicate a finding with the typical sample sizes in the set of studies.
Table 1 shows the results. The original studies were both successful with the weaker criterion value of z = 1.65 that was used by the authors. However, both studies barely met this criterion which leads to a high inflation index and a low replicability index. As predicted by the low R-Index, the next study produced a non-significant result which brought the success rate more in line with median observed power. However, the next three studies also failed to demonstrate the effect and median observed power dropped to .07. From study 9 till study 19, median observed power stays at this level, while the success rate remains above 30%, indicating the influence of publication bias.
For the total set of 19 studies, the probability of obtaining more than 53% (10 / 19) non-significant results with a 1-.07 = 93% probability of this outcome is greater than 99.99% (Schimmack, 2012). Thus, there is strong evidence of publication bias, even though the estimated median power is only 7%. Combining the very low estimate of median power with a positive inflation index yields a negative R-Index. Thus, it is not surprising that a set of studies without publication bias failed to replicate the original effect. This finding is entirely consistent with the cumulative evidence from previous studies, once publication bias is taken into account. In fact, the cumulative analysis shows that there was never convincing evidence for the effect (R-Index < 50).
Note. No. = Number of Study in Chronological Order, Year = Year, A# Article Number (see Appendix), S# Study Number in Article, z = strength of evidence for or against FFH, OP = observed power, Sig = Significant (0 = No, 1 = Yes), MOP = Median Observed Power, SR = Success Rate, Inf. = Inflation, R-Index = Replicability Index (MOP – Inf).
Darwin was a great scientist. Since he published his influential theory of evolution in 1859, biology has made tremendous progress in understanding the process of evolution. The same cannot be said about Darwin’s theory of emotion. More than hundred years later, psychologists are still debating the influence of facial feedback on emotional experiences. One reason for the slow progress in some areas of psychology is that original studies were often accepted as conclusive evidence without rigorous replication efforts. In addition, meta-analyses provided misleading results because they failed to take publication bias into account. The present replicability analysis showed that the pen-in-mouth paradigm never provided convincing evidence for facial feedback effects. Nevertheless, the original study was often cited as evidence for facial feedback effects. To make progress like other sciences, psychology needs to take empirical studies more seriously and ensure that important findings can be replicated before they become corner stones of theories and textbook findings.
This replicability analysis is limited to the pen-in-mouth paradigm. Other paradigms may produce replicable results. However, the pen-in-mouth paradigm has been used because it addressed limitations of these paradigms such as demand effects. Thus, even if these paradigms were more successful, the underlying mechanism would be less clear. At present, the replicability analysis simply shows a lack of evidence for FFH, but it would be premature to conclude that facial feedback effects do not exist.
Buck, R. (1980). Nonverbal Behavior and the Theory of Emotion: The Facial
Feedback Hypothesis. Journal of Personality and Social Psychology, 38, 811-824.
Coles, N. A., & Larsen, J. T., & Lench, H. C. (2017). A meta-analysis of the facial feedback hypothesis literature. OSF-Preprint.
Darwin, C. (1872). The expression of emotions in man and animals. London: John Murray.
Kahneman, D. (2012). A proposal to deal with questions about priming effects.
Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis. Journal of Personality and Social Psychology. 54, 768–777.
Reber, R. (2016). Impure replications.
Schimmack, U. The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles, Psychological Methods, 17, 551–566.
Schimmack, U. (2014). A revised introduction to the R-Index.
Schimmack, U. (2017). Reconstruction of a Train Wreck: How Priming Research Went off the Rails. https://replicationindex.com/category/thinking-fast-and-slow/
Simons, D. J., & Holcombe, A. O. (2014). Registered Replication Reports.
Wagenmakers, EJ et al. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 917-928.
Appendix: Articles used for Meta-Analysis
A1. Strack, F., Martin, L. L., & Stepper, S. (1988). Inhibiting and Facilitating Conditions of the Human Smile: A Nonobtrusive Test of the Facial Feedback Hypothesis. Journal of Personality and Social Psychology. 54, 768–777.
A2. Soussignan, R. (2002). Duchenne Smile, Emotional Experience, and Autonomic Reactivity: A Test of the Facial Feedback Hypothesis. Emotion, 2, 52-74.
A3. Ito, T., Chiao, K. W., Devine, P. G., Lorig, T. S., & Cacioppo, J. T. (2006). The Influence of Facial Feedback on Race Bias. Psychological Science, 17, 256-261.
A4. Andreasson, P., & Dimberg, U. (2008). Emotional Empathy and Facial Feedback. Journal of Nonverbal Behavior, 32, 215-224.
A5. Wiswede, D., Munte, T. F., Kramer, U. M., & Russler, J. (2009). Embodied Emotion Modulates Neural Signature of Performance Monitoring. PlosOne, 4, e5754, 1-6.
A6. Kraft, T. L., & Pressman, S. D. (2012). Grin and Bear It: The Influence of Manipulated Facial Expression on the Stress Response. Psychological Science, 23, 1372-1378.
A7. Paredes, B., Stavraki, M., Briñol, P., & Petty, R. E. (2013). Social Psychology, 44, 349-353.
A8. Marmolejo-Ramos, F. & Dunn, J. (2013). On the activation of sensorimotor systems during the processing of emotionally-laden stimuli. Universitas Psychologica, 12, 1511-1542.
A9. Rummer, R., Schweppe, J., Schleelmilch, R., Grice, M. (2014). Mood Is Linked to Vowel Type: The Role of Articulatory Movements. Emotion, 14, 246-250.
A10. Dzokoto, V., Wallace, D. S., Peters, L., & Bentsi-Enchill, E. (2014). Attention to Emotion and Non-Western Faces: Revisiting the Facial Feedback Hypothesis. The Journal of General Psychology, 2014, 141(2), 151–168.
A11. Arminjon, M., Preissmann, D., Chmetz, F., Duraku, A., Ansermet, F., & Magistretti, P. J. (2015). Embodied memory: Unconscious smiling modulates emotional evaluation of episodic memories, Frontiers in Psychology, 6, 650, 1-7.
A12. Epstein, N., Brendel, T., Hege, I., Ouellette, D. L., Schmidmaier, R., & Kiesewetter, J. (2016). The power of the pen: how to make physicians more friendly and patients more attractive. Medical Education, 50, 1214–1218.
12 thoughts on “The Power of the Pen-in-Mouth Paradigm (PIMP): A Replicability Analysis”
This is an interesting analysis, but I don’t think you should call it a “replicability analysis” given that you’re simply meta-analyzing studies that are so methodologically heterogeneous that they could never falsify an original hypothesis. A true (& valid) “replicability analysis” needs to ensure that each replication study uses a methodology that is indeed sufficiently methodologically similar to an original study, which is what Curate Science does, as outlined in our latest unified framework paper: https://osf.io/preprints/psyarxiv/uwmr8
Also, you should stop (mis)defining replicability as statistical power (“…goal is to estimate the average replicability of a set of studies, where replicability is defined as the probability of obtaining a statistically significant result”) because this just confuses everything. Replicability is the extent to which an effect is independently replicable in sufficiently methodogically similar replication studies (which themselves are also each sufficiently transparently reported & exhibit sufficient analytic reproducibility): see https://osf.io/pxe5h/
Finally, I disagree with your statement (“The same cannot be said about Darwin’s theory of emotion”) because his theory might still have merit, but has not **yet** been tested correctly (e.g., testing the facial feedback hypothesis using a very sophisticated camouflaged highly-repeated within-person design).
1. How do you define sufficiently similar?
2. The core definition of replication is that the replication study produces the same result. If the result of the first study is statistical significance, this implies that the criterion for the replication study is also statistical significance.
3. There may be other ways to do replicability analysis and to define replicability, but that does not mean that my use of the term is wrong, just different from your approach.
4. Finally, it is a statistical fact that the probability of replicating an original finding is a function fo power. Of course, power is not the only factor because replications are never exact, but power is by definition the probability of obtaining a significant result in the original study and in the replication study.
What these analyses show is that all studies had very low power to detect a facial feedback effect and publication bias inflated the impression of the replicability of the effect.
It would be interesting to see what conclusions we can draw from your analysis of pen-in-mouth studies.
When you first posted this on facebook, you asked me if I thought some papers were missing. I was in the midst of heavy teaching, so I didn’t have much time to go through things. I did note that a paper by Niedenthal (where I am also a co-author) was missing, but in that paper the “pen in the mouth” is cast as blocking mimicry rather than as a method for inducing emotions via facial feedback. (It was originally thought to do that, but it didn’t work, so, well, some late hypothesizing – but it was a long time ago I guess).
I didn’t think that the “blocking of mimicry” was the focus on this collection, and when I finally had time to look closer, I also ignored papers that attempts to block mimicry rather than enhancing it.
The list I used was Strack’s compliation of papers that have investigated facial feedback, which he shared along with his commentary to the failed RRR. This is also the list I started using to delve deeper into facial feedback (and which I didn’t get very far with).
I compared his list where the pen-manipulation was used, with the list here, and although there are overlaps, there are several papers on his list that aren’t here (and also, several papers here that weren’t on his list). – I copy them in here, in order of publication year.
2007 Havas, D. A., Glenberg, A. M., & Rinck, M. (2007). Emotion simulation during language comprehension. Psychonomic Bulletin & Review, 14(3), 436-441.
2008 Stel, M., van den Heuvel, C., & Smeets, R. C. (2008). Facial feedback mechanisms in autistic spectrum disorders. Journal of autism and developmental disorders, 38(7), 1250-1258.
2009 Ashton-James, C., Maddux, W. W., Galinsky, A. D., & Chartrand, T. L. Feeling Badly Makes Us More Who We Are: Negative Affect Strengthens Culturally Consistent Self-Construals.
2009 Topolinski, S., & Strack, F. (2009). The architecture of intuition: Fluency and affect determine intuitive judgments of semantic and visual coherence and judgments of grammaticality in artificial grammar learning. Journal of Experimental Psychology: General, 138(1), 39.
2010 Blaesi, S., & Wilson, M. (2010). The mirror reflects both ways: Action influences perception of others. Brain and cognition, 72(2), 306-309.
2013 Fernández-Abascal, E. G., & Díaz, M. D. M. (2013). Affective induction and creative thinking. Creativity Research Journal, 25(2), 213-221.
2013 Topolinski, S., & Deutsch, R. (2013). Phasic affective modulation of semantic priming. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39(2), 414.
2014 Bilewicz, M., & Kogan, A. (2014). Embodying imagined contact: Facial feedback moderates the intergroup consequences of mental simulation. British Journal of Social Psychology, 53(2), 387-395.
2014 Chang, J., Zhang, M., Hitchman, G., Qiu, J., & Liu, Y. (2014). When you smile, you become happy: Evidence from resting state task-based fMRI. Biological psychology, 103, 100-106.
2015 Lobmaier, J. S., & Fischer, M. H. (2015). Facial feedback affects perceived intensity but not quality of emotional expressions. Brain sciences, 5(3), 357-368.
2015 Sel, A., Calvo-Merino, B., Tuettenberg, S., & Forster, B. (2015). When you smile, the world smiles at you: ERP evidence for self-expression effects on face processing. Social cognitive and affective neuroscience, 10(10), 1316-1322.
I have done a rough coding of the papers, entering them into the R-index work-sheet. It is very rough (considering that I have spent no more than 3 interrupted half-days on it), and should probably be looked over. But, this is roughly what they add:
Success Rate Obs. Power Inflation Rate R-Index
0,6875 0,587337408 0,100162592 0,487174815
It is a rather heterogeneous set of studies – although all of them include the smiling pen manipulation (ok, so one uses chopsticks). Outcomes vary.
It would be nice updating with these, one way or another.
Thank you very much. I will add these to the dataset and redo the analysis.
Hi Ase, I looked through these studies. All except one were actually in our database. I did not include them in the analysis for a number of reasons; mainly within-subject designs and studies with DV that are only tentatively related to experienced affect. I can do a sensitivity analysis, whether results change for this broader set of studies. Conclusion shouldn’t depend on selection of studies.
I figured it would be something like this. It wasn’t quite straightforward going through them (and, yes I noticed the repeated measures).
But, it leads me to another thought that I have had for a while, since I started looking through these last year, and that is that there is possibly a need to chart the methods and measures (and the varieties of methods and measures), because, as they say, the devil is in the details. It isn’t enough to extract effect sizes, because how we manipulate and measure matters
I really like Malte Elson’s effort with the flexible measures site because it really clarifies the varieties of manipulations and measures.