2026-04-29
This work was originally created by Felix Schönbrodt under a CC-BY 4.0 Creative Commons Attribution 4.0 International License. This current work by Sarah von Grebmer zu Wolfsthurn, Malika Ihle and Felix Schönbrodt is licensed under a CC-BY-SA 4.0 Creative Commons Attribution 4.0 International SA License. It permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
Creator: Von Grebmer zu Wolfsthurn, Sarah ( 0000-0002-6413-3895)
Reviewer: Schönbrodt, Felix (0000-0002-8282-3910)
Consultant: Ihle, Malika ( 0000-0002-3242-5981)
Based on your experience so far, how would you currently rate your trust in published scientific findings on a scale from 1 - 5?
Based on your experience so far, which challenges to research trustworthiness do you see? Name a few keywords.
Display Wordcloud answer.
What is your level of familiarity with Open Research practices in general (e.g., basic concepts, terminology, or tools)?
I am unfamiliar with the concept of Open Research practices.
I have heard of them but I would not know how they apply to my work.
I have basic understanding and experience with Open Research practices in my own work/research/studies.
I am very familiar with Open Research practices and routinely apply them in my daily work/research/study routines.
What do we see in the results?
Note - this is not a test!
These terms probably have not been covered in your courses yet. We just want to see what your current understanding of these terms is.
By the end of this session, learners will be able to:
“Society trusts that scientific research results are an honest and accurate reflection of a researcher’s work.” (Committee on Science, Engineering and Public Policy 2009: ix)
“The public must be able to trust the science and scientific process informing public policy decisions.” (Obama 2009)
Resnik, D. B. (2011). Scientific research and the public trust. Science and engineering ethics, 17(3), 399-409.
Probe your prejudices: In which profession do you have the highest trust to tell the truth? And the lowest?
Data from 1,015 UK adults, 2024
Sources: Veracity Index 2024, Ipsos; Seyd, B. (2025). What is trust (in science and scientists) and is it in crisis?. Current Opinion in Psychology, 67. https://doi.org/10.1016/j.copsyc.2025.102201
“Wie sehr vertrauen Sie in Wissenschaft und Forschung?”
Source: Wissenschaft im Dialog/Verian: www.sciencebarometer.com. Minimum of 1,000 respondents in each survey wave.
Seyd, B. (2025). What is trust (in science and scientists) and is it in crisis?. Current Opinion in Psychology, 67. https://doi.org/10.1016/j.copsyc.2025.102201
Inspired by Vazire, S. (2017). Quality Uncertainty Erodes Trust in Science. Collabra: Psychology, 3(1), 1. https://doi.org/10.1525/collabra.74. Images from ChatGPT.
“When there are asymmetries in the information that the seller and the buyer have, the buyers cannot be certain about the quality of the products”
“by keeping vital information private – the raw data, the original design and analysis plan, the exploratory analyses that were conducted along the way to the final analysis authors are hiding valuable information and preventing consumers of their manuscript from being certain about its quality. Just like sellers of used cars keep things […] from the buyers.”
“For many non-scientists, learning that transparency is not the norm in science comes as a surprise. […] it must seem obvious that scientists should be held to a higher standard than used car salespeople.”
Quotes from Vazire, S. (2017).
Nullius in verba (“take nobody’s word for it”) The motto of the Royal Society, the oldest scientific society in the world
Royal Society logo by MostEpic - Own work, CC BY-SA 4.0, Link; Car dealer by ChatGPT.
Definitions of Replicability and Replication are from the iRISE glossary v2 by Voelkl et al. (2025). Note that no consensus about the meaning of these words exists - they are used differently in different disciplines.
Note: The terminology around replication varies between fields, and the boundaries between these types are blurry.
A prototypical direct replication:
A realistic example:
The qualifiers “direct”, “conceptual”, “close” etc. replications are prototypical configurations of this matrix.
TODO: Add new slide with results. Remove the callout box.
What did they find?
Large portion of replications produced weaker evidence for the original findings despite using materials provided by the original authors.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Begley, C. G., & Ellis, L. M. (2012); Camerer et al (2016); Chang & Li (2015); Cova et al. (2018); Open Science Collaboration (2015); Social Science: Combined sample of systematically sampled projects (RPP, SSRP, EERP); Prinz, F., Schlange, T., & Asadullah, K. (2011); Protzko et al. (2023)
Tyner et al. (2026):
| Discipline | Replication attempts (successful / total) | Percentage successful |
|---|---|---|
| Business | 17 / 36 | 47.2% |
| Economics | 10.2 / 24 | 42.5% |
| Education | 8.2 / 13 | 63.1% |
| Political science | 7.8 / 15 | 52.0% |
| Psychology | 28.4 / 58 | 49.0% |
| Sociology | 9.2 / 18 | 51.1% |
Tyner, S. K., Abatayo, A. L., Dayley, M., et al. (2026). Investigating the replicability of the social and behavioural sciences. Nature. https://doi.org/10.1038/s41586-025-10078-y
Adapted from Dr. Malika Ihle: https: https://osf.io/u3znx
Given that in a given study \(p < .05\): What is the probability that a real effect exists in the population ➙ \(p(H_1|D)\)?
Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216–140216. http://doi.org/10.1073/pnas.1313476110
Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216–140216. http://doi.org/10.1073/pnas.1313476110
Nuzzo, R. (2014). Statistical errors. Nature. Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1(3), 140216–140216. http://doi.org/10.1073/pnas.1313476110
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci, 14(5), 365–376. doi:10.1038/nrn3475
“When a study is underpowered it most likely provides only weak inference. Even before a single participant is assessed, it is highly unlikely that an underpowered study provides an informative result.”
“Consequently, research unlikely to produce diagnostic outcomes is inefficient and can even be considered unethical. Why sacrifice people’s time, animals’ lives, and societies’ resources on an experiment that is highly unlikely to be informative?”
Schönbrodt, F. D. & Wagenmakers, E.-J. (2018). Bayes Factor Design Analysis: Planning for compelling evidence. Psychonomic Bulletin & Review, 25, 128-142. doi:10.3758/s13423-017-1230-y. Open Access
Given that in a given study \(p < .05\): What is the probability that a real effect exists in the population ➙ \(p(H_1|D)\)?
| Actual (not desired) relevance in professorship hiring committees | Rank |
|---|---|
| Number of peer-reviewed publications | 1 |
| Fit of research profile to the hiring department | 2 |
| Quality of research talks | 3 |
| Number of publications | 4 |
| Volume of acquired third party funding | 5 |
| Number of first authorships | 6 |
Abele-Brehm, A. E., & Bühner, M. (2016). Wer soll die Professur bekommen? Psychologische Rundschau, 67(4), 250–261. http://doi.org/10.1026/0033-3042/a000335
Publication Pressure
“Researchers are not rewarded for being right, but rather for publishing a lot.”
Consequence: “Publish or Perish” culture
Researchers try to publish as much as they can and to outperform their peers [@schmidtCreatingSPACEEvolve2021].
Nelson, Simmons, & Simonsohn (2012); Nosek, Spies, Motyl (2012); Munafo (2016)
Fanelli, D. (2010). “Positive” results increase down the hierarchy of the sciences. PLOS ONE, 5, e10068. https://doi.org/10.1371/journal.pone.0010068
“If my study works, I can publish it. If it does not, let’s hide it the drawer.”
Song, F., Hooper, L., & Loke, Y. K. (2013). Publication bias: what is it? How do we measure it? How do we avoid it?. Open Access Journal of Clinical Trials, 71-81. https://doi.org/10.2147/OAJCT.S34419
Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358(3), 252-260. https://doi.org/10.1056/NEJMsa065779
De Vries, Y. A., Roest, A. M., de Jonge, P., Cuijpers, P., Munafò, M. R., & Bastiaansen, J. A. (2018). The cumulative effect of reporting and citation biases on the apparent efficacy of treatments: the case of depression. Psychological Medicine, 48(15), 2453–2455. https://doi.org/10.1017/S0033291718001873
Task: Match the bias to its description.
| Bias | Description |
|---|---|
| 1. Confirmation bias | A. Non-publication of negative outcomes within a paper, or switching non-significant primary outcomes with significant secondary ones. |
| 2. Spin | B. Studies with positive results receive more citations than negative studies. |
| 3. Study publication bias | C. After learning the outcome, believing “I knew it all along.” |
| 4. Hindsight bias | D. Non-publication of an entire study (e.g., trials with null results never submitted). |
| 5. Outcome reporting bias | E. Tendency to seek or interpret information in ways that confirm existing beliefs. |
| 6. Citation bias | F. Authors conclude the treatment is effective despite non-significant primary outcomes. |
Based on what you learnt so far, how would you now rate your trust in published scientific findings on a scale from 1 - 5? (1 = not trusting any of the findings, 2 = trusting only some findings, 3 = trusting about half of the findings, 4 = trusting the majority of the findings, 5 = trusting all findings)
1
2
3
4
5
What does replicability in research mean?
Obtaining the same results using the original dataset and code
Obtaining consistent results when a new study collects new data using the same methods
Publishing results in more than one journal
Repeating the statistical analysis multiple times
What is publication bias in research?
The tendency for journals to publish studies only from well-known researchers
The requirement that all published studies must be peer-reviewed
The practice of publishing the same study in multiple journals
The tendency for studies with positive results to be published more often than studies with non-significant or negative results
What do we see in the results?
Warning
Do not try the following at home.
O’Boyle, E. H., Banks, G. C., & Gonzalez-Mulé, E. (2017). The Chrysalis Effect: How ugly initial results metamorphosize into beautiful articles. Journal of Management, 43(2), 376–399. https://doi.org/10.1177/0149206314527133
Research question: Do aggressive primes trigger aggressive behavior?
“A second study in Turner, Layton, and Simons (1975) collects a larger sample of men and women driving vehicles of all years. The design was a 2 (Rifle: present, absent) × 2 (Bumper Sticker:”Vengeance”, absent) design with 200 subjects.”
➙ presumably, no effect … (yet! Do not give up so easily)
Cited from Joe Hilgard’s excellent blog post on the Weapon Priming Effect; Turner et al. (1975). Naturalistic studies of aggressive behavior: aggressive stimuli, victim visibility, and horn honking. JPSP, 31(6), 1098–1107.
“They divide this further by driver’s sex and by a median split on vehicle year. They find that the Rifle/Vengeance condition increased honking relative to the other three, but only among newer-vehicle male drivers, F(1, 129) = 4.03, p = .047. But then they report that the Rifle/Vengeance condition decreased honking among older-vehicle male drivers, F(1, 129) = 5.23, p = .024! No results were found among female drivers.”
Cited from Joe Hilgard’s excellent blog post on the Weapon Priming Effect; Turner et al. (1975). Naturalistic studies of aggressive behavior: aggressive stimuli, victim visibility, and horn honking. JPSP, 31(6), 1098–1107.
Armitage, P., McPherson, C. K., & Rowe, B. C. (1969). Repeated significance tests on accumulating data. Journal of the Royal Statistical Society. Series A (General), 132, 235–244.
Warning
The so-called “hacks” on the past few slides represent questionable research practices. Do not try at home.
http://shinyapps.org/apps/p-hacker/
Scenario: You are reviewing a study examining whether drinking white tea improves short-term memory, where the researchers report:
Hypothesis: White tea improves memory test scores.
Sample size: 28 participants per group (tea-drinkers vs. water-only-drinkers)
Results:
Conclusion: The results show that white tea reliably improves cognitive performance.
Which potential p-hacking strategies are at play here?
Scenario: You are reviewing a study examining whether drinking white tea improves short-term memory, where the researchers report:
Hypothesis: White tea improves memory test scores.
Sample size: 28 participants per group (tea-drinkers vs. water-only-drinkers)
Results:
Conclusion: The results show that white tea reliably improves cognitive performance.
Come on. Surely not..? 😳
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological science, 23(5), 524-532.
Gopalakrishna, G., ter Riet, G., Vink, G., Stoop, I., Wicherts, J. M., & Bouter, L. M. (2022). Prevalence of questionable research practices, research misconduct and their potential explanatory factors: A survey among academic researchers in The Netherlands. PLOS ONE, 17(2), e0263023. https://doi.org/10.1371/journal.pone.0263023
Gopalakrishna, G., ter Riet, G., Vink, G., Stoop, I., Wicherts, J. M., & Bouter, L. M. (2022). Prevalence of questionable research practices, research misconduct and their potential explanatory factors: A survey among academic researchers in The Netherlands. PLOS ONE, 17(2), e0263023. https://doi.org/10.1371/journal.pone.0263023
Prof. Brian Wansink:
With your new background knowledge on p-hacking, read the blog post “The Grad Student Who Never Said ‘No’” by Brain Wansink.
(Read the blog post first, then some of the comments below, and then the Addendum I and II. Then you have the historical unfolding of the case.)
Important
The most important point of the story: The original authors shared their raw data, which made it possible to correct the honest mistake!
Eubank, N. (2016). Lessons from a decade of replications at the quarterly journal of political science. PS: Political Science & Politics, 49(2), 273-276 https://doi.org/10.1017/S1049096516000196
Nuijten et al. (2016). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior research methods, 48(4), 1205-1226. https://doi.org/10.3758/s13428-015-0664-2. Bruns et al.(2023). Statistical reporting errors in economics [Preprint]. MetaArXiv. https://doi.org/10.31222/osf.io/mbx62
Bias + (maybe unintentional) p-hacking + human (honest) mistakes = untrustworthy research findings?
Note
Emsley, R. (2023). ChatGPT: These are not hallucinations - they’re fabrications and falsifications. Schizophrenia, 9(1), 52. Lautrupet al. (2023). Heart-to-heart with ChatGPT: the impact of patients consulting AI for cardiovascular health advice. Open Heart, 10(2). Shekar et al. (2025). People Overtrust AI-Generated Medical Advice despite Low Accuracy. NEJM AI, 2(6), AIoa2300015.
This image was taken from the Geograph project collection. The copyright on this image is owned by Chris Martin and is licensed for reuse under the Creative Commons Attribution-ShareAlike 2.0 license.
Open Research
aka
A scientific framework for the 21. century
One-minute paper: Imagine you would have to explain the current challenges in research you heard about today to a friend. Write down what you would say to them.
What are you taking away from today?
What are you taking away from today?
Remember: There are solutions!
Research is not “doomed” - on the contrary. More on this in the next session!
See you next class :)
Decide whether each scenario in the following slides is an example of reproducibility or replicability.
A computational neuroscientist reruns a published fMRI analysis using the original dataset and Python scripts to verify the reported brain activation patterns.
An environmental scientist repeats a field experiment on soil nutrient levels using the same sampling protocol at a different site.
A linguist reanalyzes a corpus of historical texts using the same annotation guidelines and code to verify reported patterns of syntactic structures.
A psychology lab replicates a social behavior experiment using new participants from a different cultural background.
FORRT Replication Database (https://forrt-replications.shinyapps.io/fred_explorer/)