How to minimize mistakes

Good coding practices

Felix Schönbrodt

felix.schoenbrodt@psy.lmu.de

Ludwig-Maximilians-Universität München

Caroline Zygar-Hoffmann

caroline.zygar@psy.lmu.de

Ludwig-Maximilians-Universität München

2024-04-27

The Ideal? Reality?

Surrounded from a flood of fake news, science is one of (the last?) sources of credible information.
Trust in science: Scientists are impartial, meticulous, and check their results rigorously in peer review before publication
Although error-freeness cannot be guaranteed, science provides the most reliable source of information, and if errors happen, they soon get detected and corrected.

Replication/credibility crisis
Number of retractions rises exponentially
Pressure to „publish or perish“ leads to hurried manuscript, less error checking
Reviewer overload leads to superficial reviews
statcheck (Nuijten et al., 2016): 50% of psychology papers contain at least one inconsistent statistic

Mistakes lurk everywhere

Errors in data collection software/scripts
Error in manual data transcription
Analyzing the wrong data set (e.g., an old version, a filter has been unknowingly applied)
Coding errors
- Wrong group assignment (control/ experimental group)
- NAs coded as 99?
Faulty analysis software; version changes in R packages
Mistyping numbers when copying them from R to manuscript
Send the wrong file to the journal submission system
Asymmetry: Mistakes tend to go in the preferred direction (Gould, 1996), because we check more vigorously when results (unexpectedly) go into the wrong direction.
- See also „garden of forking paths“ (Gelman & Loken, 2013): A lot of p-hacking is unintentional

Mistakes lurk everywhere

Examples: Miscoding Missing Values

Jasso (1985) ASR: coital frequency increases with wife’s age
- Kahn/Udry (1986): 4 observations had value 88; Jasso treated these as valid values; treating these as missing, the effect of wife’s age becomes non-significant
Herring (2009) ASR: diversity increases firm revenue
- Stojmenovska/Bol/Leopold (2017): 206 firms had values 88,888,888,888; Herring treated these as valid values; treating these as missing, the effect of diversity becomes non-significant
Munsch (2015) ASR: marital infidelity is at 10 %
- Munsch (2018): 246 missings were miscoded as “infidelity” (Stata .-problem); with correct coding marital infidelity is at 6 %

How to prevent coding errors?

Unit tests / sanity checks

Always look at the descriptive statistics (min, max, NAs, mean) of every variable (also transformed/computed variables)
- Know the scales of your variables: Is the mean plausible? Is the minimum and maximum value theoretically possible and plausible?
- Does a scale value from multiple items have only discrete values?
- Do z-standardized variables really have mean=0 and SD=1?
Plot all variables (scatterplot, histograms)

Unit tests / sanity checks

The summarytools package makes this really easy:

library(summarytools)
view(dfSummary(data))

alternatively: codebook package by Ruben Arslan (also has an RStudio plugin). Has some nice convenience functions, e.g. when importing SPSS or other data files.

Technical reproducibility

When you rely on random numbers (e.g., in simulation studies): Set a seed for reproducibility
- Take care when doing parallel computing, this sometimes requires special treatment of seeds
Document exact versions of all packages/ external programs at each completed stage of data analysis
- e.g., in R: Save the sessionInfo() of the analytical pipeline in an accompanying file when you submit a paper, and for every revision
Save a snapshot of the current version/state of the statistical software
- e.g., in R: checkpoint package, packrat package, renv package
The safest way: Make a docker container which contains the full computational environment (including OS)

Coding style

Meaningful variable names, meaningful directory structure
Never copy and paste code
Never write a coding block longer than your screen
Literate Programming (Knuth): Computer code is for humans, not just for computers.

File organization

(The following guidelines apply to many, but not all possible types of projects)
All files that are necessary for reproducing the results should be bundled in one directory
- Including: primary unprocessed data files, scripts for data preprocessing, cached intermediate data objects, scripts for hypothesis testing, generated result plots, …
- Goal: You zip the directory, send it to somebody, and the person can reproduce the full analytical pipeline.
Use a consistent, self-explanatory subdirectory structure
Number script files in execution order (see next slide)

Directory organization

Subdirectory organization

(Again, this is one possible organization scheme)
/raw_data: Contains all raw primary data files, as you received them (or exported them from the experimental software), without any manual preprocessing. Primary data files are sacrosanct – set them to read-only, never change anything in them. Any transformation, data exclusions, or error corrections must be done in scripts in order to be reproducible.
/processed_data: Stores intermediate data objects. For example, after you did all your outlier exclusions, data transformations, etc., store a file „data_cleaned.csv“ in /cache. If you refer to this data file in subsequent scripts, you do not have to run the preprocessing script every time. All files in /processed_data can be safely deleted, as they can be reconstructed by running all script files in the correct order.
/doc: Documents (such as PDFs, manuals, etc.) related to the project.
/plots: Store result plots that you created in your scripts.
/export: If you export (sub)data sets or summaries for dissemination, store them here.
/archive: Old scripts and other files which are not used for your final analytical pipeline, but which you want to keep for some reasons.

Subdirectory organization

The README file

Nowadays, it is typically in markdown format (i.e., README.md) as it is rendered in Github and other repositories
Located at the top level of the project directory, it is the central entry point for other users: What do they need to get started? How can you make their life easier?
Consider to add the following information:
- Project title, authors, date, contact information
- Instructions on how to cite the project (or a link to an existing preprint or publication)
- A brief description of the project
- An overview of the directory structure
- Instructions on how to reproduce the results
- A license statement

Variable name conventions 1

snake_case, camelCase, PascalCase, dot.style, sTUdLy_cAPs?
- Do not use dots in variables (clashes with some functions)
- Some consensus that snake_case is the best choice, but respect language specific coding conventions
- Be consistent!
Prefer short variable names (less typing while coding). But:
understandability >> brevity

average_heart_rate
averageHeartRate
AverageHeartRate
average.heart.rate
aVErAgeHeARTraTe

average_heart_rate
avg_heart_rate
avg_HR
aHR
var001
x

Variable name conventions 2

Before: Inconsistent mixture of naming styles

After: Consistent naming style

Code commenting 1

At the start of each script file: Author, license, purpose of the file.
Load add-on packages all at once at the beginning of the file
Rule of thumb: At least 1 comment per 3 lines of code
Separate chunks of code with comment lines
Use English variable names and comments from the beginning (you don‘t want to translate everything before dissemination)

Code commenting 2

Link code comments to paper structure
- „Table 3: Descriptives of…“
- „Hypothesis 2: Does metafilin increase […]?“
Bonus: Link code comments to IDs of the preregistration
If the file gets too long (e.g., > 500 lines), split into multiple files

Collaboration

Pair Programming
- Pair programming is an agile software development technique in which two programmers work together at one workstation. One, the driver, writes code while the other, the observer or navigator, reviews each line of code as it is typed in. The two programmers switch roles frequently. (Wikipedia)
Golden route: Independent implementation

References

Bishop, D. V. M. (2018). Fallibility in Science: Responding to Errors in the Work of Oneself and Others. Advances in Methods and Practices in Psychological Science, 1(3), 432–438. http://doi.org/10.1177/2515245918776632
Rouder, J., Haaf, J. M., & Snyder, H. K. (2018, March 25). Minimizing Mistakes In Psychological Science. https://doi.org/10.31234/osf.io/gxcy5
Martin, R. (2008). Clean Code: A Handbook of Agile Software Craftsmanship (1st ed.). Upper Saddle River, NJ: Prentice Hall.
Karthik Ram: How To Make Your Data Analysis Notebooks More Reproducible
- https://github.com/karthik/rstudio2019
- https://inundata.org/talks/rstd19/#/

How to minimize mistakes

The Ideal? Reality?

Mistakes lurk everywhere

Mistakes lurk everywhere

Examples: Miscoding Missing Values

How to prevent coding errors?

Unit tests / sanity checks

Unit tests / sanity checks

Technical reproducibility

Coding style

File organization

Directory organization

Subdirectory organization

Subdirectory organization

The README file

Variable name conventions 1

Variable name conventions 2

Code commenting 1

Code commenting 2

Collaboration

References

End

Contact