How to minimize mistakes

Good coding practices

Felix Schönbrodt

Ludwig-Maximilians-Universität München

Caroline Zygar-Hoffmann

Ludwig-Maximilians-Universität München

2024-04-27

The Ideal? Reality?

  • Surrounded from a flood of fake news, science is one of (the last?) sources of credible information.
  • Trust in science: Scientists are impartial, meticulous, and check their results rigorously in peer review before publication
  • Although error-freeness cannot be guaranteed, science provides the most reliable source of information, and if errors happen, they soon get detected and corrected.
  • Replication/credibility crisis
  • Number of retractions rises exponentially
  • Pressure to „publish or perish“ leads to hurried manuscript, less error checking
  • Reviewer overload leads to superficial reviews
  • statcheck (Nuijten et al., 2016): 50% of psychology papers contain at least one inconsistent statistic

Mistakes lurk everywhere

  • Errors in data collection software/scripts
  • Error in manual data transcription
  • Analyzing the wrong data set (e.g., an old version, a filter has been unknowingly applied)
  • Coding errors
    • Wrong group assignment (control/ experimental group)
    • NAs coded as 99?
  • Faulty analysis software; version changes in R packages
  • Mistyping numbers when copying them from R to manuscript
  • Send the wrong file to the journal submission system
  • Asymmetry: Mistakes tend to go in the preferred direction (Gould, 1996), because we check more vigorously when results (unexpectedly) go into the wrong direction.
    • See also „garden of forking paths“ (Gelman & Loken, 2013): A lot of p-hacking is unintentional

Mistakes lurk everywhere

Examples: Miscoding Missing Values

  • Jasso (1985) ASR: coital frequency increases with wife’s age
    • Kahn/Udry (1986): 4 observations had value 88; Jasso treated these as valid values; treating these as missing, the effect of wife’s age becomes non-significant
  • Herring (2009) ASR: diversity increases firm revenue
    • Stojmenovska/Bol/Leopold (2017): 206 firms had values 88,888,888,888; Herring treated these as valid values; treating these as missing, the effect of diversity becomes non-significant
  • Munsch (2015) ASR: marital infidelity is at 10 %
    • Munsch (2018): 246 missings were miscoded as “infidelity” (Stata .-problem); with correct coding marital infidelity is at 6 %

How to prevent coding errors?

Unit tests / sanity checks

  • Always look at the descriptive statistics (min, max, NAs, mean) of every variable (also transformed/computed variables)
    • Know the scales of your variables: Is the mean plausible? Is the minimum and maximum value theoretically possible and plausible?
    • Does a scale value from multiple items have only discrete values?
    • Do z-standardized variables really have mean=0 and SD=1?
  • Plot all variables (scatterplot, histograms)

Unit tests / sanity checks

The summarytools package makes this really easy:

library(summarytools)
view(dfSummary(data))

alternatively: codebook package by Ruben Arslan (also has an RStudio plugin). Has some nice convenience functions, e.g. when importing SPSS or other data files.

Technical reproducibility

  1. When you rely on random numbers (e.g., in simulation studies): Set a seed for reproducibility
    • Take care when doing parallel computing, this sometimes requires special treatment of seeds
  2. Document exact versions of all packages/ external programs at each completed stage of data analysis
    • e.g., in R: Save the sessionInfo() of the analytical pipeline in an accompanying file when you submit a paper, and for every revision
  3. Save a snapshot of the current version/state of the statistical software
    • e.g., in R: checkpoint package, packrat package, renv package
  4. The safest way: Make a docker container which contains the full computational environment (including OS)

Coding style

  • Meaningful variable names, meaningful directory structure
  • Never copy and paste code
  • Never write a coding block longer than your screen
  • Literate Programming (Knuth): Computer code is for humans, not just for computers.

File organization


  • (The following guidelines apply to many, but not all possible types of projects)
  • All files that are necessary for reproducing the results should be bundled in one directory
    • Including: primary unprocessed data files, scripts for data preprocessing, cached intermediate data objects, scripts for hypothesis testing, generated result plots, …
    • Goal: You zip the directory, send it to somebody, and the person can reproduce the full analytical pipeline.
  • Use a consistent, self-explanatory subdirectory structure
  • Number script files in execution order (see next slide)

Directory organization

Subdirectory organization

  • (Again, this is one possible organization scheme)
  • /raw_data: Contains all raw primary data files, as you received them (or exported them from the experimental software), without any manual preprocessing. Primary data files are sacrosanct – set them to read-only, never change anything in them. Any transformation, data exclusions, or error corrections must be done in scripts in order to be reproducible.
  • /processed_data: Stores intermediate data objects. For example, after you did all your outlier exclusions, data transformations, etc., store a file „data_cleaned.csv“ in /cache. If you refer to this data file in subsequent scripts, you do not have to run the preprocessing script every time. All files in /processed_data can be safely deleted, as they can be reconstructed by running all script files in the correct order.
  • /doc: Documents (such as PDFs, manuals, etc.) related to the project.
  • /plots: Store result plots that you created in your scripts.
  • /export: If you export (sub)data sets or summaries for dissemination, store them here.
  • /archive: Old scripts and other files which are not used for your final analytical pipeline, but which you want to keep for some reasons.

Subdirectory organization

The README file

  • Nowadays, it is typically in markdown format (i.e., README.md) as it is rendered in Github and other repositories
  • Located at the top level of the project directory, it is the central entry point for other users: What do they need to get started? How can you make their life easier?
  • Consider to add the following information:
    • Project title, authors, date, contact information
    • Instructions on how to cite the project (or a link to an existing preprint or publication)
    • A brief description of the project
    • An overview of the directory structure
    • Instructions on how to reproduce the results
    • A license statement

Variable name conventions 1

  • snake_case, camelCase, PascalCase, dot.style, sTUdLy_cAPs?
    • Do not use dots in variables (clashes with some functions)
    • Some consensus that snake_case is the best choice, but respect language specific coding conventions
    • Be consistent!
  • Prefer short variable names (less typing while coding). But:
    understandability >> brevity
average_heart_rate
averageHeartRate
AverageHeartRate
average.heart.rate
aVErAgeHeARTraTe



average_heart_rate
avg_heart_rate
avg_HR
aHR
var001
x

Variable name conventions 2

  • Before: Inconsistent mixture of naming styles

  • After: Consistent naming style

Code commenting 1


  • At the start of each script file: Author, license, purpose of the file.
  • Load add-on packages all at once at the beginning of the file
  • Rule of thumb: At least 1 comment per 3 lines of code
  • Separate chunks of code with comment lines
  • Use English variable names and comments from the beginning (you don‘t want to translate everything before dissemination)

Code commenting 2


  • Link code comments to paper structure
    • „Table 3: Descriptives of…“
    • „Hypothesis 2: Does metafilin increase […]?“
  • Bonus: Link code comments to IDs of the preregistration
  • If the file gets too long (e.g., > 500 lines), split into multiple files

Collaboration

  • Pair Programming
    • Pair programming is an agile software development technique in which two programmers work together at one workstation. One, the driver, writes code while the other, the observer or navigator, reviews each line of code as it is typed in. The two programmers switch roles frequently. (Wikipedia)
  • Golden route: Independent implementation

References

End

Contact

CC-BY-SA 4.0