Good code practice

Now you write your own code β€” how to write it so you can trust it yourself in six months

Published

June 6, 2026

You have built your cohort (Phase 10), assembled your extracts (Phase 11), and are now about to write the code that actually produces your results: descriptive tables, models, sensitivity analyses.

The most important thing in register-based research is that your results can be reproduced β€” by a reviewer, a colleague or yourself in six months. That places demands on how you organise and write your code. The habits you adopt now will save you hours later.

Tip

In short: The habits that matter most β€” one script per step, run each script top to bottom (never across), give objects meaningful names, and comment the why rather than the what. The rest is polish.

Tip

You do not have to do everything at once. Take a couple of these tips at a time and let them become habits. The most important thing is consistency β€” pick one style and stick to it.


1. Structure your scripts logically

One script per step. Put each step of the analysis in its own .R file, named with a number that tells you the order to run them in:

01_build_cohort.R       # build the cohort (pnr + index date)
02_extract_outcomes.R   # extract outcomes
03_extract_covariates.R # extract covariates
04_data_management.R    # assemble, clean, derive variables
05_descriptive.R        # descriptive analyses (Table 1)
06_analysis.R           # main models
07_sensitivity.R        # sensitivity analyses

The script numbers tell anyone who sees the folder the order to run them in. Use subfolders as the project grows β€” e.g. R/, output/, datasets/.

A predictable order within each script. Whatever the script does, the same frame recurs: a header, load packages, import data β€” then the actual work β€” and finally save the result. Only the middle part changes from script to script:

# ==================================================
# Project:   Dementia and surgery (DARTER 708421)
# Author:    Your Name
# Date:      2026-06-05
# Purpose:   Main analysis β€” Cox model for dementia
# ==================================================

# 1. Load packages ----
library(tidyverse)
library(survival)

# 2. Import data ----
analysis_data <- readRDS("datasets/analysis_data.rds")

# 3. The actual work ----
# ... varies from script to script β€” here e.g.: Table 1, Cox model, sensitivity analyses ...

# 4. Save output ----
saveRDS(cox_model, "output/cox_model.rds")

The four steps β€” header, packages, data, save β€” recur in every script; only step 3 (the actual work) changes. The header at the top tells you in five seconds what the script does, who wrote it and what it needs. More on good section headings and comments in section 5.

Warning

Avoid jumping back and forth between cleaning, modelling and plotting. A script should read and run from top to bottom. If your code jumps around, it becomes impossible to follow β€” even for yourself.


2. Run scripts top to bottom β€” never across

A script should run from line 1 to the end without interruption and give the same result every time. Avoid running two lines from one file, jumping to another and back.

Important

Never run code manually across scripts. If your result depends on you having run line 14 in script 02 before line 7 in script 03, it is not reproducible. Instead, do a saveRDS() at the end of script 02 and a readRDS() at the start of script 03.

# End of 03_extract_covariates.R β€” save the result
saveRDS(covariates, "datasets/covariates.rds")
# Start of 04_data_management.R β€” load it again
covariates <- readRDS("datasets/covariates.rds")

When you think you are done: restart R (Session β†’ Restart R) and run the whole script from line 1 again. Does it run clean? Then it is reproducible.


3. Use meaningful object names

The name should describe what the object contains.

# Bad β€” what are a and b?
a <- read_csv("data.csv")
b <- lm(bmi ~ age, data = a)
# Good β€” the name speaks for itself
participant_data <- read_csv("data.csv")
bmi_model        <- lm(bmi ~ age, data = participant_data)

In six months you will not remember what a and b were. participant_data and bmi_model explain themselves.


4. Use snake_case consistently

Consistent naming makes code far easier to read. Pick snake_case (lowercase with underscores) and stick to it:

# Good β€” snake_case
body_mass_index
participant_age
sweetener_intake
# Avoid mixing styles
BodyMassIndex     # PascalCase
bodyMassIndex     # camelCase
BMI_Data          # mixed

What matters is not which style, but that you are consistent.


5. Headings and comments make the code readable

Headings and comments are what make a script navigable β€” for a reviewer, a colleague or yourself in six months. Three things to get into the habit of:

  • A short description at the top of each script β€” what it does, what it needs as input, and what it produces (the header from section 1).
  • Section headings in .R scripts with CTRL+SHIFT+R β€” inserts a line like # Import data ----. They appear in the outline panel in the top right of the editor and in the dropdown menu at the bottom left; click to jump straight to a section. (In Quarto documents you use markdown headings with ## instead.)
  • A comment on each substantial line of code β€” but explain why, not what.

Comment the β€œwhy”, not the β€œwhat”. The code already shows what happens. A good comment explains why β€” the decision behind it.

# Bad β€” the comment just repeats the code
# Calculate BMI
data$bmi <- data$weight / data$height^2
# Better β€” the comment explains the decision
# BMI used as an adjustment variable in the primary models
data$bmi <- data$weight / data$height^2

Explain choices, assumptions and sources β€” not the obvious. It takes five minutes to write a good comment now and an hour to understand the code again in three months.


6. Avoid hard-coded β€œmagic numbers”

A β€œmagic number” is a value in the middle of your code whose meaning is unclear. Give it a name instead:

# Bad β€” why 18? what if the cutoff changes?
data <- data %>%
  filter(age >= 18)
# Better β€” the cutoff has a name and is defined in one place
adult_age_cutoff <- 18

data <- data %>%
  filter(age >= adult_age_cutoff)

This is especially important when a cutoff is used in several places or may change: then you only fix it once.


7. Keep lines reasonably short

Long lines are hard to read and to see changes in. Break long calls up so each argument stands out clearly:

# Bad β€” one long line, hard to take in
model <- glm(outcome ~ age + sex + bmi + smoking + education + income + physical_activity + energy_intake + alcohol, data = data, family = binomial())
# Better β€” one argument/block per line
model <- glm(
  outcome ~ age + sex + bmi +
    smoking + education +
    income + physical_activity +
    energy_intake + alcohol,
  data = data,
  family = binomial()
)

8. Write functions for repeated tasks

If you copy the same code more than a few times β€” e.g. a Table 1 for each exposure group β€” write a function. Functions reduce errors: fix something once, and it is fixed everywhere.

# Bad β€” the same call repeated, easy to make a mistake in one of them
table1_a <- CreateTableOne(vars = baseline_vars, strata = "operated", data = data_a)
table1_b <- CreateTableOne(vars = baseline_vars, strata = "operated", data = data_b)
# ... repeated 10 times ...
# Better β€” write the function once
create_table1 <- function(data, exposure) {
  CreateTableOne(
    vars   = baseline_vars,
    strata = exposure,
    data   = data
  )
}

table1_a <- create_table1(data_a, "operated")
table1_b <- create_table1(data_b, "operated")

How to write your own function

A function has three parts: a name, some arguments (the input in the parentheses), and a body (the code between { }). Whatever the last line produces is what the function returns.

name <- function(argument1, argument2) {
  # body: do something with the arguments
  result <- argument1 + argument2
  result        # last line = what is returned
}

A concrete example β€” a function that computes age at a given date:

# Function: age in whole years at a given date
compute_age <- function(birth_date, index_date) {
  as.numeric(difftime(index_date, birth_date, units = "days")) %/% 365.25
}

# Use it
compute_age(as.Date("1950-03-01"), as.Date("2020-01-01"))   # 69

You can read more about functions β€” arguments, default values and when they pay off β€” in Phase 15 β€” Guide to functions.


9. Fail early β€” check your data before the analysis

It is cheaper to catch an error straight away than to discover it in a finished result. Insert explicit checks of your assumptions:

# Stop immediately if an assumption is broken
stopifnot(
  all(data$age >= 0),
  all(data$age <= 120)
)
# Alternative with clearer error messages (the assertthat package)
assertthat::assert_that(
  nrow(data) > 0,
  msg = "data is empty β€” check your extract"
)

If the check fails, the script stops immediately β€” instead of carrying a hidden error forward into your models.


10. One object = one purpose

Avoid overwriting the same object again and again. It makes debugging hard, because data means something different depending on how far you have got:

# Bad β€” the same name overwritten all the way down
data <- read_csv("data.csv")
data <- filter(data, age >= 18)
data <- mutate(data, bmi = weight / height^2)
data <- left_join(data, covariates, by = "pnr")
# Better β€” each step has its own name
raw_data       <- read_csv("data.csv")

clean_data     <- raw_data %>%
  filter(age >= 18)

analysis_data  <- clean_data %>%
  mutate(bmi = weight / height^2) %>%
  left_join(covariates, by = "pnr")

Now you can inspect each intermediate step (raw_data, clean_data, analysis_data) separately β€” invaluable when something looks wrong.


See also

Back to top