Good code practice
Now you write your own code β how to write it so you can trust it yourself in six months
You have built your cohort (Phase 10), assembled your extracts (Phase 11), and are now about to write the code that actually produces your results: descriptive tables, models, sensitivity analyses.
The most important thing in register-based research is that your results can be reproduced β by a reviewer, a colleague or yourself in six months. That places demands on how you organise and write your code. The habits you adopt now will save you hours later.
In short: The habits that matter most β one script per step, run each script top to bottom (never across), give objects meaningful names, and comment the why rather than the what. The rest is polish.
You do not have to do everything at once. Take a couple of these tips at a time and let them become habits. The most important thing is consistency β pick one style and stick to it.
1. Structure your scripts logically
One script per step. Put each step of the analysis in its own .R file, named with a number that tells you the order to run them in:
01_build_cohort.R # build the cohort (pnr + index date)
02_extract_outcomes.R # extract outcomes
03_extract_covariates.R # extract covariates
04_data_management.R # assemble, clean, derive variables
05_descriptive.R # descriptive analyses (Table 1)
06_analysis.R # main models
07_sensitivity.R # sensitivity analyses
The script numbers tell anyone who sees the folder the order to run them in. Use subfolders as the project grows β e.g. R/, output/, datasets/.
A predictable order within each script. Whatever the script does, the same frame recurs: a header, load packages, import data β then the actual work β and finally save the result. Only the middle part changes from script to script:
# ==================================================
# Project: Dementia and surgery (DARTER 708421)
# Author: Your Name
# Date: 2026-06-05
# Purpose: Main analysis β Cox model for dementia
# ==================================================
# 1. Load packages ----
library(tidyverse)
library(survival)
# 2. Import data ----
analysis_data <- readRDS("datasets/analysis_data.rds")
# 3. The actual work ----
# ... varies from script to script β here e.g.: Table 1, Cox model, sensitivity analyses ...
# 4. Save output ----
saveRDS(cox_model, "output/cox_model.rds")The four steps β header, packages, data, save β recur in every script; only step 3 (the actual work) changes. The header at the top tells you in five seconds what the script does, who wrote it and what it needs. More on good section headings and comments in section 5.
Avoid jumping back and forth between cleaning, modelling and plotting. A script should read and run from top to bottom. If your code jumps around, it becomes impossible to follow β even for yourself.
2. Run scripts top to bottom β never across
A script should run from line 1 to the end without interruption and give the same result every time. Avoid running two lines from one file, jumping to another and back.
Never run code manually across scripts. If your result depends on you having run line 14 in script 02 before line 7 in script 03, it is not reproducible. Instead, do a saveRDS() at the end of script 02 and a readRDS() at the start of script 03.
# End of 03_extract_covariates.R β save the result
saveRDS(covariates, "datasets/covariates.rds")# Start of 04_data_management.R β load it again
covariates <- readRDS("datasets/covariates.rds")When you think you are done: restart R (Session β Restart R) and run the whole script from line 1 again. Does it run clean? Then it is reproducible.
3. Use meaningful object names
The name should describe what the object contains.
# Bad β what are a and b?
a <- read_csv("data.csv")
b <- lm(bmi ~ age, data = a)# Good β the name speaks for itself
participant_data <- read_csv("data.csv")
bmi_model <- lm(bmi ~ age, data = participant_data)In six months you will not remember what a and b were. participant_data and bmi_model explain themselves.
4. Use snake_case consistently
Consistent naming makes code far easier to read. Pick snake_case (lowercase with underscores) and stick to it:
# Good β snake_case
body_mass_index
participant_age
sweetener_intake# Avoid mixing styles
BodyMassIndex # PascalCase
bodyMassIndex # camelCase
BMI_Data # mixedWhat matters is not which style, but that you are consistent.
5. Headings and comments make the code readable
Headings and comments are what make a script navigable β for a reviewer, a colleague or yourself in six months. Three things to get into the habit of:
- A short description at the top of each script β what it does, what it needs as input, and what it produces (the header from section 1).
- Section headings in .R scripts with CTRL+SHIFT+R β inserts a line like
# Import data ----. They appear in the outline panel in the top right of the editor and in the dropdown menu at the bottom left; click to jump straight to a section. (In Quarto documents you use markdown headings with##instead.) - A comment on each substantial line of code β but explain why, not what.
Comment the βwhyβ, not the βwhatβ. The code already shows what happens. A good comment explains why β the decision behind it.
# Bad β the comment just repeats the code
# Calculate BMI
data$bmi <- data$weight / data$height^2# Better β the comment explains the decision
# BMI used as an adjustment variable in the primary models
data$bmi <- data$weight / data$height^2Explain choices, assumptions and sources β not the obvious. It takes five minutes to write a good comment now and an hour to understand the code again in three months.
6. Avoid hard-coded βmagic numbersβ
A βmagic numberβ is a value in the middle of your code whose meaning is unclear. Give it a name instead:
# Bad β why 18? what if the cutoff changes?
data <- data %>%
filter(age >= 18)# Better β the cutoff has a name and is defined in one place
adult_age_cutoff <- 18
data <- data %>%
filter(age >= adult_age_cutoff)This is especially important when a cutoff is used in several places or may change: then you only fix it once.
7. Keep lines reasonably short
Long lines are hard to read and to see changes in. Break long calls up so each argument stands out clearly:
# Bad β one long line, hard to take in
model <- glm(outcome ~ age + sex + bmi + smoking + education + income + physical_activity + energy_intake + alcohol, data = data, family = binomial())# Better β one argument/block per line
model <- glm(
outcome ~ age + sex + bmi +
smoking + education +
income + physical_activity +
energy_intake + alcohol,
data = data,
family = binomial()
)8. Write functions for repeated tasks
If you copy the same code more than a few times β e.g. a Table 1 for each exposure group β write a function. Functions reduce errors: fix something once, and it is fixed everywhere.
# Bad β the same call repeated, easy to make a mistake in one of them
table1_a <- CreateTableOne(vars = baseline_vars, strata = "operated", data = data_a)
table1_b <- CreateTableOne(vars = baseline_vars, strata = "operated", data = data_b)
# ... repeated 10 times ...# Better β write the function once
create_table1 <- function(data, exposure) {
CreateTableOne(
vars = baseline_vars,
strata = exposure,
data = data
)
}
table1_a <- create_table1(data_a, "operated")
table1_b <- create_table1(data_b, "operated")How to write your own function
A function has three parts: a name, some arguments (the input in the parentheses), and a body (the code between { }). Whatever the last line produces is what the function returns.
name <- function(argument1, argument2) {
# body: do something with the arguments
result <- argument1 + argument2
result # last line = what is returned
}A concrete example β a function that computes age at a given date:
# Function: age in whole years at a given date
compute_age <- function(birth_date, index_date) {
as.numeric(difftime(index_date, birth_date, units = "days")) %/% 365.25
}
# Use it
compute_age(as.Date("1950-03-01"), as.Date("2020-01-01")) # 69You can read more about functions β arguments, default values and when they pay off β in Phase 15 β Guide to functions.
9. Fail early β check your data before the analysis
It is cheaper to catch an error straight away than to discover it in a finished result. Insert explicit checks of your assumptions:
# Stop immediately if an assumption is broken
stopifnot(
all(data$age >= 0),
all(data$age <= 120)
)# Alternative with clearer error messages (the assertthat package)
assertthat::assert_that(
nrow(data) > 0,
msg = "data is empty β check your extract"
)If the check fails, the script stops immediately β instead of carrying a hidden error forward into your models.
10. One object = one purpose
Avoid overwriting the same object again and again. It makes debugging hard, because data means something different depending on how far you have got:
# Bad β the same name overwritten all the way down
data <- read_csv("data.csv")
data <- filter(data, age >= 18)
data <- mutate(data, bmi = weight / height^2)
data <- left_join(data, covariates, by = "pnr")# Better β each step has its own name
raw_data <- read_csv("data.csv")
clean_data <- raw_data %>%
filter(age >= 18)
analysis_data <- clean_data %>%
mutate(bmi = weight / height^2) %>%
left_join(covariates, by = "pnr")Now you can inspect each intermediate step (raw_data, clean_data, analysis_data) separately β invaluable when something looks wrong.
See also
- Phase 11 β Assemble your extracts β joins and pivots
- Phase 15 β Guide to functions β functions in depth
- Phase 5 β Extracting data step by step β the fundamental pattern
- Inspiration for formatting code: Stack Overflowβs formatting guide