Packages, algorithms and help
Which tool for which task — and what is available on DST
This is a reference hub: which tool solves which task, whether it is available on DST, and where to find more help. The guide takes you to an analysis-ready dataset and points you towards the right tools for the actual analysis — but does not teach the statistical methods. For that you need a statistician, a course, or the packages’ own documentation.
1. Validated algorithms
Published, peer-reviewed methods you can use directly to construct covariates. Code for NMI and the SEPLINE algorithm is under development.
| Algorithm | What it does | Article |
|---|---|---|
| Open Source Diabetes Classifier (OSDC) | Classifies everyone in Denmark as T1D, T2D or no diabetes | doi:10.2147/CLEP.S407019 |
| Nordic Multimorbidity Index (NMI) | Weighted comorbidity score (50 predictors) | doi:10.2147/CLEP.S353398 |
| SEPLINE — Socioeconomic Position in Epidemiological Research | National guideline for socioeconomic variables | doi:10.2147/CLEP.S520772 |
- OSDC — define a diabetes population, type and onset date → Phase 14 — OSDC
- NMI — adjust for comorbidity in a model or describe comorbidity burden → Phase 14 — NMI
- SEPLINE — education, income, employment as covariates → Phase 13 — Socioeconomic variables
Charlson as an alternative to NMI: both heaven::charlsonIndex() and EpiForsk (see below) calculate Charlson comorbidity score from LPR diagnoses — both are on DST.
2. Existing functions for data management and extraction
heaven — Danish register data toolkit
(pre-installed on DST; not on CRAN — github.com/tagteam/heaven)
Functions built specifically for Danish register data by Thomas Gerds (University of Copenhagen, Biostatistics): Lexis-splitting for time-varying covariates, drug exposure windows from LMDB, risk-set matching and Charlson score.
What heaven can do — and when to use it
Lexis-splitting — split follow-up time across multiple time axes when a covariate changes over time (age, calendar period, time since exposure).
library(heaven)
df_split <- lexisTwo(
indat = kohort,
invars = c("pnr", "inn", "out", "event"), # id, start date, end date, outcome
splitvars = list(age = c(40, 50, 60, 70), # age boundaries
cal = c(2010, 2015, 2020)) # calendar years
)
# Returns: one data frame with MORE rows per person — one per time interval.
# New columns are added for the current bands (age group, calendar period etc.).
# Ready for a Cox model with time-varying covariates.Drug exposure from LMDB — calculate exposure periods from prescription data (e.g. “was the person on metformin at index date?”).
exp_table <- medicinMacro(
indat = lmdb_data, id = "pnr", atc = "atc", eksd = "eksd",
ddds = "ddds", maxdepot = 3
)
# Returns: one row per exposure period per person with start and end date
# for the period. Use subsequent joins to determine exposure at index date.Charlson comorbidity score from LPR diagnoses:
charlson_data <- charlsonIndex(indat = lpr_diag_data, id = "pnr",
code = "c_diag", code_type = "icd10")
# Returns: one row per person with total Charlson score and binary indicators
# per Charlson disease group. The function matches ICD-10 codes to categories itself.Risk-set matching — match cases to controls at index date:
matched <- riskSetMatch(ptid = "pnr", event = "event",
terms = c("koen", "birth_year"), dat = kohort, ratio = 5)
# Returns: matched dataset with matchID linking cases and controls.heaven is confirmed available on DST — use library(heaven) directly. It is not on CRAN, so outside DST it must be fetched from GitHub.
EpiForsk — register data helper functions from SSI
(CRAN; on DST — github.com/Laksafoss/EpiForsk)
Code-sharing package from the Statens Serum Institut. Particularly useful: flatten_date_intervals() which merges overlapping hospital contacts into continuous periods — necessary before comorbidity lookback. Danish Charlson score included.
Most important functions and when to use them
flatten_date_intervals() — overlapping contacts
A person can have two overlapping admissions in LPR (e.g. a transfer). Before comorbidity lookback, overlapping periods must be merged:
library(EpiForsk)
lpr_flat <- flatten_date_intervals(
data = lpr_contacts, id = "pnr",
start_date = "d_inddto", end_date = "d_uddto"
)
# Returns: a data frame with merged start/end per person —
# overlapping periods are reduced to the total covered period.charlson_score() — Danish-calibrated Charlson comorbidity score from ICD-10 (incl. D-prefix). See the callout below for what it requires.
charlson <- charlson_score(data = lpr_diag_data, id = "pnr", icd_codes = "c_diag")
# Returns: one row per person with Charlson score.What does charlson_score() require? You must extract a lookback period from LPR yourself (typically 5 years before index date) and give the function all diagnoses in that window — A, B and G type. You do not need to filter on specific ICD codes in advance; the function matches the codes to Charlson categories itself. The D-prefix (DG30 etc.) is handled by EpiForsk — it does not need to be stripped first.
# Example: extract lookback and pass all diagnoses
# pnr and d_inddto are in lpr_adm — not lpr_diag
# Step 1: broadest possible time window for the entire cohort
min_lookback <- min(kohort$index_date) - 365.25 * 5 # earliest lookback start
# Step 2: fetch contacts and join to diagnoses
lpr2_lookback <- lpr_adm %>%
filter(pnr %in% !!kohort$pnr,
d_inddto >= !!min_lookback) %>% # broad pre-filter
select(pnr, recnum, d_inddto) %>%
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B", "G")) %>% # A/B/G for comorbidity
select(recnum, c_diag),
by = "recnum"
) %>%
collect() %>%
left_join(kohort %>% select(pnr, index_date), by = "pnr") %>%
filter(d_inddto >= index_date - 365.25 * 5, # precise per-person lookback
d_inddto < index_date)
charlson <- charlson_score(data = lpr2_lookback, id = "pnr", icd_codes = "c_diag")dataReporter — automatic data quality report
(CRAN; confirmed on DST — github.com/ekstroem/dataReporter)
Automatically generates an HTML/PDF report of all columns in a dataset: distribution, missing values, outliers and type checks. Useful for validating a new register extraction before starting analysis. Published in Journal of Statistical Software.
library(dataReporter)
makeDataReport(my_extract) # opens report in browserSee also Phase 7 — Inspect your data for manual inspection commands.
fakeregs — synthetic Danish register data
(GitHub only — local use only — github.com/steno-aarhus/fakeregs)
Fictitious but structurally realistic data for nine DST registers. For practising and testing code locally without DST access. Not used on the server. See Phase 6 — First extraction.
fastreg — SAS → parquet
(GitHub only — github.com/dp-next/fastreg)
Structured pipeline for converting an entire SAS workspace to parquet. Only relevant if you are building a parquet workspace from scratch. See Phase 4 — SAS → parquet.
3. Analysis tools
All packages in this section are confirmed on DST (June 2026). The guide does not teach the statistical methods — see package documentation for methodological choices.
survival — the foundation for survival analysis
(CRAN; always available — cran.r-project.org/package=survival)
Provides Surv(time, status), coxph() and survfit() — the fundamental building blocks on which all survival packages below are built. You rarely call {survival} directly for plots and tables, but library(survival) belongs at the top of your analysis scripts.
library(survival)
cox <- coxph(Surv(time, event) ~ age + sex + nmi_score, data = cohort)
summary(cox) # HR, 95% CI, p-values — pass to Publish or gtsummary for formatted outputgtsummary — Table 1 and manuscript tables
(CRAN — cran.r-project.org/package=gtsummary)
Standard for manuscript tables: tbl_summary() for Table 1 stratified by exposure, tbl_regression() for Cox and logistic regression, tbl_survfit() for survival summaries. Published in R Journal 2021.
Example: Table 1
library(gtsummary)
table1 <- cohort %>%
select(group, age, sex, nmi_score, occupation_cat) %>%
tbl_summary(
by = group, # stratify by exposure
statistic = list(
all_continuous() ~ "{mean} ({sd})", # continuous: mean (SD)
all_categorical() ~ "{n} ({p}%)" # categorical: n (%)
),
missing = "no"
) %>%
add_overall() %>%
add_p()Output control: Counts of 0–4 may not be repatriated. Manual rounding: round(x / 5) * 5. Z.gtsummary.addons::round_5_gtsummary() automates this — not on DST, import via data manager.
carpenter — pipe-based table structure
(CRAN — github.com/lwjohnst86/carpenter)
Older alternative to gtsummary with a simpler pipe-based API: start with outline_table(), add rows with add_rows(), finish with build_table(). By Luke Johnston.
library(carpenter)
outline_table(cohort, "group") %>%
add_rows("age", stat_meanSD) %>% # continuous: mean (SD)
add_rows("sex", stat_nPct) %>% # categorical: n (%)
add_rows("nmi_score", stat_medianIQR) %>% # continuous: median (IQR)
build_table()prodlim — Kaplan–Meier and Aalen–Johansen cumulative incidence
(CRAN — github.com/tagteam/prodlim)
Estimates and plots survival curves and cumulative incidence. Hist() handles competing risks with a single status variable: 0 = censored, 1 = primary event, 2 = competing event (e.g. death). Use prodlim for descriptive curves — riskRegression for regression.
Example: KM curve and Aalen–Johansen
library(prodlim)
# Kaplan–Meier:
fit_km <- prodlim(Hist(time, event) ~ group, data = cohort)
plot(fit_km)
# Aalen–Johansen cumulative incidence (competing risks):
# status: 0 = censored, 1 = primary event, 2 = competing event (e.g. death)
fit_aj <- prodlim(Hist(time, status) ~ group, data = cohort)
plot(fit_aj, cause = 1) # cumulative incidence for primary eventKM alone overestimates cumulative incidence in the presence of competing risks. Use Aalen–Johansen (cause = 1) instead.
riskRegression — Cause-Specific Cox, Fine-Gray and absolute risk
(CRAN — github.com/tagteam/riskRegression)
For regression with competing risks. CSC() fits Cause-Specific Cox (one model per cause), FGR() fits Fine-Gray subdistribution hazard. predictRisk() computes absolute risk at specific time points. Score() validates and compares models (Brier score, time-dependent AUC).
Example: CSC model and absolute risk
library(riskRegression)
library(survival)
# Cause-Specific Cox:
csc_model <- CSC(Hist(time, status) ~ age + sex + nmi_score, data = cohort)
summary(csc_model)
# Absolute risk at 1, 3 and 5 years for primary event:
ar <- predictRisk(csc_model, newdata = cohort,
times = c(1, 3, 5), cause = 1)
# ar: matrix — one column per time point, one row per personPublish — regression results ready for manuscript
(CRAN — github.com/tagteam/Publish)
regressionTable() takes a fitted model (Cox, logistic, linear) and returns a table with estimates, 95% CI and p-values. plotConfidence() produces forest plots from the same output.
library(Publish)
library(survival)
cox_model <- coxph(Surv(time, event) ~ age + sex + nmi_score, data = cohort)
regressionTable(cox_model) # HR, 95% CI, p-values
plotConfidence(regressionTable(cox_model)) # forest plotMatchIt — propensity score and exact matching
(CRAN — cran.r-project.org/package=MatchIt)
Matches exposed to unexposed on observed covariates. Supports nearest-neighbour, exact and optimal matching. summary() gives balance statistics (SMD), match.data() returns the matched dataset. Alternative to heaven::riskSetMatch() with better balance diagnostics.
Example: 1:1 nearest-neighbour matching
library(MatchIt)
m <- matchit(exposed ~ age + sex + nmi_score,
data = cohort,
method = "nearest", # nearest-neighbour matching
ratio = 1) # 1:1
# Balance check — SMD < 0.1 is a common rule of thumb:
summary(m, standardize = TRUE)
# Matched dataset:
matched <- match.data(m)pec — prediction error curves
(CRAN — cran.r-project.org/package=pec)
Computes and plots Brier score over time for comparing and validating prediction models. Niche — only relevant for prediction models; riskRegression::Score() usually covers the need.
BuyseTest — Generalised Pairwise Comparisons
(CRAN, confirmed on DST v3.3.3 — github.com/bozenne/BuyseTest)
Implements Generalised Pairwise Comparisons (GPC): compares two groups via win ratio, net benefit and win probability on prioritised endpoints. Niche — relevant when you have a hierarchy of endpoints and want to avoid classical event-time assumptions.
library(BuyseTest)
# One time-to-event endpoint:
result <- BuyseTest(group ~ tte(time, event, threshold = 0),
data = cohort,
method.inference = "u-statistic")
summary(result) # win ratio, net benefit, win probabilitypowerBuyseTest() calculates the required sample size given desired power and expected effect — useful before data collection or grant applications.
kinship2 — pedigrees and kinship matrices
(CRAN — cran.r-project.org/web/packages/kinship2)
Creates pedigree objects from parent–child relationships and computes kinship coefficients (probability of shared allele: 0.5 = parent/child, 0.25 = siblings). Useful for multi-generational family studies. By Terry Therneau (Mayo Clinic).
Example: create pedigree and compute kinship
library(kinship2)
# Create pedigree:
# DST's koen variable (1 = male, 2 = female) matches kinship2's expectation directly
ped <- pedigree(
id = family$pnr,
dadid = family$pnr_father, # 0 or NA if father unknown
momid = family$pnr_mother, # 0 or NA if mother unknown
sex = family$koen # 1 = male, 2 = female
)
# Kinship coefficients:
km <- kinship(ped)
# Visualise family tree:
plot(ped)tame — cluster medication exposure from LMDB
(CRAN — github.com/Laksafoss/tame)
Agglomerative hierarchical clustering of medication data using a custom distance measure combining ATC classification, timing and dose. medic() is the primary function. Relevant for characterising polypharmacy patterns in cohorts. By Anders Laksafoss (SSI).
Example: cluster medication patterns
library(tame)
# Cluster on ATC code alone:
clust <- medic(lmdb_data, id = pnr, atc = atc, k = 5)
# k = desired number of clusters — choose with scree plot or domain knowledge
# With timing information (column range):
clust <- medic(lmdb_data, id = pnr, atc = atc,
timing = trimester_1:trimester_3, # columns defining time windows
k = 5)
summary(clust) # summarise clustersid, atc and timing use bare column names (NSE — no quotes). lmdb_data must have one row per prescription dispensing.
Possible inspiration — not confirmed on DST
famnet (GitHub, experimental) — identifies family relationships from person IDs: given a dataset with PersonID, FatherID and MotherID, fmn_siblings() finds all sibling pairs with relationship type. Relevant for family linkage across generations in Danish register data. By Luke Johnston.
plotsurv (GitHub, not on DST) — publication-ready Kaplan–Meier and Aalen–Johansen competing-risks plots with at-risk table, CI bands and censoring ticks. plotsurv(survfit_object).
Z.gtsummary.addons (GitHub, not on DST) — extends gtsummary with add_SMD() (standardised mean differences for Table 1) and round_5_gtsummary() (rounds counts to the nearest 5 — relevant for DST output control).
4. Where to get more help?
Community and courses — see the prioritised help list and course links in Phase 2 — R: the bare essentials (Stack Overflow, Zheers R Coding Café, DDEA, R for Data Science, Epi R Handbook).
Institutional reference material:
| Source | What you find |
|---|---|
| ctpteam/RDST | “R on DST” manual: project setup, data access, good coding practice on the DST server |
| ctpteam/DST | Guide to LPR3, DST manuals, slides and a heaven PDF |
| steno-aarhus/registers-project-database | SDCA’s own database: variable lists, register descriptions and application templates for the Danish Health Data Authority |
| github.com/tagteam | riskRegression, prodlim, Publish, pec, heaven — Thomas Gerds and Brice Ozenne, KU Biostatistics |