Packages, algorithms and help

Which tool for which task — and what is available on DST

Published

June 6, 2026

This is a reference hub: which tool solves which task, whether it is available on DST, and where to find more help. The guide takes you to an analysis-ready dataset and points you towards the right tools for the actual analysis — but does not teach the statistical methods. For that you need a statistician, a course, or the packages’ own documentation.


1. Validated algorithms

Published, peer-reviewed methods you can use directly to construct covariates. Code for NMI and the SEPLINE algorithm is under development.

Algorithm What it does Article
Open Source Diabetes Classifier (OSDC) Classifies everyone in Denmark as T1D, T2D or no diabetes doi:10.2147/CLEP.S407019
Nordic Multimorbidity Index (NMI) Weighted comorbidity score (50 predictors) doi:10.2147/CLEP.S353398
SEPLINE — Socioeconomic Position in Epidemiological Research National guideline for socioeconomic variables doi:10.2147/CLEP.S520772
Tip

Charlson as an alternative to NMI: both heaven::charlsonIndex() and EpiForsk (see below) calculate Charlson comorbidity score from LPR diagnoses — both are on DST.


2. Existing functions for data management and extraction

heaven — Danish register data toolkit

(pre-installed on DST; not on CRAN — github.com/tagteam/heaven)

Functions built specifically for Danish register data by Thomas Gerds (University of Copenhagen, Biostatistics): Lexis-splitting for time-varying covariates, drug exposure windows from LMDB, risk-set matching and Charlson score.

What heaven can do — and when to use it

Lexis-splitting — split follow-up time across multiple time axes when a covariate changes over time (age, calendar period, time since exposure).

library(heaven)

df_split <- lexisTwo(
  indat      = kohort,
  invars     = c("pnr", "inn", "out", "event"),   # id, start date, end date, outcome
  splitvars  = list(age = c(40, 50, 60, 70),       # age boundaries
                    cal = c(2010, 2015, 2020))      # calendar years
)
# Returns: one data frame with MORE rows per person — one per time interval.
# New columns are added for the current bands (age group, calendar period etc.).
# Ready for a Cox model with time-varying covariates.

Drug exposure from LMDB — calculate exposure periods from prescription data (e.g. “was the person on metformin at index date?”).

exp_table <- medicinMacro(
  indat = lmdb_data, id = "pnr", atc = "atc", eksd = "eksd",
  ddds = "ddds", maxdepot = 3
)
# Returns: one row per exposure period per person with start and end date
# for the period. Use subsequent joins to determine exposure at index date.

Charlson comorbidity score from LPR diagnoses:

charlson_data <- charlsonIndex(indat = lpr_diag_data, id = "pnr",
                               code = "c_diag", code_type = "icd10")
# Returns: one row per person with total Charlson score and binary indicators
# per Charlson disease group. The function matches ICD-10 codes to categories itself.

Risk-set matching — match cases to controls at index date:

matched <- riskSetMatch(ptid = "pnr", event = "event",
                        terms = c("koen", "birth_year"), dat = kohort, ratio = 5)
# Returns: matched dataset with matchID linking cases and controls.
Note

heaven is confirmed available on DST — use library(heaven) directly. It is not on CRAN, so outside DST it must be fetched from GitHub.


EpiForsk — register data helper functions from SSI

(CRAN; on DST — github.com/Laksafoss/EpiForsk)

Code-sharing package from the Statens Serum Institut. Particularly useful: flatten_date_intervals() which merges overlapping hospital contacts into continuous periods — necessary before comorbidity lookback. Danish Charlson score included.

Most important functions and when to use them

flatten_date_intervals() — overlapping contacts

A person can have two overlapping admissions in LPR (e.g. a transfer). Before comorbidity lookback, overlapping periods must be merged:

library(EpiForsk)

lpr_flat <- flatten_date_intervals(
  data = lpr_contacts, id = "pnr",
  start_date = "d_inddto", end_date = "d_uddto"
)
# Returns: a data frame with merged start/end per person —
# overlapping periods are reduced to the total covered period.

charlson_score() — Danish-calibrated Charlson comorbidity score from ICD-10 (incl. D-prefix). See the callout below for what it requires.

charlson <- charlson_score(data = lpr_diag_data, id = "pnr", icd_codes = "c_diag")
# Returns: one row per person with Charlson score.
Important

What does charlson_score() require? You must extract a lookback period from LPR yourself (typically 5 years before index date) and give the function all diagnoses in that window — A, B and G type. You do not need to filter on specific ICD codes in advance; the function matches the codes to Charlson categories itself. The D-prefix (DG30 etc.) is handled by EpiForsk — it does not need to be stripped first.

# Example: extract lookback and pass all diagnoses
# pnr and d_inddto are in lpr_adm — not lpr_diag

# Step 1: broadest possible time window for the entire cohort
min_lookback <- min(kohort$index_date) - 365.25 * 5   # earliest lookback start

# Step 2: fetch contacts and join to diagnoses
lpr2_lookback <- lpr_adm %>%
  filter(pnr %in% !!kohort$pnr,
         d_inddto >= !!min_lookback) %>%              # broad pre-filter
  select(pnr, recnum, d_inddto) %>%
  inner_join(
    lpr_diag %>%
      filter(c_diagtype %in% c("A", "B", "G")) %>%   # A/B/G for comorbidity
      select(recnum, c_diag),
    by = "recnum"
  ) %>%
  collect() %>%
  left_join(kohort %>% select(pnr, index_date), by = "pnr") %>%
  filter(d_inddto >= index_date - 365.25 * 5,        # precise per-person lookback
         d_inddto < index_date)

charlson <- charlson_score(data = lpr2_lookback, id = "pnr", icd_codes = "c_diag")

dataReporter — automatic data quality report

(CRAN; confirmed on DST — github.com/ekstroem/dataReporter)

Automatically generates an HTML/PDF report of all columns in a dataset: distribution, missing values, outliers and type checks. Useful for validating a new register extraction before starting analysis. Published in Journal of Statistical Software.

library(dataReporter)
makeDataReport(my_extract)   # opens report in browser

See also Phase 7 — Inspect your data for manual inspection commands.


fakeregs — synthetic Danish register data

(GitHub only — local use only — github.com/steno-aarhus/fakeregs)

Fictitious but structurally realistic data for nine DST registers. For practising and testing code locally without DST access. Not used on the server. See Phase 6 — First extraction.


fastreg — SAS → parquet

(GitHub only — github.com/dp-next/fastreg)

Structured pipeline for converting an entire SAS workspace to parquet. Only relevant if you are building a parquet workspace from scratch. See Phase 4 — SAS → parquet.


3. Analysis tools

All packages in this section are confirmed on DST (June 2026). The guide does not teach the statistical methods — see package documentation for methodological choices.

survival — the foundation for survival analysis

(CRAN; always available — cran.r-project.org/package=survival)

Provides Surv(time, status), coxph() and survfit() — the fundamental building blocks on which all survival packages below are built. You rarely call {survival} directly for plots and tables, but library(survival) belongs at the top of your analysis scripts.

library(survival)
cox <- coxph(Surv(time, event) ~ age + sex + nmi_score, data = cohort)
summary(cox)   # HR, 95% CI, p-values — pass to Publish or gtsummary for formatted output

gtsummary — Table 1 and manuscript tables

(CRAN — cran.r-project.org/package=gtsummary)

Standard for manuscript tables: tbl_summary() for Table 1 stratified by exposure, tbl_regression() for Cox and logistic regression, tbl_survfit() for survival summaries. Published in R Journal 2021.

Example: Table 1
library(gtsummary)

table1 <- cohort %>%
  select(group, age, sex, nmi_score, occupation_cat) %>%
  tbl_summary(
    by = group,                                                # stratify by exposure
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",                    # continuous: mean (SD)
      all_categorical() ~ "{n} ({p}%)"                        # categorical: n (%)
    ),
    missing = "no"
  ) %>%
  add_overall() %>%
  add_p()
Warning

Output control: Counts of 0–4 may not be repatriated. Manual rounding: round(x / 5) * 5. Z.gtsummary.addons::round_5_gtsummary() automates this — not on DST, import via data manager.


carpenter — pipe-based table structure

(CRAN — github.com/lwjohnst86/carpenter)

Older alternative to gtsummary with a simpler pipe-based API: start with outline_table(), add rows with add_rows(), finish with build_table(). By Luke Johnston.

library(carpenter)

outline_table(cohort, "group") %>%
  add_rows("age",       stat_meanSD) %>%     # continuous: mean (SD)
  add_rows("sex",       stat_nPct) %>%       # categorical: n (%)
  add_rows("nmi_score", stat_medianIQR) %>%  # continuous: median (IQR)
  build_table()

prodlim — Kaplan–Meier and Aalen–Johansen cumulative incidence

(CRAN — github.com/tagteam/prodlim)

Estimates and plots survival curves and cumulative incidence. Hist() handles competing risks with a single status variable: 0 = censored, 1 = primary event, 2 = competing event (e.g. death). Use prodlim for descriptive curves — riskRegression for regression.

Example: KM curve and Aalen–Johansen
library(prodlim)

# Kaplan–Meier:
fit_km <- prodlim(Hist(time, event) ~ group, data = cohort)
plot(fit_km)

# Aalen–Johansen cumulative incidence (competing risks):
# status: 0 = censored, 1 = primary event, 2 = competing event (e.g. death)
fit_aj <- prodlim(Hist(time, status) ~ group, data = cohort)
plot(fit_aj, cause = 1)   # cumulative incidence for primary event
Note

KM alone overestimates cumulative incidence in the presence of competing risks. Use Aalen–Johansen (cause = 1) instead.


riskRegression — Cause-Specific Cox, Fine-Gray and absolute risk

(CRAN — github.com/tagteam/riskRegression)

For regression with competing risks. CSC() fits Cause-Specific Cox (one model per cause), FGR() fits Fine-Gray subdistribution hazard. predictRisk() computes absolute risk at specific time points. Score() validates and compares models (Brier score, time-dependent AUC).

Example: CSC model and absolute risk
library(riskRegression)
library(survival)

# Cause-Specific Cox:
csc_model <- CSC(Hist(time, status) ~ age + sex + nmi_score, data = cohort)
summary(csc_model)

# Absolute risk at 1, 3 and 5 years for primary event:
ar <- predictRisk(csc_model, newdata = cohort,
                  times = c(1, 3, 5), cause = 1)
# ar: matrix — one column per time point, one row per person

Publish — regression results ready for manuscript

(CRAN — github.com/tagteam/Publish)

regressionTable() takes a fitted model (Cox, logistic, linear) and returns a table with estimates, 95% CI and p-values. plotConfidence() produces forest plots from the same output.

library(Publish)
library(survival)

cox_model <- coxph(Surv(time, event) ~ age + sex + nmi_score, data = cohort)

regressionTable(cox_model)                    # HR, 95% CI, p-values
plotConfidence(regressionTable(cox_model))    # forest plot

MatchIt — propensity score and exact matching

(CRAN — cran.r-project.org/package=MatchIt)

Matches exposed to unexposed on observed covariates. Supports nearest-neighbour, exact and optimal matching. summary() gives balance statistics (SMD), match.data() returns the matched dataset. Alternative to heaven::riskSetMatch() with better balance diagnostics.

Example: 1:1 nearest-neighbour matching
library(MatchIt)

m <- matchit(exposed ~ age + sex + nmi_score,
             data   = cohort,
             method = "nearest",   # nearest-neighbour matching
             ratio  = 1)           # 1:1

# Balance check — SMD < 0.1 is a common rule of thumb:
summary(m, standardize = TRUE)

# Matched dataset:
matched <- match.data(m)

pec — prediction error curves

(CRAN — cran.r-project.org/package=pec)

Computes and plots Brier score over time for comparing and validating prediction models. Niche — only relevant for prediction models; riskRegression::Score() usually covers the need.


BuyseTest — Generalised Pairwise Comparisons

(CRAN, confirmed on DST v3.3.3 — github.com/bozenne/BuyseTest)

Implements Generalised Pairwise Comparisons (GPC): compares two groups via win ratio, net benefit and win probability on prioritised endpoints. Niche — relevant when you have a hierarchy of endpoints and want to avoid classical event-time assumptions.

library(BuyseTest)

# One time-to-event endpoint:
result <- BuyseTest(group ~ tte(time, event, threshold = 0),
                    data = cohort,
                    method.inference = "u-statistic")
summary(result)   # win ratio, net benefit, win probability

powerBuyseTest() calculates the required sample size given desired power and expected effect — useful before data collection or grant applications.


kinship2 — pedigrees and kinship matrices

(CRAN — cran.r-project.org/web/packages/kinship2)

Creates pedigree objects from parent–child relationships and computes kinship coefficients (probability of shared allele: 0.5 = parent/child, 0.25 = siblings). Useful for multi-generational family studies. By Terry Therneau (Mayo Clinic).

Example: create pedigree and compute kinship
library(kinship2)

# Create pedigree:
# DST's koen variable (1 = male, 2 = female) matches kinship2's expectation directly
ped <- pedigree(
  id    = family$pnr,
  dadid = family$pnr_father,   # 0 or NA if father unknown
  momid = family$pnr_mother,   # 0 or NA if mother unknown
  sex   = family$koen           # 1 = male, 2 = female
)

# Kinship coefficients:
km <- kinship(ped)

# Visualise family tree:
plot(ped)

tame — cluster medication exposure from LMDB

(CRAN — github.com/Laksafoss/tame)

Agglomerative hierarchical clustering of medication data using a custom distance measure combining ATC classification, timing and dose. medic() is the primary function. Relevant for characterising polypharmacy patterns in cohorts. By Anders Laksafoss (SSI).

Example: cluster medication patterns
library(tame)

# Cluster on ATC code alone:
clust <- medic(lmdb_data, id = pnr, atc = atc, k = 5)
# k = desired number of clusters — choose with scree plot or domain knowledge

# With timing information (column range):
clust <- medic(lmdb_data, id = pnr, atc = atc,
               timing = trimester_1:trimester_3,   # columns defining time windows
               k = 5)

summary(clust)   # summarise clusters
Note

id, atc and timing use bare column names (NSE — no quotes). lmdb_data must have one row per prescription dispensing.


Possible inspiration — not confirmed on DST

famnet (GitHub, experimental) — identifies family relationships from person IDs: given a dataset with PersonID, FatherID and MotherID, fmn_siblings() finds all sibling pairs with relationship type. Relevant for family linkage across generations in Danish register data. By Luke Johnston.

plotsurv (GitHub, not on DST) — publication-ready Kaplan–Meier and Aalen–Johansen competing-risks plots with at-risk table, CI bands and censoring ticks. plotsurv(survfit_object).

Z.gtsummary.addons (GitHub, not on DST) — extends gtsummary with add_SMD() (standardised mean differences for Table 1) and round_5_gtsummary() (rounds counts to the nearest 5 — relevant for DST output control).


4. Where to get more help?

Community and courses — see the prioritised help list and course links in Phase 2 — R: the bare essentials (Stack Overflow, Zheers R Coding Café, DDEA, R for Data Science, Epi R Handbook).

Institutional reference material:

Source What you find
ctpteam/RDST “R on DST” manual: project setup, data access, good coding practice on the DST server
ctpteam/DST Guide to LPR3, DST manuals, slides and a heaven PDF
steno-aarhus/registers-project-database SDCA’s own database: variable lists, register descriptions and application templates for the Danish Health Data Authority
github.com/tagteam riskRegression, prodlim, Publish, pec, heaven — Thomas Gerds and Brice Ozenne, KU Biostatistics

Next steps

Phase 16 — Export and repatriation

Back to top