Extract from LPR
Practical recipes β the two approaches, helper functions and integration with the cohort
This page shows how to extract diagnoses from LPR with code. It builds on the structure from Phase 9a β Understand LPR: the periods (LPR2/LPR3), the D-prefix, the diagnosis types (A/B/G) and the filter for retracted diagnoses. Read that page first if you havenβt already.
You use the same extraction pattern as in Phase 5 and Phase 6 β just applied to LPRβs two generations. It is the most important and probably the most complex part of the guide.
This page and Phase 10 are circularly dependent β this is deliberate. To build your cohort (Phase 10) you need to know how to extract diagnosis codes and procedures from LPR. To extract diagnosis codes from LPR (this page) you need a cohort. We resolve it as follows: this page teaches you the extraction pattern β we assume you already have a cohort. Phase 10 shows you how to build that cohort using exactly the pattern you just learned. Read the phases in order, and come back to the code here when your cohort is ready.
The code examples use the column names DSTβs parquet registers typically have. Your columns may be named differently β check with names(your_data) or look them up in Phase 15 β Register reference.
The code uses inner_join() and bind_rows(). If these are new concepts, they are explained in detail in Phase 11 β Joins and pivots β you can follow the code here and return to Phase 11 afterwards.
Fetch diagnoses from LPR β choose your approach
Choose one of two approaches depending on your study:
| Approach 1 β direct extraction | Approach 2 β alle_dx |
|
|---|---|---|
| Best when | You have fewer outcomes | You have multiple outcomes from LPR |
| Workflow | Fetch specific codes β Exclude β done | Fetch all β Exclude β filter per outcome |
| Advantage | Simpler and faster for single-outcome studies | LPR queried only once; reused for all outcomes |
Approach 1 is best for a smaller number of outcomes. Filter on specific ICD codes directly in the filter() step before collect(). DuckDB/Arrow pushes the filter down to the storage layer β only matching rows are loaded into RAM.
Approach 2 is best when your study has multiple outcomes. You query LPR once and build alle_dx: a shared table with all A and B diagnoses. For each new outcome, filter alle_dx on the relevant codes β the only line you change is the code list.
The examples require parquet files and a completed study population. kohort is the data.frame with pnr and index_date per person β see Phase 10. Adjust paths to your project.
Approach 1 β fetch specific diagnoses directly (start here for one outcome)
Filter on specific codes before collect(). The example fetches diabetes mellitus (E10βE14) β replace CODES_REGEX with your own codes.
library(arrow)
library(dplyr)
cohort_pnrs <- unique(kohort$pnr)
CODES_REGEX <- "^DE1[0-4]" # diabetes mellitus (E10βE14) β with D-prefix
# ββ LPR2 somatic βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
lpr_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>% rename_with(tolower)
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>% rename_with(tolower)
lpr2_dm <- lpr_adm %>%
filter(pnr %in% !!cohort_pnrs) %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B"),
grepl(CODES_REGEX, c_diag)) %>% # filter BEFORE collect β D-prefix in regex
select(recnum, c_diag, c_diagtype),
by = "recnum"
) %>%
collect() %>%
mutate(icd3 = substr(c_diag, 2, 4))
# ββ LPR3 βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
lpr3_k <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/") %>% rename_with(tolower)
lpr3_d <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_diagnose/") %>% rename_with(tolower)
lpr3_dm <- lpr3_k %>%
filter(pnr %in% !!cohort_pnrs) %>%
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% c("A", "B"),
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja",
grepl(CODES_REGEX, diag_kode)) %>%
select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
by = "dw_ek_kontakt"
) %>%
collect() %>%
mutate(date_contact = as.Date(date_contact), icd3 = substr(c_diag, 2, 4))
dm_dx <- bind_rows(lpr2_dm, lpr3_dm)
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3Many specific codes? Build the regex programmatically:
codes <- c("E10", "E11", "E12", "E13", "E14")
CODES_REGEX <- paste0("^D(", paste(codes, collapse = "|"), ")")Do you have F-codes (e.g. dementia, depression)? Extend the regex to include them, e.g. "^DE1[0-4]|^DF0[0-3]|^DG30", and add psychiatric LPR2 β see Approach 2 below for the code.
Alternative: compact extraction (single-table approach)
A colleague may have shown you this shorter approach:
lpr <- left_join(lpr_adm, lpr_diag, by = "RECNUM") |>
filter(C_DIAGTYPE == "A",
grepl("^S72", C_DIAG)) |>
group_by(PNR) |>
filter(D_INDDTO == min(D_INDDTO)) |>
slice(1) |>
ungroup()It is shorter but has three pitfalls on DST data:
- D-prefix error:
"^S72"does NOT match"DS72..."in DST data β returns zero rows with no error message. Use"^DS72"(with D) or strip the prefix first. left_joininstead ofinner_join: Keeps all admissions fromlpr_admβ including those with no matching diagnosis. Unnecessarily heavy on national registers.- No pnr filter: Loads the entire populationβs data. Correct when building a cohort (Phase 10), not when extracting from an existing one.
Approach 2 β fetch all diagnoses + filter outcome (for multiple outcomes)
Part 1 β build alle_dx
library(arrow)
library(dplyr)
cohort_pnrs <- unique(kohort$pnr)
# ββ LPR2 somatic (up to March 2019) ββββββββββββββββββββββββββββββββββββββ
lpr_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>% rename_with(tolower)
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>% rename_with(tolower)
lpr2_dx <- lpr_adm %>%
filter(pnr %in% !!cohort_pnrs) %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B")) %>%
select(recnum, c_diag, c_diagtype),
by = "recnum"
) %>%
collect() %>%
mutate(icd3 = substr(c_diag, 2, 4))
# ββ LPR3 (March 2019 and onwards) ββββββββββββββββββββββββββββββββββββββββ
lpr3_k <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/") %>% rename_with(tolower)
lpr3_d <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_diagnose/") %>% rename_with(tolower)
lpr3_dx <- lpr3_k %>%
filter(pnr %in% !!cohort_pnrs) %>%
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% c("A", "B"),
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
by = "dw_ek_kontakt"
) %>%
collect() %>%
mutate(date_contact = as.Date(date_contact), icd3 = substr(c_diag, 2, 4))
alle_dx <- bind_rows(lpr2_dx, lpr3_dx)
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3Do you have F-codes (e.g. dementia, depression)? Psychiatric diagnoses recorded before March 2019 are in separate registers. Add them before bind_rows():
psyk_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/t_psyk_adm/") %>%
rename_with(tolower) %>% rename(pnr = v_cpr, recnum = k_recnum)
psyk_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/t_psyk_diag/") %>%
rename_with(tolower) %>% rename(recnum = v_recnum)
lpr2_psyk_dx <- psyk_adm %>%
filter(pnr %in% !!cohort_pnrs) %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(psyk_diag %>% filter(c_diagtype %in% c("A", "B")) %>%
select(recnum, c_diag, c_diagtype), by = "recnum") %>%
collect() %>% mutate(icd3 = substr(c_diag, 2, 4))
alle_dx <- bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx)Using duckplyr? union_all() combines tables before collect() and requires identical column names and types. Rename LPR3 columns to match the LPR2 format before combining β see the onboarding document for an example.
Filter your extracted table for specific outcomes
CODES <- c("G30", "F00", "F01", "F02", "F03") # dementia β change to your outcome
outcome <- alle_dx %>%
filter(icd3 %in% CODES) %>%
inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>% # use cohort_clean after exclusion (Phase 10 Step 2)
filter(date_contact > index_date) %>% # post-index; use < for baseline covariate
group_by(pnr) %>%
arrange(date_contact) %>%
slice(1) %>%
ungroup() %>%
select(pnr, event_date = date_contact)
# Join to cohort β NA = no event (censored at end of study)
result <- cohort %>%
select(pnr) %>%
left_join(outcome, by = "pnr")
saveRDS(result, "datasets/extract_dementia.rds") # change filename for each new outcomeExclusion of prevalent cases β persons who already had the diagnosis before index date β happens in Phase 10, Step 2. Use cohort_clean instead of cohort in the code above after completing that step.
Try it yourself β runnable example with synthetic data (Approach 1)
This example requires RStudio installed locally on your computer β not the DST server. The synthetic dataset (fakeregs) is not available on DST. Download R: cran.r-project.org Β· Download RStudio: posit.co/download/rstudio-desktop
The example extracts CVD diagnoses (ischaemic heart disease, ICD-10 I20βI25) from LPR2 and LPR3 combined β the complete pattern from the theory section above, but runnable locally with synthetic data. It follows Approach 1: specific codes are filtered out before collect().
The synthetic LPR data is generated with the fakeregs package, which you already know from Phase 6 β First extraction. If you have already generated and saved data there, synth_data/lpr_adm/ is ready and you can skip the preparation block.
Adapted from Anders Aasted Isaksenβs dev/common_tasks_datatable.qmd in fakeregs (MIT licence, Steno Diabetes Center Aarhus). Rewritten to dplyr + arrow and adapted to this guideβs pattern.
# Install fakeregs for the first time:
# install.packages("pak"); pak::pak("steno-aarhus/fakeregs")
library(fakeregs) # synthetic DST register data
library(dplyr) # filter, select, mutate, inner_join, bind_rows
library(arrow) # open_dataset, write_parquet
# ββ Preparation: generate synthetic data and save as parquet (done only once) ββββ
bp <- generate_background_pop()
lpr_adm_synth <- generate_lpr_adm(background_df = bp)
lpr_diag_synth <- generate_lpr_diag(background_df = lpr_adm_synth)
lpr_a_k_synth <- generate_lpr_a_kontakt(background_df = bp)
lpr_a_d_synth <- generate_lpr_a_diagnose(background_df = lpr_a_k_synth)
dir.create("synth_data/lpr_adm", recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_diag", recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_kontakt", recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_diagnose", recursive = TRUE, showWarnings = FALSE)
write_parquet(lpr_adm_synth, "synth_data/lpr_adm/lpr_adm.parquet")
write_parquet(lpr_diag_synth, "synth_data/lpr_diag/lpr_diag.parquet")
write_parquet(lpr_a_k_synth, "synth_data/lpr_a_kontakt/lpr_a_kontakt.parquet")
write_parquet(lpr_a_d_synth, "synth_data/lpr_a_diagnose/lpr_a_diagnose.parquet")The path is relative to your working directory β check with getwd(). If you have already run the preparation block in Phase 6, synth_data/lpr_adm/ is already saved.
# The ICD codes we are looking for β change these to your own outcome
CVD_CODES <- c("I20", "I21", "I22", "I23", "I24", "I25") # ischaemic heart disease
# ββ LPR2 somatic (up to March 2019) ββββββββββββββββββββββββββββββββββββββ
lpr_adm <- open_dataset("synth_data/lpr_adm/") %>% rename_with(tolower) # LPR2 contact table β synthetic
lpr_diag <- open_dataset("synth_data/lpr_diag/") %>% rename_with(tolower) # LPR2 diagnosis table β synthetic
lpr2_cvd <- lpr_adm %>%
select(pnr, recnum, date_contact = d_inddto) %>% # select only necessary columns
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B"), # only action and secondary diagnoses
substr(c_diag, 2, 4) %in% !!CVD_CODES) %>% # !! sends the local R vector to DuckDB
select(recnum, c_diag), # only join key and diagnosis code
by = "recnum" # join key in LPR2
) %>%
collect() %>% # HERE data is fetched into R
mutate(icd3 = substr(c_diag, 2, 4)) # save cleaned code as new column
# ββ LPR3 (March 2019 and onwards) βββββββββββββββββββββββββββββββββββββββββ
lpr3_k <- open_dataset("synth_data/lpr_a_kontakt/") %>% rename_with(tolower) # LPR3 contact table β synthetic
lpr3_d <- open_dataset("synth_data/lpr_a_diagnose/") %>% rename_with(tolower) # LPR3 diagnosis table β synthetic
lpr3_cvd <- lpr3_k %>%
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>% # dw_ek_kontakt is join key to lpr_a_diagnose
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% c("A", "B"),
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja", # exclude retracted diagnoses
substr(diag_kode, 2, 4) %in% !!CVD_CODES) %>% # !! sends the local R vector to DuckDB
select(dw_ek_kontakt, c_diag = diag_kode), # rename to c_diag for consistency with LPR2
by = "dw_ek_kontakt" # join key in LPR3
) %>%
collect() %>% # fetch into R
mutate(
date_contact = as.Date(date_contact), # datetime β date
icd3 = substr(c_diag, 2, 4) # strip D-prefix: "DI21" β "I21"
)
# ββ Combine and save ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
alle_cvd <- bind_rows(lpr2_cvd, lpr3_cvd) # stack LPR2 and LPR3
nrow(alle_cvd) # check: number of diagnosis rows
length(unique(alle_cvd$pnr)) # check: number of unique individuals
table(alle_cvd$icd3) # distribution across codes
saveRDS(alle_cvd, "datasets/extract_cvd.rds") # save β change path to your own folderWrap the pattern in a reusable function (for multiple outcomes)
If you extract diagnoses for several outcomes, it pays off to encapsulate the Approach 2 pattern in one reusable function rather than copying ~40 lines for each new outcome. Define it at the top of your script or in a separate functions.R file.
Advantages: - One place to fix if something changes (e.g. a new register or a new column) - The code block for each outcome is reduced from ~40 lines to one function call - Errors are introduced in one place instead of in each copy
Working on DARTER (or another project with dstDataPrep)? Swap open_dataset("E:/workdata/.../<register>/") for load_database("<register>") β it resolves the path automatically. See DARTER β overview and pipeline for the fully adapted variant β it is kept up to date with the current, confirmed register names (as of June 2026).
See the full get_lpr_diagnoses() function and usage
library(arrow)
library(dplyr)
get_lpr_diagnoses <- function(pnr_vector, diagtypes = c("A", "B"), inpatient_only = FALSE) {
base <- "E:/workdata/[projectnumber]/cleaned-data/parquet-registers/"
# Open registers
lpr_adm <- open_dataset(paste0(base, "lpr_adm/")) %>% rename_with(tolower) # LPR2 somatic contacts
lpr_diag <- open_dataset(paste0(base, "lpr_diag/")) %>% rename_with(tolower) # LPR2 somatic diagnoses
psyk_adm <- open_dataset(paste0(base, "t_psyk_adm/")) %>% rename_with(tolower) %>%
rename(pnr = v_cpr, recnum = k_recnum) # LPR2 psychiatric contacts
psyk_diag <- open_dataset(paste0(base, "t_psyk_diag/")) %>% rename_with(tolower) %>%
rename(recnum = v_recnum) # LPR2 psychiatric diagnoses
lpr3_k <- open_dataset(paste0(base, "lpr_a_kontakt/")) %>% rename_with(tolower) %>%
filter(lprindberetningssystem == "LPR3") # CRITICAL: avoid duplicated rows from LPR_F format
lpr3_d <- open_dataset(paste0(base, "lpr_a_diagnose/")) %>% rename_with(tolower) # LPR3 diagnoses
# Filter on admission type if desired
if (inpatient_only) {
lpr_adm <- lpr_adm %>% filter(c_pattype == "0") # "0" = inpatient in LPR2
lpr3_k <- lpr3_k %>% filter(kont_type == "ALCA00") # "ALCA00" = inpatient in LPR3
}
# LPR2 somatic
lpr2_dx <- lpr_adm %>%
filter(pnr %in% !!pnr_vector) %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(
lpr_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag),
by = "recnum"
) %>%
collect() %>%
mutate(icd3 = substr(c_diag, 2, 4)) # strip D-prefix
# LPR2 psychiatric
lpr2_psyk_dx <- psyk_adm %>%
filter(pnr %in% !!pnr_vector) %>%
select(pnr, recnum, date_contact = d_inddto) %>%
inner_join(
psyk_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag),
by = "recnum"
) %>%
collect() %>%
mutate(icd3 = substr(c_diag, 2, 4))
# LPR3
lpr3_dx <- lpr3_k %>%
filter(pnr %in% !!pnr_vector) %>%
select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
inner_join(
lpr3_d %>%
filter(diag_kode_type %in% !!diagtypes,
is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
select(dw_ek_kontakt, c_diag = diag_kode),
by = "dw_ek_kontakt"
) %>%
collect() %>%
mutate(date_contact = as.Date(date_contact), # datetime β date
icd3 = substr(c_diag, 2, 4))
bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx) # return combined table
}Use the function β one call per extraction, only change CODES:
cohort <- readRDS("datasets/full_cohort.rds")
pnr_list <- unique(cohort$pnr)
# Fetch all diagnoses for the cohort (Phase 1 β see hospital contacts page)
alle_dx <- get_lpr_diagnoses(
pnr_vector = pnr_list,
diagtypes = c("A", "B"),
inpatient_only = FALSE
)
# Returns: pnr | date_contact | c_diag | icd3
# Extract one outcome β only change CODES (Phase 2)
CODES <- c("F00", "F01", "F02", "F03", "G30", "G31") # dementia
dementia <- alle_dx %>%
filter(icd3 %in% CODES) %>%
inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>%
filter(date_contact > index_date) %>%
group_by(pnr) %>% arrange(date_contact) %>% slice(1) %>% ungroup() %>%
select(pnr, dementia_date = date_contact)
result <- cohort %>% select(pnr) %>% left_join(dementia, by = "pnr")
saveRDS(result, "datasets/extract_dementia.rds")Next steps
You have now extracted diagnoses from two LPR generations. Next steps are to shape and combine your extracts:
See also
- Phase 9a β Understand LPR β structure, periods, D-prefix and diagnosis types
- Phase 6 β First extraction β step-by-step introduction to open_dataset, collect and saveRDS
- Phase 15 β Register reference β confirmed column names for all LPR registers
- Phase 15 β Pitfalls β known issues with LPR on DST