Extract from LPR

Practical recipes — the two approaches, helper functions and integration with the cohort

Published

June 6, 2026

This page shows how to extract diagnoses from LPR with code. It builds on the structure from Phase 9a — Understand LPR: the periods (LPR2/LPR3), the D-prefix, the diagnosis types (A/B/G) and the filter for retracted diagnoses. Read that page first if you haven’t already.

You use the same extraction pattern as in Phase 5 and Phase 6 — just applied to LPR’s two generations. It is the most important and probably the most complex part of the guide.

Note

This page and Phase 10 are circularly dependent — this is deliberate. To build your cohort (Phase 10) you need to know how to extract diagnosis codes and procedures from LPR. To extract diagnosis codes from LPR (this page) you need a cohort. We resolve it as follows: this page teaches you the extraction pattern — we assume you already have a cohort. Phase 10 shows you how to build that cohort using exactly the pattern you just learned. Read the phases in order, and come back to the code here when your cohort is ready.

Note

The code examples use the column names DST’s parquet registers typically have. Your columns may be named differently — check with names(your_data) or look them up in Phase 15 — Register reference.

Note

The code uses inner_join() and bind_rows(). If these are new concepts, they are explained in detail in Phase 11 — Joins and pivots — you can follow the code here and return to Phase 11 afterwards.

Fetch diagnoses from LPR — choose your approach

Choose one of two approaches depending on your study:

	Approach 1 — direct extraction	Approach 2 — `alle_dx`
Best when	You have fewer outcomes	You have multiple outcomes from LPR
Workflow	Fetch specific codes → Exclude → done	Fetch all → Exclude → filter per outcome
Advantage	Simpler and faster for single-outcome studies	LPR queried only once; reused for all outcomes

Approach 1 is best for a smaller number of outcomes. Filter on specific ICD codes directly in the filter() step before collect(). DuckDB/Arrow pushes the filter down to the storage layer — only matching rows are loaded into RAM.

Approach 2 is best when your study has multiple outcomes. You query LPR once and build alle_dx: a shared table with all A and B diagnoses. For each new outcome, filter alle_dx on the relevant codes — the only line you change is the code list.

Note

The examples require parquet files and a completed study population. kohort is the data.frame with pnr and index_date per person — see Phase 10. Adjust paths to your project.

Approach 1 — fetch specific diagnoses directly (start here for one outcome)

Filter on specific codes before collect(). The example fetches diabetes mellitus (E10–E14) — replace CODES_REGEX with your own codes.

library(arrow)
library(dplyr)

cohort_pnrs <- unique(kohort$pnr)
CODES_REGEX <- "^DE1[0-4]"   # diabetes mellitus (E10–E14) — with D-prefix

# ── LPR2 somatic ─────────────────────────────────────────────────────────
lpr_adm  <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/")  %>% rename_with(tolower)
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>% rename_with(tolower)

lpr2_dm <- lpr_adm %>%
  filter(pnr %in% !!cohort_pnrs) %>%
  select(pnr, recnum, date_contact = d_inddto) %>%
  inner_join(
    lpr_diag %>%
      filter(c_diagtype %in% c("A", "B"),
             grepl(CODES_REGEX, c_diag)) %>%   # filter BEFORE collect — D-prefix in regex
      select(recnum, c_diag, c_diagtype),
    by = "recnum"
  ) %>%
  collect() %>%
  mutate(icd3 = substr(c_diag, 2, 4))

# ── LPR3 ─────────────────────────────────────────────────────────────────
lpr3_k <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/")  %>% rename_with(tolower)
lpr3_d <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_diagnose/") %>% rename_with(tolower)

lpr3_dm <- lpr3_k %>%
  filter(pnr %in% !!cohort_pnrs) %>%
  select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
  inner_join(
    lpr3_d %>%
      filter(diag_kode_type %in% c("A", "B"),
             is.na(senere_afkraeftet) | senere_afkraeftet != "Ja",
             grepl(CODES_REGEX, diag_kode)) %>%
      select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
    by = "dw_ek_kontakt"
  ) %>%
  collect() %>%
  mutate(date_contact = as.Date(date_contact), icd3 = substr(c_diag, 2, 4))

dm_dx <- bind_rows(lpr2_dm, lpr3_dm)
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3

Tip

Many specific codes? Build the regex programmatically:

codes <- c("E10", "E11", "E12", "E13", "E14")
CODES_REGEX <- paste0("^D(", paste(codes, collapse = "|"), ")")

Note

Do you have F-codes (e.g. dementia, depression)? Extend the regex to include them, e.g. "^DE1[0-4]|^DF0[0-3]|^DG30", and add psychiatric LPR2 — see Approach 2 below for the code.

Alternative: compact extraction (single-table approach)

A colleague may have shown you this shorter approach:

lpr <- left_join(lpr_adm, lpr_diag, by = "RECNUM") |>
  filter(C_DIAGTYPE == "A",
         grepl("^S72", C_DIAG)) |>
  group_by(PNR) |>
  filter(D_INDDTO == min(D_INDDTO)) |>
  slice(1) |>
  ungroup()

It is shorter but has three pitfalls on DST data:

D-prefix error: "^S72" does NOT match "DS72..." in DST data — returns zero rows with no error message. Use "^DS72" (with D) or strip the prefix first.
left_join instead of inner_join: Keeps all admissions from lpr_adm — including those with no matching diagnosis. Unnecessarily heavy on national registers.
No pnr filter: Loads the entire population’s data. Correct when building a cohort (Phase 10), not when extracting from an existing one.

Approach 2 — fetch all diagnoses + filter outcome (for multiple outcomes)

Part 1 — build alle_dx

library(arrow)
library(dplyr)

cohort_pnrs <- unique(kohort$pnr)

# ── LPR2 somatic (up to March 2019) ──────────────────────────────────────
lpr_adm  <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/")  %>% rename_with(tolower)
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>% rename_with(tolower)

lpr2_dx <- lpr_adm %>%
  filter(pnr %in% !!cohort_pnrs) %>%
  select(pnr, recnum, date_contact = d_inddto) %>%
  inner_join(
    lpr_diag %>%
      filter(c_diagtype %in% c("A", "B")) %>%
      select(recnum, c_diag, c_diagtype),
    by = "recnum"
  ) %>%
  collect() %>%
  mutate(icd3 = substr(c_diag, 2, 4))

# ── LPR3 (March 2019 and onwards) ────────────────────────────────────────
lpr3_k <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/")  %>% rename_with(tolower)
lpr3_d <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_diagnose/") %>% rename_with(tolower)

lpr3_dx <- lpr3_k %>%
  filter(pnr %in% !!cohort_pnrs) %>%
  select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
  inner_join(
    lpr3_d %>%
      filter(diag_kode_type %in% c("A", "B"),
             is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
      select(dw_ek_kontakt, c_diag = diag_kode, c_diagtype = diag_kode_type),
    by = "dw_ek_kontakt"
  ) %>%
  collect() %>%
  mutate(date_contact = as.Date(date_contact), icd3 = substr(c_diag, 2, 4))

alle_dx <- bind_rows(lpr2_dx, lpr3_dx)
# Columns: pnr | date_contact | c_diag | c_diagtype | icd3

Note

Do you have F-codes (e.g. dementia, depression)? Psychiatric diagnoses recorded before March 2019 are in separate registers. Add them before bind_rows():

psyk_adm  <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/t_psyk_adm/") %>%
  rename_with(tolower) %>% rename(pnr = v_cpr, recnum = k_recnum)
psyk_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/t_psyk_diag/") %>%
  rename_with(tolower) %>% rename(recnum = v_recnum)

lpr2_psyk_dx <- psyk_adm %>%
  filter(pnr %in% !!cohort_pnrs) %>%
  select(pnr, recnum, date_contact = d_inddto) %>%
  inner_join(psyk_diag %>% filter(c_diagtype %in% c("A", "B")) %>%
               select(recnum, c_diag, c_diagtype), by = "recnum") %>%
  collect() %>% mutate(icd3 = substr(c_diag, 2, 4))

alle_dx <- bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx)

Tip

Using duckplyr? union_all() combines tables before collect() and requires identical column names and types. Rename LPR3 columns to match the LPR2 format before combining — see the onboarding document for an example.

Filter your extracted table for specific outcomes

CODES <- c("G30", "F00", "F01", "F02", "F03")   # dementia — change to your outcome

outcome <- alle_dx %>%
  filter(icd3 %in% CODES) %>%
  inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>%   # use cohort_clean after exclusion (Phase 10 Step 2)
  filter(date_contact > index_date) %>%   # post-index; use < for baseline covariate
  group_by(pnr) %>%
  arrange(date_contact) %>%
  slice(1) %>%
  ungroup() %>%
  select(pnr, event_date = date_contact)

# Join to cohort — NA = no event (censored at end of study)
result <- cohort %>%
  select(pnr) %>%
  left_join(outcome, by = "pnr")

saveRDS(result, "datasets/extract_dementia.rds")   # change filename for each new outcome

Note

Exclusion of prevalent cases — persons who already had the diagnosis before index date — happens in Phase 10, Step 2. Use cohort_clean instead of cohort in the code above after completing that step.

Try it yourself — runnable example with synthetic data (Approach 1)

Important

This example requires RStudio installed locally on your computer — not the DST server. The synthetic dataset (fakeregs) is not available on DST. Download R: cran.r-project.org · Download RStudio: posit.co/download/rstudio-desktop

The example extracts CVD diagnoses (ischaemic heart disease, ICD-10 I20–I25) from LPR2 and LPR3 combined — the complete pattern from the theory section above, but runnable locally with synthetic data. It follows Approach 1: specific codes are filtered out before collect().

The synthetic LPR data is generated with the fakeregs package, which you already know from Phase 6 — First extraction. If you have already generated and saved data there, synth_data/lpr_adm/ is ready and you can skip the preparation block.

Adapted from Anders Aasted Isaksen’s dev/common_tasks_datatable.qmd in fakeregs (MIT licence, Steno Diabetes Center Aarhus). Rewritten to dplyr + arrow and adapted to this guide’s pattern.

# Install fakeregs for the first time:
# install.packages("pak"); pak::pak("steno-aarhus/fakeregs")

library(fakeregs)   # synthetic DST register data
library(dplyr)      # filter, select, mutate, inner_join, bind_rows
library(arrow)      # open_dataset, write_parquet

# ── Preparation: generate synthetic data and save as parquet (done only once) ────
bp             <- generate_background_pop()
lpr_adm_synth  <- generate_lpr_adm(background_df = bp)
lpr_diag_synth <- generate_lpr_diag(background_df = lpr_adm_synth)
lpr_a_k_synth  <- generate_lpr_a_kontakt(background_df = bp)
lpr_a_d_synth  <- generate_lpr_a_diagnose(background_df = lpr_a_k_synth)

dir.create("synth_data/lpr_adm",        recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_diag",       recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_kontakt",  recursive = TRUE, showWarnings = FALSE)
dir.create("synth_data/lpr_a_diagnose", recursive = TRUE, showWarnings = FALSE)
write_parquet(lpr_adm_synth,  "synth_data/lpr_adm/lpr_adm.parquet")
write_parquet(lpr_diag_synth, "synth_data/lpr_diag/lpr_diag.parquet")
write_parquet(lpr_a_k_synth,  "synth_data/lpr_a_kontakt/lpr_a_kontakt.parquet")
write_parquet(lpr_a_d_synth,  "synth_data/lpr_a_diagnose/lpr_a_diagnose.parquet")

Tip

The path is relative to your working directory — check with getwd(). If you have already run the preparation block in Phase 6, synth_data/lpr_adm/ is already saved.

# The ICD codes we are looking for — change these to your own outcome
CVD_CODES <- c("I20", "I21", "I22", "I23", "I24", "I25")   # ischaemic heart disease

# ── LPR2 somatic (up to March 2019) ──────────────────────────────────────
lpr_adm  <- open_dataset("synth_data/lpr_adm/")  %>% rename_with(tolower)   # LPR2 contact table — synthetic
lpr_diag <- open_dataset("synth_data/lpr_diag/") %>% rename_with(tolower)   # LPR2 diagnosis table — synthetic

lpr2_cvd <- lpr_adm %>%
  select(pnr, recnum, date_contact = d_inddto) %>%           # select only necessary columns
  inner_join(
    lpr_diag %>%
      filter(c_diagtype %in% c("A", "B"),                    # only action and secondary diagnoses
             substr(c_diag, 2, 4) %in% !!CVD_CODES) %>%       # !! sends the local R vector to DuckDB
      select(recnum, c_diag),                    # only join key and diagnosis code
    by = "recnum"                                             # join key in LPR2
  ) %>%
  collect() %>%                                              # HERE data is fetched into R
  mutate(icd3 = substr(c_diag, 2, 4))                        # save cleaned code as new column

# ── LPR3 (March 2019 and onwards) ─────────────────────────────────────────
lpr3_k <- open_dataset("synth_data/lpr_a_kontakt/")  %>% rename_with(tolower)   # LPR3 contact table — synthetic
lpr3_d <- open_dataset("synth_data/lpr_a_diagnose/") %>% rename_with(tolower)   # LPR3 diagnosis table — synthetic

lpr3_cvd <- lpr3_k %>%
  select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%   # dw_ek_kontakt is join key to lpr_a_diagnose
  inner_join(
    lpr3_d %>%
      filter(diag_kode_type %in% c("A", "B"),
             is.na(senere_afkraeftet) | senere_afkraeftet != "Ja",  # exclude retracted diagnoses
             substr(diag_kode, 2, 4) %in% !!CVD_CODES) %>%   # !! sends the local R vector to DuckDB
      select(dw_ek_kontakt, c_diag = diag_kode),             # rename to c_diag for consistency with LPR2
    by = "dw_ek_kontakt"                                     # join key in LPR3
  ) %>%
  collect() %>%                                              # fetch into R
  mutate(
    date_contact = as.Date(date_contact),                    # datetime → date
    icd3         = substr(c_diag, 2, 4)                      # strip D-prefix: "DI21" → "I21"
  )

# ── Combine and save ────────────────────────────────────────────────────────
alle_cvd <- bind_rows(lpr2_cvd, lpr3_cvd)                   # stack LPR2 and LPR3

nrow(alle_cvd)                                               # check: number of diagnosis rows
length(unique(alle_cvd$pnr))                                 # check: number of unique individuals
table(alle_cvd$icd3)                                         # distribution across codes

saveRDS(alle_cvd, "datasets/extract_cvd.rds")                # save — change path to your own folder

Wrap the pattern in a reusable function (for multiple outcomes)

If you extract diagnoses for several outcomes, it pays off to encapsulate the Approach 2 pattern in one reusable function rather than copying ~40 lines for each new outcome. Define it at the top of your script or in a separate functions.R file.

Advantages: - One place to fix if something changes (e.g. a new register or a new column) - The code block for each outcome is reduced from ~40 lines to one function call - Errors are introduced in one place instead of in each copy

Tip

Working on DARTER (or another project with dstDataPrep)? Swap open_dataset("E:/workdata/.../<register>/") for load_database("<register>") — it resolves the path automatically. See DARTER — overview and pipeline for the fully adapted variant — it is kept up to date with the current, confirmed register names (as of June 2026).

See the full get_lpr_diagnoses() function and usage

library(arrow)
library(dplyr)

get_lpr_diagnoses <- function(pnr_vector, diagtypes = c("A", "B"), inpatient_only = FALSE) {
  base <- "E:/workdata/[projectnumber]/cleaned-data/parquet-registers/"

  # Open registers
  lpr_adm   <- open_dataset(paste0(base, "lpr_adm/"))   %>% rename_with(tolower)   # LPR2 somatic contacts
  lpr_diag  <- open_dataset(paste0(base, "lpr_diag/"))  %>% rename_with(tolower)   # LPR2 somatic diagnoses
  psyk_adm  <- open_dataset(paste0(base, "t_psyk_adm/"))  %>% rename_with(tolower) %>%
    rename(pnr = v_cpr, recnum = k_recnum)                            # LPR2 psychiatric contacts
  psyk_diag <- open_dataset(paste0(base, "t_psyk_diag/")) %>% rename_with(tolower) %>%
    rename(recnum = v_recnum)                                          # LPR2 psychiatric diagnoses
  lpr3_k    <- open_dataset(paste0(base, "lpr_a_kontakt/"))  %>% rename_with(tolower) %>%
    filter(lprindberetningssystem == "LPR3")                               # CRITICAL: avoid duplicated rows from LPR_F format
  lpr3_d    <- open_dataset(paste0(base, "lpr_a_diagnose/")) %>% rename_with(tolower)  # LPR3 diagnoses

  # Filter on admission type if desired
  if (inpatient_only) {
    lpr_adm <- lpr_adm %>% filter(c_pattype == "0")          # "0" = inpatient in LPR2
    lpr3_k  <- lpr3_k  %>% filter(kont_type == "ALCA00")     # "ALCA00" = inpatient in LPR3
  }

  # LPR2 somatic
  lpr2_dx <- lpr_adm %>%
    filter(pnr %in% !!pnr_vector) %>%
    select(pnr, recnum, date_contact = d_inddto) %>%
    inner_join(
      lpr_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag),
      by = "recnum"
    ) %>%
    collect() %>%
    mutate(icd3 = substr(c_diag, 2, 4))                       # strip D-prefix

  # LPR2 psychiatric
  lpr2_psyk_dx <- psyk_adm %>%
    filter(pnr %in% !!pnr_vector) %>%
    select(pnr, recnum, date_contact = d_inddto) %>%
    inner_join(
      psyk_diag %>% filter(c_diagtype %in% !!diagtypes) %>% select(recnum, c_diag),
      by = "recnum"
    ) %>%
    collect() %>%
    mutate(icd3 = substr(c_diag, 2, 4))

  # LPR3
  lpr3_dx <- lpr3_k %>%
    filter(pnr %in% !!pnr_vector) %>%
    select(pnr, dw_ek_kontakt, date_contact = kont_starttidspunkt) %>%
    inner_join(
      lpr3_d %>%
        filter(diag_kode_type %in% !!diagtypes,
               is.na(senere_afkraeftet) | senere_afkraeftet != "Ja") %>%
        select(dw_ek_kontakt, c_diag = diag_kode),
      by = "dw_ek_kontakt"
    ) %>%
    collect() %>%
    mutate(date_contact = as.Date(date_contact),               # datetime → date
           icd3 = substr(c_diag, 2, 4))

  bind_rows(lpr2_dx, lpr2_psyk_dx, lpr3_dx)                   # return combined table
}

Use the function — one call per extraction, only change CODES:

cohort    <- readRDS("datasets/full_cohort.rds")
pnr_list  <- unique(cohort$pnr)

# Fetch all diagnoses for the cohort (Phase 1 — see hospital contacts page)
alle_dx <- get_lpr_diagnoses(
  pnr_vector    = pnr_list,
  diagtypes     = c("A", "B"),
  inpatient_only = FALSE
)
# Returns: pnr | date_contact | c_diag | icd3

# Extract one outcome — only change CODES (Phase 2)
CODES <- c("F00", "F01", "F02", "F03", "G30", "G31")   # dementia

dementia <- alle_dx %>%
  filter(icd3 %in% CODES) %>%
  inner_join(cohort %>% select(pnr, index_date), by = "pnr") %>%
  filter(date_contact > index_date) %>%
  group_by(pnr) %>% arrange(date_contact) %>% slice(1) %>% ungroup() %>%
  select(pnr, dementia_date = date_contact)

result <- cohort %>% select(pnr) %>% left_join(dementia, by = "pnr")
saveRDS(result, "datasets/extract_dementia.rds")

Next steps

You have now extracted diagnoses from two LPR generations. Next steps are to shape and combine your extracts:

→ Phase 11 — Joins and pivots

Fetch diagnoses from LPR — choose your approach

Wrap the pattern in a reusable function (for multiple outcomes)

Next steps

See also