First extraction
From register to analysis-ready data β step by step
This page shows the complete process from a raw register to a saved dataset ready for analysis. You will see the pattern that recurs in almost every register extraction.
These examples cannot be run on the DST server. The synthetic dataset (fakeregs) is not available there. You need RStudio installed locally on your computer:
- Download R: cran.r-project.org
- Download RStudio: posit.co/download/rstudio-desktop
- Open RStudio, create a new script (File β New File β R Script), and copy the code there.
When you are ready to work with real register data, you use the same pattern β but on the DST server and with your projectβs register data.
The examples use synthetic data from the package fakeregs (MIT licence, Anders Aasted Isaksen, Steno Diabetes Center Aarhus) β fictitious persons with the structure and column names of DST registers. The package uses generate_*() functions to create synthetic data, which you save as parquet and practise open_dataset() on.
Next step: hospital diagnoses (LPR) This example uses lpr_adm (hospital contacts β contact dates only, no diagnosis join). When working with diagnosis data from LPR (ICD codes), additional rules apply: the register is split into two periods (LPR2 and LPR3), codes have a D-prefix that must be removed, and you must choose diagnosis types. All of this is covered in Phase 9 β Hospital contacts (LPR).
Preparation: install fakeregs and generate synthetic data
fakeregs is not on CRAN β install directly from GitHub:
install.packages("pak") # only the first time
pak::pak("steno-aarhus/fakeregs") # install fakeregslibrary(fakeregs) # synthetic DST register data
library(dplyr) # filter, select, mutate, collect
library(arrow) # open_dataset, write_parquet
# Generate synthetic data and save as parquet (done only once)
bp <- generate_background_pop() # synthetic background population
bef_synth <- generate_bef(background_df = bp) # synthetic BEF register
lpr_synth <- generate_lpr_adm(background_df = bp) # synthetic LPR contact register
dir.create("synth_data/bef", recursive = TRUE, showWarnings = FALSE) # create folders
dir.create("synth_data/lpr_adm", recursive = TRUE, showWarnings = FALSE)
write_parquet(bef_synth, "synth_data/bef/bef.parquet") # save as parquet
write_parquet(lpr_synth, "synth_data/lpr_adm/lpr_adm.parquet")The path is relative to your working directory. "synth_data/bef/" is created in the folder R is set to work in β check which one with getwd(). To save elsewhere, use a full path: "C:/Users/yourname/projects/synth_data/bef/".
You can now practise open_dataset() exactly as on the DST server.
Step 1 β Define your study population
Every extraction starts with a list of pnrβs β the people you want data for. In practice this comes from your cohort script. Here we build a small practice cohort directly from the BEF register.
step1_cohort.R
# Open BEF β lazy connection, no data in R yet
bef_data <- open_dataset("synth_data/bef/") %>%
rename_with(tolower)
# Fetch 200 random pnr's as your cohort
cohort_pnrs <- bef_data %>%
filter(year == 2015) %>% # take one snapshot year
select(pnr) %>%
collect() %>% # HERE data is fetched into R
slice_sample(n = 200) %>%
pull(pnr)
length(cohort_pnrs) # check: should return 200In a real project, cohort_pnrs is a vector you built in a previous script and reload with readRDS("datasets/full_cohort.rds") %>% pull(pnr).
Recode BEF variables β what do koen, civst and reg mean?
The code below was written by Anders Aasted Isaksen (Steno Diabetes Center Aarhus) and is taken directly from the vignette common_tasks_dplyr.qmd in the package fakeregs (MIT licence). The code is reproduced unchanged.
The BEF register stores koen, civst and reg as codes β not as text. This recoding translates them into analysis-ready variables:
# Continuing from Step 1 β bef_data is already opened with open_dataset()
bef_clean <- bef_data %>%
filter(year == 2015, alder >= 18) %>%
select(pnr, year, foed_dag, koen, civst, reg, opr_land) %>%
collect() %>% # fetch into R before mutate
mutate(
foed_dato = as.Date(foed_dag), # date format
# koen: 1 = male, 2 = female (Anders Aasted Isaksen, fakeregs)
koen_text = if_else(koen == "1", "Male", "Female"),
# civst: marital status (Anders Aasted Isaksen, fakeregs)
civil_status = case_when(
civst %in% c("G", "P") ~ "Married/partner",
civst %in% c("F", "O", "E", "L") ~ "Divorced/widowed",
civst == "U" ~ "Single",
TRUE ~ NA_character_
),
# reg: region codes (Anders Aasted Isaksen, fakeregs)
region = case_when(
reg == 81 ~ "Region Nordjylland",
reg == 82 ~ "Region Midtjylland",
reg == 83 ~ "Region Syddanmark",
reg == 84 ~ "Region Hovedstaden",
reg == 85 ~ "Region Sjælland",
TRUE ~ NA_character_
),
# opr_land: 5100 = Denmark (Anders Aasted Isaksen, fakeregs)
immigrant = opr_land != 5100
)
head(bef_clean)Source: fakeregs/vignettes/common_tasks_dplyr.qmd, Anders Aasted Isaksen, Steno Diabetes Center Aarhus (MIT licence).
This is only an excerpt. Isaksenβs full vignette is more thorough and covers additional variables and patterns β see it directly here: steno-aarhus.github.io/fakeregs/articles/common_tasks_dplyr.html
Step 2 β Extract data from a register
Now we extract hospital contacts from lpr_adm for our cohort. The pattern is always the same: open β filter β select columns β collect.
step2_extraction.R
# Open lpr_adm β lazy connection
lpr_adm <- open_dataset("synth_data/lpr_adm/") %>%
rename_with(tolower)
# Extract: filter BEFORE collect β otherwise the session will crash
contacts <- lpr_adm %>%
filter(pnr %in% !!cohort_pnrs) %>% # only our cohort
select(pnr, recnum, d_inddto) %>% # only the columns we use
collect() # HERE data is moved into R
nrow(contacts) # how many contact rows?
head(contacts) # the first six rowsWhat happened?
open_dataset()opened a lazy connection β no data in R yetfilter()andselect()sent instructions to Arrow/DuckDB β still no data in Rcollect()executed the query and fetched only the necessary rows into R
See Extracting data step by step for a detailed explanation of lazy evaluation.
Test on a small sample first. Before running a heavy extraction on the full cohort, test the code on a few people or rows β this catches errors quickly without waiting. E.g. filter(pnr %in% !!head(cohort_pnrs, 10)), or collect() %>% head(100) while building the code.
Step 3 β Build analysis variables
Add variables with mutate() after collect() β now you are in R and can use all functions.
contacts <- contacts %>%
mutate(
date = as.Date(d_inddto), # explicit date class
year = as.integer(format(date, "%Y")) # year from contact date
)Step 4 β Save and reload
Save with saveRDS() so the next script can reload it without re-running all the extractions.
step4_save.R
saveRDS(contacts, "datasets/extract_contacts.rds") # save to disk β change path to your own folder
# Reload in the next script:
contacts <- readRDS("datasets/extract_contacts.rds")If you do not write a full path, the file is saved in your working directory. Run getwd() to see which folder that is.
The datasets/ folder is stored locally on the DST server only. Intermediate results are repatriated via output control β see 16 β Export and repatriation.
Inspect the result
Right after an extraction you should check that you got what you expected:
head(contacts) # the first six rows β does it look right?
nrow(contacts) # number of rows β as expected?
length(unique(contacts$pnr)) # how many unique individuals?
colSums(is.na(contacts)) # missing values per column
class(contacts$d_inddto) # is the date column Date? (not character)If your extraction includes exclusion steps β e.g. βremove persons with an early diagnosisβ β it is good practice to count N for each step. Replace raw and clean with your own variable names:
# Template β replace raw and clean with your own variable names:
cat("Raw extraction: ", nrow(raw), "\n") # all rows before exclusion
cat("After exclusions: ", nrow(clean), "\n") # after each step
cat("Excluded in total: ", nrow(raw) - nrow(clean), "\n") # differenceThese lines cannot be run with the synthetic practice data β they are a template for use when working with your own data and an exclusion sequence. The pattern is repeated for each exclusion step and forms the basis for a CONSORT flow diagram (a standardised flow diagram showing how many were excluded at each step and why).
This is a quick sanity check. The full toolkit for exploring data β summary(), table(), cross-tables and NA handling β is covered in Phase 7 β Inspect your data.
Next steps
You have now made a complete extraction and saved it. Next steps are to learn to explore data thoroughly:
- Phase 7 β Inspect your data β the full toolkit for understanding what you have
- Phase 9 β Hospital contacts (LPR) β the complete pattern for diagnosis extractions
- Phase 11 β Joins and pivots β combine two extracts
Source and adaptation
Step 1 (cohort construction from BEF) is adapted from Anders Aasted Isaksenβs vignette common_tasks_dplyr.qmd in the fakeregs package (MIT licence, Steno Diabetes Center Aarhus). Steps 2β4 and the checklist are written for this guide.