erDiagram
full_cohort ||--o| extract_dementia : pnr
full_cohort ||--o| extract_bef : pnr
full_cohort ||--o| extract_nmi : pnr
full_cohort {
string pnr PK
date index_date
int exposed
}
extract_dementia {
string pnr FK
date event_date
}
extract_bef {
string pnr FK
int age
int sex
}
extract_nmi {
string pnr FK
int nmi_score
}
Assemble your extracts
Joins and pivots β prepare your analysis dataset
You arrive at this phase with separate RDS files: your cohort from Phase 10, event dates from Phase 9, covariates from Phase 6 and socioeconomic variables from Phase 13. You leave with one analysis-ready dataset ready for Phase 14.
In short: Reshape each extract to one row per person (with pivot_wider() or group_by() + slice()), then join them on pnr with left_join() into one analysis-ready dataset.
Before you can analyse, you need to:
- Shape each extract into one row per person (most registers are in long format with many rows per person)
- Assemble them into one analysis-ready dataset with
left_join()
That is what this phase is about. The technique for step 1 is either pivot_wider() or group_by() + slice() β the choice depends on what you want out of the register. Step 2 is always left_join().
The code examples use generic paths and variable names. Adapt to your projectβs folder structure and column names.
What is an analysis-ready dataset?
An analysis-ready dataset has one row per person and one column per variable β what we call wide format:
pnr |
index_date |
censor_date |
followup_years |
event |
age |
sex |
nmi_score |
|---|---|---|---|---|---|---|---|
| 001 | 2015-03-01 | 2022-07-14 | 7.4 | 0 | 58 | 2 | 12 |
| 002 | 2016-11-20 | 2019-11-20 | 3.0 | 1 | 63 | 1 | 7 |
The example uses a cohort study with exposed, event and followup_years. For a prevalence study the column structure will look different β typically without follow-up time, but with a cross-sectional date and covariates measured at that point. The techniques in this phase (joins, pivots, group_by() + slice()) are the same regardless of study design.
The registers you extract from β LPR, LMDB, BEF β are almost always in long format with many rows per person. This phase shows how to reshape and assemble them.
What have you already saved?
Each extraction from Phases 6, 9 and 12 etc. should be saved as its own RDS file β one register, one topic. Example of what you typically have:
| File | Contents | Key columns |
|---|---|---|
full_cohort.rds |
Cohort with index date | pnr, index_date, exposed |
extract_dementia.rds |
First dementia diagnosis | pnr, event_date (NA = none) |
extract_deaths.rds |
Date of death | pnr, death_date (NA = alive) |
extract_emigration.rds |
First emigration | pnr, emigration_date (NA = not emigrated) |
extract_bef.rds |
Demographics at index | pnr, age, sex |
extract_nmi.rds |
Comorbidity at baseline | pnr, nmi_score |
extract_ses.rds |
Socioeconomic variables | pnr, education, income, occupation |
The goal of this phase is to assemble them into one dataset with left_join() β one new column per file. The cohort is the backbone: each extract is joined on via pnr and adds its columns β all cohort members are kept, and people with no match get NA.
Assemble only what you need. It is tempting to keep all columns βjust in caseβ. Do not. Always end with select() and explicitly choose the columns the analysis requires. A dataset with only the necessary columns is far easier to debug and understand.
Two tools for shaping data
The two central operations in this phase are pivots and joins β they sound similar but do fundamentally different things:
| Pivots | Joins | |
|---|---|---|
| What it does | Reshapes one table β changes the form, not the content | Combines two tables β adds columns from another table |
| Input | One table | Two tables with a shared key (typically pnr) |
| Output | Same data, new shape | Wider table with columns from both |
| Used for | Long β wide format (pivot_wider) or wide β long (pivot_longer) |
Assemble cohort + outcomes + covariates into one dataset |
Rule of thumb: Use pivot when you want to change the shape of one table. Use join when you want to combine two tables.
Wide vs. long format
Imagine three persons with weight measurements at two time points.
Wide format β one row per person, one column per time point:
| pnr | weight_baseline | weight_6mo |
|---|---|---|
| 001 | 120 | 95 |
| 002 | 105 | 88 |
| 003 | 98 | 80 |
Long format β one row per observation, one column for time point and one for value:
| pnr | timepoint | weight |
|---|---|---|
| 001 | baseline | 120 |
| 001 | 6mo | 95 |
| 002 | baseline | 105 |
| 002 | 6mo | 88 |
| 003 | baseline | 98 |
| 003 | 6mo | 80 |
Both formats contain the same information β they are just arranged differently.
Which formats are where?
| Register | Format | Why |
|---|---|---|
| BEF | Long | One row per person per snapshot β quarterly since 2008, otherwise annual |
| LPR (lpr_adm + lpr_diag) | Long | One row per diagnosis per contact β one admission can have 10 diagnoses |
| LMDB | Long | One row per dispensed prescription β the same person can have hundreds |
| DODSAARS | Wide | One row per deceased person β already one-per-person |
| Your analysis dataset | Wide | One row per person β one column per covariate |
Registers are usually in long format. Your analysis dataset should typically be in wide format.
Pivots β switch between long and wide
Long β wide with pivot_wider()
library(tidyr) # pivot_wider, pivot_longer
df_wide <- df_long %>%
pivot_wider(
names_from = "timepoint", # which column should become column names?
values_from = "weight" # which column contains the values?
)You CANNOT pivot_wider() an entire LPR table. LPR potentially has hundreds of unique ICD codes per person β pivot_wider() would give you one column per code, i.e. thousands of columns and a dataset that is impossible to work with.
Instead use the pattern from Phase 9: extract the first date for one specific diagnosis group, save as extract_dementia.rds, and join it as one column with left_join(). One outcome variable = one column = one RDS file.
Practical example: employment status from AKM β long to wide
AKM (the Labour Classification Module) stores one row per person per year β long format. If you want to compare employment status at baseline and follow-up side by side, you can pivot:
library(dplyr) # filter, mutate
library(tidyr) # pivot_wider
# AKM in long format β two years per person:
akm_long <- data.frame(
pnr = c("001", "001", "002", "002", "003", "003"),
aar = c(2015, 2020, 2015, 2020, 2015, 2020),
socio = c(110, 110, 210, 310, 150, 150) # employment categories from AKM (socio13)
)
# Convert to wide β one column per year:
akm_wide <- akm_long %>%
pivot_wider(
names_from = aar,
values_from = socio,
names_prefix = "socio_" # gives columns: socio_2015, socio_2020
)
# Result:
# pnr socio_2015 socio_2020
# 001 110 110
# 002 210 310
# 003 150 150In practice you would typically not pivot AKM β you filter to one specific year with filter(aar == index_year) and use group_by() + slice() below. Pivot is relevant when you explicitly want two time points side by side in the analysis.
Wide β long with pivot_longer()
df_long <- df_wide %>%
pivot_longer(
cols = c(weight_baseline, weight_6mo), # columns to "fold in"
names_to = "timepoint", # name for the new timepoint column
values_to = "weight" # name for the new value column
)Prepare for join β group_by() + slice()
left_join() behaves predictably when both tables have one row per pnr. Many register extracts have multiple rows per person (long format). group_by() + slice() is the primary way to reduce a dataset to one representative row per person before joining β it is preparation for the join, not the join itself.
The pattern is always the same: group by pnr, sort so the desired row is on top, take the first row per group.
# BEF: keep only the newest record per person (BEF has one snapshot per year)
bef_one_per_person <- bef %>%
group_by(pnr) %>% # group: all rows with the same pnr are treated together
arrange(desc(aar)) %>% # sort within group: newest year first (desc = descending)
slice(1) %>% # take only the first row per group β i.e. the newest year
ungroup() # IMPORTANT: release grouping when doneAlways remember ungroup(). group_by() puts a βgrouped_dfβ tag on your data.frame, which all subsequent operations inherit. That means: - mutate() calculates per group, not across the entire dataset - n() counts rows per group, not in total - summarize() gives one result per group
These errors are hard to detect because the code runs without error messages and produces seemingly reasonable numbers. ungroup() removes the tag and makes the data.frame normal again. Rule of thumb: always end a group_by() chain with ungroup().
# LPR / alle_dx: find first diagnosis per person (for outcome definition)
first_dementia <- alle_dx %>%
filter(icd3 %in% c("G30", "F00", "F01", "F02", "F03")) %>% # only dementia codes
group_by(pnr) %>% # group per person
arrange(date_contact) %>% # oldest date first (ascending β no desc)
slice(1) %>% # first diagnosis per person
ungroup() # release grouping
# β ready for: cohort %>% left_join(first_dementia, by = "pnr")# VNDS: find first emigration per person (a person may have emigrated and returned)
first_emigration <- vnds %>%
filter(indud_kode == "U") %>% # only emigration events
group_by(pnr) %>%
arrange(haend_dato) %>%
slice(1) %>%
ungroup()When do you use group_by() + slice() vs pivot_wider()? If you want one representative value per person (newest snapshot, first event, earliest date) β use group_by() + slice(). If you want two or more time points side by side as separate columns in the analysis β use pivot_wider(). In register-based research, group_by() + slice() is by far the most common choice.
Joins β link two tables
In register-based research you almost always work with data from two or more tables that need to be linked. All joins in dplyr take two tables and one key column they share (e.g. pnr).
Your extracts typically all share pnr as the key β that is what you join on. A simple ER diagram of the keys:
pnr is the primary key (PK) in the cohort and a foreign key (FK) in each extract. ||--o| means: each person in the cohort has zero or one matching row in the extract.
The joins you will use
library(dplyr) # left_join, inner_join, semi_join, anti_join
# left_join: keep ALL rows from x, add columns from y
# Persons in x without a match in y get NA.
result <- cohort %>%
left_join(outcome, by = "pnr") # all cohort members retained; outcome = NA if no event
# inner_join: keep only rows with a match in BOTH tables
result <- lpr_adm %>%
inner_join(lpr_diag, by = "recnum") # only contacts with at least one diagnosis
# right_join: keep all rows from y (rarely used)
# full_join: keep all rows from both (rarely used)Filter joins β change rows only, not columns
# semi_join: keep rows in x THAT HAVE a match in y
bef_in_cohort <- bef %>%
semi_join(cohort, by = "pnr") # only BEF rows for cohort members
# anti_join: keep rows in x THAT DO NOT HAVE a match in y
bef_not_in_cohort <- bef %>%
anti_join(cohort, by = "pnr") # only BEF rows for non-cohort members
# also used for diagnostics β see "Validate your join" belowOverview: all joins in one table
| Function | Keeps rows from | Used for |
|---|---|---|
left_join(x, y) |
All from x | Add outcomes/covariates to cohort β NA if no match |
inner_join(x, y) |
Only matches in both | Join contacts and diagnoses β only want rows with both |
right_join(x, y) |
All from y | Rarely |
full_join(x, y) |
All from both | Rarely |
semi_join(x, y) |
Only x with match in y | Filter a register to only cohort members |
anti_join(x, y) |
Only x without match in y | Find everyone in BEF NOT in the cohort |
Join with multiple keys
When two tables share more than one column as key:
# Join on pnr AND year
bef_akm <- bef %>%
left_join(akm, by = c("pnr", "aar"))Validate your join
Joins rarely fail with an error message β but they can produce silently wrong results. Always check these three things after a left_join():
# Before join: what is nrow?
nrow(cohort) # e.g. 4,823
# Join:
cohort2 <- cohort %>% left_join(outcome, by = "pnr")
# Check 1: did the row count increase? Duplicated keys in outcome give extra rows.
nrow(cohort2) # should still be 4,823
# Check 2: how many got a match?
sum(!is.na(cohort2$event_date)) # number with outcome
sum(is.na(cohort2$event_date)) # number without β expected censored?
# Check 3: who did NOT match? (diagnostic β not necessarily an error)
missing <- anti_join(cohort, outcome, by = "pnr") # pnr's not found in outcome
nrow(missing) # 0 = all matched; > 0 = investigateanti_join() as diagnostics anti_join(cohort, outcome, by = "pnr") returns the cohort members that are not found in the outcome table. Use it for βwho in my cohort has no death date record?β (all alive β expected) or βwho is missing from register X?β (unexpected β investigate).
Stack tables vertically β bind_rows()
bind_rows() and left_join() solve two fundamentally different problems:
bind_rows() |
left_join() |
|
|---|---|---|
| What it does | Stacks tables vertically β adds rows | Combines tables horizontally β adds columns |
| Requires | That the columns have the same names | That the tables share one common key column |
| Result | More rows, same number of columns | Same number of rows, more columns |
| Matches on key? | No β all rows from both tables | Yes β match on pnr or other key |
| Typical use | LPR2 + LPR3 β one combined register | Link outcome or covariate to cohort table |
bind_rows() is not a join. It does not look at pnr and does not match. It simply places table 2βs rows below table 1βs rows β like pasting two Excel sheets together vertically.
# bind_rows: LPR2 and LPR3 have the same columns (pnr, date_contact, icd3)
# β stack them vertically into one combined diagnosis register
lpr2_dx # 45,000 rows β diagnoses up to March 2019
lpr3_dx # 32,000 rows β diagnoses from March 2019 onwards
alle_dx <- bind_rows(lpr2_dx, lpr3_dx) # 77,000 rows β both periods combinedColumns missing in one table (e.g. a column that only exists in LPR3) are automatically filled with NA for rows from the other table.
When do you use which? bind_rows() β when you have two versions of the same register (LPR2 + LPR3, exposed + comparators) and want to merge them into one. left_join() β when you want to attach a new variable (outcome, date of death, age) to your cohort table.
Practical examples: join date of death and emigration to the cohort
Date of death (DODSAARS)
# Cohort data (one row per person):
kohort <- readRDS("datasets/full_cohort.rds")
# Death dates from DODSAARS:
dodsaars <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
rename_with(tolower) %>% # standardise column names
filter(pnr %in% !!kohort$pnr) %>% # only the cohort's pnr's
select(pnr, death_date = d_dodsdto) %>% # rename d_dodsdto to death_date
collect() # fetch into R
# Join: all cohort members retained; the living get death_date = NA
cohort_with_death <- kohort %>%
left_join(dodsaars, by = "pnr") # attach death date β NA = still aliveEmigration date (VNDS)
Emigration censors just like death β the person leaves the study on the day they emigrate. VNDS contains one row per migration event; use only "U" (emigration) and take the first date per person.
# Emigration dates from VNDS:
vnds <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
rename_with(tolower) %>% # standardise column names
filter(pnr %in% !!kohort$pnr,
indud_kode == "U") %>% # only emigration events
select(pnr, emigration_date = haend_dato) %>% # rename haend_dato to emigration_date
collect() %>% # fetch into R
group_by(pnr) %>% # group to find first emigration
arrange(emigration_date) %>% # oldest date first
slice(1) %>% # first emigration per person
ungroup() # release grouping
# Join: all cohort members retained; non-emigrants get emigration_date = NA
cohort_with_emigration <- kohort %>%
left_join(vnds, by = "pnr") # NA = never emigrated during the study periodCalculate follow-up time and event variable
Before you can analyse, each cohort member needs a censoring date (when follow-up ends) and an event variable (did they experience the outcome?).
The censoring date is the earliest of: event date, date of death, emigration date and end of study period.
study_end <- as.Date("2024-12-31") # replace with your actual study end date
# format: "yyyy-mm-dd" (ISO 8601 β R's standard)
kohort <- kohort %>%
mutate(
# Censoring date = the earliest of all possible stopping reasons
censor_date = pmin(event_date, death_date, emigration_date,
study_end, na.rm = TRUE),
# Follow-up time in years
followup_years = as.numeric(censor_date - index_date) / 365.25,
# Event variable: 1 = outcome occurred before censoring, 0 = censored
event = as.integer(!is.na(event_date) & event_date <= censor_date)
)pmin() compares vectors position by position and returns the smallest value per person β it is the vectorised version of min(). na.rm = TRUE ensures that a missing death date (= alive) does not make the censoring date NA.
Build your study population
You now have all the building blocks β extractions (Phase 6), registers (Phase 8), LPR/SKS (Phase 9) and joins (above) β to assemble your study population. It happens in four steps:
- Define the exposed cohort. Filter the register that defines the exposure β e.g. bariatric surgery via SKS codes in
lpr_sksopr/procedurer_kirurgi(Phase 9). Keeppnrand index date (the surgery date). - Build the comparator cohort. Match controls to the exposed on e.g. age, sex and calendar year. Controls are assigned the matched exposed personβs index date.
- Apply the same inclusion/exclusion criteria to both groups, and assemble them (
bind_rows()). - Link outcomes and covariates with
left_join()(one row per person), calculate follow-up time, and end withselect().
End with select() β keep only the columns the analysis requires.
cohort_final <- kohort %>%
select(pnr, index_date, censor_date, followup_years, event,
alder, koen, nmi_score, occupation_cat, education_cat, income_cat)
saveRDS(cohort_final, "datasets/analysis_dataset.rds") # save the final datasetDesign choices (who is the comparator cohort, immortal time, matching ratio) belong to the planning stage β see Phase 1 β The comparator cohort. The actual cohort building and matching with heaven::riskSetMatch() β including pitfalls such as immortal time β are in Phase 10 β Build your study population.
Complete recipe β from cohort to analysis-ready dataset
# Complete recipe: all steps in order
# 0. Load your cohort (pnr + index_date, built in Phase 6/9)
kohort <- readRDS("datasets/full_cohort.rds") # one row per person
# 1. Exclude prevalent cases BEFORE outcome extraction (see Phase 9)
# ... cohort_clean <- kohort %>% anti_join(prevalent, by = "pnr")
# 2. Link outcome (e.g. first dementia diagnosis after index date)
outcome <- readRDS("datasets/extract_dementia.rds") # pnr + event_date (NA = no event)
kohort <- kohort_clean %>% left_join(outcome, by = "pnr")
# 3. Link censoring: death date and emigration
deaths <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
rename_with(tolower) %>%
filter(pnr %in% !!kohort$pnr) %>%
select(pnr, death_date = d_dodsdto) %>% collect()
vnds_data <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
rename_with(tolower) %>%
filter(pnr %in% !!kohort$pnr, indud_kode == "U") %>%
select(pnr, emigration_date = haend_dato) %>% collect() %>%
group_by(pnr) %>% arrange(emigration_date) %>% slice(1) %>% ungroup()
kohort <- kohort %>%
left_join(deaths, by = "pnr") %>%
left_join(vnds_data, by = "pnr")
# 4. Calculate follow-up time and event variable
study_end <- as.Date("2024-12-31")
kohort <- kohort %>%
mutate(
censor_date = pmin(event_date, death_date, emigration_date, study_end, na.rm = TRUE),
followup_years = as.numeric(censor_date - index_date) / 365.25,
event = as.integer(!is.na(event_date) & event_date <= censor_date)
)
# 5. Link covariates (demographic, SES, comorbidity)
bef_data <- readRDS("datasets/extract_bef.rds") # age, sex from BEF
ses_data <- readRDS("datasets/extract_ses.rds") # education, income, employment
nmi_data <- readRDS("datasets/extract_nmi.rds") # NMI score
kohort <- kohort %>%
left_join(bef_data, by = "pnr") %>%
left_join(ses_data, by = "pnr") %>%
left_join(nmi_data, by = "pnr")
# 6. Keep ONLY the columns the analysis requires
cohort_final <- kohort %>%
select(pnr, index_date, censor_date, followup_years, event,
alder, koen, nmi_score, occupation_cat, education_cat, income_cat)
saveRDS(cohort_final, "datasets/analysis_dataset.rds") # save the final dataset
nrow(cohort_final) # verify number of persons
names(cohort_final) # verify column namesNext steps
You can now shape and combine your extracts. Next step is the socioeconomic variables:
See also
- Phase 15 β Functions: overview β all functions for transforming data
- Phase 9 β Hospital contacts (LPR) β practical example of long register data and inner_join