Assemble your extracts

Joins and pivots β€” prepare your analysis dataset

Published

June 6, 2026

You arrive at this phase with separate RDS files: your cohort from Phase 10, event dates from Phase 9, covariates from Phase 6 and socioeconomic variables from Phase 13. You leave with one analysis-ready dataset ready for Phase 14.

Tip

In short: Reshape each extract to one row per person (with pivot_wider() or group_by() + slice()), then join them on pnr with left_join() into one analysis-ready dataset.

Before you can analyse, you need to:

  1. Shape each extract into one row per person (most registers are in long format with many rows per person)
  2. Assemble them into one analysis-ready dataset with left_join()

That is what this phase is about. The technique for step 1 is either pivot_wider() or group_by() + slice() β€” the choice depends on what you want out of the register. Step 2 is always left_join().

Note

The code examples use generic paths and variable names. Adapt to your project’s folder structure and column names.


What is an analysis-ready dataset?

An analysis-ready dataset has one row per person and one column per variable β€” what we call wide format:

pnr index_date censor_date followup_years event age sex nmi_score
001 2015-03-01 2022-07-14 7.4 0 58 2 12
002 2016-11-20 2019-11-20 3.0 1 63 1 7
Note

The example uses a cohort study with exposed, event and followup_years. For a prevalence study the column structure will look different β€” typically without follow-up time, but with a cross-sectional date and covariates measured at that point. The techniques in this phase (joins, pivots, group_by() + slice()) are the same regardless of study design.

The registers you extract from β€” LPR, LMDB, BEF β€” are almost always in long format with many rows per person. This phase shows how to reshape and assemble them.

What have you already saved?

Each extraction from Phases 6, 9 and 12 etc. should be saved as its own RDS file β€” one register, one topic. Example of what you typically have:

File Contents Key columns
full_cohort.rds Cohort with index date pnr, index_date, exposed
extract_dementia.rds First dementia diagnosis pnr, event_date (NA = none)
extract_deaths.rds Date of death pnr, death_date (NA = alive)
extract_emigration.rds First emigration pnr, emigration_date (NA = not emigrated)
extract_bef.rds Demographics at index pnr, age, sex
extract_nmi.rds Comorbidity at baseline pnr, nmi_score
extract_ses.rds Socioeconomic variables pnr, education, income, occupation

The goal of this phase is to assemble them into one dataset with left_join() β€” one new column per file. The cohort is the backbone: each extract is joined on via pnr and adds its columns β€” all cohort members are kept, and people with no match get NA.

Important

Assemble only what you need. It is tempting to keep all columns β€œjust in case”. Do not. Always end with select() and explicitly choose the columns the analysis requires. A dataset with only the necessary columns is far easier to debug and understand.


Two tools for shaping data

The two central operations in this phase are pivots and joins β€” they sound similar but do fundamentally different things:

Pivots Joins
What it does Reshapes one table β€” changes the form, not the content Combines two tables β€” adds columns from another table
Input One table Two tables with a shared key (typically pnr)
Output Same data, new shape Wider table with columns from both
Used for Long β†’ wide format (pivot_wider) or wide β†’ long (pivot_longer) Assemble cohort + outcomes + covariates into one dataset

Rule of thumb: Use pivot when you want to change the shape of one table. Use join when you want to combine two tables.

β†’ Pivots Β· Joins


Wide vs. long format

Imagine three persons with weight measurements at two time points.

Wide format β€” one row per person, one column per time point:

pnr weight_baseline weight_6mo
001 120 95
002 105 88
003 98 80

Long format β€” one row per observation, one column for time point and one for value:

pnr timepoint weight
001 baseline 120
001 6mo 95
002 baseline 105
002 6mo 88
003 baseline 98
003 6mo 80

Both formats contain the same information β€” they are just arranged differently.


Which formats are where?

Register Format Why
BEF Long One row per person per snapshot β€” quarterly since 2008, otherwise annual
LPR (lpr_adm + lpr_diag) Long One row per diagnosis per contact β€” one admission can have 10 diagnoses
LMDB Long One row per dispensed prescription β€” the same person can have hundreds
DODSAARS Wide One row per deceased person β€” already one-per-person
Your analysis dataset Wide One row per person β€” one column per covariate

Registers are usually in long format. Your analysis dataset should typically be in wide format.


Pivots β€” switch between long and wide

Long β†’ wide with pivot_wider()

library(tidyr)   # pivot_wider, pivot_longer

df_wide <- df_long %>%
  pivot_wider(
    names_from  = "timepoint",  # which column should become column names?
    values_from = "weight"      # which column contains the values?
  )
Warning

You CANNOT pivot_wider() an entire LPR table. LPR potentially has hundreds of unique ICD codes per person β€” pivot_wider() would give you one column per code, i.e. thousands of columns and a dataset that is impossible to work with.

Instead use the pattern from Phase 9: extract the first date for one specific diagnosis group, save as extract_dementia.rds, and join it as one column with left_join(). One outcome variable = one column = one RDS file.

Practical example: employment status from AKM β€” long to wide

AKM (the Labour Classification Module) stores one row per person per year β€” long format. If you want to compare employment status at baseline and follow-up side by side, you can pivot:

library(dplyr)   # filter, mutate
library(tidyr)   # pivot_wider

# AKM in long format β€” two years per person:
akm_long <- data.frame(
  pnr   = c("001", "001", "002", "002", "003", "003"),
  aar   = c(2015, 2020, 2015, 2020, 2015, 2020),
  socio = c(110, 110, 210, 310, 150, 150)   # employment categories from AKM (socio13)
)

# Convert to wide β€” one column per year:
akm_wide <- akm_long %>%
  pivot_wider(
    names_from   = aar,
    values_from  = socio,
    names_prefix = "socio_"   # gives columns: socio_2015, socio_2020
  )

# Result:
#   pnr   socio_2015  socio_2020
#   001          110         110
#   002          210         310
#   003          150         150

In practice you would typically not pivot AKM β€” you filter to one specific year with filter(aar == index_year) and use group_by() + slice() below. Pivot is relevant when you explicitly want two time points side by side in the analysis.

Wide β†’ long with pivot_longer()

df_long <- df_wide %>%
  pivot_longer(
    cols      = c(weight_baseline, weight_6mo),  # columns to "fold in"
    names_to  = "timepoint",                     # name for the new timepoint column
    values_to = "weight"                         # name for the new value column
  )

Prepare for join β€” group_by() + slice()

left_join() behaves predictably when both tables have one row per pnr. Many register extracts have multiple rows per person (long format). group_by() + slice() is the primary way to reduce a dataset to one representative row per person before joining β€” it is preparation for the join, not the join itself.

The pattern is always the same: group by pnr, sort so the desired row is on top, take the first row per group.

# BEF: keep only the newest record per person (BEF has one snapshot per year)
bef_one_per_person <- bef %>%
  group_by(pnr) %>%          # group: all rows with the same pnr are treated together
  arrange(desc(aar)) %>%     # sort within group: newest year first (desc = descending)
  slice(1) %>%               # take only the first row per group β€” i.e. the newest year
  ungroup()                  # IMPORTANT: release grouping when done
Important

Always remember ungroup(). group_by() puts a β€œgrouped_df” tag on your data.frame, which all subsequent operations inherit. That means: - mutate() calculates per group, not across the entire dataset - n() counts rows per group, not in total - summarize() gives one result per group

These errors are hard to detect because the code runs without error messages and produces seemingly reasonable numbers. ungroup() removes the tag and makes the data.frame normal again. Rule of thumb: always end a group_by() chain with ungroup().

# LPR / alle_dx: find first diagnosis per person (for outcome definition)
first_dementia <- alle_dx %>%
  filter(icd3 %in% c("G30", "F00", "F01", "F02", "F03")) %>%   # only dementia codes
  group_by(pnr) %>%              # group per person
  arrange(date_contact) %>%      # oldest date first (ascending β€” no desc)
  slice(1) %>%                   # first diagnosis per person
  ungroup()                      # release grouping
# β†’ ready for: cohort %>% left_join(first_dementia, by = "pnr")
# VNDS: find first emigration per person (a person may have emigrated and returned)
first_emigration <- vnds %>%
  filter(indud_kode == "U") %>%   # only emigration events
  group_by(pnr) %>%
  arrange(haend_dato) %>%
  slice(1) %>%
  ungroup()
Tip

When do you use group_by() + slice() vs pivot_wider()? If you want one representative value per person (newest snapshot, first event, earliest date) β€” use group_by() + slice(). If you want two or more time points side by side as separate columns in the analysis β€” use pivot_wider(). In register-based research, group_by() + slice() is by far the most common choice.


Validate your join

Joins rarely fail with an error message β€” but they can produce silently wrong results. Always check these three things after a left_join():

# Before join: what is nrow?
nrow(cohort)                                   # e.g. 4,823

# Join:
cohort2 <- cohort %>% left_join(outcome, by = "pnr")

# Check 1: did the row count increase? Duplicated keys in outcome give extra rows.
nrow(cohort2)                                  # should still be 4,823

# Check 2: how many got a match?
sum(!is.na(cohort2$event_date))                # number with outcome
sum(is.na(cohort2$event_date))                 # number without β€” expected censored?

# Check 3: who did NOT match? (diagnostic β€” not necessarily an error)
missing <- anti_join(cohort, outcome, by = "pnr")   # pnr's not found in outcome
nrow(missing)                                  # 0 = all matched; > 0 = investigate
Tip

anti_join() as diagnostics anti_join(cohort, outcome, by = "pnr") returns the cohort members that are not found in the outcome table. Use it for β€œwho in my cohort has no death date record?” (all alive β€” expected) or β€œwho is missing from register X?” (unexpected β€” investigate).


Stack tables vertically β€” bind_rows()

bind_rows() and left_join() solve two fundamentally different problems:

bind_rows() left_join()
What it does Stacks tables vertically β€” adds rows Combines tables horizontally β€” adds columns
Requires That the columns have the same names That the tables share one common key column
Result More rows, same number of columns Same number of rows, more columns
Matches on key? No β€” all rows from both tables Yes β€” match on pnr or other key
Typical use LPR2 + LPR3 β†’ one combined register Link outcome or covariate to cohort table

bind_rows() is not a join. It does not look at pnr and does not match. It simply places table 2’s rows below table 1’s rows β€” like pasting two Excel sheets together vertically.

# bind_rows: LPR2 and LPR3 have the same columns (pnr, date_contact, icd3)
# β†’ stack them vertically into one combined diagnosis register
lpr2_dx    # 45,000 rows β€” diagnoses up to March 2019
lpr3_dx    # 32,000 rows β€” diagnoses from March 2019 onwards

alle_dx <- bind_rows(lpr2_dx, lpr3_dx)   # 77,000 rows β€” both periods combined

Columns missing in one table (e.g. a column that only exists in LPR3) are automatically filled with NA for rows from the other table.

Note

When do you use which? bind_rows() β€” when you have two versions of the same register (LPR2 + LPR3, exposed + comparators) and want to merge them into one. left_join() β€” when you want to attach a new variable (outcome, date of death, age) to your cohort table.


Practical examples: join date of death and emigration to the cohort

Date of death (DODSAARS)

# Cohort data (one row per person):
kohort <- readRDS("datasets/full_cohort.rds")

# Death dates from DODSAARS:
dodsaars <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
  rename_with(tolower) %>%               # standardise column names
  filter(pnr %in% !!kohort$pnr) %>%     # only the cohort's pnr's
  select(pnr, death_date = d_dodsdto) %>%   # rename d_dodsdto to death_date
  collect()                              # fetch into R

# Join: all cohort members retained; the living get death_date = NA
cohort_with_death <- kohort %>%
  left_join(dodsaars, by = "pnr")       # attach death date β€” NA = still alive

Emigration date (VNDS)

Emigration censors just like death β€” the person leaves the study on the day they emigrate. VNDS contains one row per migration event; use only "U" (emigration) and take the first date per person.

# Emigration dates from VNDS:
vnds <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
  rename_with(tolower) %>%                       # standardise column names
  filter(pnr %in% !!kohort$pnr,
         indud_kode == "U") %>%                  # only emigration events
  select(pnr, emigration_date = haend_dato) %>%  # rename haend_dato to emigration_date
  collect() %>%                                  # fetch into R
  group_by(pnr) %>%                              # group to find first emigration
  arrange(emigration_date) %>%                   # oldest date first
  slice(1) %>%                                   # first emigration per person
  ungroup()                                      # release grouping

# Join: all cohort members retained; non-emigrants get emigration_date = NA
cohort_with_emigration <- kohort %>%
  left_join(vnds, by = "pnr")                    # NA = never emigrated during the study period

Calculate follow-up time and event variable

Before you can analyse, each cohort member needs a censoring date (when follow-up ends) and an event variable (did they experience the outcome?).

The censoring date is the earliest of: event date, date of death, emigration date and end of study period.

study_end <- as.Date("2024-12-31")   # replace with your actual study end date
                                      # format: "yyyy-mm-dd" (ISO 8601 β€” R's standard)

kohort <- kohort %>%
  mutate(
    # Censoring date = the earliest of all possible stopping reasons
    censor_date = pmin(event_date, death_date, emigration_date,
                       study_end, na.rm = TRUE),

    # Follow-up time in years
    followup_years = as.numeric(censor_date - index_date) / 365.25,

    # Event variable: 1 = outcome occurred before censoring, 0 = censored
    event = as.integer(!is.na(event_date) & event_date <= censor_date)
  )
Note

pmin() compares vectors position by position and returns the smallest value per person β€” it is the vectorised version of min(). na.rm = TRUE ensures that a missing death date (= alive) does not make the censoring date NA.


Build your study population

You now have all the building blocks β€” extractions (Phase 6), registers (Phase 8), LPR/SKS (Phase 9) and joins (above) β€” to assemble your study population. It happens in four steps:

  1. Define the exposed cohort. Filter the register that defines the exposure β€” e.g. bariatric surgery via SKS codes in lpr_sksopr / procedurer_kirurgi (Phase 9). Keep pnr and index date (the surgery date).
  2. Build the comparator cohort. Match controls to the exposed on e.g. age, sex and calendar year. Controls are assigned the matched exposed person’s index date.
  3. Apply the same inclusion/exclusion criteria to both groups, and assemble them (bind_rows()).
  4. Link outcomes and covariates with left_join() (one row per person), calculate follow-up time, and end with select().
Important

End with select() β€” keep only the columns the analysis requires.

cohort_final <- kohort %>%
  select(pnr, index_date, censor_date, followup_years, event,
         alder, koen, nmi_score, occupation_cat, education_cat, income_cat)

saveRDS(cohort_final, "datasets/analysis_dataset.rds")   # save the final dataset
Tip

Design choices (who is the comparator cohort, immortal time, matching ratio) belong to the planning stage β€” see Phase 1 β€” The comparator cohort. The actual cohort building and matching with heaven::riskSetMatch() β€” including pitfalls such as immortal time β€” are in Phase 10 β€” Build your study population.

Complete recipe β€” from cohort to analysis-ready dataset
# Complete recipe: all steps in order

# 0. Load your cohort (pnr + index_date, built in Phase 6/9)
kohort <- readRDS("datasets/full_cohort.rds")   # one row per person

# 1. Exclude prevalent cases BEFORE outcome extraction (see Phase 9)
# ... cohort_clean <- kohort %>% anti_join(prevalent, by = "pnr")

# 2. Link outcome (e.g. first dementia diagnosis after index date)
outcome  <- readRDS("datasets/extract_dementia.rds")   # pnr + event_date (NA = no event)
kohort <- kohort_clean %>% left_join(outcome, by = "pnr")

# 3. Link censoring: death date and emigration
deaths <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/dodsaars/") %>%
  rename_with(tolower) %>%
  filter(pnr %in% !!kohort$pnr) %>%
  select(pnr, death_date = d_dodsdto) %>% collect()

vnds_data <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/vnds/") %>%
  rename_with(tolower) %>%
  filter(pnr %in% !!kohort$pnr, indud_kode == "U") %>%
  select(pnr, emigration_date = haend_dato) %>% collect() %>%
  group_by(pnr) %>% arrange(emigration_date) %>% slice(1) %>% ungroup()

kohort <- kohort %>%
  left_join(deaths,    by = "pnr") %>%
  left_join(vnds_data, by = "pnr")

# 4. Calculate follow-up time and event variable
study_end <- as.Date("2024-12-31")
kohort <- kohort %>%
  mutate(
    censor_date    = pmin(event_date, death_date, emigration_date, study_end, na.rm = TRUE),
    followup_years = as.numeric(censor_date - index_date) / 365.25,
    event          = as.integer(!is.na(event_date) & event_date <= censor_date)
  )

# 5. Link covariates (demographic, SES, comorbidity)
bef_data <- readRDS("datasets/extract_bef.rds")    # age, sex from BEF
ses_data <- readRDS("datasets/extract_ses.rds")    # education, income, employment
nmi_data <- readRDS("datasets/extract_nmi.rds")    # NMI score

kohort <- kohort %>%
  left_join(bef_data, by = "pnr") %>%
  left_join(ses_data, by = "pnr") %>%
  left_join(nmi_data, by = "pnr")

# 6. Keep ONLY the columns the analysis requires
cohort_final <- kohort %>%
  select(pnr, index_date, censor_date, followup_years, event,
         alder, koen, nmi_score, occupation_cat, education_cat, income_cat)

saveRDS(cohort_final, "datasets/analysis_dataset.rds")   # save the final dataset
nrow(cohort_final)                                        # verify number of persons
names(cohort_final)                                       # verify column names

Next steps

You can now shape and combine your extracts. Next step is the socioeconomic variables:

β†’ Phase 13 β€” Socioeconomic variables

See also

Back to top