Build your study population
Identify your population β and build a comparator cohort if relevant
All extractions in the other phases β outcomes (Phase 9), covariates (Phase 6) and socioeconomics (Phase 13) β assume you already have a cohort: a table with pnr and index_date per person. This page shows how to build it from scratch.
In short: You build your study population in six steps β identify the exposed (1), exclude prevalent cases (2), build a match pool (3), apply eligibility criteria (4), match a comparison cohort (5) and save the cohort (6). The result is a table with pnr + index_date per person.
This page is still under development. The code needs further review and testing before being used directly. Use it as structural guidance and adapt to your own project.
Study design determines the approach. This page shows an active cohort study with matching β one exposed group and one comparator cohort. Adapt step 1 to your exposure definition: SKS code, ICD diagnosis, ATC code, clinical measurement or other. See Phase 1 β what type of study for the overview of case-control vs. cohort.
What is index date?
Index date is the point in time that marks the start of follow-up for a given person.
- Exposed: the date the person received the exposure (e.g. surgery date, diagnosis date, first prescription dispensed)
- Comparator cohort: the index date assigned from the matched exposed person
Everything that follows β outcome date, covariates at baseline, follow-up time β is calculated relative to index date. The definition of index date is crucial for study validity.
Step 1 β Identify the exposed
You scan the register that defines your exposure. You do not yet have a cohort_pnrs list β you query the entire register and filter on the exposure criterion.
The exposure can be defined in many ways:
| Type | Example | Register |
|---|---|---|
| Surgery / procedure (SKS code) | Bariatric surgery KJDF10/KJDF11 | lpr_sksopra (LPR2), procedurer_kirurgia (LPR3) |
| Hospital diagnosis (ICD code) | Type 2 diabetes E11 | lpr_diag + lpr_adm, lpr_a_diagnose + lpr_a_kontakt |
| Medication exposure (ATC code) | Metformin A10BA02 | LMDB |
| Clinical measurement / biomarker | BMI > 35, HbA1c > 75 mmol/mol | Project-specific data / OSDC / DBSO |
a lpr_sksopr and procedurer_kirurgi are the names on the DARTER project (708421) β see Register reference. Names may vary on other projects.
Example A: SKS codes (surgery/procedure)
library(arrow) # open_dataset
library(dplyr) # filter, select, group_by, slice, ungroup, bind_rows, mutate
# Adapt these codes to your study
RYGB <- c("KJDF10", "KJDF11") # Roux-en-Y gastric bypass
SG <- c("KJDF40", "KJDF41", "KJDF96", "KJDF97") # sleeve gastrectomy
BS_CODES <- c(RYGB, SG) # combined vector
# ββ LPR2: procedures up to 2018/2019 ββββββββββββββββββββββββββββββββββββ
lpr_sksopr <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_sksopr/") %>%
rename_with(tolower)
lpr_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>%
rename_with(tolower)
exp_lpr2 <- lpr_sksopr %>%
filter(c_opr %in% !!BS_CODES) %>% # only bariatric procedures
select(recnum, sks_code = c_opr) %>% # recnum is the join key to lpr_adm
inner_join(
lpr_adm %>% select(pnr, recnum, index_date = d_inddto), # attach pnr and date
by = "recnum"
) %>%
collect()
# ββ LPR3: procedures from 2019 onwards ββββββββββββββββββββββββββββββββββ
proc_kir <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/procedurer_kirurgi/") %>%
rename_with(tolower)
lpr_a_k <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_a_kontakt/") %>%
rename_with(tolower)
exp_lpr3 <- proc_kir %>%
filter(procedurekode %in% !!BS_CODES) %>% # only bariatric procedures
select(dw_ek_forloeb, sks_code = procedurekode) %>%
inner_join(
lpr_a_k %>% select(pnr, dw_ek_forloeb, index_date = kont_starttidspunkt),
by = "dw_ek_forloeb"
) %>%
collect() %>%
mutate(index_date = as.Date(index_date)) # datetime β date
# ββ Combine and take one procedure per person (the first) βββββββββββββββ
# The result is called "exposed" β only the exposed group, NOT the full cohort yet
exposed <- bind_rows(exp_lpr2, exp_lpr3) %>%
group_by(pnr) %>% # group per person
arrange(index_date) %>% # oldest date first
slice(1) %>% # one procedure per person (the first)
ungroup() %>% # release grouping (see Phase 11)
mutate(exposed = 1L) # mark as exposed (1 = yes)
nrow(exposed) # number of unique operated individualsexposed contains only the operated individuals. The full cohort (exposed + comparator cohort) is built in step 5 and saved as cohort. It is cohort β not exposed β that you use as cohort_pnrs in the other phases.
Example B: ICD diagnosis as exposure criterion
# Same LPR pattern as Phase 9 β but without filter(pnr %in% !!cohort_pnrs),
# as the cohort does not yet exist. You query the full population to identify the exposed.
lpr_diag <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_diag/") %>%
rename_with(tolower)
lpr_adm <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/lpr_adm/") %>%
rename_with(tolower)
exposed <- lpr_adm %>%
inner_join(
lpr_diag %>%
filter(c_diagtype %in% c("A", "B"),
substr(c_diag, 2, 4) == "E11") %>% # T2D: "DE11" β strip D-prefix
select(recnum, c_diag),
by = "recnum"
) %>%
select(pnr, index_date = d_inddto) %>%
collect() %>%
group_by(pnr) %>%
arrange(index_date) %>%
slice(1) %>% # first diagnosis per person
ungroup() %>%
mutate(exposed = 1L)What should you do after step 1?
- Prevalence study β you now have your study population with an index date. Skip steps 2β6 and go directly to Phase 6 β Extract covariates and Phase 11 β Assemble the dataset.
- Cohort study or nested case-control β continue with steps 2β6 below.
Note for case-control: your βexposedβ group in step 1 is your cases (those who experienced the outcome, identified from LPR), not an exposure. Steps 3β6 then sample controls β people without the outcome who were at risk at the time each case received their diagnosis.
Step 2 β Exclude prevalent cases
Exclude persons who already had the diagnosis before index date β otherwise they count as new cases even though they are not. This step must happen before matching, so the comparator pool is not sampled from a contaminated source.
You need your extracted LPR diagnosis table from Phase 9 β either alle_dx or your direct extraction.
Show code
EXCL_CODES <- c("G30", "F00", "F01", "F02", "F03") # change to your own exclusion codes
# diagnoses = alle_dx (Approach 2) or your direct extraction (Approach 1) from Phase 9
prevalent <- diagnoses %>%
filter(icd3 %in% EXCL_CODES) %>%
inner_join(exposed %>% select(pnr, index_date), by = "pnr") %>%
filter(date_contact < index_date) %>%
distinct(pnr)
cat("Exposed before exclusion: ", nrow(exposed), "\n")
exposed_excl <- exposed %>%
anti_join(prevalent, by = "pnr")
cat("Exposed after exclusion: ", nrow(exposed_excl), "\n")
cat("Excluded: ", nrow(exposed) - nrow(exposed_excl), "\n")Lookback period: Consider restricting date_contact < index_date further β e.g. only the 5 years before index β to avoid false exclusions based on very old diagnoses.
filter(date_contact >= index_date - 365*5,
date_contact < index_date)Prevalence exclusion also applies to the matchpool (step 3) β exclude people with prevalent outcome from the pool too, otherwise your exposed are matched to persons who were not actually at risk.
Step 3 β Build matchpool from BEF
Comparator candidates are drawn from BEF β everyone in the population not yet exposed. Add the variables you want to match on (birth year, sex, calendar year etc.).
Show code
bef <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/bef/") %>%
rename_with(tolower)
matchpool <- bef %>%
filter(aar == 2015) %>% # one BEF snapshot β adjust to your study window
select(pnr, foed_dag, koen) %>%
collect() %>%
anti_join(exposed, by = "pnr") %>% # keep only pnr's NOT in exposed
# anti_join = inverse inner_join: no match = keep
# see Phase 11 for full explanation of join types
mutate(
birth_year = as.integer(format(as.Date(foed_dag), "%Y")), # matching variable
exposed = 0L # mark as potential comparator
)
nrow(matchpool) # number of potential comparatorsStep 4 β Apply inclusion criteria BEFORE sampling
Inclusion criteria must be applied BEFORE matching β not after. Exclusion of e.g. persons under 18 or not residing in Denmark must happen on the matchpool before comparators are sampled. Excluding after matching introduces systematic bias, because you change the population that the exposed are matched to.
See: Lund et al., Clinical Epidemiology 2015 for a review of bias from incorrect exclusion order.
Show code
# Attach BEF variables to matchpool and apply inclusion criteria
bef_baseline <- bef %>%
filter(aar == 2015) %>% # use same snapshot year
select(pnr, alder, opr_land) %>%
collect()
matchpool_clean <- matchpool %>%
inner_join(bef_baseline, by = "pnr") %>%
filter(
alder >= 18, # adults only
opr_land == 5100 # Danish residence
)
# Apply the same criteria to exposed (prevalence exclusion already done in Step 2):
exposed_clean <- exposed_excl %>%
inner_join(bef_baseline, by = "pnr") %>%
filter(alder >= 18, opr_land == 5100)Step 5 β Match comparators
heaven::riskSetMatch() implements risk-set sampling / incidence-density sampling β the standard in register-based cohorts.
Show code
library(heaven) # riskSetMatch
# Combine into one dataset: exposed = 1 (exposed) and exposed = 0 (potential comparator)
pool <- bind_rows(
exposed_clean %>% select(pnr, index_date, exposed, koen, birth_year),
matchpool_clean %>% select(pnr, exposed, koen, birth_year)
)
# Risk-set matching β 1:5 on sex and birth year
# The result "matched" becomes your full cohort: exposed + matched comparators
matched <- riskSetMatch(
ptid = "pnr", # personal identifier
event = "exposed", # 1 = exposed, 0 = potential comparator
terms = c("koen", "birth_year"), # matching variables β adapt to your study
dat = pool,
ratio = 5 # up to 5 comparators per exposed
)Chronological sampling and replacement. riskSetMatch() samples chronologically β persons can contribute to the comparator cohort up until the point they themselves become exposed. Take an explicit position on whether there should be replacement (a person can be matched to multiple exposed individuals) or not. Replacement increases effective sample size but gives correlated observations β this must be handled in the analysis. See Lund et al. 2015 and ?riskSetMatch for details.
riskSetMatch() is explained in Phase 14a β Packages and the design choice behind matching in Phase 1.
Matched comparators are automatically assigned the matched exposed personβs index date.
More in depth β pitfalls when building the comparison cohort
- Immortal time bias: a person in the comparison cohort must be alive, resident in Denmark and meet the eligibility criteria on their assigned index date β not only at the start of the study.
- Risk-set / incidence-density sampling: a person can appear in the comparison cohort and later become exposed themselves and become a case. Decide how you handle that (
riskSetMatch()supports risk-set matching). - Matching ratio: e.g. 1:5 exposed:comparison β more controls give more precision, but with diminishing returns.
- The same exclusions are applied to both groups, otherwise you introduce selection bias.
Step 6 β Save your cohort
# "matched" from riskSetMatch() is now your full cohort:
# exposed (exposed = 1) + matched comparators (exposed = 0)
kohort <- matched # rename to "kohort" β this is what you use from now on
saveRDS(kohort, "datasets/full_cohort.rds") # save
# Verify:
nrow(kohort) # total number of persons
table(kohort$exposed) # 0 = comparator cohort, 1 = exposed
names(kohort) # which columns are included?What now?
You have full_cohort.rds β one row per person with pnr and index_date for both groups (exposed + comparator cohort). Use it in all subsequent extractions:
kohort <- readRDS("datasets/full_cohort.rds")
cohort_pnrs <- unique(kohort$pnr) # vector with all pnr's β this is what is inserted
# wherever the other phases use "cohort_pnrs"cohort_pnrs thus contains the pnrβs for both exposed and comparators β not just the exposed. This is important: extraction of outcomes and covariates must cover the entire study population.
| Extraction | Phase |
|---|---|
| Outcomes (diagnoses, date of death, emigration) | Phase 9, Phase 11 |
| Covariates from BEF (age, sex) | Phase 6 |
| Socioeconomic variables | Phase 13 |
| Comorbidity (NMI, Charlson) | Phase 14c |
| Assemble into one analysis dataset | Phase 11 |
See also
- Phase 1 β Study preparation β design choices behind cohort and matching
- Phase 9 β Hospital contacts (LPR) β diagnosis pattern for exposure identification and prevalence exclusion
- Phase 11 β Joins and pivots β assemble all extracts into one dataset
- Phase 15d β Register reference β column names for lpr_sksopr, procedurer_kirurgi etc.
- Lund et al. (2015), Clinical Epidemiology β bias from incorrect exclusion order in risk-set sampling