flowchart TD
E["Epidemiological studies"]:::neutral
O["Observational<br>β register research lives here"]:::active
X["Experimental"]:::ref
D["Descriptive"]:::active
A["Analytical"]:::active
CC["Case-control"]:::active
CO["Cohort"]:::active
R["RCT<br>β requires intervention,<br>not possible with register data"]:::ref
E --> O
E --> X
O --> D
O --> A
A --> CC
A --> CO
X --> R
classDef neutral fill:#eef0f2,stroke:#8a94a6,color:#1f2733;
classDef active fill:#eaf2fb,stroke:#4a78b5,color:#173a5e;
classDef ref fill:#f6f6f6,stroke:#cccccc,color:#999999;
Plan your study
Before opening R β define your question, cohort, and data model
Register-based research does not start in R. It starts with pen and paper. This page guides you through the things you should have in place before writing a single line of code.
In short: Settle four things on paper before you code β a precise research question, your data model (which registers cover exposure, outcome and covariates), your covariates chosen with a DAG, and your comparison cohort.
What type of study are you doing?
Almost all register-based research is observational and analytical β you observe what has already happened, without intervening. The two classic analytical designs are case-control and cohort; Phase 10 shows how to build a matched cohort study. Randomised trials (RCT) cannot be done with register data and are included here only for the overview.
Case-control or cohort β what is the difference?
The two analytical designs differ in which end you start from:
| Cohort | Case-control | |
|---|---|---|
| Starting point | Exposure | Outcome |
| Direction | Follows forward: exposed β outcome | Looks back: case β prior exposure |
| Best when | Exposure is rare; multiple outcomes | Outcome is rare; single outcome |
| Effect measure | Incidence, relative risk (RR), hazard ratio | Odds ratio (OR) |
| In registers | Define exposed group + comparator cohort, follow forward | Find all cases, select controls, look back at exposure |
Cohort follows persons forward in time from the index date and measures how many develop the outcome β which is why you can compute incidence and risk. Well suited when you have multiple outcomes (cf. the alle_dx approach in Phase 9b).
Case-control starts from those who already have the outcome and matches them with controls without it β efficient for rare outcomes, but cannot compute absolute risk.
With register data you can do both, because the entire populationβs history is available. Phase 10 shows a matched cohort study step by step.
Key concepts
Before planning a study it is worth knowing these terms β they are used throughout the guide.
Cohort A group of people followed over time because they share a particular characteristic at a particular point in time. Example: all patients who underwent bariatric surgery in the period 2010β2020.
Index date The start date of follow-up β the point from which you begin counting. For operated patients this is typically the date of surgery. For matched comparators the same date as the matched operated patient is assigned.
Exposure The factor whose effect you are investigating β e.g. a surgery, a medication, or a diagnosis.
Outcome What you are measuring β e.g. onset of a disease, a hospitalisation, or death.
Covariates Variables you include to account for confounding β factors that affect both exposure and outcome. Examples: age, sex, comorbidity, socioeconomic status.
1. What do I want to investigate?
Formulate your research question precisely before looking at any data. A vague question produces a messy dataset. A precise question produces a clear plan.
Ask yourself:
| Question | Example |
|---|---|
| Who is my population? | All adults with T2D in Denmark, 2010β2020 |
| What is my exposure? | Bariatric surgery |
| What is my outcome? | Dementia |
| When does follow-up start? | Date of surgery (index date) |
| When does it end? | Diagnosis, death, emigration, or end of study period |
| Which confounders should be adjusted for? | Age, sex, comorbidity, SES |
2. Which registers cover what?
Before mapping your data model it is useful to know which registers exist.
| What do you need to find? | Register |
|---|---|
| Demographics (age, sex, residence) | BEF β Population Register |
| Hospital diagnoses and contacts | LPR β National Patient Register (LPR2 + LPR3) |
| Dispensed prescriptions | LMDB β Prescription Register |
| Date of death (for censoring) | DODSAARS β Death Register |
| Emigration (for censoring) | VNDS β Migration Register |
| Education | UDDA β Education Register |
| Income | FAIK β Family Income Register |
| Employment | AKM β Labour Classification Module |
A complete description of all registers with column names and join keys is in Phase 15 β Register reference β
3. Choose your covariates using a DAG
Which variables should you adjust for? The answer is not βas many as possibleβ. Adjusting for the wrong variables can introduce bias rather than remove it.
A DAG (directed acyclic graph β a causal diagram) is a drawing of your assumptions about how exposure, outcome and other variables relate to each other. It makes your assumptions explicit and helps you choose the right set of covariates.
Rules of thumb:
- Adjust for confounders β variables that affect both exposure and outcome (e.g. age, comorbidity).
- Do NOT adjust for mediators β variables that lie on the causal pathway between exposure and outcome (this removes part of the effect you want to measure).
- Do NOT adjust for colliders β common effects of two variables (this opens a spurious association).
Example: surgery and dementia β a DAG with confounder, mediator and collider
A concrete example: does surgery affect the risk of dementia?
- Age is a confounder β it affects both the probability of surgery and of dementia. Adjust for it.
- Delirium (post-operative delirium) is a mediator β it lies on the path surgery β delirium β dementia. Do not adjust β that removes part of the effect you want to measure.
- Hospitalisation is a collider β both surgery and dementia lead to hospitalisation. Do not adjust β it opens a spurious association.
You can paste the model straight into dagitty.net and have the minimal adjustment set computed:
dag {
Age [pos="0,-1"]
Surgery [exposure, pos="-1.5,0"]
Delirium [pos="0,0"]
Dementia [outcome, pos="1.5,0"]
Hospitalisation [pos="0,1"]
Age -> Surgery
Age -> Dementia
Surgery -> Delirium
Delirium -> Dementia
Surgery -> Hospitalisation
Dementia -> Hospitalisation
}
For this DAG the minimal adjustment set is {Age} β you only need to adjust for age.
Tools
- dagitty.net β draw your diagram in the browser; it automatically calculates the minimal set of covariates to adjust for.
- Causal Diagrams: Draw Your Assumptions Before Your Conclusions β free HarvardX course by Miguel HernΓ‘n on exactly this.
- Background: HernΓ‘n & Robins, Causal Inference: What If (free PDF) β also in Phase 15 β Learning resources.
4. The comparator cohort
Many studies compare an exposed group with a comparator cohort. How you build it is a design decision to be made on paper β before writing code.
Things to consider:
- Who is an appropriate comparator? E.g. for bariatric surgery: people with severe obesity who were not operated on, or a matched background population. The choice depends on the question.
- Index date for the comparator cohort. Your exposed cohort has an index date determined by the exposure (e.g. the surgery date). The comparator cohort does not β it must be assigned a date, typically the same date as the matched exposed person, so both groups are followed from a comparable point in time.
- Eligibility at index. The comparator cohort must meet the inclusion criteria on their assigned index date β otherwise you risk immortal time bias (a distortion that arises when a person is assigned exposure time during which they by definition could not yet have experienced the outcome).
- Matching variables and ratio. E.g. age, sex and calendar year; decide the ratio (e.g. 1:5).
- Can anyone in the comparator cohort become exposed later? E.g. can a person who started as a control later undergo surgery? Decide what happens in that case β whether they remain a control or transfer to the exposed group.
- The same exclusions are applied to both groups.
β The complete pattern for cohort construction and matching is in Phase 10 β Build your study population.
5. Get an overview β pen and paper
Before opening R, answer these questions in writing:
- Which variables do I need? (patient information β age, sex, diagnoses etc. β and for which years)
- Which registers contain this information? (LPR, BEF, LMDB, β¦)
- In what order should data be assembled? (define population β extract outcome β extract covariates)
A solid overview on paper saves many hours of debugging in code.
Example: overview for a dementia study
Population: Adults who have undergone bariatric surgery (identified via the Danish Obesity
Treatment Database β DBSO), 2010β2024
Matched comparators from the Population Register (BEF)
Outcome: First dementia diagnosis (LPR β ICD-10: F00βF03, G30βG31)
Date: first contact with a dementia code after the surgery date
Covariates: Age and sex (BEF)
Comorbidity (LPR β 5-year lookback, i.e. diagnoses in the 5 years before index date)
Education (UDDA)
Income (FAIK via BEF familie_id)
Employment (AKM)
Censoring: Death (DODSAARS)
Emigration (VNDS)
End of study period (31 Dec 2024)
6. Write an analysis plan
An analysis plan is a document you write before looking at your data. It forces you to commit to design, statistics and variables before results can colour your decisions.
Use the STROBE checklist as a skeleton: STROBE Statement β checklists β
Pre-register your analysis plan on e.g. OSF β this is good scientific practice and required by many journals: Open Science Framework β registration templates
7. Next steps
Once you have your overview in place:
- New to R? β Phase 2 β R: the bare essentials
- Ready for the DST server? β Phase 3 β Log in to DST
- Working on DARTER / project 708421? β Read this first: DARTER β overview and pipeline