Data Management

Merging extracted datasets and deriving analysis variables

Published

June 2, 2026

Important

All code on this page is example code — adapt to your own study variables before running.

This page explains how to use scripts/manage_dataset.R to combine extracted datasets into a single analysis-ready file.


What the data looks like after extraction

After running the extraction scripts you have three separate data frames:

Data frame Format One row =
dataset Wide One participant
diagnosis_events Long (Approach A) One diagnostic event
diagnosis_df Wide (Approach B) One participant, with first dates + binary flags
prescriptions Long One prescription record

The choice between long and wide format for diagnoses is made in extract_diagnoses.R — see that script for details. Always run dplyr::glimpse() on each data frame before merging, and again after.


GP data availability

Only about 45% of UKB participants have linked GP records. If your analysis uses GP data, filter before merging:

dataset_gp <- dataset %>%
  dplyr::filter(p42040 > 0)   # p42040 = count of linked GP clinical records

Not filtering introduces systematic bias — non-linked participants would be treated as “no disease” rather than “data not available.” See the Dataset Reference for the three GP linkage fields (p42040, p42039, p42038).


Derive first diagnosis date (Approach A)

If you used extract_diagnoses.R Approach A (all events), derive the first event per participant and condition before merging:

first_diagnosis <- diagnosis_events %>%
  dplyr::group_by(eid, condition) %>%
  dplyr::slice_min(date, n = 1, with_ties = FALSE) %>%
  dplyr::ungroup()

GP and HES may give different first dates for the same participant. Use first_by_source (see manage_dataset.R Step 4) for source-specific dates.

If you used Approach B (get_df()), first dates and binary flags are already in diagnosis_df — skip this step.


Merging datasets

Join by eid. Always coerce eid to integer in every table first:

analysis_dataset <- dataset %>%
  dplyr::left_join(first_diagnosis, by = "eid") %>%
  dplyr::left_join(prescriptions,   by = "eid")

See manage_dataset.R Step 5 for left_join, inner_join, and count + join options. After merging, run glimpse() on the result and check for duplicate eids — a one-to-many join can silently multiply rows.


Renaming, recoding, and derived variables

manage_dataset.R Step 7 provides example functions for:

  • rename_variables() — renames p-coded columns (p31 → sex, p53_i0 → date_baseline, etc.)
  • recode_variables() — converts integer codes to labelled factors (sex, ethnicity, smoking)
  • Derived variables — age at baseline from year and month of birth (day set to 15th)

Always run glimpse() on the merged dataset after renaming — not just on the inputs. Joining can silently change column types.


UKB built-in ICD-10 first occurrence

UKB provides pre-computed first-occurrence ICD-10 fields from HES:

  • Field 41270 — all ICD-10 diagnoses (all episodes, repeated)
  • Field 41280 — date for each entry in Field 41270 (yes, there is a date)
  • Field 41202 / 41204 — first occurrence, main and secondary positions

Key limitation: HES records only. For conditions commonly diagnosed in primary care, these fields substantially undercount cases. Combining GP and HES (Steps 4–5 above) gives a more complete picture.

UK Biobank Data Showcase