Data Management
Merging extracted datasets and deriving analysis variables
All code on this page is example code — adapt to your own study variables before running.
This page explains how to use scripts/manage_dataset.R to combine extracted datasets into a single analysis-ready file.
What the data looks like after extraction
After running the extraction scripts you have three separate data frames:
| Data frame | Format | One row = |
|---|---|---|
dataset |
Wide | One participant |
diagnosis_events |
Long (Approach A) | One diagnostic event |
diagnosis_df |
Wide (Approach B) | One participant, with first dates + binary flags |
prescriptions |
Long | One prescription record |
The choice between long and wide format for diagnoses is made in extract_diagnoses.R — see that script for details. Always run dplyr::glimpse() on each data frame before merging, and again after.
GP data availability
Only about 45% of UKB participants have linked GP records. If your analysis uses GP data, filter before merging:
dataset_gp <- dataset %>%
dplyr::filter(p42040 > 0) # p42040 = count of linked GP clinical recordsNot filtering introduces systematic bias — non-linked participants would be treated as “no disease” rather than “data not available.” See the Dataset Reference for the three GP linkage fields (p42040, p42039, p42038).
Derive first diagnosis date (Approach A)
If you used extract_diagnoses.R Approach A (all events), derive the first event per participant and condition before merging:
first_diagnosis <- diagnosis_events %>%
dplyr::group_by(eid, condition) %>%
dplyr::slice_min(date, n = 1, with_ties = FALSE) %>%
dplyr::ungroup()GP and HES may give different first dates for the same participant. Use first_by_source (see manage_dataset.R Step 4) for source-specific dates.
If you used Approach B (get_df()), first dates and binary flags are already in diagnosis_df — skip this step.
Merging datasets
Join by eid. Always coerce eid to integer in every table first:
analysis_dataset <- dataset %>%
dplyr::left_join(first_diagnosis, by = "eid") %>%
dplyr::left_join(prescriptions, by = "eid")See manage_dataset.R Step 5 for left_join, inner_join, and count + join options. After merging, run glimpse() on the result and check for duplicate eids — a one-to-many join can silently multiply rows.
Renaming, recoding, and derived variables
manage_dataset.R Step 7 provides example functions for:
rename_variables()— renames p-coded columns (p31 → sex, p53_i0 → date_baseline, etc.)recode_variables()— converts integer codes to labelled factors (sex, ethnicity, smoking)- Derived variables — age at baseline from year and month of birth (day set to 15th)
Always run glimpse() on the merged dataset after renaming — not just on the inputs. Joining can silently change column types.
UKB built-in ICD-10 first occurrence
UKB provides pre-computed first-occurrence ICD-10 fields from HES:
- Field 41270 — all ICD-10 diagnoses (all episodes, repeated)
- Field 41280 — date for each entry in Field 41270 (yes, there is a date)
- Field 41202 / 41204 — first occurrence, main and secondary positions
Key limitation: HES records only. For conditions commonly diagnosed in primary care, these fields substantially undercount cases. Combining GP and HES (Steps 4–5 above) gives a more complete picture.