Dataset Reference
Confirmed column names for all clinical data sources used in UK Biobank extraction
All column names on this page are confirmed via the UKDC source code and ukbrapR documentation. They are shown after any standardisation applied by the extraction scripts (e.g. renaming event_dt to date). The original column names returned by ukbrapR are noted where they differ.
To look up any Field ID, visit the UK Biobank Data Showcase.
Reverse lookup: I want to find…
| I want to find… | Dataset | Key column(s) |
|---|---|---|
| Participant identifier | All datasets | eid (integer) |
| Sex | Demographics | p31 (0 = Female, 1 = Male) |
| Year of birth | Demographics | p34 |
| Month of birth | Demographics | p52 |
| Date of birth (approximated) | Demographics | Derived: p34 + p52 |
| Assessment centre visit date | Demographics | p53_i0 (visit 0), p53_i1 (visit 1) |
| Whether GP records exist | Demographics | p42040, p42039, p42038 |
| GP clinical diagnosis events | GP clinical | eid, date, code, source |
| Hospital diagnosis events | HES | eid, date, diag_icd10, diag_icd9, source |
| Prescription events | GP prescriptions | eid, date, drug_name, bnf_code |
| HbA1c measurements | Demographics | p30750_i0, p30750_i1, p30750_i2 |
| Random glucose | Demographics | p30740_i0, p30740_i1, p30740_i2 |
GP clinical records
Returned by ukbrapR::get_diagnoses() as $gp_clinical.
GP records are available for approximately 45% of UK Biobank participants. A participant with no rows in this table may simply have no GP linkage, not be disease-free.
| Column | Type | Description | Notes |
|---|---|---|---|
eid |
integer | Participant identifier | Coerce to integer before joins |
event_dt |
Date | Date of the clinical event | ukbrapR column name. Renamed to date by the extraction script. Contains UKB placeholder dates — see below. |
read_2 |
character | Read version 2 code | Column name in ukbrapR v0.3.14+. Older versions use read_code. |
read_3 |
character | CTV3 code | Present in ukbrapR v0.3.14+. |
source |
character | Added by extraction script | Always "GP" after processing |
UK Biobank uses three dates to represent missing or unknown event dates in GP records:
1901-01-011902-02-021903-03-03
These are not real event dates. The extraction script replaces them with NA. If you process raw ukbrapR output directly, remove them before any date-based analysis.
HES diagnosis records
Returned by ukbrapR::get_diagnoses() as $hesin_diag.
| Column | Type | Description | Notes |
|---|---|---|---|
eid |
integer | Participant identifier | Coerce to integer before joins |
epistart |
Date | Episode start date | ukbrapR column name. Renamed to date by the extraction script. |
diag_icd10 |
character | ICD-10 diagnosis code | Used from ~1995 onwards. Stored as the full sub-code (e.g. E110); ukbrapR matches on the first three characters. |
diag_icd9 |
character | ICD-9 diagnosis code | Used before ~1995. Must be queried alongside diag_icd10 to cover the full record history. |
source |
character | Added by extraction script | Always "HES" after processing |
A HES record has both diag_icd10 and diag_icd9, but only one will be populated depending on the episode date. Records from before approximately 1995 will have an ICD-9 code in diag_icd9 and a missing diag_icd10. If you filter on !is.na(diag_icd10) you will silently exclude all pre-1995 diagnoses.
GP prescriptions (gp_scripts)
Accessed via Arrow from gp_scripts.tsv (raw) or a pre-converted parquet version. The file contains approximately 57 million rows — one per prescription item.
| Column | Type | Description | Notes |
|---|---|---|---|
eid |
integer | Participant identifier | Coerce to integer before joins |
issue_date |
character | Date the prescription was issued | Raw format is dd/mm/yyyy (character). Must be parsed with as.Date(issue_date, format = "%d/%m/%Y"). |
drug_name |
character | Free-text drug name | Inconsistent capitalisation and abbreviations across practices. Use case-insensitive regex. |
bnf_code |
character | BNF chapter code | Missing in ~23% of rows. Never rely on BNF alone. |
quantity |
character | Quantity prescribed | Not used in extraction scripts; available for sensitivity analyses. |
issue_date is a character column in dd/mm/yyyy format
The raw gp_scripts file stores dates as strings in day/month/year order, not the ISO standard year/month/day. Parsing it with as.Date() without specifying format = "%d/%m/%Y" will silently produce NA for every row.
# WRONG — produces NA for all rows
as.Date("01/06/2010")
# CORRECT — specifies the format
as.Date("01/06/2010", format = "%d/%m/%Y")Demographics (UKB field naming convention)
Extracted from the UKB database using ukbAid::proj_create_dataset(). Column names follow the RAP convention: p<field_id> for single-instance fields, p<field_id>_i<instance> for multi-instance fields.
| Column | Field ID | Type | Description | Notes |
|---|---|---|---|---|
eid |
— | integer | Participant identifier | Universal join key |
p31 |
31 | integer | Sex | 0 = Female, 1 = Male |
p34 |
34 | integer | Year of birth | Use with p52 to approximate date of birth |
p52 |
52 | integer | Month of birth | 1–12. Combine with p34 to approximate date of birth (day set to 15). |
p53_i0 |
53, instance 0 | Date | Assessment centre date, visit 0 | Initial assessment (2006–2010). Standard baseline date for incident/prevalent classification. |
p53_i1 |
53, instance 1 | Date | Assessment centre date, visit 1 | First repeat visit (2012–2013). Not all participants attended. |
p53_i2 |
53, instance 2 | Date | Assessment centre date, visit 2 | Second repeat visit (2014–2020). Sparse. |
p42040 |
42040 | integer | Number of GP clinical event records | Non-zero if participant has linked GP data. Use to derive has_gp_data. |
p42039 |
42039 | integer | Number of GP prescription records | Non-zero if participant has linked prescription data. |
p42038 |
42038 | integer | Number of GP registration records | Non-zero if participant has registration record. |
p30750_i0 |
30750, instance 0 | numeric | HbA1c (mmol/mol), visit 0 | Raw assay value. Matched by assay date p30751_i0. |
p30751_i0 |
30751, instance 0 | Date | HbA1c assay date, visit 0 | Assay date for p30750_i0. |
p30740_i0 |
30740, instance 0 | numeric | Random glucose (mmol/L), visit 0 | Non-fasting. Do not apply fasting glucose thresholds. |
p30741_i0 |
30741, instance 0 | Date | Glucose assay date, visit 0 | Assay date for p30740_i0. |
UK Biobank does not release an exact date of birth. Approximate from year and month:
demographics <- demographics |>
dplyr::mutate(
date_of_birth = as.Date(
paste0(p34, "-", sprintf("%02d", p52), "-15") # 15th of birth month
)
)If month is not available, use July 1st as the annual midpoint: as.Date(paste0(p34, "-07-01")).
has_gp_data
demographics <- demographics |>
dplyr::mutate(
has_gp_data = (p42040 > 0 | p42039 > 0 | p42038 > 0)
)has_gp_data = FALSE means the participant has no GP linkage — not that they have no diagnoses. These participants will have no GP clinical events or prescriptions, regardless of their health status.
Standardised event output
After running the extraction scripts, you will have a tidy events data frame with consistent columns across all sources. This is the format used for downstream analysis.
| Column | Type | Description |
|---|---|---|
eid |
integer | Participant identifier |
date |
Date | Event date (placeholder dates removed; future dates removed) |
source |
character | Where the event came from: "GP" or "HES" |
code |
character | The diagnostic code (GP events only) |
diag_icd10 |
character | ICD-10 code (HES events only) |
diag_icd9 |
character | ICD-9 code (HES events only) |
drug_name |
character | Drug name (prescription events only) |
bnf_code |
character | BNF code (prescription events only) |
drug_class |
character | Drug class label added by extract_medications.R |
Combine GP events, HES events, and prescription events with dplyr::bind_rows() — missing columns are filled with NA.