Data sources, query patterns, and platform orientation for RAP
Published
June 2, 2026
Set up your RAP session
Before you can work with any UK Biobank data, your RAP environment needs to be connected to your project and the necessary packages need to be installed. The RAP environment resets when a session ends — packages do not persist.
Start with ukbAid. If you have not already done so, complete the ukbAid initial setup. This covers requesting RAP access, configuring your GitHub personal access token, and creating your project repository. None of the scripts on this site will work without that foundation.
Open scripts/setup.R and run only the sections relevant to your current session. The script is structured in numbered steps — not all steps are needed every time:
Steps 1–3 (packages, GitHub, libraries): run every session
Step 4 (create dataset): run once per project, then comment out
Step 5 (load dataset): run every session after step 4 is done
Steps 6–8 (extract, join, save): run when you need that specific step
ImportantukbrapR documentation: run export_tables() before GP or HES extraction
According to the ukbrapR README on GitHub, before calling get_diagnoses() for the first time you must run:
ukbrapR::export_tables()
This submits a job that copies approximately 10 GB of linked clinical data (GP records, HES, cancer registry, death records) to RAP persistent storage. It takes several minutes. You only need to do it once per project — subsequent sessions read from the exported files automatically.
NoteNote for discussion
UKDC’s extraction scripts call ukbrapR::get_diagnoses() directly without first running export_tables(), and diagnoses are returned correctly. It is unclear whether the linked tables were already present in this project’s RAP storage from a prior step, or whether export_tables() is not actually required in this setup. To clarify with supervisors before advising others to run or skip this step.
Extract diagnoses and medication data
Once your session is set up, you can extract clinical events for your study population. Two ready-to-use scripts are available.
GP clinical records — Read v2 and CTV3 codes from primary care
Hospital Episode Statistics (HES) — ICD-10 and ICD-9 codes from inpatient admissions
The script uses a code list CSV you provide (see the Code Lists for how to build one) and queries both sources in a single call to ukbrapR::get_diagnoses().
Medications — GP prescriptions
scripts/extract_medications.R extracts prescription records from the gp_scripts table (~57 million rows) using BNF chapter codes and/or drug-name regex patterns. The script uses Arrow lazy loading so only matching rows enter R memory.
Three clinical data sources
UK Biobank provides three types of linked clinical data that are relevant to most disease studies. Each one is stored and accessed differently.
Data source
What it contains
How it is accessed
Date column
GP clinical records
Read v2 and CTV3 diagnostic, procedure, and admin codes from primary care
ukbrapR::get_diagnoses() — after running export_tables() once
event_dt
Hospital Episode Statistics (HES)
ICD-10 and ICD-9 diagnosis codes from inpatient hospital admissions
ukbrapR::get_diagnoses() — after running export_tables() once
epistart
GP prescriptions (gp_scripts)
One row per prescription item: drug name, BNF code, issue date. ~57 million rows.
Arrow file scan: arrow::open_delim_dataset()
issue_date
NoteNothing is loaded into memory until you ask for it
GP prescriptions are stored as a large tab-delimited file (gp_scripts.tsv, ~57 million rows). The Arrow package lets you describe a query — which rows to keep, which columns to select — and only the matching rows are loaded into R memory when you call collect(). Opening the dataset with open_delim_dataset() is instant; the work happens at collect().
WarningOnly about 45% of participants have linked GP records
UK Biobank participants who consented to GP data linkage represent roughly 45% of the full cohort. The remaining 55% have no GP clinical records or prescriptions — not because they have no health events, but because their GP data was never linked.
This matters for two reasons:
A participant with zero GP events might simply lack GP linkage, not be disease-free. Never interpret an empty GP record as evidence of health.
Participants without GP linkage are not a random sample of the cohort. Linkage rates vary by region and GP practice.
The UKB demographics fields p42040, p42039, and p42038 contain counts of linked GP records. A participant has GP data if any of these counts is non-zero. See the Dataset Reference for details.
How Arrow lazy loading works
When working with the GP prescriptions file (57 million rows), you cannot load it into R memory directly — it will crash your session. Arrow solves this with lazy evaluation: open the file, push filters into the scan, then collect only matching rows. Here is the pattern used in extract_medications.R:
library(arrow)library(dplyr)# Step 1: open the file — no data enters R yetds <- arrow::open_delim_dataset("/mnt/project/ukbrapr_data/gp_scripts.tsv", # path on RAPdelim ="\t"# tab-delimited)# Step 2: build a lazy query (still no data in memory)query <- ds |> dplyr::select(eid, issue_date, drug_name, bnf_code) |># keep only needed columns dplyr::filter(grepl("^0601", bnf_code)) # push filter into Arrow scan# Step 3: collect — only matching rows enter Rresult <- dplyr::collect(query)
flowchart LR
A["open_delim_dataset()<br/>open connection<br/>(no data yet)"] --> B["select()<br/>filter()<br/>build lazy query"]
B --> C["collect()<br/>read matching rows<br/>into R memory"]
C --> D["join / mutate<br/>downstream work"]
The most important rule: filter BEFORE collect
Calling collect() before filtering loads the entire 57-million-row file into RAM. On most RAP instances this will crash your session with no warning.
# WRONG — loads all 57M rows, then filters in Rprescriptions <- ds |> dplyr::collect() |> dplyr::filter(grepl("^0601", bnf_code)) # too late; data already in memory# CORRECT — filter pushes into Arrow, only matching rows enter Rprescriptions <- ds |> dplyr::filter(grepl("^0601", bnf_code)) |># inside Arrow scan dplyr::collect()
Why select() before collect() also matters
Arrow reads only the columns you ask for. Selecting four columns from a file with twenty reduces the amount of data Arrow has to parse even for matching rows. Always select() down to the columns you need before collect().
Field IDs and column naming
UK Biobank identifies every variable by a numeric Field ID. When you extract demographics using ukbAid::proj_create_dataset(), the column names in the output follow the RAP naming convention:
Field ID
RAP column name
Contents
31
p31
Sex (0 = Female, 1 = Male)
34
p34
Year of birth
52
p52
Month of birth (1–12)
53, instance 0
p53_i0
Date of attending assessment centre, visit 0 (2006–2010)
53, instance 1
p53_i1
Date of attending assessment centre, visit 1 (2012–2013)
42040
p42040
Number of GP clinical event records linked
42039
p42039
Number of GP prescription records linked
42038
p42038
Number of GP registration records linked
The pattern is: single-instance fields are named p<field_id>; multi-instance fields are named p<field_id>_i<instance>.
UK Biobank does not release an exact date of birth. Year of birth (p34) and month of birth (p52) are available separately. A standard approximation is to set the day to the 15th of the birth month:
If month is not available, the 1st of July is a common fallback.
The ICD-9 / ICD-10 split in HES
Hospital Episode Statistics used ICD-9 codes before approximately 1995 and switched to ICD-10 coding from then onwards. Both codes are present as separate columns in the HES data:
Column
Coding system
Period
diag_icd10
ICD-10 (e.g. E11)
~1995 onwards
diag_icd9
ICD-9 (e.g. 250)
Before ~1995
If your study covers the full HES history, you must search both columns. Querying only diag_icd10 silently discards all diagnoses before 1995.