Start Here

A step-by-step path from zero to your first extraction

Published

June 2, 2026

This page tells you exactly what to do and in what order. Follow it top to bottom the first time you use this site.

Before you begin

You need two things before any of the scripts on this site will work:

UK Biobank RAP access — applied for through the UK Biobank portal and the DNAnexus platform. If you do not have this yet, ukbAid’s initial setup guide walks you through the full application process.
A GitHub project repository — set up via ukbAid. The scripts connect your RAP session to GitHub every time you open one. Without this, nothing will save correctly.

If you have completed the ukbAid initial setup and have a working RAP project, you are ready to continue.

Open scripts/setup.R in your RAP RStudio session and run steps 1–4 in order.

Step	What it does	Run when
1	Installs pak, cli, parquetize, ukbAid	Every session
2	Connects your session to GitHub	Every session
3	Installs ukbrapR and loads all libraries	Every session
4	Extracts your chosen UKB variables to a parquet file	Once — then comment out

Step 4 takes 5–60 minutes depending on how many variables you request and current platform load. Do not close the session while it runs.

Important

After step 4 finishes, comment it out. If you run it again in a future session it will re-extract everything and overwrite your saved file.

At the start of each new RAP session, run steps 1–3 and step 5:

Steps 1–3: reinstall packages and reconnect to GitHub (the RAP environment resets between sessions — nothing persists automatically)
Step 5: load your saved dataset from the parquet file (takes seconds)

Once your dataset is loaded, decide what clinical data you need.

If your study requires diagnostic events from GP records or hospital episodes:

Build a code list first. Your extraction will find nothing without one. Read the Code Lists guide and create a CSV with your condition codes (Read v2, CTV3, ICD-10, ICD-9).
Run extract_diagnoses.R. Open scripts/extract_diagnoses.R, point it at your code list CSV, and run it step by step.

If your study requires GP prescription data:

Run extract_medications.R. Open scripts/extract_medications.R and update the two configuration values at the top:
- BNF_PREFIX — the BNF chapter for your drug class
- DRUG_PATTERN — the drug names to match
Test your pattern on a sample (Step 3 in the script) before running on all 57 million rows.

Open scripts/manage_dataset.R to combine your extracted datasets into a single analysis-ready file:

Filter to GP-linked participants if your analysis uses GP data (filter(p42040 > 0)).
Derive the first diagnosis date per participant and condition using slice_min() (or use the get_df() wide-format output if you chose Approach B in extract_diagnoses.R).
Merge diagnosis events, prescriptions, and demographics by eid.
Rename and recode — the script includes example functions for renaming p-coded columns and converting categoricals to labelled factors.
Save the merged dataset to parquet.

I want to…	Go to
Understand the extraction scripts step by step	Extract Data
Build or find a diagnostic code list	Code Lists
Merge datasets and derive first diagnosis dates	Data Management
Look up what a function does	Functions
Check what columns are in the clinical data	Dataset Reference
Understand a silent error or unexpected result	Common Mistakes