Start Here
A step-by-step path from zero to your first extraction
This page tells you exactly what to do and in what order. Follow it top to bottom the first time you use this site.
Before you begin
You need two things before any of the scripts on this site will work:
UK Biobank RAP access — applied for through the UK Biobank portal and the DNAnexus platform. If you do not have this yet, ukbAid’s initial setup guide walks you through the full application process.
A GitHub project repository — set up via ukbAid. The scripts connect your RAP session to GitHub every time you open one. Without this, nothing will save correctly.
If you have completed the ukbAid initial setup and have a working RAP project, you are ready to continue.
Session 1: First-time setup
Open scripts/setup.R in your RAP RStudio session and run steps 1–4 in order.
| Step | What it does | Run when |
|---|---|---|
| 1 | Installs pak, cli, parquetize, ukbAid | Every session |
| 2 | Connects your session to GitHub | Every session |
| 3 | Installs ukbrapR and loads all libraries | Every session |
| 4 | Extracts your chosen UKB variables to a parquet file | Once — then comment out |
Step 4 takes 5–60 minutes depending on how many variables you request and current platform load. Do not close the session while it runs.
After step 4 finishes, comment it out. If you run it again in a future session it will re-extract everything and overwrite your saved file.
Every session after
At the start of each new RAP session, run steps 1–3 and step 5:
- Steps 1–3: reinstall packages and reconnect to GitHub (the RAP environment resets between sessions — nothing persists automatically)
- Step 5: load your saved dataset from the parquet file (takes seconds)
Your first extraction
Once your dataset is loaded, decide what clinical data you need.
GP and hospital diagnoses
If your study requires diagnostic events from GP records or hospital episodes:
Build a code list first. Your extraction will find nothing without one. Read the Code Lists guide and create a CSV with your condition codes (Read v2, CTV3, ICD-10, ICD-9).
Run
extract_diagnoses.R. Openscripts/extract_diagnoses.R, point it at your code list CSV, and run it step by step.
Medication prescriptions
If your study requires GP prescription data:
- Run
extract_medications.R. Openscripts/extract_medications.Rand update the two configuration values at the top:BNF_PREFIX— the BNF chapter for your drug classDRUG_PATTERN— the drug names to match
- Test your pattern on a sample (Step 3 in the script) before running on all 57 million rows.
After extraction: merge and save
Open scripts/manage_dataset.R to combine your extracted datasets into a single analysis-ready file:
- Filter to GP-linked participants if your analysis uses GP data (
filter(p42040 > 0)). - Derive the first diagnosis date per participant and condition using
slice_min()(or use theget_df()wide-format output if you chose Approach B inextract_diagnoses.R). - Merge diagnosis events, prescriptions, and demographics by
eid. - Rename and recode — the script includes example functions for renaming p-coded columns and converting categoricals to labelled factors.
- Save the merged dataset to parquet.
Where to go next
| I want to… | Go to |
|---|---|
| Understand the extraction scripts step by step | Extract Data |
| Build or find a diagnostic code list | Code Lists |
| Merge datasets and derive first diagnosis dates | Data Management |
| Look up what a function does | Functions |
| Check what columns are in the clinical data | Dataset Reference |
| Understand a silent error or unexpected result | Common Mistakes |