Start Here

A step-by-step path from zero to your first extraction

Published

June 2, 2026

This page tells you exactly what to do and in what order. Follow it top to bottom the first time you use this site.


Before you begin

You need two things before any of the scripts on this site will work:

  1. UK Biobank RAP access — applied for through the UK Biobank portal and the DNAnexus platform. If you do not have this yet, ukbAid’s initial setup guide walks you through the full application process.

  2. A GitHub project repository — set up via ukbAid. The scripts connect your RAP session to GitHub every time you open one. Without this, nothing will save correctly.

If you have completed the ukbAid initial setup and have a working RAP project, you are ready to continue.


Session 1: First-time setup

Open scripts/setup.R in your RAP RStudio session and run steps 1–4 in order.

Step What it does Run when
1 Installs pak, cli, parquetize, ukbAid Every session
2 Connects your session to GitHub Every session
3 Installs ukbrapR and loads all libraries Every session
4 Extracts your chosen UKB variables to a parquet file Once — then comment out

Step 4 takes 5–60 minutes depending on how many variables you request and current platform load. Do not close the session while it runs.

Important

After step 4 finishes, comment it out. If you run it again in a future session it will re-extract everything and overwrite your saved file.


Every session after

At the start of each new RAP session, run steps 1–3 and step 5:

  • Steps 1–3: reinstall packages and reconnect to GitHub (the RAP environment resets between sessions — nothing persists automatically)
  • Step 5: load your saved dataset from the parquet file (takes seconds)

Your first extraction

Once your dataset is loaded, decide what clinical data you need.

GP and hospital diagnoses

If your study requires diagnostic events from GP records or hospital episodes:

  1. Build a code list first. Your extraction will find nothing without one. Read the Code Lists guide and create a CSV with your condition codes (Read v2, CTV3, ICD-10, ICD-9).

  2. Run extract_diagnoses.R. Open scripts/extract_diagnoses.R, point it at your code list CSV, and run it step by step.

Medication prescriptions

If your study requires GP prescription data:

  1. Run extract_medications.R. Open scripts/extract_medications.R and update the two configuration values at the top:
    • BNF_PREFIX — the BNF chapter for your drug class
    • DRUG_PATTERN — the drug names to match
  2. Test your pattern on a sample (Step 3 in the script) before running on all 57 million rows.

After extraction: merge and save

Open scripts/manage_dataset.R to combine your extracted datasets into a single analysis-ready file:

  1. Filter to GP-linked participants if your analysis uses GP data (filter(p42040 > 0)).
  2. Derive the first diagnosis date per participant and condition using slice_min() (or use the get_df() wide-format output if you chose Approach B in extract_diagnoses.R).
  3. Merge diagnosis events, prescriptions, and demographics by eid.
  4. Rename and recode — the script includes example functions for renaming p-coded columns and converting categoricals to labelled factors.
  5. Save the merged dataset to parquet.

Where to go next

I want to… Go to
Understand the extraction scripts step by step Extract Data
Build or find a diagnostic code list Code Lists
Merge datasets and derive first diagnosis dates Data Management
Look up what a function does Functions
Check what columns are in the clinical data Dataset Reference
Understand a silent error or unexpected result Common Mistakes