Code Lists Guide

Building diagnostic code lists and medication patterns for UK Biobank extraction

Published

June 2, 2026

This page covers the two types of input files the extraction scripts need: a diagnostic code list CSV and medication regex patterns. It explains the required format, how to create them, where to find validated code lists, and how to pass them to the scripts.


Part 1: Diagnostic code lists

You can either create your own CSV code list or load one from an existing validated repository. The sections below first cover the required format and columns, then explain where to save your file, and finally show how to load directly from Prigge et al. without downloading anything manually.


What is a diagnostic code list?

UK Biobank GP records use two coding systems: Read version 2 and CTV3 (Clinical Terms Version 3). Hospital records use ICD-10 and, for records before ~1995, ICD-9. A code list is simply a table that tells the extraction script which codes correspond to the condition you want to find.

The script extract_diagnoses.R reads your code list and passes it to ukbrapR::get_diagnoses(), which searches the UKB database for matching records.


The code list schema

Your CSV must contain at least the three required columns. The remaining four columns are useful but will not cause errors if absent.

Column Required? Example value What it controls
code Required G20 The diagnostic code to search for. Format depends on vocabulary — see the code format table below. Must be uppercase, no trailing whitespace.
condition Required parkinson A short label grouping codes for the same condition. Used to tag extracted events. All codes for one condition should share the same label.
vocab_id Required Read2 Which coding system the code belongs to. Must be one of: Read2, CTV3, ICD10, ICD9.
Description Nice to have Parkinson's disease Human-readable description. Useful for review; not used by the script.
Sub-condition Nice to have parkinson_confirmed A finer-grained label within a condition (e.g. to separate definite from possible cases). Not used by the script.
Diagnosis Nice to have Parkinson's disease Longer clinical name. Not used by the script.
source Nice to have ClinicalCodes.org Where the code came from. Useful for audit.
Importantvocab_id must match exactly

The vocab_id values must be spelled exactly as shown. A common mistake is using “ICD-10” (with a hyphen) from a published code list. The script corrects “ICD-10” to “ICD10” and “ICD-9” to “ICD9” automatically, but “icd10” (lowercase) or “ICD 10” (with a space) will cause silent failures where no records are returned.

The script prints the vocabularies it found in your list before querying — check that message to confirm they were recognised.


Code format by vocabulary

Each coding system has a fixed format. Using the wrong format is a common source of silent failures where no records are returned.

NoteBackground: Read v2 and CTV3

Read v2 was developed by Dr. James Read, a UK general practitioner, in 1982 and became the NHS standard coding system for GP records in England and Wales in the early 1990s. It remained the dominant system in practices using INPS Vision software until the mid-2010s.

CTV3 (Clinical Terms Version 3, also called Read Codes v3) was developed in the late 1990s as part of an NHS initiative to align GP coding with SNOMED CT. It was adopted from around 2000 onwards, primarily in practices using EMIS software.

Both systems appear in UK Biobank GP records because different practices used different software. UKB GP data spans roughly 1985–2016 — a period when both coding schemes were in active use. A participant’s GP records may contain Read v2 codes, CTV3 codes, or a mix of both depending on which system their practice used and when. This is why code lists must include both vocabularies to capture all relevant events.

CTV3 was itself superseded by SNOMED CT in NHS primary care systems from 2018 onwards, but this falls after the main UKB data collection window.

Vocabulary Format Length Example Notes
Read v2 Letters and numbers, dots fill unused positions Always 5 characters F1200, F12.. Dots are literal characters, not wildcards — see below
CTV3 Letters and numbers, dots fill unused positions Always 5 characters XE1cO, F1200 Same rules as Read v2
ICD-10 Capital letter + 2 digits 3 characters, no dots G20, E11, I10 ukbrapR prefix-matches: G20 catches G20.0, G20.1, etc. Do not include the dot.
ICD-9 Digits only 3 digits 332, 250 3-digit prefix is sufficient
NoteWhat do the dots in Read v2 mean?

Read v2 codes are always exactly 5 characters. When a code represents a broad category rather than a specific condition, the unused character positions are filled with dots.

For example, F12.. is a real, stored code meaning “Parkinson’s disease” at the parent level. F1200 is a specific subcode meaning “Parkinson’s disease NOS”. The dots in F12.. are literally part of the code — they are not wildcards. Some GP records are coded at the parent level (F12..), others at a specific subcode (F1200). Include both to capture all records.


Worked example: a minimal code list

Below is an example for Parkinson’s disease covering all four vocabulary types. Use this as a template.

condition,vocab_id,code,Description
parkinson,Read2,F12..,"Parkinson's disease (parent code — 5 chars, dots are literal)"
parkinson,Read2,F1200,"Parkinson's disease NOS (specific subcode)"
parkinson,CTV3,XE1cO,"Parkinson's disease (5 chars)"
parkinson,ICD10,G20,"Parkinson's disease (3 chars, no dot)"
parkinson,ICD10,G21,"Secondary parkinsonism (3 chars, no dot)"
parkinson,ICD9,332,"Parkinson's disease (3 digits)"

The Description column is for your own reference — it is not used by the extraction script.


Where to save your CSV

If you build your own CSV — save it in data-raw/ in your project. This is the standard R project folder for raw input files that you create or download manually. Load it in extract_diagnoses.R using here::here(), which always resolves relative to your project root (the .Rproj file):

your-project/
└── data-raw/
    ├── af_codes.csv
    ├── parkinson_codes.csv
    └── hypertension_codes.csv
# !! Check that this path matches where you actually saved the file !!
codes <- readr::read_csv(here::here("data-raw/af_codes.csv"), show_col_types = FALSE)

If you load from a URL (Prigge et al. method below) — no file is saved at all. The CSV is read directly into R from GitHub and exists only in memory. You do not need a data-raw/ entry for URL-loaded code lists.


Validation checklist before running

Before passing your code list to the extraction script, check these five things:

Warning
  1. No trailing spaces in codes. A code "G20 " (with a space) will not match "G20". The script runs trimws() automatically, but check your source data.
  2. Codes are uppercase. The script runs toupper() automatically, but verify your source data if you are not sure.
  3. vocab_id spelling is exact. Accepted values: Read2, CTV3, ICD10, ICD9. No hyphens, no lowercase.
  4. condition label is consistent. All codes for the same condition must share the same condition value. A typo creates a second condition group in the output.
  5. ICD-10 codes without dots. Use E11, not E11.0. ukbrapR matches on the first three characters of the ICD code stored in HES.

Existing validated code lists

Building a code list from scratch is time-consuming and easy to do incompletely. For most common conditions, validated lists already exist that you can download and adapt.

TipPrigge et al. MLTC code lists (recommended starting point)

The most comprehensive freely available resource for Read v2 and CTV3 code lists is:

Prigge R, Fleetwood KJ, Jackson CA, Mercer SW, Kelly PA, Sudlow C, et al. Robustly measuring multimorbidity using disparate linked datasets. Commun Med (London). 2025 Jul 8;5(1):283.

Repository: https://github.com/rprigge-uoe/mltc-codelists

This repository provides directly downloadable Read v2, CTV3, and ICD-10 diagnostic code lists for 212 conditions, developed and validated for use with linked primary care data in UK cohort studies.

Loading directly from the repository

Rather than downloading files manually, you can read them into R directly from their GitHub raw URLs. On any file page in the repository, click Raw to get the direct download URL.

The files for each condition are split across three folders in the repo: Read v2/, ICD-10/, and CTV3/. Copy the raw URL for each.

Step 1: define the URLs

# Example: atrial fibrillation
AF_read_url <- "https://raw.githubusercontent.com/rprigge-uoe/mltc-codelists/master/Read%20v2/readv2_5_AF.csv"
AF_icd_url  <- "https://raw.githubusercontent.com/rprigge-uoe/mltc-codelists/refs/heads/master/ICD-10/ICD_AF.csv"
AF_ctv_url  <- "https://raw.githubusercontent.com/rprigge-uoe/mltc-codelists/refs/heads/master/CTV3/ctv3_AF.csv"

Step 2: define a helper function

This function reads all three files, standardises the column names to match the schema expected by extract_diagnoses.R, and combines them into one data frame.

create_codes_df3 <- function(read_url, icd_url, ctv_url, condition_label) {

  # Read one URL, extract the code column, add vocab_id and condition.
  # Prigge et al. CSVs use "Code" (capital C). Falls back to the first column
  # if neither "Code" nor "code" is present.
  read_one <- function(url, vocab) {
    df       <- readr::read_csv(url, show_col_types = FALSE)
    code_col <- dplyr::case_when(
      "Code" %in% names(df) ~ "Code",
      "code" %in% names(df) ~ "code",
      TRUE                  ~ names(df)[1]
    )
    dplyr::tibble(
      code      = toupper(trimws(df[[code_col]])),
      vocab_id  = vocab,
      condition = condition_label
    )
  }

  dplyr::bind_rows(
    read_one(read_url, "Read2"),
    read_one(icd_url,  "ICD10"),
    read_one(ctv_url,  "CTV3")
  )
}

Step 3: build and inspect the combined code list

codes_af <- create_codes_df3(AF_read_url, AF_icd_url, AF_ctv_url, "af")
head(codes_af)

The result is a data frame in the standard three-column format, ready to pass directly to extract_diagnoses.R as the codes object.

Note

Always inspect the output with head() before running the extraction. Confirm that the code column contains the codes (not descriptions or row numbers) and that vocab_id values are Read2, ICD10, or CTV3.

Other resources:


Part 2: Medication extraction

Two approaches — and why you need both

The GP prescriptions file (gp_scripts) has two columns you can use to identify drug prescriptions:

Column What it contains Limitation
bnf_code BNF (British National Formulary) chapter code Approximately 23% of rows have no BNF code recorded. BNF-only filtering silently misses these.
drug_name Free-text drug name as entered by the GP system Inconsistent: abbreviations, brand names, and formatting vary by practice and system. Needs careful regex design.

Using both approaches together is recommended. The two queries are designed to be mutually exclusive, so there is no double-counting:

  • Approach 1 (BNF filter): collect all rows where bnf_code starts with your chapter prefix. This captures all well-coded prescriptions quickly.
  • Approach 2 (drug-name regex): apply a pattern to all rows that were not captured by Approach 1 (missing, empty, or different BNF chapter). This captures what BNF misses.

This principle is not specific to diabetes — it applies to any drug class you want to extract from UK Biobank prescriptions.


BNF chapter codes

The BNF organises drugs into chapters by therapeutic class. To find the relevant chapter for your drugs, search the NHS BNF online or check the bnf_code values in a sample of your drug’s prescriptions.

Common chapters relevant to chronic disease research:

BNF prefix Drug class
0601 Drugs used in diabetes
0205 Drugs affecting the renin-angiotensin system (ACE inhibitors, ARBs)
0206 Diuretics
0212 Lipid-regulating drugs (statins, fibrates)
0401 Antidepressants
0403 Antipsychotics
0202 Antihypertensives (beta-blockers)

Set BNF_PREFIX in extract_medications.R to the prefix for your drug class.


Writing and testing a drug-name regex pattern

The drug_name column contains free-text entries that vary widely. A pattern for statins, for example, might look like:

DRUG_PATTERN <- paste(
  "simvastatin|zocor",
  "atorvastatin|lipitor",
  "rosuvastatin|crestor",
  "pravastatin|lipostat",
  "fluvastatin|lescol",
  sep = "|"
)

Include both generic names and brand names. The match is case-insensitive.

WarningAlways test your pattern on a sample before running on 57 million rows

Step 3 in extract_medications.R shows how to pull a 1000-row sample and inspect which drug names match and which do not. This takes seconds and can save you from running a flawed pattern across the full file.

sample_names <- ds |>
  dplyr::select(drug_name) |>
  head(1000) |>
  dplyr::collect() |>
  dplyr::pull(drug_name)

# Check what matches:
matched <- sample_names[stringr::str_detect(
  sample_names,
  stringr::regex(DRUG_PATTERN, ignore_case = TRUE)
)]

Look at both the matched names (check for false positives) and the unmatched names (check for drugs you missed).


Classifying prescriptions after extraction

After extracting raw prescriptions, you will typically want to group them by drug class. The example below uses dplyr::case_when() — the first matching condition wins, so put the most specific patterns first.

prescriptions <- prescriptions_raw |>
  dplyr::mutate(
    date = as.Date(issue_date, format = "%d/%m/%Y"),  # Parse UKB date format
    drug_class = dplyr::case_when(
      stringr::str_detect(drug_name,
        stringr::regex("simvastatin|zocor", ignore_case = TRUE)) ~ "simvastatin",
      stringr::str_detect(drug_name,
        stringr::regex("atorvastatin|lipitor", ignore_case = TRUE)) ~ "atorvastatin",
      stringr::str_detect(drug_name,
        stringr::regex("rosuvastatin|crestor", ignore_case = TRUE)) ~ "rosuvastatin",
      TRUE ~ "other_statin"
    )
  )

Drug indication changes over time

Before treating all prescriptions of a drug as evidence of your condition of interest, check whether the drug’s licensed indications changed during your study period.

When a drug receives a new approval for a different condition, prescriptions after that date can no longer be assumed to reflect the original condition. This is a general problem — not specific to any one drug class.

WarningExamples of indication expansions relevant to UK Biobank studies
Drug class Original indication New indication Approximate date
SGLT2 inhibitors (e.g. empagliflozin, dapagliflozin) Type 2 diabetes Heart failure, chronic kidney disease 2020
GLP-1 agonists (e.g. semaglutide, liraglutide) Type 2 diabetes Obesity 2020–2021
Methotrexate Cancer Rheumatoid arthritis, psoriasis Expanded progressively
Beta-blockers (e.g. bisoprolol) Hypertension Heart failure Late 1990s
Aspirin Pain / fever Cardiovascular prevention 1980s onwards

If your study period spans a change in approved indications, consider applying a date cutoff: include only prescriptions issued before the new indication became common, or stratify your analysis by period.


Part 3: Going further

Non-diagnostic code lists

For some research questions you may want to exclude events that represent administrative activity (e.g. annual review codes, referral codes) rather than a genuine diagnosis. These can be included in your code list with a distinct condition label (e.g. "condition_admin") and filtered out after extraction.

Delivery and procedure codes

If your study excludes events during pregnancy or within a window around a surgical procedure, you can build a separate code list for those events following the same CSV schema, extract them with extract_diagnoses.R, and use the resulting dates to define exclusion windows.

Classifying diabetes in UK Biobank

If your research question involves identifying and classifying diabetes cases (type 1, type 2, or other) from UK Biobank linked records, a dedicated R package has been developed at Steno Diabetes Center Aarhus:

UKDC — UK Diabetes Classifier Currently available at: https://github.com/steno-aarhus/UKDC (formal publication forthcoming)

UKDC handles GP diagnosis extraction, HES extraction, prescription classification, pregnancy exclusion, and type classification in a single workflow. The extraction scripts in this guide are general-purpose equivalents of its core extraction functions, designed to work for any condition.