Extracting data step by step

Open, filter, select, collect

Published

June 6, 2026

In Phase 4 you saw that registers are loaded lazily: you open a connection, not data. This page shows the working method that follows from that β€” the same pattern you will use in every single extraction for the rest of the guide.

Important

This is the most important pattern in the entire guide. Every extraction from DST registers follows this structure. Learn it here once β€” you will recognise it everywhere from Phase 6 onwards.

Note

Why can’t we just load a register like a CSV file? Because the registers are too large. A CSV (or readRDS) reads the entire file into memory at once β€” that works fine with your own small datasets, but BEF and LMDB have millions to hundreds of millions of rows and would crash your session. The lazy connection lets you filter before you fetch, so only the rows you actually need ever reach memory.


The pattern: open connection β†’ filter β†’ select β†’ fetch

Think of a supermarket. You can either buy the whole shop and sort at home β€” or send a shopping list and only have what is on the list delivered. Lazy loading is the latter: you describe what you want, and only fetch the items at the end with collect().

library(arrow)   # open_dataset
library(dplyr)   # filter, select, collect

# 1. Open lazy connection β€” no data is moved
bef <- open_dataset("E:/workdata/[projectnumber]/cleaned-data/parquet-registers/bef/") %>%
  rename_with(tolower)

# 2. Filter rows and select columns β€” still no data in R
bef_filtered <- bef %>%
  filter(aar == 2020) %>%
  select(pnr, koen, foed_dag, familie_id)

# 3. collect() β€” HERE the selected data is moved into R's memory
bef_data <- bef_filtered %>% collect()

Steps 1–2 are pure instructions β€” they cost almost no memory. collect() in step 3 is the only line that actually moves data.

Tip

Working on DARTER? Replace open_dataset("E:/workdata/...") with load_database("bef") from dstDataPrep. See DARTER β€” overview and pipeline.

Note

Why rename_with(tolower) as the first step? Raw column names in DST registers have inconsistent capitalisation β€” PNR, pnr, Pnr. By standardising to lower case immediately, the rest of your code can use pnr everywhere without thinking about it. Add it immediately after open_dataset() as the first step in your pipe β€” before filtering or selecting columns.


The most important rule: filter and select columns BEFORE fetching

# CORRECT β€” filter and select columns BEFORE collect
result <- large_register %>%
  filter(pnr %in% !!cohort_pnrs) %>%   # keep only the relevant rows
  select(pnr, date, value) %>%          # keep only the columns you use
  collect()

# WRONG β€” fetches millions of rows and all columns into RAM
result <- large_register %>%
  collect() %>%
  filter(pnr %in% cohort_pnrs)

BEF has millions of rows; LMDB has hundreds of millions. Calling collect() before filter() can crash your session β€” and because everyone shares the server’s RAM, it makes the server slow for everyone else at the same time. The more you have filtered and selected columns in advance, the faster and more stably it runs.

Tip

Rule of thumb: filter() limits rows. select() limits columns. Use both before collect().


In depth: what !! means, and why it is necessary

Did you notice !! in the example above β€” filter(pnr %in% !!cohort_pnrs)?

What is cohort_pnrs? It is a vector (a list) of the personal identifiers that should be included in your study β€” e.g. all those who received a particular surgery or diagnosis. You built it from a register by filtering on a procedure or ICD code. It looks like this in R:

cohort_pnrs <- unique(cohort$pnr)   # extracted from your cohort dataset
Warning

You must never write actual personal identifiers (pnr) in your code β€” only variable names and vectors. Writing actual CPR numbers directly in a script is a violation of DST’s rules for handling microdata.

Back to !!: The lazy query is not executed by R, but by the database engine (Arrow/DuckDB), which runs separately from R and does not know your local variables.

filter(pnr %in% !!cohort_pnrs)

Without !!, the engine would look for a column called cohort_pnrs inside the register. It does not exist β€” cohort_pnrs is an R vector you created locally. !! means: β€œthis is an R object, not a column name β€” look up its values in R and use them in the query.”

You will see !! in front of almost any local variable used inside a filter() on a lazy object.

Tip

Rule of thumb: if you are referring to a column in the register, write the name directly. If you are referring to an R variable you created yourself, put !! in front of it.


Check: do you have data or just a connection?

Many errors arise because you think you have fetched data β€” but actually still have an unexecuted query. Ask the object what it is:

class(my_object)
What class() returns What it means What to do
"tbl_df" "data.frame" Data is in R Nothing β€” you are ready
"tbl_duckdb_connection" Unsent DuckDB query Add %>% collect()
"Table" "ArrowObject" Unsent Arrow query Add %>% collect()

In depth: RAM is shared β€” clean up after large extractions

On DST all users share the server’s memory. When the memory bar in RStudio turns red, it slows everyone’s sessions.

rm(large_register)   # removes the object from your R session (disappears from the list in the top right of RStudio)
gc()                 # asks the operating system to reclaim memory R no longer uses

rm() removes the object from your active session β€” it can no longer be used in code and disappears from the Environment panel (top right in RStudio).

gc() (garbage collection) does not affect your session β€” all your other objects are still there. R normally holds on to freed memory a little longer, because it is faster to reuse than to request new memory from the operating system. gc() forces R to return that memory to the operating system immediately, so it becomes available to other users on the shared DST server. You lose nothing β€” it helps your colleagues.

Always clean up after large extractions before continuing.


What happens visually

flowchart TD
    P[("Parquet register<br>on the DST server")]:::store
    L["open_dataset()"]:::lazy
    F["filter() β€” choose rows"]:::lazy
    S["select() β€” choose columns"]:::lazy
    J["inner_join() β€” link (optional)"]:::lazy
    C(["collect()"]):::collect
    R[("data.frame in R")]:::store
    V["saveRDS() β†’ datasets/"]:::save

    P -->|"lazy β€” nothing moves yet"| L
    L --> F --> S --> J
    J -->|"send the query"| C
    C -->|"only chosen rows + columns"| R
    R --> V

    classDef store fill:#eef0f2,stroke:#8a94a6,color:#1f2733;
    classDef lazy fill:#eaf2fb,stroke:#4a78b5,color:#173a5e;
    classDef collect fill:#fff3e0,stroke:#e69500,color:#7a4f00;
    classDef save fill:#e9f7ef,stroke:#3fae6b,color:#14532d;

The blue steps are lazy β€” they only build up a query. Data moves into R only at collect() (orange). That is why filter() and select() must come before collect().


Want to know more? What are Parquet, Arrow and DuckDB really?

You do not need to know these names to use the pattern above β€” but here is what happens under the hood.

Parquet is a file format that stores data column by column rather than row by row. If you only want pnr and aar, the computer reads only those two columns β€” not the rest. That is why it is fast.

On the DARTER project (708421) most SAS datasets are already converted to parquet and can be opened directly with load_database() β€” see DARTER β€” Register paths and datastores. On other projects you can convert SAS files to parquet yourself β€” see Phase 4 β€” Convert SAS to parquet.

Arrow is the R package that reads parquet files and translates your dplyr commands into efficient parquet queries. open_dataset() uses Arrow.

DuckDB is a database that runs directly inside your R session. It is very fast and supports almost all dplyr syntax. If you use load_database() (DARTER/dstDataPrep), Arrow and DuckDB are used under the hood. You just write normal dplyr code β€” both ways.

More technical: Arrow vs. DuckDB β€” when is which used?

Both packages use lazy evaluation and collect().

Arrow is fast at reading parquet, but does not support all dplyr functions.

DuckDB (via duckplyr) supports almost all dplyr functions and falls back automatically to R’s engine for the few it cannot handle.

If you encounter an error about β€œunsupported function”, you are probably using a function DuckDB does not support. Solution: collect() data into R first, and perform the operation after that.


Next steps

You now know the pattern. In the next phase you use it on real (synthetic) register data from start to finish.

β†’ Phase 6 β€” First extraction

Back to top