File types and how to open them

What you encounter in your workspace β€” and what opens it

Published

June 6, 2026

When you open File Explorer on the DST server, you encounter files with different extensions. This page is your reference tool: what file type is it, which function opens it, and does the data enter memory immediately or not?

The last point β€” lazy vs. full loading β€” is the most important distinction on this page. The mechanism behind it is explained in Phase 5 β€” Extracting data step by step; here you just need to know which file types behave which way.


What a project folder looks like

A typical project on the DST server is organised roughly like this. You work mostly in R/, datasets/ and output/ β€” the registers themselves live under cleaned-data/:

E:/workdata/[projectnumber]/
β”œβ”€β”€ rawdata/                     # raw register data from DST (often SAS)
β”‚   └── bef/  lpr_adm/  lpr_diag/  …
β”œβ”€β”€ cleaned-data/
β”‚   └── parquet-registers/       # registers converted to parquet
β”‚       └── bef/  lpr_adm/  lpr_a_kontakt/  …
β”œβ”€β”€ R/                           # your analysis scripts (01_, 02_, …)
β”œβ”€β”€ datasets/                    # your own intermediate results (.rds)
└── output/                      # tables, figures and logs

The paths in the code examples (E:/workdata/[projectnumber]/cleaned-data/parquet-registers/…) follow this structure. Your actual folder may look different β€” check with File Explorer. How to name the scripts in R/ is covered in Phase 12 β€” Good code practice.


Overview β€” file type, package, function

File type Package Function Loading
.parquet / parquet folder arrow open_dataset("path/to/folder/") Lazy β€” nothing in RAM until collect()
.rds base R readRDS("path/to/file.rds") Full β€” entire file into RAM
.sas7bdat haven read_sas("path/to/file.sas7bdat") Full β€” entire file into RAM
.csv readr read_csv("path/to/file.csv") Full β€” entire file into RAM
.xlsx readxl read_xlsx("path/to/file.xlsx") Full β€” entire file into RAM

What do you write in the parentheses? You write the path to the parquet folder containing the register β€” e.g. open_dataset("path/to/bef/"). The exact path depends on your project and server. Column names for each register are in 14d β€” Register reference.

Rarer formats (Stata, SPSS, Feather, RData)

You rarely encounter these in a typical DST cohort study, but here they are for completeness:

File type Package Function
.dta (Stata) haven read_dta()
.sav (SPSS) haven read_sav()
.feather arrow read_feather()
.rdata / .rda base R load()

.rdata/.rda differs from .rds in that it can save multiple objects at once β€” but .rds is preferred because you control what the object is called when you read it back in.


When do you use what?

The three you actually work with day to day:

File type Used for
Parquet The large registers (BEF, LPR, LMDB …). You load them lazily and filter before fetching data.
RDS Your own intermediate results β€” datasets you save from one script and reload in the next.
SAS Format tables and raw register data not yet converted to parquet.

RDS β€” the format you write yourself most

RDS is R’s own format. It is fast, compact and preserves all R properties (data types, factor levels, column names) perfectly.

If you are working with a pipeline of scripts β€” e.g. one script that builds your cohort and another that extracts diagnoses β€” you save the result from script 1 as .rds, so script 2 can read it in and continue from there, without having to re-run everything.

saveRDS(cohort, "datasets/full_cohort.rds")   # save an R object to disk
cohort <- readRDS("datasets/full_cohort.rds")   # read it back in the next script
Code examples: SAS and CSV

SAS β€” for format tables and unconverted register data:

library(haven)
df <- read_sas("E:/rawdata/[projectnumber]/lpr_adm2018.sas7bdat")

Loading large SAS files is very slow β€” which is exactly why data on DST is converted to parquet. Only use SAS for format tables and files without a parquet version.

CSV β€” for exporting finished tables (e.g. at repatriation):

library(readr)
write_csv(my_table, "output/table1.csv")

Never save raw register data as CSV β€” only aggregated results. See Phase 16 β€” Export and repatriation for the rules.


Lazy vs. full loading β€” what is the difference in practice?

Note the last column in the overview table. It divides everything into two:

Full loading (readRDS, read_sas, read_csv, read_xlsx) The entire file is read into RAM immediately. For your own intermediate results this is fine β€” they are relatively small. But it would crash your session if you tried it on a whole register.

Lazy loading (parquet via open_dataset()) The file is opened as a connection, not as data. You can filter and select columns, and only when you call collect() are the selected rows moved into RAM. This is how you work with registers of millions of rows without running out of memory.

Important

SAS files are also large β€” and are shared with everyone on the server. On DST all users share the server’s RAM. read_sas() on a large SAS file burdens the server for everyone at the same time. If you use a SAS file repeatedly, it is worth converting it to parquet once β€” this saves significant RAM and makes loading much faster. See Convert SAS to parquet below for the procedure.

Why lazy loading works, and how collect() functions, is the topic of the next phase.

β†’ Phase 5 β€” Extracting data step by step


Convert SAS files to Parquet

Most projects on DST receive registers as SAS files (.sas7bdat). Before you can use them with open_dataset() and lazy evaluation, they must be converted to Parquet once. After that you use them exactly like any other register.

Important

Relevant for most projects outside DARTER. If you are working on a project where the registers have not already been converted to parquet, this step is necessary before you can run extractions. Done once per register β€” after that the normal extraction pattern applies.

Why Parquet is worth it:

SAS (.sas7bdat) Parquet
Read time (1M rows) ~30–120 sec ~1–3 sec
Disk space Large 50–75% smaller
Requires package haven arrow
Lazy evaluation No β€” all into RAM Yes β€” filter BEFORE collect

The conversion is done once:

library(haven)    # read SAS file
library(arrow)    # write Parquet

sas_file  <- "E:/workdata/[projectnumber]/raw/my_register.sas7bdat"
parq_path <- "E:/workdata/[projectnumber]/cleaned-data/parquet-external/my_register/"

# 1. Read the SAS file into R
df <- read_sas(sas_file)            # reads the entire file into RAM β€” we call it "df", but you can use any name

# 2. Standardise column names
df <- df %>% rename_with(tolower)

# 3. Write as Parquet
dir.create(parq_path, recursive = TRUE, showWarnings = FALSE)
write_parquet(df, file.path(parq_path, "my_register.parquet"))

# 4. Verify
open_dataset(parq_path) %>% glimpse()
Tip

Large SAS files: heaven::import_SAS() (pre-installed on DST) is more efficient than read_sas() for large files and can limit what is read β€” e.g. keep = c("pnr","atc"), where = "..." (filter rows) or obs = 1000 (first rows only, good for testing).

After conversion you can use the register lazily with open_dataset() like any other:

my_reg <- open_dataset(parq_path) %>% rename_with(tolower)

Alternative: fastreg (dp-next, MIT licence) provides a more structured pipeline for converting entire DST workspaces from SAS to Parquet with convert() and a targets-based template. Relevant if you want to build a complete parquet workspace from scratch.

Back to top