File types and how to open them
What you encounter in your workspace β and what opens it
When you open File Explorer on the DST server, you encounter files with different extensions. This page is your reference tool: what file type is it, which function opens it, and does the data enter memory immediately or not?
The last point β lazy vs. full loading β is the most important distinction on this page. The mechanism behind it is explained in Phase 5 β Extracting data step by step; here you just need to know which file types behave which way.
What a project folder looks like
A typical project on the DST server is organised roughly like this. You work mostly in R/, datasets/ and output/ β the registers themselves live under cleaned-data/:
E:/workdata/[projectnumber]/
βββ rawdata/ # raw register data from DST (often SAS)
β βββ bef/ lpr_adm/ lpr_diag/ β¦
βββ cleaned-data/
β βββ parquet-registers/ # registers converted to parquet
β βββ bef/ lpr_adm/ lpr_a_kontakt/ β¦
βββ R/ # your analysis scripts (01_, 02_, β¦)
βββ datasets/ # your own intermediate results (.rds)
βββ output/ # tables, figures and logs
The paths in the code examples (E:/workdata/[projectnumber]/cleaned-data/parquet-registers/β¦) follow this structure. Your actual folder may look different β check with File Explorer. How to name the scripts in R/ is covered in Phase 12 β Good code practice.
Overview β file type, package, function
| File type | Package | Function | Loading |
|---|---|---|---|
.parquet / parquet folder |
arrow |
open_dataset("path/to/folder/") |
Lazy β nothing in RAM until collect() |
.rds |
base R | readRDS("path/to/file.rds") |
Full β entire file into RAM |
.sas7bdat |
haven |
read_sas("path/to/file.sas7bdat") |
Full β entire file into RAM |
.csv |
readr |
read_csv("path/to/file.csv") |
Full β entire file into RAM |
.xlsx |
readxl |
read_xlsx("path/to/file.xlsx") |
Full β entire file into RAM |
What do you write in the parentheses? You write the path to the parquet folder containing the register β e.g. open_dataset("path/to/bef/"). The exact path depends on your project and server. Column names for each register are in 14d β Register reference.
Rarer formats (Stata, SPSS, Feather, RData)
You rarely encounter these in a typical DST cohort study, but here they are for completeness:
| File type | Package | Function |
|---|---|---|
.dta (Stata) |
haven |
read_dta() |
.sav (SPSS) |
haven |
read_sav() |
.feather |
arrow |
read_feather() |
.rdata / .rda |
base R | load() |
.rdata/.rda differs from .rds in that it can save multiple objects at once β but .rds is preferred because you control what the object is called when you read it back in.
When do you use what?
The three you actually work with day to day:
| File type | Used for |
|---|---|
| Parquet | The large registers (BEF, LPR, LMDB β¦). You load them lazily and filter before fetching data. |
| RDS | Your own intermediate results β datasets you save from one script and reload in the next. |
| SAS | Format tables and raw register data not yet converted to parquet. |
RDS β the format you write yourself most
RDS is Rβs own format. It is fast, compact and preserves all R properties (data types, factor levels, column names) perfectly.
If you are working with a pipeline of scripts β e.g. one script that builds your cohort and another that extracts diagnoses β you save the result from script 1 as .rds, so script 2 can read it in and continue from there, without having to re-run everything.
saveRDS(cohort, "datasets/full_cohort.rds") # save an R object to disk
cohort <- readRDS("datasets/full_cohort.rds") # read it back in the next scriptCode examples: SAS and CSV
SAS β for format tables and unconverted register data:
library(haven)
df <- read_sas("E:/rawdata/[projectnumber]/lpr_adm2018.sas7bdat")Loading large SAS files is very slow β which is exactly why data on DST is converted to parquet. Only use SAS for format tables and files without a parquet version.
CSV β for exporting finished tables (e.g. at repatriation):
library(readr)
write_csv(my_table, "output/table1.csv")Never save raw register data as CSV β only aggregated results. See Phase 16 β Export and repatriation for the rules.
Lazy vs. full loading β what is the difference in practice?
Note the last column in the overview table. It divides everything into two:
Full loading (readRDS, read_sas, read_csv, read_xlsx) The entire file is read into RAM immediately. For your own intermediate results this is fine β they are relatively small. But it would crash your session if you tried it on a whole register.
Lazy loading (parquet via open_dataset()) The file is opened as a connection, not as data. You can filter and select columns, and only when you call collect() are the selected rows moved into RAM. This is how you work with registers of millions of rows without running out of memory.
SAS files are also large β and are shared with everyone on the server. On DST all users share the serverβs RAM. read_sas() on a large SAS file burdens the server for everyone at the same time. If you use a SAS file repeatedly, it is worth converting it to parquet once β this saves significant RAM and makes loading much faster. See Convert SAS to parquet below for the procedure.
Why lazy loading works, and how collect() functions, is the topic of the next phase.
β Phase 5 β Extracting data step by step
Convert SAS files to Parquet
Most projects on DST receive registers as SAS files (.sas7bdat). Before you can use them with open_dataset() and lazy evaluation, they must be converted to Parquet once. After that you use them exactly like any other register.
Relevant for most projects outside DARTER. If you are working on a project where the registers have not already been converted to parquet, this step is necessary before you can run extractions. Done once per register β after that the normal extraction pattern applies.
Why Parquet is worth it:
| SAS (.sas7bdat) | Parquet | |
|---|---|---|
| Read time (1M rows) | ~30β120 sec | ~1β3 sec |
| Disk space | Large | 50β75% smaller |
| Requires package | haven |
arrow |
| Lazy evaluation | No β all into RAM | Yes β filter BEFORE collect |
The conversion is done once:
library(haven) # read SAS file
library(arrow) # write Parquet
sas_file <- "E:/workdata/[projectnumber]/raw/my_register.sas7bdat"
parq_path <- "E:/workdata/[projectnumber]/cleaned-data/parquet-external/my_register/"
# 1. Read the SAS file into R
df <- read_sas(sas_file) # reads the entire file into RAM β we call it "df", but you can use any name
# 2. Standardise column names
df <- df %>% rename_with(tolower)
# 3. Write as Parquet
dir.create(parq_path, recursive = TRUE, showWarnings = FALSE)
write_parquet(df, file.path(parq_path, "my_register.parquet"))
# 4. Verify
open_dataset(parq_path) %>% glimpse()Large SAS files: heaven::import_SAS() (pre-installed on DST) is more efficient than read_sas() for large files and can limit what is read β e.g. keep = c("pnr","atc"), where = "..." (filter rows) or obs = 1000 (first rows only, good for testing).
After conversion you can use the register lazily with open_dataset() like any other:
my_reg <- open_dataset(parq_path) %>% rename_with(tolower)Alternative: fastreg (dp-next, MIT licence) provides a more structured pipeline for converting entire DST workspaces from SAS to Parquet with convert() and a targets-based template. Relevant if you want to build a complete parquet workspace from scratch.