Inspect and understand your data

The most important commands for seeing what you actually have

Published

June 6, 2026

In Phase 6 you made your first extraction — with synthetic data from fakeregs. Now you have data, and before you analyse it you need to understand what you have. It is only here that these commands make sense: they are tools for looking at a dataset, so they are meaningless without data to look at.

This page shows the commands you will use again and again to inspect your extracts.

Note

$ — access one column in a table data$koen means: “the column koen in the table data”. Replace data with your own dataset name and koen with your own column name. You will see $ everywhere in R code.

Tip

Want to practise the commands in RStudio? All examples on the page use bef_data — a BEF extract with columns such as koen, alder and foed_dag. You can generate it with two lines, provided you have run the preparation block in Phase 6:

# Continuing from Phase 6 — bef_data is already opened as a lazy Arrow connection
bef_data <- bef_data %>% filter(year == 2015) %>% collect()   # filter and fetch into R

Seeing a red error message?

Before looking in the code — do this in order:

Read the error message — which line is mentioned? Which object name appears?
Run class(object) — is it data ("data.frame") or still a connection ("tbl_duckdb_connection")?
Run names(object) — is the column named exactly what you think? A single letter or difference in capitalisation is enough to fail.
Isolate the failing line — run it alone and see what happens.
Use ?functionname — type e.g. ?colSums in the console to open the help documentation in the Help panel (bottom right). It shows what the function does, which arguments it takes, and examples.
Ask a colleague or search for the error message — see 2 — R: the bare essentials for the prioritised help list.

An overview of common error messages and what they typically mean is in Phase 15c — DST pitfalls.

See what you have

dim(bef_data)          # number of rows and columns — e.g. "1200 rows, 8 columns"
nrow(bef_data)         # number of rows only
ncol(bef_data)         # number of columns only
names(bef_data)        # column names as a vector

names() also works on a lazy Arrow object before collect() — e.g. names(bef) if you want to check which columns the register has before deciding what to select().

Example with simple data

df <- data.frame(
  pnr        = c("001", "002", "003"),
  sex        = c("M", "F", "M"),
  index_date = as.Date(c("2015-03-01", "2016-07-14", "2014-11-30"))
)

names(df)
# [1] "pnr"        "sex"        "index_date"

Understand the structure

glimpse(bef_data)      # column name, type and first values — compact and readable (requires dplyr)
str(bef_data)          # same information, but more verbose output
class(bef_data)        # object type — is data actually in R, or is it still a connection?
class(bef_data$alder)  # type for one column: "numeric", "character", "Date" etc.

class(bef_data) tells you whether you have real data ("data.frame"/"tbl_df") or still just an unsent query. You can see three different return values:

"data.frame" / "tbl_df" — data is in R. You can use all functions.
"tbl_duckdb_connection" / "Table" — lazy Arrow/DuckDB query. Missing collect().
"arrow_dplyr_query" — an Arrow query with one or more piped steps (e.g. filter() or select()), but not yet executed. Missing collect().

Tip

class() can help you debug Does the object look like data but behave strangely, or does your code fail with a mysterious message? Run class(your_object) — if it is not "data.frame", you are probably missing a collect(). The full table of what class() can return — and why — is in Phase 5 — Extracting data step by step.

See the first and last rows

head(bef_data)         # the first 6 rows — do the columns and types look right?
head(bef_data, 10)     # the first 10 rows
tail(bef_data)         # the last 6 rows — useful for detecting incomplete datasets

Explore the contents

As noted at the top of the page, $ is R’s way of saying “this column in this table”: data$koen is the column koen in the table data.

unique(bef_data$koen)                     # which unique values exist in the koen column?
table(bef_data$koen)                      # frequency table: how many rows have each value?
table(bef_data$koen, bef_data$civst)      # cross-table: distribution of sex across marital status

Example with simple data

# A small example with five patients and two variables:
df <- data.frame(
  sex          = c("M", "F", "M", "F", "M"),
  age_group    = c("18-40", "18-40", "41-60", "41-60", "41-60")
)

table(df$sex)
#  F  M
#  2  3       # 2 women, 3 men

table(df$sex, df$age_group)
#    18-40  41-60
# F      1      1    # 1 woman in 18-40, 1 woman in 41-60
# M      1      2    # 1 man in 18-40, 2 men in 41-60

Summarise data

Note

What is NA? NA (Not Available) is R’s term for a missing or unknown value. A cell can have NA because the information was not recorded, not reported, or does not exist for that person. Most calculation functions return NA if there is one NA in the data — unless you write na.rm = TRUE (“remove NAs”). is.na(x) returns TRUE for NA values and FALSE for everything else.

When is NA a problem? It depends on which column is missing:

NA in a key variable (index date, pnr, outcome) is serious — these persons cannot be correctly included in the analysis, and you must decide whether to exclude them.
NA in a covariate (e.g. income) can often be handled — e.g. with a separate “unknown” category or imputation.
NA from a join usually means a person was not found in the right-hand dataset — e.g. no prescription record. Here NA is effectively a “no/none”, not an error.

Always check colSums(is.na(data)) immediately after an extraction or a join, so you detect unexpected gaps before they propagate silently through the analysis.

summary(bef_data)              # min, max, median, mean and quartiles for all columns
summary(bef_data$foed_dag)     # summary of one column

For continuous variables:

min(bef_data$alder, na.rm = TRUE)    # smallest value (na.rm removes NAs - rm = remove)
max(bef_data$alder, na.rm = TRUE)    # largest value
mean(bef_data$alder, na.rm = TRUE)   # mean
median(bef_data$alder, na.rm = TRUE) # median
sd(bef_data$alder, na.rm = TRUE)     # standard deviation
IQR(bef_data$alder, na.rm = TRUE)    # interquartile range (Q3 - Q1)

Check missing values

sum(is.na(bef_data$koen))    # number of NAs in the koen column — replace with your own column name
colSums(is.na(bef_data))     # number of NAs per column — gives an overview of the entire dataset

colSums(is.na(bef_data)) returns one line with a counter per column:

#  pnr  koen  alder  foed_dag  year  civst  opr_land  reg
#    0     0      0         3     0      0         0    0

Here foed_dag is missing for 3 — everything else is complete. A column with 0 has no NAs at all.

Check dates

Date columns can contain impossible values — dates far outside the study period are a sign of a conversion error or wrong column.

min(bef_data$foed_dag, na.rm = TRUE)   # is the earliest birth date plausible?
max(bef_data$foed_dag, na.rm = TRUE)   # is the latest birth date plausible?

# Check for dates BEFORE an expected interval:
sum(bef_data$foed_dag < as.Date("1900-01-01"), na.rm = TRUE)   # replace the date with your lower bound

# Check for dates AFTER an expected interval:
sum(bef_data$foed_dag > as.Date("2015-12-31"), na.rm = TRUE)   # replace the date with your upper bound

See Phase 15c — Pitfalls for the most common date conversion errors and how to fix them.

More exploration: count, sort and quick plots

Count and sort (dplyr):

Replace bef_data with your own dataset name and column names with your own.

bef_data %>% count(koen)               # number of rows per category
bef_data %>% count(koen, civst)        # per combination of two variables
bef_data %>% arrange(foed_dag)         # sort ascending by birth date
bef_data %>% arrange(desc(foed_dag))   # sort descending

Quick visualisations — for getting an overview, not for publication:

Replace bef_data with your own dataset name and column names with your own.

# Continuous variables:
hist(bef_data$alder)                            # histogram
boxplot(bef_data$alder)                         # box plot
boxplot(alder ~ koen, data = bef_data)          # box plot split by sex

# Categorical variables:
barplot(table(bef_data$koen))                   # bar chart

Next steps

You can now inspect a dataset. Next steps are to know which registers contain what:

Phase 8 — Know your registers — decision table: which register contains what
Phase 15a — Functions: overview — filter(), select(), mutate(), left_join() and more
dataReporter — automatically generates an HTML report of all columns (distribution, missingness, outliers). Confirmed on DST.