Inspect and understand your data
The most important commands for seeing what you actually have
In Phase 6 you made your first extraction β with synthetic data from fakeregs. Now you have data, and before you analyse it you need to understand what you have. It is only here that these commands make sense: they are tools for looking at a dataset, so they are meaningless without data to look at.
This page shows the commands you will use again and again to inspect your extracts.
$ β access one column in a table data$koen means: βthe column koen in the table dataβ. Replace data with your own dataset name and koen with your own column name. You will see $ everywhere in R code.
Want to practise the commands in RStudio? All examples on the page use bef_data β a BEF extract with columns such as koen, alder and foed_dag. You can generate it with two lines, provided you have run the preparation block in Phase 6:
# Continuing from Phase 6 β bef_data is already opened as a lazy Arrow connection
bef_data <- bef_data %>% filter(year == 2015) %>% collect() # filter and fetch into RSeeing a red error message?
Before looking in the code β do this in order:
- Read the error message β which line is mentioned? Which object name appears?
- Run
class(object)β is it data ("data.frame") or still a connection ("tbl_duckdb_connection")? - Run
names(object)β is the column named exactly what you think? A single letter or difference in capitalisation is enough to fail. - Isolate the failing line β run it alone and see what happens.
- Use
?functionnameβ type e.g.?colSumsin the console to open the help documentation in the Help panel (bottom right). It shows what the function does, which arguments it takes, and examples. - Ask a colleague or search for the error message β see 2 β R: the bare essentials for the prioritised help list.
An overview of common error messages and what they typically mean is in Phase 15c β DST pitfalls.
See what you have
dim(bef_data) # number of rows and columns β e.g. "1200 rows, 8 columns"
nrow(bef_data) # number of rows only
ncol(bef_data) # number of columns only
names(bef_data) # column names as a vectornames() also works on a lazy Arrow object before collect() β e.g. names(bef) if you want to check which columns the register has before deciding what to select().
Example with simple data
df <- data.frame(
pnr = c("001", "002", "003"),
sex = c("M", "F", "M"),
index_date = as.Date(c("2015-03-01", "2016-07-14", "2014-11-30"))
)
names(df)
# [1] "pnr" "sex" "index_date"Understand the structure
glimpse(bef_data) # column name, type and first values β compact and readable (requires dplyr)
str(bef_data) # same information, but more verbose output
class(bef_data) # object type β is data actually in R, or is it still a connection?
class(bef_data$alder) # type for one column: "numeric", "character", "Date" etc.class(bef_data) tells you whether you have real data ("data.frame"/"tbl_df") or still just an unsent query. You can see three different return values:
"data.frame"/"tbl_df"β data is in R. You can use all functions."tbl_duckdb_connection"/"Table"β lazy Arrow/DuckDB query. Missingcollect()."arrow_dplyr_query"β an Arrow query with one or more piped steps (e.g.filter()orselect()), but not yet executed. Missingcollect().
class() can help you debug Does the object look like data but behave strangely, or does your code fail with a mysterious message? Run class(your_object) β if it is not "data.frame", you are probably missing a collect(). The full table of what class() can return β and why β is in Phase 5 β Extracting data step by step.
See the first and last rows
head(bef_data) # the first 6 rows β do the columns and types look right?
head(bef_data, 10) # the first 10 rows
tail(bef_data) # the last 6 rows β useful for detecting incomplete datasetsExplore the contents
As noted at the top of the page, $ is Rβs way of saying βthis column in this tableβ: data$koen is the column koen in the table data.
unique(bef_data$koen) # which unique values exist in the koen column?
table(bef_data$koen) # frequency table: how many rows have each value?
table(bef_data$koen, bef_data$civst) # cross-table: distribution of sex across marital statusExample with simple data
# A small example with five patients and two variables:
df <- data.frame(
sex = c("M", "F", "M", "F", "M"),
age_group = c("18-40", "18-40", "41-60", "41-60", "41-60")
)
table(df$sex)
# F M
# 2 3 # 2 women, 3 men
table(df$sex, df$age_group)
# 18-40 41-60
# F 1 1 # 1 woman in 18-40, 1 woman in 41-60
# M 1 2 # 1 man in 18-40, 2 men in 41-60Summarise data
What is NA? NA (Not Available) is Rβs term for a missing or unknown value. A cell can have NA because the information was not recorded, not reported, or does not exist for that person. Most calculation functions return NA if there is one NA in the data β unless you write na.rm = TRUE (βremove NAsβ). is.na(x) returns TRUE for NA values and FALSE for everything else.
When is NA a problem? It depends on which column is missing:
- NA in a key variable (index date, pnr, outcome) is serious β these persons cannot be correctly included in the analysis, and you must decide whether to exclude them.
- NA in a covariate (e.g. income) can often be handled β e.g. with a separate βunknownβ category or imputation.
- NA from a join usually means a person was not found in the right-hand dataset β e.g. no prescription record. Here
NAis effectively a βno/noneβ, not an error.
Always check colSums(is.na(data)) immediately after an extraction or a join, so you detect unexpected gaps before they propagate silently through the analysis.
summary(bef_data) # min, max, median, mean and quartiles for all columns
summary(bef_data$foed_dag) # summary of one columnFor continuous variables:
min(bef_data$alder, na.rm = TRUE) # smallest value (na.rm removes NAs - rm = remove)
max(bef_data$alder, na.rm = TRUE) # largest value
mean(bef_data$alder, na.rm = TRUE) # mean
median(bef_data$alder, na.rm = TRUE) # median
sd(bef_data$alder, na.rm = TRUE) # standard deviation
IQR(bef_data$alder, na.rm = TRUE) # interquartile range (Q3 - Q1)Check missing values
sum(is.na(bef_data$koen)) # number of NAs in the koen column β replace with your own column name
colSums(is.na(bef_data)) # number of NAs per column β gives an overview of the entire datasetcolSums(is.na(bef_data)) returns one line with a counter per column:
# pnr koen alder foed_dag year civst opr_land reg
# 0 0 0 3 0 0 0 0
Here foed_dag is missing for 3 β everything else is complete. A column with 0 has no NAs at all.
Check dates
Date columns can contain impossible values β dates far outside the study period are a sign of a conversion error or wrong column.
min(bef_data$foed_dag, na.rm = TRUE) # is the earliest birth date plausible?
max(bef_data$foed_dag, na.rm = TRUE) # is the latest birth date plausible?
# Check for dates BEFORE an expected interval:
sum(bef_data$foed_dag < as.Date("1900-01-01"), na.rm = TRUE) # replace the date with your lower bound
# Check for dates AFTER an expected interval:
sum(bef_data$foed_dag > as.Date("2015-12-31"), na.rm = TRUE) # replace the date with your upper boundSee Phase 15c β Pitfalls for the most common date conversion errors and how to fix them.
More exploration: count, sort and quick plots
Count and sort (dplyr):
Replace bef_data with your own dataset name and column names with your own.
bef_data %>% count(koen) # number of rows per category
bef_data %>% count(koen, civst) # per combination of two variables
bef_data %>% arrange(foed_dag) # sort ascending by birth date
bef_data %>% arrange(desc(foed_dag)) # sort descendingQuick visualisations β for getting an overview, not for publication:
Replace bef_data with your own dataset name and column names with your own.
# Continuous variables:
hist(bef_data$alder) # histogram
boxplot(bef_data$alder) # box plot
boxplot(alder ~ koen, data = bef_data) # box plot split by sex
# Categorical variables:
barplot(table(bef_data$koen)) # bar chartNext steps
You can now inspect a dataset. Next steps are to know which registers contain what:
- Phase 8 β Know your registers β decision table: which register contains what
- Phase 15a β Functions: overview β
filter(),select(),mutate(),left_join()and more dataReporterβ automatically generates an HTML report of all columns (distribution, missingness, outliers). Confirmed on DST.