While the song is playing...
Draw a mental model / concept map of last lectures content on data visualisation.
Why would I want to deal with missing data?
I was handed a dataset in my PhD that had more than 65% of the data missing. We had to spend quite a bit of time exploring the missingness relationships and the whole process was very frustrating, this inspired me to develop methods for exploring missing data, which ended up being a substantial part of my thesis
But today we are going to talk about some example weather data from Melbourne.
GSODR
???f
So here we can create a graphic where we can see the data that are present are in blue, and the data that are missing are in red. What we see here then is that the data that are missing for air temperature are in red here, but are actually measured only beyond gust speeds of 8 kpmh
So in cases of strong winds, the air temperature measurements break. Done. Dusted.
Wait, What?
But, like all good explanations, this one is simple - but the process to get there, where we describe it, probably was not.
In order to get to a position where you could generate this graphic, and show this data, you probably had to spend more time than you would have liked developing exploratory data analyses and models
It might seem obvious to point out the precise mechanism for generating the missingness, but this is kind of difficult when there is a lot of missing data, or when there are many variables.
What dealing with missing data looks like in a paper
What dealing with missing data actually looks like
What I want dealing with missing data to be like
visdat.njtierney.com
naniar.njtierney.com
What even are missing values
How to start looking at missing data
How to start exploring missing data
How to impute (fill in) Missing values
Missing values are values that should have been recorded but were not.
NA
= Not Available.
Before we get started, we need to define missing values
Missing values are values that should have been recorded, but were not.
Think of it this way:
You might accidentally not record seeing a bird - this is a missing value. This is different to recording that there were no birds observed.
R stores missing values as NA
, which stands for Not Available.
x <- c(1, NA, 3, NA, NA, 5)
any_na(x)
[1] TRUE
are_na(x)
[1] FALSE TRUE FALSE TRUE TRUE FALSE
n_miss(x)
[1] 3
prop_miss(x)
[1] 0.5
Missing values don't jump out and scream "I'm here!". They're usually hidden, like a needle in a haystack.
To detect missing values use any_na
, which returns TRUE if there are any missings, and FALSE if there are none.
are_na
asks "are these NA?" and returns TRUE/FALSE for each value
are_na
shows us 3 TRUE values - 3 missing values.
To avoid counting each TRUE yourself, n_miss
counts the number of missings
And prop_miss
gives the proportion of missings, which gives important context: 50% of data is missing!
NA
+ [anything] = NA
heights
Sophie Dan Fred 165 177 NA
sum(heights)
[1] NA
na.rm = TRUE
will removes missings
sum(heights, na.rm = TRUE)
[1] 342
na.rm = TRUE
will removes missings
sum(heights, na.rm = TRUE)
[1] 342
Use this power responsibly
!
So what happens when we mix missing values with our calculations? We need to know what happens, so we can be primed to find these cases. The general rule is this:
Calculations with NA
return NA
.
Say you have the height of three friends: Sophie, Dan, and Fred.
The sum of their heights returns NA
,
This is because we don't know the sum of a number and NA.
exercise-5a-intro-missing.Rmd
# install.packages("usethis")library(usethis)use_course("dmac.netlify.com/lectures/lecture5a/exercise/exercise-5a.zip")
Basic summaries of missingness:
n_miss
n_complete
Basic summaries of missingness:
n_miss
n_complete
Dataframe summaries of missingness:
miss_var_summary
miss_case_summary
These functions work with group_by
Now that you understand what missing values are, how to count them, and how they operate, let's scale these up to more detailed summaries of missingness.
We need to summarise missing data to identify variables, cases, or patterns of missingness, as these can bias our data analysis.
There are two main summaries: basic, and dataframe summaries.
Basic summaries return a single number, like the number of missing or complete values using n_miss
or n_complete
.
However, you will need more detailed missingness summaries to help you on your journey through a data analysis.
This lesson introduces you to missing data summaries.
naniar
provides a family of functions all starting with miss_
., which each provide different summaries of missingness, and return a dataframe.
This allows us to see features that can be difficult to articulate, or time consuming to calculate.
For example, miss_var_summary
and miss_case_summary
return the number and percentage of missings in each variable or case.
These summaries work with dplyr
''s group_by
, so you can fluidly explore missingness by each groups.
miss_var_summary(dat_sf_clean)
## # A tibble: 6 x 3## variable n_miss pct_miss## <chr> <int> <dbl>## 1 temp_min 70 17.3 ## 2 temp_max 70 17.3 ## 3 temp_avg 70 17.3 ## 4 wind_speed_max 23 5.68## 5 date 0 0 ## 6 month 0 0
Use miss_var_summary
to summarise the number of missings in each variable.
This returns a dataframe where each row is a variable. It also includes summaries of the number and percentage of missings for each variable in the dataset, and is sorted by the number of missings.
For example, Ozone has 37 missing values, and is about 24.2 percent missing.
miss_case_summary(dat_sf_clean)
## # A tibble: 405 x 3## case n_miss pct_miss## <int> <int> <dbl>## 1 89 4 66.7## 2 182 4 66.7## 3 188 4 66.7## 4 271 4 66.7## 5 6 3 50 ## 6 7 3 50 ## 7 10 3 50 ## 8 29 3 50 ## 9 37 3 50 ## 10 39 3 50 ## # … with 395 more rows
Similar to miss_var_summary
, miss_case_summary
returns a summary dataframe, where each case represents a dataset row number.
Here, case 5 - the fifth row in the dataset - has 2 missing values, which means 33% of that case is missing.
miss_var_table(dat_sf_clean)
## # A tibble: 3 x 3## n_miss_in_var n_vars pct_vars## <int> <int> <dbl>## 1 0 2 33.3## 2 23 1 16.7## 3 70 3 50
miss_case_table(dat_sf_clean)
## # A tibble: 4 x 3## n_miss_in_case n_cases pct_cases## <int> <int> <dbl>## 1 0 316 78.0 ## 2 1 19 4.69 ## 3 3 66 16.3 ## 4 4 4 0.988
Tabulation of missingness counts the number of times there are 0, 1, 2, 3, and so on, missings. They are very useful, compact summaries that reveal interesting structure.
miss_var_table
returns a dataframe with the number of missings in a variable, and the number and percentage of variables affected.
For example, there are four variables with no missings detected, which corresponds to 66.7 percent of variables, and there was 1 variable with 7 missings, and 1 variable with 37 missings.
Similarly, miss_case_table
returns the same information, but for cases.
group_by
dat_sf_clean %>% group_by(month) %>% miss_var_summary()
## # A tibble: 60 x 4## month variable n_miss pct_miss## <dbl> <chr> <int> <dbl>## 1 1 temp_min 7 11.5 ## 2 1 temp_max 7 11.5 ## 3 1 temp_avg 7 11.5 ## 4 1 wind_speed_max 4 6.56## 5 1 date 0 0 ## 6 2 temp_min 5 12.8 ## 7 2 temp_max 5 12.8 ## 8 2 temp_avg 5 12.8 ## 9 2 wind_speed_max 4 10.3 ## 10 2 date 0 0 ## # … with 50 more rows
Sometimes you are interested in missingness for groups in the data.
Each missingness summary function can be calculated by group, using group_by
from dplyr
.
For example, we can look at the missingness by Month in the airquality dataset.
Here we see that Month 5 for Ozone there are 5 missings, but for Month 6 Ozone has 21 missings.
naniar
provides a friendly family of missing data visualization functions.We cover how to get a bird's eye view of the data, how to look at missings in the variables and cases, and how to generate visualizations for missing spans and across groups in the data.
We now know what missing values are, how they work, how to count and summarise them - now let's look at some of the built-in visualisations that come with naniar
.
Data summaries are very useful, but sometimes an idea or a thought can be quickly captured with a visualisation.
naniar
provides a friendly family of missing data visualisation functions, each presenting different visualisations missingness summaries.
In fact, each of these visualisations is a nice compact shorthand for the data summaries. While you could create similar and more complex visualisations using the summary information from the previous lesson, this can be repetitive. The visualisations in naniar
reduce repetition and increase iteration, so you can operate closer to the speed of thought.
vis_miss(dat_sf_clean)
When you first get a dataset, it can be difficult to get a visceral sense of where the missings are.
To get an overview of the amount of missingness, use the vis_miss
function from the visdat
package.
vis_miss
produces a "heatmap" of the missingness - like as if the plot corresponded to the dataset as a giant spreadsheet, with values coloured black for missing, and grey for present.
vis_miss
also provides missingness summary statistics, showing the overall percentage of missingness in the legend, and the amount of missings in each variable.
These can be turned off in its options, described in the helpfile.
vis_miss(dat_sf_clean, cluster = TRUE)
vis_miss
also allows for clustering of the missing data by setting cluster = TRUE
: this orders the rows by missingness to identify common co-occurrences.
gg_miss_var(dat_sf_clean)
gg_miss_case(dat_sf_clean)
To quickly show the missingness in variables and cases, we visualise them using gg_miss_var
and gg_miss_case
. Note that these are visual analogues of the miss_var_summary
and miss_case_summary
functions.
These plots show the amount of missingness on the x axis, and for gg_miss_var
, each point represents the amount of missingness in that variable, and for gg_miss_case
, each line represents the amount of missingness in that case.
Note that these visualisations are ordered so that the most missing is at the top. The ordsering in gg_miss_case
can be turned off with option, order_cases = FALSE
.
gg_miss_var(dat_sf_clean, facet = month)
gg_miss_var
and gg_miss_case
also allow for facetting by one variable.
This means you can explore missingness in cases and variables across the levels of another group.
This plot is facetted by month, showing the number of missings in each variable for each month.
Here we see that Ozone in Month 6 has the most missings.
gg_miss_upset(dat_sf_clean)
To visualise the common combinations of missingness - which variables and cases go missing together, use gg_miss_upset
.
This powerful visualisation shows the number of combinations of missing values that co-occur.
An upset plot of the airquality
dataset shows there are only missing values in Ozone and Solar.R, with 35 in only Ozone, 5 in Solar.R, and in both Ozone and Solar.R, there are 2 missing cases.
gg_miss_fct(x = dat_sf_clean, fct = month)
To explore how missingness in each variable changes across a factor, use gg_miss_fct
.
This displays a heatmap visualisation showing the factors on the x axis, each other variable on the y axis, and the amount of missingness coloured from dark purple to yellow.
gg_miss_fct
does not support facetting.
miss_*
miss_var_*
miss_case_*
miss_*
miss_var_*
miss_case_*
gg_miss_*
gg_miss_var
gg_miss_case
Principles of Tidy Missing Data
The Shadowlands
Representing Missing values
Variables in columns
Observations in Rows
One value per cell
Variable ends in NA
Values are missing (NA) or not (!NA)
bind_shadow(data)
bind_shadow()
bind_shadow(dat_sf_clean) %>% glimpse()
## Observations: 405## Variables: 12## $ date <date> 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-05, 2017-01-06, …## $ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…## $ temp_min <dbl> 5.9, 8.4, 10.4, 5.8, 4.4, NA, NA, 11.9, 11.9, NA, 6.7, 5.5, 6…## $ temp_max <dbl> 10.5, 11.1, 14.5, 9.4, 9.1, NA, NA, 16.0, 14.6, NA, 11.2, 11.…## $ temp_avg <dbl> 8.4, 9.7, 12.4, 8.1, 6.4, NA, NA, 13.5, 13.2, NA, 8.5, 7.7, 8…## $ wind_speed_max <dbl> 5.1, 5.1, 6.7, 5.1, 5.1, 8.8, 8.2, 7.2, 7.7, 8.2, 5.1, 4.1, 4…## $ date_NA <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !…## $ month_NA <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !…## $ temp_min_NA <fct> !NA, !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, NA, !NA, !NA, !NA,…## $ temp_max_NA <fct> !NA, !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, NA, !NA, !NA, !NA,…## $ temp_avg_NA <fct> !NA, !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, NA, !NA, !NA, !NA,…## $ wind_speed_max_NA <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !…
dat_sf_clean %>% ggplot(aes(x = wind_speed_max)) + geom_density()
dat_sf_clean %>%bind_shadow() %>% ggplot(aes(x = wind_speed_max, colour = temp_avg_NA)) + geom_density()
Provides a consistent way to set the existing location of missing values.
This has additional great benefits when combined with imputation
Also allows for additional visualisations
ggplot(dat_sf_clean, aes(x = temp_avg, y = wind_speed_max)) + geom_point()
## # A tibble: 7 x 2## temp_avg temp_avg_NA## <dbl> <fct> ## 1 8.4 !NA ## 2 9.7 !NA ## 3 12.4 !NA ## 4 8.1 !NA ## 5 6.4 !NA ## 6 NA NA ## 7 NA NA
## # A tibble: 7 x 2## temp_avg temp_avg_NA## <dbl> <fct> ## 1 8.4 !NA ## 2 9.7 !NA ## 3 12.4 !NA ## 4 8.1 !NA ## 5 6.4 !NA ## 6 NA NA ## 7 NA NA
## # A tibble: 7 x 2## temp_avg temp_avg_NA## <dbl> <fct> ## 1 8.4 !NA ## 2 9.7 !NA ## 3 12.4 !NA ## 4 8.1 !NA ## 5 6.4 !NA ## 6 5.66 NA ## 7 5.69 NA
One approach, (ggobi) is to shift missing values below the minimum value
This then means that they can be plotted on the same axis.
Typically, when exploring this data, you would do something like this:
The problem with this is that ggplot does not handle missings be default, and removes the missing values. This makes it hard to explore the missing values.
impute_below()
dat_sf_clean %>% slice(5:10) %>% mutate(temp_avg_shift = impute_below(temp_avg)) %>% select(temp_avg, temp_avg_shift)
## # A tibble: 6 x 2## temp_avg temp_avg_shift## <dbl> <dbl>## 1 6.4 6.4 ## 2 NA 5.73## 3 NA 5.74## 4 13.5 13.5 ## 5 13.2 13.2 ## 6 NA 5.53
geom_miss_point()
ggplot(dat_sf_clean, aes(x = wind_speed_max, y = temp_avg)) + geom_miss_point()
Instead, we have create a ggplot2 geom, geom_missing_point()
.
This geom
allows for missing values to be displayed, and also works with the rest of ggplot2 - themes, and facets as well.
Exploring imputed values
Exploring imputed values
dat_sf_clean %>% simputation::impute_lm(temp_avg ~ wind_speed_max) %>% ggplot(aes(x = date, y = temp_avg)) + geom_point()
They are invisible!
Where are the imputed values?
bind_shadow(dat_sf_clean) %>% simputation::impute_lm(temp_avg ~ wind_speed_max + date) %>% ggplot(aes(x = date, y = temp_avg, colour = temp_avg_NA)) + geom_point() + scale_colour_brewer(palette = "Dark2")
bind_shadow(dat_sf_clean) %>% simputation::impute_lm(temp_avg ~ wind_speed_max) %>% ggplot(aes(x = temp_avg, fill = wind_speed_max_NA)) + geom_density(alpha = 0.5)
oceanbuoys
## # A tibble: 736 x 8## year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5 ## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 ## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5 ## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 ## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5 ## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 ## # … with 726 more rows
vis_dat(oceanbuoys)
vis_miss(oceanbuoys)
What do we learn?
Overall statistics, 3% of possible values are missing. That's not much. BUT, both air temperature and humidity have more than 10% missing which is too much to ignore.
This type of display is called a "heatmap", displays the data table, with cells coloured according to some other information. In this case it is type of variable, and missingness status. What do we learn?
character
(text) variables, or integer
variables.gg_miss_upset(oceanbuoys)
ggplot(oceanbuoys, aes(x = sea_temp_c, y = humidity)) + geom_point() + theme(aspect.ratio = 1)
.right-plot[
and this is a problem, because results computed on data with missing values might be biased. for example, ggplot
ignores them, but at least tells you its ignoring them:
geom_miss_point()
ggplot(oceanbuoys, aes(x = sea_temp_c, y = humidity)) + scale_colour_brewer(palette="Dark2") + geom_miss_point() + theme(aspect.ratio=1)
ggplot(oceanbuoys, aes(x = sea_temp_c, y = humidity)) + geom_miss_point() + scale_colour_brewer(palette = "Dark2") + facet_wrap(~year) + theme(aspect.ratio=1)
ggplot(oceanbuoys, aes(x = sea_temp_c, y = air_temp_c)) + geom_miss_point() + scale_colour_brewer(palette="Dark2") + facet_wrap(~year) + theme(aspect.ratio=1)
What do we learn?
tao_shadow <- bind_shadow(oceanbuoys)tao_shadow
## # A tibble: 736 x 16## year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns year_NA## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40 !NA ## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30 !NA ## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5 !NA ## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 !NA ## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10 !NA ## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60 !NA ## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5 !NA ## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 !NA ## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5 !NA ## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 !NA ## # … with 726 more rows, and 7 more variables: latitude_NA <fct>, longitude_NA <fct>,## # sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>,## # wind_ns_NA <fct>
tao_imp_mean <- tao_shadow %>% mutate(sea_temp_c = impute_mean(sea_temp_c), air_temp_c = impute_mean(air_temp_c))tao_shadow
## # A tibble: 736 x 16## year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns year_NA## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> ## 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40 !NA ## 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30 !NA ## 3 1997 0 -110 27.6 27 76.5 -5.10 4.5 !NA ## 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5 !NA ## 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10 !NA ## 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60 !NA ## 7 1997 0 -110 28.0 27.0 76.5 -2 3.5 !NA ## 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5 !NA ## 9 1997 0 -110 28.0 27.2 78.6 -4.20 5 !NA ## 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5 !NA ## # … with 726 more rows, and 7 more variables: latitude_NA <fct>, longitude_NA <fct>,## # sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>,## # wind_ns_NA <fct>
ggplot(tao_imp_mean, aes(x = sea_temp_c, y = air_temp_c, colour = air_temp_c_NA)) + geom_point(alpha = 0.7) + facet_wrap(~year) + scale_colour_brewer(palette = "Dark2") + theme(aspect.ratio = 1)
What do we learn?
tao_shadow <- tao_shadow %>% group_by(year) %>% mutate(sea_temp_c = impute_mean(sea_temp_c), air_temp_c = impute_mean(air_temp_c))
ggplot(tao_shadow, aes(x = sea_temp_c, y = air_temp_c, colour=air_temp_c_NA)) + geom_point(alpha=0.7) + facet_wrap(~year) + scale_colour_brewer(palette="Dark2") + theme(aspect.ratio=1)
What do we learn?
ggplot(data = tao_shadow, aes(x = wind_ew, y=wind_ns, colour=air_temp_c_NA)) + scale_colour_brewer(palette="Dark2") + geom_point(alpha=0.7) + theme(aspect.ratio=1)
What do we learn?
# install.packages("usethis")library(usethis)use_course("dmac.netlify.com/lectures/lecture4b/exercise/exercise-5a.zip")
This work is licensed under a Creative Commons Attribution 4.0 International License.
While the song is playing...
Draw a mental model / concept map of last lectures content on data visualisation.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |