ETC1010: Data Modelling and ComputingLecture 5A: Missing DataDr. Nicholas Tierney & Professor Di CookEBS, Monash U.2019-08-301 / 75

While the song is playing...

Draw a mental model / concept map of last lectures content on data visualisation.

2 / 75

recapJoins
advanced data vis

3 / 75

Motivation

Why would I want to deal with missing data?

4 / 75

I was handed a dataset in my PhD that had more than 65% of the data missing. We had to spend quite a bit of time exploring the missingness relationships and the whole process was very frustrating, this inspired me to develop methods for exploring missing data, which ended up being a substantial part of my thesis

But today we are going to talk about some example weather data from Melbourne.

Example

San Francisco weather data

|| Date | Wind | Temp ||

Using the R package: `GSODR`

(Global Surface Summary of the Day)

Written by Adam Sparks

github.com/ropensci/GSODR

5 / 75

6 / 75

Why data was missing

???f

So here we can create a graphic where we can see the data that are present are in blue, and the data that are missing are in red. What we see here then is that the data that are missing for air temperature are in red here, but are actually measured only beyond gust speeds of 8 kpmh

So in cases of strong winds, the air temperature measurements break. Done. Dusted.

7 / 75

Wait, What?

8 / 75

But, like all good explanations, this one is simple - but the process to get there, where we describe it, probably was not.

In order to get to a position where you could generate this graphic, and show this data, you probably had to spend more time than you would have liked developing exploratory data analyses and models

It might seem obvious to point out the precise mechanism for generating the missingness, but this is kind of difficult when there is a lot of missing data, or when there are many variables.

What dealing with missing data looks like in a paper

9 / 75

10 / 75

11 / 75

What dealing with missing data actually looks like

12 / 75

13 / 75

What I want dealing with missing data to be like

14 / 75

15 / 75

Learn more

visdat.njtierney.com

naniar.njtierney.com

16 / 75

Overview

What even are missing values
How to start looking at missing data
How to start exploring missing data
How to impute (fill in) Missing values

17 / 75

What are missing values?

Missing values are values that should have been recorded but were not.

NA = Not Available.

18 / 75

Before we get started, we need to define missing values

Missing values are values that should have been recorded, but were not.

Think of it this way:

You might accidentally not record seeing a bird - this is a missing value. This is different to recording that there were no birds observed.

R stores missing values as NA, which stands for Not Available.

How do I check if I have missing values?

x <- c(1, NA, 3, NA, NA, 5)

any_na(x)

[1] TRUE

are_na(x)

[1] FALSE  TRUE FALSE  TRUE  TRUE FALSE

n_miss(x)

[1] 3

prop_miss(x)

[1] 0.5

19 / 75

Missing values don't jump out and scream "I'm here!". They're usually hidden, like a needle in a haystack.

To detect missing values use any_na, which returns TRUE if there are any missings, and FALSE if there are none.

are_na asks "are these NA?" and returns TRUE/FALSE for each value

are_na shows us 3 TRUE values - 3 missing values.

To avoid counting each TRUE yourself, n_miss counts the number of missings

And prop_miss gives the proportion of missings, which gives important context: 50% of data is missing!

Working with missing data

NA + [anything] = NA

heights

Sophie    Dan   Fred 
   165    177     NA

sum(heights)

[1] NA

20 / 75

Working with missing data

na.rm = TRUE will removes missings

sum(heights, na.rm = TRUE)

[1] 342

21 / 75

Working with missing data

na.rm = TRUE will removes missings

sum(heights, na.rm = TRUE)

[1] 342

Use this power responsibly!

21 / 75

So what happens when we mix missing values with our calculations? We need to know what happens, so we can be primed to find these cases. The general rule is this:

Calculations with NA return NA.

Say you have the height of three friends: Sophie, Dan, and Fred.

The sum of their heights returns NA,

This is because we don't know the sum of a number and NA.

Your turn:Open rstudio.cloud
go to exercise-5a-intro-missing.Rmd
If you want to use R / Rstudio on your laptop:Install R + Rstudio (see )
open R
type the following:# install.packages("usethis")
library(usethis)
use_course("dmac.netlify.com/lectures/lecture5a/exercise/exercise-5a.zip")



22 / 75

Introduction to missingness summaries

Basic summaries of missingness:

n_miss
n_complete

23 / 75

Introduction to missingness summaries

Basic summaries of missingness:

n_miss
n_complete

Dataframe summaries of missingness:

miss_var_summary
miss_case_summary

These functions work with group_by

23 / 75

Now that you understand what missing values are, how to count them, and how they operate, let's scale these up to more detailed summaries of missingness.

We need to summarise missing data to identify variables, cases, or patterns of missingness, as these can bias our data analysis.

There are two main summaries: basic, and dataframe summaries.

Basic summaries return a single number, like the number of missing or complete values using n_miss or n_complete.

However, you will need more detailed missingness summaries to help you on your journey through a data analysis.

This lesson introduces you to missing data summaries.

naniar provides a family of functions all starting with miss_., which each provide different summaries of missingness, and return a dataframe.

This allows us to see features that can be difficult to articulate, or time consuming to calculate.

For example, miss_var_summary and miss_case_summary return the number and percentage of missings in each variable or case.

These summaries work with dplyr''s group_by, so you can fluidly explore missingness by each groups.

Missing data summaries: Variables

miss_var_summary(dat_sf_clean)

## # A tibble: 6 x 3
##   variable       n_miss pct_miss
##   <chr>           <int>    <dbl>
## 1 temp_min           70    17.3 
## 2 temp_max           70    17.3 
## 3 temp_avg           70    17.3 
## 4 wind_speed_max     23     5.68
## 5 date                0     0   
## 6 month               0     0

24 / 75

Use miss_var_summary to summarise the number of missings in each variable.

This returns a dataframe where each row is a variable. It also includes summaries of the number and percentage of missings for each variable in the dataset, and is sorted by the number of missings.

For example, Ozone has 37 missing values, and is about 24.2 percent missing.

Missing data summaries: Cases

miss_case_summary(dat_sf_clean)

## # A tibble: 405 x 3
##     case n_miss pct_miss
##    <int>  <int>    <dbl>
##  1    89      4     66.7
##  2   182      4     66.7
##  3   188      4     66.7
##  4   271      4     66.7
##  5     6      3     50  
##  6     7      3     50  
##  7    10      3     50  
##  8    29      3     50  
##  9    37      3     50  
## 10    39      3     50  
## # … with 395 more rows

25 / 75

Similar to miss_var_summary, miss_case_summary returns a summary dataframe, where each case represents a dataset row number.

Here, case 5 - the fifth row in the dataset - has 2 missing values, which means 33% of that case is missing.

Missing data tabulations: variables

miss_var_table(dat_sf_clean)

## # A tibble: 3 x 3
##   n_miss_in_var n_vars pct_vars
##           <int>  <int>    <dbl>
## 1             0      2     33.3
## 2            23      1     16.7
## 3            70      3     50

26 / 75

Missing data tabulations: cases

miss_case_table(dat_sf_clean)

## # A tibble: 4 x 3
##   n_miss_in_case n_cases pct_cases
##            <int>   <int>     <dbl>
## 1              0     316    78.0  
## 2              1      19     4.69 
## 3              3      66    16.3  
## 4              4       4     0.988

27 / 75

Tabulation of missingness counts the number of times there are 0, 1, 2, 3, and so on, missings. They are very useful, compact summaries that reveal interesting structure.

miss_var_table returns a dataframe with the number of missings in a variable, and the number and percentage of variables affected.

For example, there are four variables with no missings detected, which corresponds to 66.7 percent of variables, and there was 1 variable with 7 missings, and 1 variable with 37 missings.

Similarly, miss_case_table returns the same information, but for cases.

Using summaries with `group_by`

dat_sf_clean %>%
  group_by(month) %>%
  miss_var_summary()

## # A tibble: 60 x 4
##    month variable       n_miss pct_miss
##    <dbl> <chr>           <int>    <dbl>
##  1     1 temp_min            7    11.5 
##  2     1 temp_max            7    11.5 
##  3     1 temp_avg            7    11.5 
##  4     1 wind_speed_max      4     6.56
##  5     1 date                0     0   
##  6     2 temp_min            5    12.8 
##  7     2 temp_max            5    12.8 
##  8     2 temp_avg            5    12.8 
##  9     2 wind_speed_max      4    10.3 
## 10     2 date                0     0   
## # … with 50 more rows

28 / 75

Sometimes you are interested in missingness for groups in the data.

Each missingness summary function can be calculated by group, using group_by from dplyr.

For example, we can look at the missingness by Month in the airquality dataset.

Here we see that Month 5 for Ozone there are 5 missings, but for Month 6 Ozone has 21 missings.

Your TurnOpen exercise-5a-summarise-missings.Rmd

29 / 75

Introduction to missing data visualisations in naniarVisualisation can quickly capture an idea or thought.
naniar provides a friendly family of missing data visualization functions.
Each visualization corresponds to a data summary.
Visualisations help you operate closer to the speed of thought.

30 / 75

We cover how to get a bird's eye view of the data, how to look at missings in the variables and cases, and how to generate visualizations for missing spans and across groups in the data.

We now know what missing values are, how they work, how to count and summarise them - now let's look at some of the built-in visualisations that come with naniar.

Data summaries are very useful, but sometimes an idea or a thought can be quickly captured with a visualisation.

naniar provides a friendly family of missing data visualisation functions, each presenting different visualisations missingness summaries.

In fact, each of these visualisations is a nice compact shorthand for the data summaries. While you could create similar and more complex visualisations using the summary information from the previous lesson, this can be repetitive. The visualisations in naniar reduce repetition and increase iteration, so you can operate closer to the speed of thought.

Get a bird's eye view of the missing data

vis_miss(dat_sf_clean)

31 / 75

When you first get a dataset, it can be difficult to get a visceral sense of where the missings are.

To get an overview of the amount of missingness, use the vis_miss function from the visdat package.

vis_miss produces a "heatmap" of the missingness - like as if the plot corresponded to the dataset as a giant spreadsheet, with values coloured black for missing, and grey for present.

vis_miss also provides missingness summary statistics, showing the overall percentage of missingness in the legend, and the amount of missings in each variable.

These can be turned off in its options, described in the helpfile.

Get a bird's eye view of the missing data

vis_miss(dat_sf_clean, cluster = TRUE)

32 / 75

vis_miss also allows for clustering of the missing data by setting cluster = TRUE: this orders the rows by missingness to identify common co-occurrences.

Look at missings in cases

gg_miss_var(dat_sf_clean)

33 / 75

Look at missings in cases

gg_miss_case(dat_sf_clean)

34 / 75

To quickly show the missingness in variables and cases, we visualise them using gg_miss_var and gg_miss_case. Note that these are visual analogues of the miss_var_summary and miss_case_summary functions.

These plots show the amount of missingness on the x axis, and for gg_miss_var, each point represents the amount of missingness in that variable, and for gg_miss_case, each line represents the amount of missingness in that case.

Note that these visualisations are ordered so that the most missing is at the top. The ordsering in gg_miss_case can be turned off with option, order_cases = FALSE.

Look at missings in variables

gg_miss_var(dat_sf_clean, facet = month)

35 / 75

gg_miss_var and gg_miss_case also allow for facetting by one variable.

This means you can explore missingness in cases and variables across the levels of another group.

This plot is facetted by month, showing the number of missings in each variable for each month.

Here we see that Ozone in Month 6 has the most missings.

Visualizing missingness patterns

gg_miss_upset(dat_sf_clean)

36 / 75

To visualise the common combinations of missingness - which variables and cases go missing together, use gg_miss_upset.

This powerful visualisation shows the number of combinations of missing values that co-occur.

An upset plot of the airquality dataset shows there are only missing values in Ozone and Solar.R, with 35 in only Ozone, 5 in Solar.R, and in both Ozone and Solar.R, there are 2 missing cases.

Visualizing factors of missingness

gg_miss_fct(x = dat_sf_clean, fct = month)

37 / 75

To explore how missingness in each variable changes across a factor, use gg_miss_fct.

This displays a heatmap visualisation showing the factors on the x axis, each other variable on the y axis, and the amount of missingness coloured from dark purple to yellow.

gg_miss_fct does not support facetting.

Your turncomplete exercise-5a-visualise-missings.Rmd

38 / 75

miss_*
miss_var_*
miss_case_*
39 / 75

miss_*
miss_var_*
miss_case_*
gg_miss_*
gg_miss_var
gg_miss_case
39 / 75

Principles of Tidy Missing Data

40 / 75

The Shadowlands

Representing Missing values

41 / 75

Tidy Data

Variables in columns

Observations in Rows

One value per cell

42 / 75

Data Shadow

Variable ends in NA

Values are missing (NA) or not (!NA)

43 / 75

Tidy Missing Data

bind_shadow(data)

44 / 75

`bind_shadow()`

bind_shadow(dat_sf_clean) %>% glimpse()

## Observations: 405
## Variables: 12
## $ date              <date> 2017-01-02, 2017-01-03, 2017-01-04, 2017-01-05, 2017-01-06, …
## $ month             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ temp_min          <dbl> 5.9, 8.4, 10.4, 5.8, 4.4, NA, NA, 11.9, 11.9, NA, 6.7, 5.5, 6…
## $ temp_max          <dbl> 10.5, 11.1, 14.5, 9.4, 9.1, NA, NA, 16.0, 14.6, NA, 11.2, 11.…
## $ temp_avg          <dbl> 8.4, 9.7, 12.4, 8.1, 6.4, NA, NA, 13.5, 13.2, NA, 8.5, 7.7, 8…
## $ wind_speed_max    <dbl> 5.1, 5.1, 6.7, 5.1, 5.1, 8.8, 8.2, 7.2, 7.7, 8.2, 5.1, 4.1, 4…
## $ date_NA           <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !…
## $ month_NA          <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !…
## $ temp_min_NA       <fct> !NA, !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, NA, !NA, !NA, !NA,…
## $ temp_max_NA       <fct> !NA, !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, NA, !NA, !NA, !NA,…
## $ temp_avg_NA       <fct> !NA, !NA, !NA, !NA, !NA, NA, NA, !NA, !NA, NA, !NA, !NA, !NA,…
## $ wind_speed_max_NA <fct> !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !NA, !…

45 / 75

Shadows In Practice: Explore one variable

dat_sf_clean %>% 
  ggplot(aes(x = wind_speed_max)) + 
  geom_density()

46 / 75

Shadows In Practice: Explore one variable

dat_sf_clean %>%
bind_shadow() %>%
  ggplot(aes(x = wind_speed_max,
           colour = temp_avg_NA)) +
  geom_density()

47 / 75

Provides a consistent way to set the existing location of missing values.
This has additional great benefits when combined with imputation
Also allows for additional visualisations

In Practice: Explore two variables

ggplot(dat_sf_clean, 
       aes(x = temp_avg, 
           y = wind_speed_max)) + 
  geom_point()

48 / 75

Impute shadow values into our realm## # A tibble: 7 x 2
##   temp_avg temp_avg_NA
##      <dbl> <fct>      
## 1      8.4 !NA        
## 2      9.7 !NA        
## 3     12.4 !NA        
## 4      8.1 !NA        
## 5      6.4 !NA        
## 6     NA   NA         
## 7     NA   NA
49 / 75

Impute shadow values into our realm## # A tibble: 7 x 2
##   temp_avg temp_avg_NA
##      <dbl> <fct>      
## 1      8.4 !NA        
## 2      9.7 !NA        
## 3     12.4 !NA        
## 4      8.1 !NA        
## 5      6.4 !NA        
## 6     NA   NA         
## 7     NA   NA
## # A tibble: 7 x 2
##   temp_avg temp_avg_NA
##      <dbl> <fct>      
## 1     8.4  !NA        
## 2     9.7  !NA        
## 3    12.4  !NA        
## 4     8.1  !NA        
## 5     6.4  !NA        
## 6     5.66 NA         
## 7     5.69 NA
49 / 75

One approach, (ggobi) is to shift missing values below the minimum value

This then means that they can be plotted on the same axis.

Typically, when exploring this data, you would do something like this:

The problem with this is that ggplot does not handle missings be default, and removes the missing values. This makes it hard to explore the missing values.

`impute_below()`

Impute missing values from the shadows into our realm

dat_sf_clean %>%
  slice(5:10) %>%
  mutate(temp_avg_shift = impute_below(temp_avg)) %>%
  select(temp_avg, temp_avg_shift)

## # A tibble: 6 x 2
##   temp_avg temp_avg_shift
##      <dbl>          <dbl>
## 1      6.4           6.4 
## 2     NA             5.73
## 3     NA             5.74
## 4     13.5          13.5 
## 5     13.2          13.2 
## 6     NA             5.53

50 / 75

`geom_miss_point()`

ggplot(dat_sf_clean, 
       aes(x = wind_speed_max, 
           y = temp_avg)) + 
  geom_miss_point()

51 / 75

Instead, we have create a ggplot2 geom, geom_missing_point().

This geom allows for missing values to be displayed, and also works with the rest of ggplot2 - themes, and facets as well.

52 / 75

Exploring imputed values

53 / 75

Exploring imputed values

Imputation is the process of filling in missing values with some other estimate

53 / 75

What about this imputation thing?

dat_sf_clean %>%  
  simputation::impute_lm(temp_avg ~ wind_speed_max) %>%
  ggplot(aes(x = date,
             y = temp_avg)) + 
  geom_point()

54 / 75

They are invisible!

Where are the imputed values?

55 / 75

Tidy Missing Data reveals the imputations!

bind_shadow(dat_sf_clean) %>%
  simputation::impute_lm(temp_avg ~ wind_speed_max + date) %>%
  ggplot(aes(x = date,
             y = temp_avg,
             colour  = temp_avg_NA)) +
  geom_point() +
  scale_colour_brewer(palette = "Dark2")

56 / 75

Shadows make things clearer!

bind_shadow(dat_sf_clean) %>%
  simputation::impute_lm(temp_avg ~ wind_speed_max) %>%
  ggplot(aes(x = temp_avg,
             fill = wind_speed_max_NA)) +
  geom_density(alpha = 0.5)

57 / 75

Example data: oceanbuoys

oceanbuoys

## # A tibble: 736 x 8
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl>
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5 
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5 
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5 
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5 
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5   
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5 
## # … with 726 more rows

58 / 75

Start looking at missing values

vis_dat(oceanbuoys)

59 / 75

Start looking at missing values

vis_miss(oceanbuoys)

60 / 75

What do we learn?

Two variables, air temperature and humidity, have a large number of missings.
Year, latitude and longitude have no missings
Sea temperature has a couple of missings
Some rows have a lot of missings

Overall statistics, 3% of possible values are missing. That's not much. BUT, both air temperature and humidity have more than 10% missing which is too much to ignore.

This type of display is called a "heatmap", displays the data table, with cells coloured according to some other information. In this case it is type of variable, and missingness status. What do we learn?

Most of the variables are character (text) variables, or integer variables.
Missing values are located only in the counts, but it is in blocks, so perhaps corresponds to some category levels of the other variables.

Missing value patterns

gg_miss_upset(oceanbuoys)

61 / 75

Missings Tend to get ignored by most software

ggplot(oceanbuoys,
  aes(x = sea_temp_c,
      y = humidity)) +
  geom_point() + 
  theme(aspect.ratio = 1)

.right-plot[

62 / 75

and this is a problem, because results computed on data with missing values might be biased. for example, ggplot ignores them, but at least tells you its ignoring them:

Add missings to plot with `geom_miss_point()`

ggplot(oceanbuoys,
       aes(x = sea_temp_c,
           y = humidity)) +
  scale_colour_brewer(palette="Dark2") +
  geom_miss_point() + theme(aspect.ratio=1)

ggplot(oceanbuoys,
       aes(x = sea_temp_c, y = humidity)) +
  geom_miss_point() + 
  scale_colour_brewer(palette = "Dark2") +
  facet_wrap(~year) + 
  theme(aspect.ratio=1)

63 / 75

Understanding missing dependencies

ggplot(oceanbuoys,
       aes(x = sea_temp_c,
           y = air_temp_c)) +
  geom_miss_point() + 
  scale_colour_brewer(palette="Dark2") +
  facet_wrap(~year) + 
  theme(aspect.ratio=1)

64 / 75

What do we learn?

There is a different missingness pattern in each of the years
Year needs to be accounted for in finding good substitute values.

Strategies for working with missing valuesSmall fraction of cases have several missings (around 5%) - explore data, and possibly drop the cases
A variable or two, out of many, have a lot of missings, drop the variables

65 / 75

Strategies for working with missing valuesIf missings are small in number, but located in many cases and variables, you need to impute these values, to do most analyses
Designing the imputation should take into account dependencies that you have seen between missingness and existing variables.
For the ocean buoys data this means imputation needs to be done separately by year

66 / 75

Common ways to impute values(Usually bad) Simple parametric: use the mean or median of the complete cases for each variable
(Better) More complex: use models to predict missing values
(Best) Multiple imputation: Use a statistical distribution, e.g. normal model and simulate a value (or set of values, hot deck imputation) for the missings.

67 / 75

Setup for missings

tao_shadow <- bind_shadow(oceanbuoys)
tao_shadow

## # A tibble: 736 x 16
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns year_NA
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl> <fct>  
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40 !NA    
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30 !NA    
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5  !NA    
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5  !NA    
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10 !NA    
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60 !NA    
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5  !NA    
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5  !NA    
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5    !NA    
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5  !NA    
## # … with 726 more rows, and 7 more variables: latitude_NA <fct>, longitude_NA <fct>,
## #   sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>,
## #   wind_ns_NA <fct>

68 / 75

Imputing the Mean (ignoring year).

tao_imp_mean <- tao_shadow %>%
  mutate(sea_temp_c = impute_mean(sea_temp_c),
         air_temp_c = impute_mean(air_temp_c))
tao_shadow

## # A tibble: 736 x 16
##     year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns year_NA
##    <dbl>    <dbl>     <dbl>      <dbl>      <dbl>    <dbl>   <dbl>   <dbl> <fct>  
##  1  1997        0      -110       27.6       27.1     79.6   -6.40    5.40 !NA    
##  2  1997        0      -110       27.5       27.0     75.8   -5.30    5.30 !NA    
##  3  1997        0      -110       27.6       27       76.5   -5.10    4.5  !NA    
##  4  1997        0      -110       27.6       26.9     76.2   -4.90    2.5  !NA    
##  5  1997        0      -110       27.6       26.8     76.4   -3.5     4.10 !NA    
##  6  1997        0      -110       27.8       26.9     76.7   -4.40    1.60 !NA    
##  7  1997        0      -110       28.0       27.0     76.5   -2       3.5  !NA    
##  8  1997        0      -110       28.0       27.1     78.3   -3.70    4.5  !NA    
##  9  1997        0      -110       28.0       27.2     78.6   -4.20    5    !NA    
## 10  1997        0      -110       28.0       27.2     76.9   -3.60    3.5  !NA    
## # … with 726 more rows, and 7 more variables: latitude_NA <fct>, longitude_NA <fct>,
## #   sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>, wind_ew_NA <fct>,
## #   wind_ns_NA <fct>

69 / 75

Imputing the Mean (ignoring year).

ggplot(tao_imp_mean,
       aes(x = sea_temp_c,
           y = air_temp_c, 
           colour = air_temp_c_NA)) +
  geom_point(alpha = 0.7) + 
  facet_wrap(~year) + 
  scale_colour_brewer(palette = "Dark2") +
  theme(aspect.ratio = 1)

70 / 75

What do we learn?

Oh, this is so wrong!
The imputed values are nothing like the complete case values

Impute Mean by year

tao_shadow <- tao_shadow %>%
  group_by(year) %>%
  mutate(sea_temp_c = impute_mean(sea_temp_c),
         air_temp_c = impute_mean(air_temp_c))

71 / 75

by year

ggplot(tao_shadow,
       aes(x = sea_temp_c,
           y = air_temp_c, 
           colour=air_temp_c_NA)) +
  geom_point(alpha=0.7) + 
  facet_wrap(~year) + 
  scale_colour_brewer(palette="Dark2") +
  theme(aspect.ratio=1)

72 / 75

What do we learn?

The imputed values are closer to the complete case values
However, they form a rigid line, mismatching the variation
and they extend outside the range of complete values. This is a problem in that the imputed air temperature value for these high sea temperature cases is lower than we would expect, and thus possibly impeding good model fitting

Two minute challenge

Change the code to plot sea temperature against humidity, with colour representing missing humidity values. What do you learn about the imputations?

Relationship with other variables

ggplot(data = tao_shadow,
       aes(x = wind_ew, 
           y=wind_ns, 
           colour=air_temp_c_NA)) +
       scale_colour_brewer(palette="Dark2") +
       geom_point(alpha=0.7) + theme(aspect.ratio=1)

What do we learn?

The lowest values of east-west winds have no missing values. Maybe it is less likely to have air temperature missing values when there are light east-west winds?

Two minute challenge

Generate the shadow matrix and make a plot of the winds, coloured by missingness on humidity.

Your Turn:lab quiz open (requires answering questions from Lab exercise)
go to rstudio.cloud and finish final exercise
If you want to use R / Rstudio on your laptop:Install R + Rstudio (see )
open R
type the following:# install.packages("usethis")
library(usethis)
use_course("dmac.netlify.com/lectures/lecture4b/exercise/exercise-5a.zip")



73 / 75

Resources

74 / 75

This work is licensed under a Creative Commons Attribution 4.0 International License.

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

ETC1010: Data Modelling and Computing

Lecture 5A: Missing Data

Dr. Nicholas Tierney & Professor Di Cook

EBS, Monash U.

2019-08-30

recap

Motivation

Example

San Francisco weather data

|| Date | Wind | Temp ||

Using the R package: GSODR

(Global Surface Summary of the Day)

Written by Adam Sparks

github.com/ropensci/GSODR

Why data was missing

Learn more

Overview

What are missing values?

How do I check if I have missing values?

Working with missing data

Working with missing data

Working with missing data

Your turn:

Introduction to missingness summaries

Introduction to missingness summaries

Missing data summaries: Variables

Missing data summaries: Cases

Missing data tabulations: variables

Missing data tabulations: cases

Using summaries with group_by

Your Turn

Introduction to missing data visualisations in naniar

Get a bird's eye view of the missing data

Get a bird's eye view of the missing data

Look at missings in cases

Look at missings in cases

Look at missings in variables

Visualizing missingness patterns

Visualizing factors of missingness

Your turn

miss_*

miss_var_*

miss_case_*

miss_*

miss_var_*

miss_case_*

gg_miss_*

gg_miss_var

gg_miss_case

Tidy Data

Data Shadow

Tidy Missing Data

bind_shadow()

Shadows In Practice: Explore one variable

Shadows In Practice: Explore one variable

In Practice: Explore two variables

Impute shadow values into our realm

Impute shadow values into our realm

impute_below()

Impute missing values from the shadows into our realm

geom_miss_point()

Facets!

What about this imputation thing?

Tidy Missing Data reveals the imputations!

Shadows make things clearer!

Example data: oceanbuoys

Start looking at missing values

Start looking at missing values

Missing value patterns

Missings Tend to get ignored by most software

Add missings to plot with geom_miss_point()

Facet By year

Understanding missing dependencies

Strategies for working with missing values

Strategies for working with missing values

Common ways to impute values

Setup for missings

Imputing the Mean (ignoring year).

Imputing the Mean (ignoring year).

Impute Mean by year

Using the R package: `GSODR`

Using summaries with `group_by`

`miss_*`

`miss_var_*`

`miss_case_*`

`miss_*`

`miss_var_*`

`miss_case_*`

`gg_miss_*`

`gg_miss_var`

`gg_miss_case`

`bind_shadow()`

`impute_below()`

`geom_miss_point()`

Add missings to plot with `geom_miss_point()`