---
title: "ETC1010: Data Modelling and Computing"
author: "Professor Di Cook, EBS, Monash U."
output:
learnr::tutorial:
css: "css/logo.css"
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
library(learnr)
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
collapse = TRUE,
fig.height = 8,
fig.width = 8,
fig.align = "center",
cache = FALSE)
tutorial_html_dependency()
```
```{r echo=FALSE}
library(tidyverse)
library(broom)
library(gridExtra)
```
# Intermediate models
## Course web site
This is a link to the course web site, in case you need to go back and forth between tutorial and web materials: [http://dmac.dicook.org](http://dmac.dicook.org)
## Overview
- Transformations: building on a stable platform
- Adding interactions to the model - what does this mean?
- Model building - how do you decide when you've got the bets possible model?
## Transformations
*Transform your work larvae to butterfly* (Image source: Jan Thornhill http://sci-why.blogspot.com/2013/04/help-save-monarch-butterfly.html)
- **Numerical variables**, whether they are explanatory or response, work better for modeling if they are reasonably spread out. If they are right- or left-skewed the model puts more weight on the extreme values.
- **Categorical variables** may need to be re-levelled to balance classes, or remove low count classes.
```{r}
x <- runif(100)
df <- tibble(x, y=5-2*x+rnorm(100), x2=(2*x)^10)
p1 <- ggplot(df, aes(x=x, y=y)) + geom_point() +
geom_smooth(method="lm", se=FALSE) +
ggtitle("Ideal base for model")
p2 <- ggplot(df, aes(x=x2, y=y)) + geom_point() +
geom_smooth(method="lm", se=FALSE) +
ggtitle("Less stable base for model")
p3 <- ggplot(df, aes(x=x2, y=y)) + geom_point() +
scale_x_log10() + xlab("Log x2") +
geom_smooth(method="lm", se=FALSE) +
ggtitle("Log transform x over-corrects")
p4 <- ggplot(df, aes(x=x2^(1/10), y=y)) + geom_point() +
xlab("x2^(1/10)") +
geom_smooth(method="lm", se=FALSE) +
ggtitle("Power transform x fixes")
grid.arrange(p1, p2, p3, p4, ncol=2)
```
### Power transformation
This is a family of transformations you can make on quantitative data to help "symmetrise" or "normalise" (make is bell-shaped) the data. These are also called Box-Cox transformations. For $\lambda \in (-\infty, \infty)$,
$$x_i^{(\lambda)} = \frac{x_i^\lambda -1}{\lambda} ~~~ \mbox{if} ~\lambda \neq 0$$
but if $\lambda=0, x_i^{(\lambda)} =\ln{x_i}$
```{r}
df <- tibble(x1=rexp(500), x2=runif(500)^5, x3=abs(6-rexp(500)))
p1 <- ggplot(df, aes(x=x1)) + geom_density(fill="black", alpha=0.7)
p2 <- ggplot(df, aes(x=x2)) + geom_density(fill="black", alpha=0.7)
p3 <- ggplot(df, aes(x=x3)) + geom_density(fill="black", alpha=0.7)
p4 <- ggplot(df, aes(x=log(x1+0.1))) + geom_density(fill="black", alpha=0.7) + ggtitle("lambda=0")
p5 <- ggplot(df, aes(x=((x2)^(1/5)-1)/(1/5))) + geom_density(fill="black", alpha=0.7) + ggtitle("lambda=1/5")
p6 <- ggplot(df, aes(x=(x3^5-1)/5)) + geom_density(fill="black", alpha=0.7) + ggtitle("lambda=5")
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=3)
```
It may be necessary to *shift* the values into the domain suitable for the function. For example, to take a log, all values must be bigger than 1.
### Transforming categorical variables
- Categories with small numbers of counts may need ot be combined.
- Ordered categories might be re-coded to reflect "real" difference between classes
#### Example: flying etiquette
```{r}
fly <- read_csv("data/flying-etiquette.csv") %>%
select(`How often do you travel by plane?`, `Do you ever recline your seat when you fly?`, `Is itrude to recline your seat on a plane?`) %>%
rename(travel_freq=`How often do you travel by plane?`,
recline=`Do you ever recline your seat when you fly?`,
rude=`Is itrude to recline your seat on a plane?`)
fly %>% count(travel_freq, sort=TRUE)
```
How often a person flies, is a pretty important context variable for understanding people's responses in the survey. There are two issues here:
- Categories `Every day`, `A few times per week` and even `A few times per month` have small counts. You may not have enough subjects from which to infer to the larger population of flyers in these categories.
- Category `Never`!!! If they have never flown how can they provide useful information on flying etiquette?
```{r}
library(forcats)
fly %>% count(recline, sort=TRUE)
fly %>% count(rude, sort=TRUE)
fly %>% filter(!is.na(recline)) %>%
filter(!is.na(rude)) %>%
count(recline, rude, sort=TRUE)
fly %>%
filter(!is.na(recline)) %>%
filter(!is.na(rude)) %>%
mutate(recline = fct_recode(recline, Always = "Usually")) %>%
count(recline, rude, sort=TRUE)
```
- Single variables have healthy counts
- But pairs of variables, some combinations of levels have low counts
- Collapsing two or more categories into one, might build better support
## Adding interactions to the model
![](images/interaction.png)
### Interaction between quantitative and categorical variables
An interaction term is needed in a model if the linear relationship is different for the response vs quantitative variable for different levels of the categorical variable. That is, a different *slope* needs to be used/estimated for each level.
Let's take a look at how this works for the [2015 OECD PISA data](http://www.oecd.org/pisa/data/2015database/). The question to be answered is whether more time spent studying science is associated with higher science scores, and how this varies with enjoyment of science.
```{r echo=FALSE, fig.width=6, fig.height=6}
load("data/pisa_au_sub.rda")
library(labelled)
pisa_au_science <- pisa_au %>%
filter(science_fun < 5) %>%
filter(!is.na(science_time)) %>%
select(science, science_fun, science_time, stuweight) %>%
mutate(science_fun = to_factor(science_fun, ordered=TRUE, drop_unused_labels = TRUE)) %>%
mutate(science_time = as.numeric(science_time)) %>%
filter(science_time>0)
ggplot(pisa_au_science, aes(x=science_time, y=science,
colour=science_fun)) +
geom_point(alpha=0.1) +
scale_colour_brewer("Enjoy science", palette="Dark2") +
facet_wrap(~science_fun, ncol=2) +
scale_x_log10() +
# geom_smooth(method="lm", se=FALSE) +
theme(legend.position="bottom") +
xlab("Time spent studying science per week (mins") +
ylab("Synthetic science score")
```
```{r pipeline, echo=FALSE}
quiz(
question("How would you describe the relationship between science score and time spent studying?",
answer("Weak", correct = TRUE),
answer("Moderate"),
answer("Strong")),
question("What do these lines of code do? `filter(science_fun < 5) %>%
filter(!is.na(science_time))`",
answer("Remove missing values", correct = TRUE),
answer("Remove extreme values, and missing values"),
answer("I have no idea")),
question("Why was `science_time` transformed to a log scale?",
answer("It has a right-skewed distribution.", correct = TRUE),
answer("It has a left-skewed distribution."),
answer("It is symmetric")),
question("Why were 0 values of `science_time` removed?",
answer("It could be argued that these are most likely missing values", correct = TRUE),
answer("They are outliers affecting the modeling"),
answer("No-one would be able to study science 0 minutes per week"))
)
```
There are two possible models:
$y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\varepsilon_i$ (Model 1)
$y_i = \beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i1}*x_{i2}+\varepsilon_i$ (Model 2)
where $y=$ science score, $x_1=$ science study time, $x_2=$ science enjoyment.
Model 2 has an interaction term. This means that the slope will be allowed to vary for the different levels of the categorical variables, science_fun.
*Note:* Ordered factors are treated as "numeric" in the default model fit, so we should convert `science_fun` to be an unordered categorical variable. Also, `science_time` is heavily skewed so should be transformed.
```{r}
pisa_au_science <- pisa_au_science %>%
mutate(log_science_time = log10(science_time)) %>%
mutate(science_fun_c = factor(science_fun, ordered=FALSE))
mod1 <- lm(science~log_science_time+science_fun_c, data=pisa_au_science, weights = stuweight)
mod2 <- lm(science~log_science_time*science_fun_c, data=pisa_au_science, weights = stuweight)
tidy(mod1)
tidy(mod2)
```
### Five minute challenge
- Write out the equations for both models. (Ignore the log transformation.)
- Make a **hand** sketch of both models.
### Which is the better model?
```{r}
glance(mod1)
glance(mod2)
```
`r emo::ji("shocked")` they are both pretty bad! The interaction model (mod2) is slightly better but its really not.
### Interaction between quantitative variables
- Interactions for two quantitative variables in a model, can be thought of as allowing the paper sheet (model) to curl.
```{r}
library(viridis)
x1 <- runif(500)
x2 <- runif(500)
df <- tibble(x1, x2, y1=5-2*x1+4*x2+rnorm(500)*0.5,
y2=5-2*x1+4*x2+10*(-x1)*x2+rnorm(500)*0.5)
p1 <- ggplot(df, aes(x=x1, y=x2, colour=y1)) +
geom_point(size=5, alpha=0.5) +
scale_colour_viridis() +
theme_bw() + theme(aspect.ratio=1, legend.position="none") +
ggtitle("Flat sheet: no interaction")
p2 <- ggplot(df, aes(x=x1, y=x2, colour=y2)) +
geom_point(size=5, alpha=0.5) +
scale_colour_viridis() +
theme_bw() + theme(aspect.ratio=1, legend.position="none") +
ggtitle("Curled sheet: needs an interaction term")
df_loess1 <- loess(y1~x1+x2, data=df)
df_loess2 <- loess(y2~x1+x2, data=df)
df_grid <- expand.grid(x1=seq(0, 1, 0.1), x2=seq(0, 1, 0.1))
df_grid <- df_grid %>%
mutate(y1 = as.vector(predict(df_loess1, newdata=df_grid)),
y2 = as.vector(predict(df_loess2, newdata=df_grid)))
p3 <- ggplot(df_grid, aes(x=x1, y=x2, z=y1, colour=y1)) +
geom_contour(aes(colour = stat(level))) +
scale_colour_viridis() +
theme_bw() + theme(aspect.ratio=1, legend.position="none") +
ggtitle("Flat sheet: no interaction") +
geom_point(data=df, aes(x=x1, y=x2, colour=y1), size=0.5, alpha=0.5)
p4 <- ggplot(df_grid, aes(x=x1, y=x2, z=y2)) +
geom_contour(aes(colour = stat(level))) +
scale_colour_viridis() +
theme_bw() + theme(aspect.ratio=1, legend.position="none") +
ggtitle("Curled sheet: needs an interaction term") +
geom_point(data=df, aes(x=x1, y=x2, colour=y2), size=0.5, alpha=0.5)
grid.arrange(p1, p2, p3, p4, ncol=2)
```
### Five minute challenge
Using the PISA data: How does science score relate to text anxiety and gender?
- Make a plot of science by anxtest, coloured by gender. Does it look like an interaction term might be necessary?
- Fit the model with `science` score as the response and `gender` and `anxtest`.
- Try an interaction between gender and anxtest.
- Which is the better model?
The code below has `???` where you need to replace with something sensible.
```{r eval=FALSE}
pisa_au_science <- pisa_au %>%
filter(!is.na(???)) %>%
select(science, gender, anxtest, stuweight)
ggplot(pisa_au_science, aes(x=anxtest, y=science, colour=gender)) + geom_smooth(method="lm")
sci_lm1 <- lm(science ~ ??? + ???, data=pisa_au_science, weights=stuweight)
sci_lm2 <- lm(science ~ ??? * ???, data=pisa_au_science, weights=stuweight)
tidy(???)
glance(???)
```
## Model building
Goal: The simplest model possible that provides similar predictive accuracy to most complex model.
Approach:
- Start simply, fit main effects models (single best variable, adding several more variables independently) and try to understand the effect that each has in the model.
- Explore transformations with the aim to build a stable foundation of explanatory variables for the model.
- Check model diagnostics, residual plots.
- Explore two variable interactions, and understand effect on model.
- Explore three variable interactions.
- Use model goodness of fit to help decide on final. There may be more than one model that are almost equally as good.
*Aside:* Ideally, values of explanatory variables cover all possible combinations in their domain. There should *not be any association between explanatory variables*. If there is, then the there is more uncertainty in the parameter estimates. Its like building a table with only two legs, that table would be a bit wobbly, and unstable. A work around is to first regress one explanatory variable on the other, and add the residuals from this fit to the model, instead of the original variable. That is, suppose $X_1, X_2$ are strongly linearly associated, then model $X_2\sim b_0+b_1X_1+e$, and use $e$ (call it $X^*_2$) in the model instead of $X_2$. You would then only be using the part of $X_2$ that is not related to $X_1$ to expand the model. This approach can be used for multiple explanatory variables that are associated.
### Lab exercise
Build the best model you can for science score, by exploring these variables: math score, reading score, tvs, books, breakfast, feel free to choose others. (Code provided in Rmd is just a sample, and needs to be modified.)
```{r echo=FALSE, eval=FALSE}
pisa_au_science <- pisa_au %>%
filter(science_fun < 5) %>%
filter(!is.na(science_time)) %>%
filter(!is.na(anxtest)) %>%
select(science, math, read, science_fun, science_time, breakfast, gender, anxtest, books, tvs, stuweight) %>%
mutate(science_time = as.numeric(science_time)) %>%
filter(science_time>0) %>%
mutate(log_science_time = log10(science_time)) %>%
mutate(science_fun_c = factor(science_fun, ordered=FALSE))
sci_lm <- lm(science ~ math + read + science_fun_c + science_time + gender*anxtest, data=pisa_au_science, weights=stuweight)
tidy(sci_lm)
glance(sci_lm)
```
## Share and share alike

This work is licensed under a Creative Commons Attribution 4.0 International License.