ETC1010: Data Modelling and ComputingLecture 7A: Linear ModelsDr. Nicholas Tierney & Professor Di CookEBS, Monash U.2019-09-111 / 56

Recapstyle
functions
map

2 / 56

Today: The language of models3 / 56

Modelling

Use models to explain the relationship between variables and to make predictions
For now we focus on linear models (but remember there are other types of models too!)

4 / 56

Packages

You're familiar with the tidyverse:
The broom package takes the messy output of built-in functions in R, such as lm, and turns them into tidy data frames.

5 / 56

Data: Paris Paintings

6 / 56

Paris Paintings

pp <- read_csv("data/paris-paintings.csv", na = c("n/a", "", "NA"))
pp

## # A tibble: 3,393 x 61
##    name  sale  lot   position dealer  year origin_author origin_cat school_pntg
##    <chr> <chr> <chr>    <dbl> <chr>  <dbl> <chr>         <chr>      <chr>      
##  1 L176… L1764 2       0.0328 L       1764 F             O          F          
##  2 L176… L1764 3       0.0492 L       1764 I             O          I          
##  3 L176… L1764 4       0.0656 L       1764 X             O          D/FL       
##  4 L176… L1764 5       0.0820 L       1764 F             O          F          
##  5 L176… L1764 5       0.0820 L       1764 F             O          F          
##  6 L176… L1764 6       0.0984 L       1764 X             O          I          
##  7 L176… L1764 7       0.115  L       1764 F             O          F          
##  8 L176… L1764 7       0.115  L       1764 F             O          F          
##  9 L176… L1764 8       0.131  L       1764 X             O          I          
## 10 L176… L1764 9       0.148  L       1764 D/FL          O          D/FL       
## # … with 3,383 more rows, and 52 more variables: diff_origin <dbl>, logprice <dbl>,
## #   price <dbl>, count <dbl>, subject <chr>, authorstandard <chr>, artistliving <dbl>,
## #   authorstyle <chr>, author <chr>, winningbidder <chr>, winningbiddertype <chr>,
## #   endbuyer <chr>, Interm <dbl>, type_intermed <chr>, Height_in <dbl>, Width_in <dbl>,
## #   Surface_Rect <dbl>, Diam_in <dbl>, Surface_Rnd <dbl>, Shape <chr>, Surface <dbl>,
## #   material <chr>, mat <chr>, materialCat <chr>, quantity <dbl>, nfigures <dbl>,
## #   engraved <dbl>, original <dbl>, prevcoll <dbl>, othartist <dbl>, paired <dbl>,
## #   figures <dbl>, finished <dbl>, lrgfont <dbl>, relig <dbl>, landsALL <dbl>,
## #   lands_sc <dbl>, lands_elem <dbl>, lands_figs <dbl>, lands_ment <dbl>, arch <dbl>,
## #   mytho <dbl>, peasant <dbl>, othgenre <dbl>, singlefig <dbl>, portrait <dbl>,
## #   still_life <dbl>, discauth <dbl>, history <dbl>, allegory <dbl>, pastorale <dbl>,
## #   other <dbl>

7 / 56

Meet the data curators

Sandra van Ginhoven

Hilary Coe Cronheim

PhD students in the Duke Art, Law, and Markets Initiative in 2013

Source: Printed catalogues of 28 auction sales in Paris, 1764- 1780
3,393 paintings, their prices, and descriptive details from sales catalogues over 60 variables

8 / 56

Auctions today9 / 56

Auctions back in the day

Pierre-Antoine de Machy, Public Sale at the Hôtel Bullion, Musée Carnavalet, Paris (18th century)

10 / 56

Paris auction market

11 / 56

Modelling the relationship between variables12 / 56

Prices: Describe the distribution of prices of paintings.

ggplot(data = pp, aes(x = price)) +
  geom_histogram(binwidth = 1000)

13 / 56

Models as functions

We can represent relationships between variables using functions
A function is a mathematical concept: the relationship between an output and one or more inputs.
Plug in the inputs and receive back the output

14 / 56

Models as functions: Example

The formula $y = 3 x + 7$ is a function with input $x$ and output $y$ , when $x$ is $5$ , the output $y$ is $22$

y = 3 * 5 + 7 = 22

15 / 56

Models as functions: Example

The formula $y = 3 x + 7$ is a function with input $x$ and output $y$ , when $x$ is $5$ , the output $y$ is $22$

y = 3 * 5 + 7 = 22

anon <- function(x) 3*x + 7
anon(5)

## [1] 22

15 / 56

Height as a function of width

Describe the relationship between height and width of paintings.

16 / 56

Visualizing the linear model

ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  geom_smooth(method = "lm") # lm for linear model

17 / 56

Visualizing the linear model (without the measure of uncertainty around the line)

ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) # lm for linear model

18 / 56

Visualizing the linear model (style the line)

ggplot(data = pp, aes(x = Width_in, y = Height_in)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, 
              col = "pink", # color
              lty = 2,      # line type
              lwd = 3)      # line weight

19 / 56

VocabularyResponse variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)

20 / 56

VocabularyResponse variable: Variable whose behavior or variation you are trying to understand, on the y-axis (dependent variable)

Explanatory variables: Other variables that you want to use to explain the variation in the response, on the x-axis (independent variables)

20 / 56

VocabularyPredicted value: Output of the function model functionThe model function gives the typical value of the response variable
conditioning on the explanatory variables


21 / 56

VocabularyPredicted value: Output of the function model functionThe model function gives the typical value of the response variable
conditioning on the explanatory variables

Residuals: Show how far each case is from its model valueResidual = Observed value - Predicted value
Tells how far above/below the model function each case is

21 / 56

Residuals

What does a negative residual mean?
Which paintings on the plot have have negative residuals, those below or above the line?

22 / 56

23 / 56

What feature is apparent in this plot that was not (as) apparent in the previous plots?
What might be the reason for this feature?

23 / 56

The plot below displays the relationship between height and width of paintings. It uses a lower alpha level for the points than the previous plots we looked at.

Landscape vs portait paintings

Landscape painting is the depiction in art of landscapes – natural scenery such as mountains, valleys, trees, rivers, and forests, especially where the main subject is a wide view – with its elements arranged into a coherent composition.¹
Landscape paintings tend to be wider than longer.

Portrait painting is a genre in painting, where the intent is to depict a human subject.²
Portrait paintings tend to be longer than wider.

[1] Source: Wikipedia, Landscape painting

[2] Source: Wikipedia, Portait painting

24 / 56

Multiple explanatory variables

How, if at all, the relatonship between width and height of paintings vary by whether or not they have any landscape elements?

ggplot(data = pp, aes(x = Width_in, y = Height_in, 
                      color = factor(landsALL))) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(color = "landscape")

25 / 56

Models - upsides and downsides

Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modelling over simple visual inspection of data.
There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted.

26 / 56

Variation around the model...

is just as important as the model, if not more!

Statistics is the explanation of variation in the context of what remains unexplained.

Scatterplot suggests there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role.
Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.)

27 / 56

How do we use models?

Explanation: Characterize the relationship between $y$ and $x$ via slopes for numerical explanatory variables or differences for categorical explanatory variables. (also called inference, as you make inference on these relationships)
Prediction: Plug in $x$ , get the predicted $y$

28 / 56

Your Turn: go to rstudio.cloud and start exercise 7a29 / 56

Characterizing relationships with models30 / 56

Height & width

m_ht_wt <- lm(Height_in ~ Width_in, data = pp)
m_ht_wt

## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808

31 / 56

Height & width

m_ht_wt <- lm(Height_in ~ Width_in, data = pp)
m_ht_wt

## 
## Call:
## lm(formula = Height_in ~ Width_in, data = pp)
## 
## Coefficients:
## (Intercept)     Width_in  
##      3.6214       0.7808

31 / 56

Model of height and width

$\hat{H e i g h t_{i n}} = 3.62 + 0.78 W i d t h_{i n}$

32 / 56

Model of height and width

$\hat{H e i g h t_{i n}} = 3.62 + 0.78 W i d t h_{i n}$

Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.

32 / 56

Model of height and width

$\hat{H e i g h t_{i n}} = 3.62 + 0.78 W i d t h_{i n}$

Slope: For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches.

Intercept: Paintings that are 0 inches wide are expected to be 3.62 inches high, on average.

32 / 56

The linear model with a single predictor

Interested in $β_{0}$ (population parameter for the intercept) and the $β_{1}$ (population parameter for the slope) in the following model:

$\hat{y} = β_{0} + β_{1} x$

33 / 56

Least squares regression

The regression line minimizes the sum of squared residuals.

34 / 56

Least squares regression

The regression line minimizes the sum of squared residuals.

If $e_{i} = y - \hat{y}$ ,

then, the regression line minimizes $\sum_{i = 1}^{n} e_{i}^{2}$ .

34 / 56

Visualizing residuals

35 / 56

Visualizing residuals (cont.)

36 / 56

Visualizing residuals (cont.)

37 / 56

Properties of the least squares regression line

The regression line goes through the center of mass point, the coordinates corresponding to average $x$ and average $y$ : $(\bar{x}, \bar{y})$ :

$\hat{y} = β_{0} + β_{1} x \to β_{0} = \hat{y} - β_{1} x$

The slope has the same sign as the correlation coefficient:

$β_{1} = r \frac{s_{y}}{s_{x}}$

38 / 56

Properties of the least squares regression line

The sum of the residuals is zero: $\sum_{i = 1}^{n} e_{i} = 0$
The residuals and $x$ values are uncorrelated.

39 / 56

Height & landscape features

m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)
m_ht_lands

## 
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
## 
## Coefficients:
##       (Intercept)  factor(landsALL)1  
##            22.680             -5.645

40 / 56

Height & landscape features

m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp)
m_ht_lands

## 
## Call:
## lm(formula = Height_in ~ factor(landsALL), data = pp)
## 
## Coefficients:
##       (Intercept)  factor(landsALL)1  
##            22.680             -5.645

$\hat{H e i g h t_{i n}} = 22.68 - 5.65 l a n d s A L L$

40 / 56

Height & landscape features (cont.)

Slope: Paintings with landscape features are expected, on average, to be 5.65 inches shorter than paintings that without landscape features.
- Compares baseline level (landsALL = 0) to other level (landsALL = 1).
Intercept: Paintings that don't have landscape features are expected, on average, to be 22.68 inches tall.

41 / 56

Categorical predictor with 2 levels

## # A tibble: 8 x 3
##   name     price landsALL
##   <chr>    <dbl>    <dbl>
## 1 L1764-2    360        0
## 2 L1764-3      6        0
## 3 L1764-4     12        1
## 4 L1764-5a     6        1
## 5 L1764-5b     6        1
## 6 L1764-6      9        0
## 7 L1764-7a    12        0
## 8 L1764-7b    12        0

42 / 56

Relationship between height and school

(m_ht_sch <- lm(Height_in ~ school_pntg, data = pp))

## 
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
## 
## Coefficients:
##     (Intercept)  school_pntgD/FL     school_pntgF     school_pntgG     school_pntgI  
##          14.000            2.329           10.197            1.650           10.287  
##    school_pntgS     school_pntgX  
##          30.429            2.869

43 / 56

Relationship between height and school

(m_ht_sch <- lm(Height_in ~ school_pntg, data = pp))

## 
## Call:
## lm(formula = Height_in ~ school_pntg, data = pp)
## 
## Coefficients:
##     (Intercept)  school_pntgD/FL     school_pntgF     school_pntgG     school_pntgI  
##          14.000            2.329           10.197            1.650           10.287  
##    school_pntgS     school_pntgX  
##          30.429            2.869

When the categorical explanatory variable has many levels, they're encoded to dummy variables.
Each coefficient describes the expected difference between heights in that particular school compared to the baseline level.

43 / 56

Categorical predictor with >2 levels

## # A tibble: 7 x 7
## # Groups:   school_pntg [7]
##   school_pntg  D_FL     F     G     I     S     X
##   <chr>       <int> <int> <int> <int> <int> <int>
## 1 A               0     0     0     0     0     0
## 2 D/FL            1     0     0     0     0     0
## 3 F               0     1     0     0     0     0
## 4 G               0     0     1     0     0     0
## 5 I               0     0     0     1     0     0
## 6 S               0     0     0     0     1     0
## 7 X               0     0     0     0     0     1

44 / 56

The linear model with multiple predictors

Population model:

$\hat{y} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{k} x_{k}$

45 / 56

The linear model with multiple predictors

Population model:

$\hat{y} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{k} x_{k}$

Sample model that we use to estimate the population model:

$\hat{y} = b_{0} + b_{1} x_{1} + b_{2} x_{2} + \dots + b_{k} x_{k}$

45 / 56

Correlation does not imply causation!Remember this when interpreting model coefficients

46 / 56

Prediction with models47 / 56

Predict height from width

On average, how tall are paintings that are 60 inches wide? $\hat{H e i g h t_{i n}} = 3.62 + 0.78 W i d t h_{i n}$

48 / 56

Predict height from width

On average, how tall are paintings that are 60 inches wide? $\hat{H e i g h t_{i n}} = 3.62 + 0.78 W i d t h_{i n}$

3.62 + 0.78 * 60

## [1] 50.42

"On average, we expect paintings that are 60 inches wide to be 50.42 inches high."

Warning: We "expect" this to happen, but there will be some variability. (We'll learn about measuring the variability around the prediction later.)

48 / 56

Prediction vs. extrapolation

On average, how tall are paintings that are 400 inches wide? $\hat{H e i g h t_{i n}} = 3.62 + 0.78 W i d t h_{i n}$

49 / 56

Watch out for extrapolation!

"When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on."¹
Stephen Colbert, April 6th, 2010

[1] OpenIntro Statistics. "Extrapolation is treacherous." OpenIntro Statistics.

50 / 56

Measuring model fit51 / 56

Measuring the strength of the fit

$R^{2}$ is a common measurement of strength of linear model fit.
$R^{2}$ tells us % variability in response explained by model.
Remaining variation is explained by variables not in the model.
$R^{2}$ is sometimes called the coefficient of determination.

52 / 56

Obtaining $R^{2}$ in R

Height vs. width

glance(m_ht_wt)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC deviance
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>   <dbl>  <dbl>  <dbl>    <dbl>
## 1     0.683         0.683  8.30     6749.       0     2 -11083. 22173. 22191.  216055.
## # … with 1 more variable: df.residual <int>

glance(m_ht_wt)$r.squared # extract R-squared

## [1] 0.6829468

Roughly 68% of the variability in heights of paintings can be explained by their widths.

53 / 56

Obtaining $R^{2}$ in R

Height vs. lanscape features

glance(m_ht_lands)$r.squared

## [1] 0.03456724

54 / 56

Your Turn: go to rstudio.cloud55 / 56

References

data science in a box

56 / 56

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

ETC1010: Data Modelling and Computing

Lecture 7A: Linear Models

Dr. Nicholas Tierney & Professor Di Cook

EBS, Monash U.

2019-09-11

Recap

Today: The language of models

Modelling

Packages

Paris Paintings

Meet the data curators

Auctions today

Auctions back in the day

Paris auction market

Modelling the relationship between variables

Prices: Describe the distribution of prices of paintings.

Models as functions

Models as functions: Example

Models as functions: Example

Height as a function of width

Visualizing the linear model

Visualizing the linear model (without the measure of uncertainty around the line)

Visualizing the linear model (style the line)

Vocabulary

Vocabulary

Vocabulary

Vocabulary

Residuals

Landscape vs portait paintings

Multiple explanatory variables

Models - upsides and downsides

Variation around the model...

How do we use models?

Your Turn: go to rstudio.cloud and start exercise 7a

Characterizing relationships with models

Height & width

Height & width

Model of height and width

Model of height and width

Model of height and width

The linear model with a single predictor

Least squares regression

Least squares regression

Visualizing residuals

Visualizing residuals (cont.)

Visualizing residuals (cont.)

Properties of the least squares regression line

Properties of the least squares regression line

Height & landscape features

Height & landscape features

Height & landscape features (cont.)

Categorical predictor with 2 levels

Relationship between height and school

Relationship between height and school

Categorical predictor with >2 levels

The linear model with multiple predictors

The linear model with multiple predictors

Correlation does not imply causation!

Prediction with models

Predict height from width

Predict height from width

Prediction vs. extrapolation

Watch out for extrapolation!

Measuring model fit

Measuring the strength of the fit

Obtaining R2R2 in R

Obtaining R2R2 in R

Your Turn: go to rstudio.cloud

References

Recap

Help

Obtaining $R^{2}$ in R

Obtaining $R^{2}$ in R