--- class: bg-main1 # Model of height and width .huge[ $$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$ ] -- .huge[ - **Slope:** For each additional inch the painting is wider, the height is expected to be higher, on average, by 0.78 inches. ] -- .huge[ - **Intercept:** Paintings that are 0 inches wide are expected to be 3.62 inches high, on average. ] --- class: bg-main1 # The linear model with a single predictor .vlarge[ - Interested in $\beta_0$ (population parameter for the intercept) and the $\beta_1$ (population parameter for the slope) in the following model: $$ \hat{y} = \beta_0 + \beta_1~x $$ ] --- class: bg-main1 # Least squares regression .huge[ The regression line minimizes the sum of squared residuals. ] -- .huge[ If $e_i = y - \hat{y}$, then, the regression line minimizes $\sum_{i = 1}^n e_i^2$. ] --- # Visualizing residuals ```{r vis-resid, echo=FALSE} d <- tibble( Width_in = m_ht_wt$model$Width_in, Height_in = m_ht_wt$model$Height_in, pred = m_ht_wt$fitted.values, res = m_ht_wt$residuals ) p <- ggplot(data = d, mapping = aes(x = Width_in, y = Height_in)) + geom_point(alpha = 0.2) + theme_bw() + labs(title = "Height vs. width of paintings", subtitle = "Just the data") + xlim(0, 250) + ylim(0, 200) p ``` --- ## Visualizing residuals (cont.) ```{r vis-resid-line, echo=FALSE} p <- p + geom_smooth(method = "lm", color = color_palette$darkblue, se = FALSE) + geom_point(mapping = aes(y = pred), color = color_palette$darkblue) + labs(subtitle = "Data + least squares resgression line") p ``` --- ## Visualizing residuals (cont.) ```{r vis-redis-segment, echo = FALSE} p + geom_segment(mapping = aes(xend = Width_in, yend = pred), color = color_palette$lightblue, alpha = 0.4) + labs(subtitle = "Data + least squares resgression line + residuals") ``` --- # Properties of the least squares regression line .huge[ - The regression line goes through the center of mass point, the coordinates corresponding to average $x$ and average $y$: $(\bar{x}, \bar{y})$: $$\hat{y} = \beta_0 + \beta_1 x ~ \rightarrow ~ \beta_0 = \hat{y} - \beta_1 x$$ - The slope has the same sign as the correlation coefficient: $$\beta_1 = r \frac{s_y}{s_x}$$ ] --- class: bg-main1 # Properties of the least squares regression line .huge[ - The sum of the residuals is zero: $$\sum_{i = 1}^n e_i = 0$$ - The residuals and $x$ values are uncorrelated. ] --- class: bg-main1 # Height & landscape features ```{r fit-lands} m_ht_lands <- lm(Height_in ~ factor(landsALL), data = pp) m_ht_lands ``` --

.huge[ $$\widehat{Height_{in}} = 22.68 - 5.65~landsALL$$ ] --- class: bg-main1 # Height & landscape features (cont.) .huge[ - **Slope:** Paintings with landscape features are expected, on average, to be 5.65 inches shorter than paintings that without landscape features. - Compares baseline level (`landsALL = 0`) to other level (`landsALL = 1`). - **Intercept:** Paintings that don't have landscape features are expected, on average, to be 22.68 inches tall. ] --- class: bg-main1 # Categorical predictor with 2 levels ```{r slice-paint, echo=FALSE} pp %>% select(name, price, landsALL) %>% slice(1:8) ``` --- class: bg-main1 # Relationship between height and school ```{r fit-school} (m_ht_sch <- lm(Height_in ~ school_pntg, data = pp)) ``` -- .vlarge[ - When the categorical explanatory variable has many levels, they're encoded to **dummy variables**. - Each coefficient describes the expected difference between heights in that particular school compared to the baseline level. ] --- class: bg-main1 # Categorical predictor with >2 levels ```{r show-cats, echo=FALSE} pp %>% select(school_pntg) %>% group_by(school_pntg) %>% sample_n(1) %>% mutate( D_FL = as.integer(ifelse(school_pntg == "D/FL", 1L, 0)), F = as.integer(ifelse(school_pntg == "F", 1L, 0)), G = as.integer(ifelse(school_pntg == "G", 1L, 0)), I = as.integer(ifelse(school_pntg == "I", 1L, 0)), S = as.integer(ifelse(school_pntg == "S", 1L, 0)), X = as.integer(ifelse(school_pntg == "X", 1L, 0)) ) ``` --- class: bg-main1 # The linear model with multiple predictors .huge[ - Population model: $$ \hat{y} = \beta_0 + \beta_1~x_1 + \beta_2~x_2 + \cdots + \beta_k~x_k $$ ] -- .huge[ - Sample model that we use to estimate the population model: $$ \hat{y} = b_0 + b_1~x_1 + b_2~x_2 + \cdots + b_k~x_k $$ ] --- class: bg-main1 # Correlation does not imply causation! .vhuge[ - Remember this when interpreting model coefficients ] --- class: bg-main1 # Prediction with models --- class: bg-main1 # Predict height from width .vlarge[ On average, how tall are paintings that are 60 inches wide? $$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$ ] -- ```{r add-slope} 3.62 + 0.78 * 60 ``` .vlarge[ "On average, we expect paintings that are 60 inches wide to be 50.42 inches high." **Warning:** We "expect" this to happen, but there will be some variability. (We'll learn about measuring the variability around the prediction later.) ] --- class: bg-main1 # Prediction vs. extrapolation .huge[ On average, how tall are paintings that are 400 inches wide? $$\widehat{Height_{in}} = 3.62 + 0.78~Width_{in}$$ ] ```{r extrapolate, warning = FALSE, echo=FALSE, fig.height = 2.5} newdata <- tibble(Width_in = 400) newdata <- newdata %>% mutate(Height_in = predict(m_ht_wt, newdata = newdata)) ggplot(data = pp, aes(x = Width_in, y = Height_in)) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", fullrange = TRUE, color = color_palette$darkblue, se = FALSE) + xlim(0, 420) + ylim(0, 320) + geom_segment(data = newdata, mapping = aes(x = Width_in, y = 0, xend = Width_in, yend = Height_in), color = color_palette$salmon, lty = 2) + geom_segment(data = newdata, mapping = aes(x = Width_in, y = Height_in, xend = 0, yend = Height_in), color = color_palette$salmon, lty = 2) ``` --- class: bg-black .white[ ## Watch out for extrapolation! ] .vlarge.white[ > "When those blizzards hit the East Coast this winter, it proved to my satisfaction that global warming was a fraud. That snow was freezing cold. But in an alarming trend, temperatures this spring have risen. Consider this: On February 6th it was 10 degrees. Today it hit almost 80. At this rate, by August it will be 220 degrees. So clearly folks the climate debate rages on."

Stephen Colbert, April 6th, 2010 ] .footnote.white[ [1] OpenIntro Statistics. "Extrapolation is treacherous." OpenIntro Statistics. ] --- class: center, middle # Measuring model fit --- class: bg-main1 # Measuring the strength of the fit .huge[ - $R^2$ is a common measurement of strength of linear model fit. - $R^2$ tells us % variability in response explained by model. - Remaining variation is explained by variables not in the model. - $R^2$ is sometimes called the coefficient of determination. ] --- class: bg-main1 # Obtaining $R^2$ in R .vlarge[ - Height vs. width ] ```{r glance-r-squared} glance(m_ht_wt) glance(m_ht_wt)$r.squared # extract R-squared ``` .vlarge[ Roughly 68% of the variability in heights of paintings can be explained by their widths. ] --- class: bg-main1 # Obtaining $R^2$ in R .huge[ - Height vs. lanscape features ] ```{r glance-lands} glance(m_ht_lands)$r.squared ``` --- class: bg-main1 # Your Turn: go to rstudio.cloud --- class: bg-main1 # References .huge[ - [data science in a box](https://datasciencebox.org/) ]