--- title: "ETC1010: Data Modelling and Computing" author: "Professor Di Cook, EBS, Monash U." output: learnr::tutorial: css: "css/logo.css" runtime: shiny_prerendered --- ```{r setup, include=FALSE} library(learnr) knitr::opts_chunk\$set(echo = TRUE, message = FALSE, warning = FALSE, collapse = TRUE, fig.height = 6, fig.width = 6, fig.align = "center", cache = FALSE) tutorial_html_dependency() library(tidyverse) library(broom) library(gridExtra) ``` # Many models ## Course web site This is a link to the course web site, in case you need to go back and forth between tutorial and web materials: [http://dmac.dicook.org](http://dmac.dicook.org) ## Beyond a single model ![](images/blind-men-and-the-elephant.png)
- Beyond a single model - Fitting many models Image source: https://balajiviswanathan.quora.com/Lessons-from-the-Blind-men-and-the-elephant ## Gapminder Hans Rosling was a Swedish doctor, academic and statistician, Professor of International Health at Karolinska Institute. He passed away in 2017. He developed a keen interest in health and wealth across the globe, and the relationship with other factors like agriculture, education, energy. His presentations on data are amazing! A starting place is https://www.youtube.com/watch?v=jbkSRLYSojo. And you can play with the gapminder data using animations at https://www.gapminder.org/tools/. ### R package The R package, called gapminder, contains a subset of the data. It has data on five year intervals from 1952 to 2007. ```{r} library(gapminder) glimpse(gapminder) ``` ## Fit linear models The question is "How has life expectancy changed over years, for each country?" Plot life expectancy by year, for each country. ```{r fig.height=4} gapminder %>% ggplot(aes(year, lifeExp, group = country)) + geom_line(alpha = 1/3) ``` - There generally appears to be an increase in life expectancy - A number of countries have big dips from the 70s through 90s - a cluster of countries starts off with low life expectancy but ends up close to the highest by the end of the period. ### Mutate year 1950 is the first year, so for model fitting we are going to shift year to begin in 1950, makes interpretability easier. ```{r} gapminder2 <- gapminder %>% mutate(year1950 = year-1950) ``` ## Australia Australia was already had one of the top life expectancies in the 1950s. ```{r fig.height=3, fig.width=8} oz <- gapminder2 %>% filter(country=="Australia") head(oz) p1 <- ggplot(data=oz, aes(x=year, y=lifeExp)) + geom_line() oz_lm <- lm(lifeExp~year1950, data=oz) tidy(oz_lm) oz_mod <- augment(oz_lm, oz) p2 <- ggplot(data=oz_mod, aes(x=year, y=.fitted)) + geom_line() p3 <- ggplot(data=oz_mod, aes(x=year, y=.std.resid)) + geom_hline(yintercept=0, colour="white", size=2) + geom_line() grid.arrange(p1, p2, p3, ncol=3) ``` - Life expectancy has increased 2.3 years every decade, on average. - There was a slow period from 1960 through to 1972, probably related to mortality during the Vietnam war. ```{r life, echo=FALSE} quiz( question("What was the average life expectancy in 1950?", answer("50"), answer("about 65"), answer("67.9", correct=TRUE)), question("What was the average life expectancy in 2000?", answer("59"), answer("about 69"), answer("79", correct=TRUE))) ``` ## Fit all countries ```{r echo=TRUE} library(purrr) by_country <- gapminder2 %>% select(country, year1950, lifeExp, continent) %>% group_by(country, continent) %>% nest() by_country <- by_country %>% mutate( model = purrr::map(data, ~ lm(lifeExp ~ year1950, data = .)) ) country_coefs <- by_country %>% unnest(model %>% purrr::map(broom::tidy)) country_coefs <- country_coefs %>% select(country, continent, term, estimate) %>% spread(term, estimate) %>% rename(intercept = `(Intercept)`) head(country_coefs) country_coefs %>% filter(country == "Australia") ``` It is also possible to use a `for` loop to compute the slope and intercept for each country. ```{r eval=FALSE} n <- length(table(gapminder2\$country)) country_coefs <- data.frame(country=gapminder2\$country[seq(1, 1704, 12)], continent=gapminder2\$continent[seq(1, 1704, 12)], intercept=rep(0,n), year1950=rep(0,n)) for (i in 1:n) { sub <- gapminder2 %>% filter(country==country_coefs\$country[i]) sub_lm <- lm(lifeExp~year1950, data=sub) sub_lm_coefs <- coefficients(sub_lm) country_coefs\$intercept[i] <- sub_lm_coefs country_coefs\$year1950[i] <- sub_lm_coefs } head(country_coefs) ``` ### Five minute challenge - Fit the models to all countries - Pick your favourite country (not Australia), print the coefficients, and make a hand sketch of the the model fit. ### Plot all the models ```{r fig.height=4} country_model <- by_country %>% unnest(model %>% purrr::map(broom::augment)) p1 <- gapminder %>% ggplot(aes(year, lifeExp, group = country)) + geom_line(alpha = 1/3) + ggtitle("Data") p2 <- ggplot(country_model) + geom_line(aes(x=year1950+1950, y=.fitted, group=country), alpha = 1/3) + xlab("year") + ggtitle("Fitted models") grid.arrange(p1, p2, ncol=2) ``` ### Plot all the model coefficients ```{r} p <- ggplot(country_coefs, aes(x=intercept, y=year1950, colour=continent, label=country)) + geom_point(alpha=0.5, size=2) + scale_color_brewer(palette = "Dark2") library(plotly) ggplotly(p) ``` Let's summarise the information learned from the model coefficients. - Generally the relationship is negative: this means that if a country started with a high intercept tends to have lower rate of increase. - There is a difference across the continents: Countries in Europe and Oceania tended to start with higher life expectancy and increased; countries in Asia and America tended to start lower but have high rates of improvement; Africa tends to start lower and have a huge range in rate of change. - Three countries had negative growth in life expectancy: Rwand, Zimbabwe, Zambia ```{r coefs, echo=FALSE} quiz( question("Name the country that has the lowest improvement in life expectancy", answer("Zimbabwe", correct=TRUE), answer("Oman"), answer("Norway"), answer("Gambia")), question("Name the country that has the highest improvement in life expectancy", answer("Zimbabwe"), answer("Oman", correct=TRUE), answer("Norway"), answer("Gambia")), question("Name the country that has the lowest initial life expectancy", answer("Zimbabwe"), answer("Oman"), answer("Norway"), answer("Gambia", correct=TRUE)), question("Name the country that has the highest initial life expectancy", answer("Zimbabwe"), answer("Oman"), answer("Norway", correct=TRUE), answer("Gambia")) ) ``` ## Model diagnostics by country ```{r} country_fit <- by_country %>% unnest(model %>% purrr::map(broom::glance)) ``` Or you can use a `for` loop to compute this. ```{r eval=FALSE} n <- length(unique(gapminder2\$country)) country_fit <- data.frame(country=gapminder2\$country[seq(1, 1704, 12)], continent=gapminder2\$continent[seq(1, 1704, 12)], intercept=rep(0,n), year1950=rep(0,n), r.squared=rep(0,n)) for (i in 1:n) { sub <- gapminder2 %>% filter(country==country_fit\$country[i]) sub_lm <- lm(lifeExp~year1950, data=sub) sub_lm_fit <- coefficients(sub_lm) country_fit\$intercept[i] <- sub_lm_coefs country_fit\$year1950[i] <- sub_lm_coefs country_fit\$r.squared[i] <- summary(sub_lm)\$r.squared } head(country_fit) ``` ### Plot the \$R^2\$ values as a histogram. ```{r} ggplot(country_fit, aes(x=r.squared)) + geom_histogram() ``` ### Countries with worst fit Examine the countries with the worst fit, countries with \$R^2<0.45\$, by making scatterplots of the data, with the linear model overlaid. ```{r fig.height=4} badfit <- country_fit %>% filter(r.squared <= 0.45) gapminder2_sub <- gapminder2 %>% filter(country %in% badfit\$country) ggplot(data=gapminder2_sub, aes(x=year, y=lifeExp)) + geom_point() + facet_wrap(~country) + scale_x_continuous(breaks=seq(1950,2000,10), labels=c("1950", "60","70", "80","90","2000")) + geom_smooth(method="lm", se=FALSE) ``` Each of these countries had been moving on a nice trajectory of increasing life expectancy, and then suffered a big dip during the time period. ### Five minute challenge Use google to explain these dips using world history and current affairs information. ## Lab exercise - Download the children per woman (total fertility) data from the [gapminder web site](https://www.gapminder.org/data/) - Conduct the same analysis, as done for life expectancy. - Find the unusual countries The code below will help you read in the data and process it, but you will need to make some changes to do the full analysis. ```{r eval=FALSE} library(readxl) fert <- read_xlsx("data/children_per_woman_total_fertility.xlsx") %>% rename(country = geo) fert <- fert %>% gather(year, fert, -country) %>% mutate(year = as.numeric(year)) %>% filter(year > 1950) %>% mutate(year1950 = year - 1950) ggplot(fert, aes(x=year, y=fert, group=country)) + geom_line(alpha=0.1) ``` ## Share and share alike 