class: center, middle, inverse, title-slide # ETC1010: Data Modelling and Computing ## Week of Data Visualisation: Lecture 3 ### Dr. Nicholas Tierney & Professor Di Cook ### EBS, Monash U. ### 2019-08-14 --- class: bg-main5 # Learning Tips --- class: bg-main5 # Understanding learning .huge[ - Growth and fixed mindsets - Reframe success + failure as opportunities for growth - Growing area of research by [Carol Dweck of Stanford](https://www.youtube.com/watch?v=hiiEeMN7vbQ) ] --- class: bg-main5 .center[ # Reframing ] .pull-left[ # From .huge[ > "I'll never understand" > "I just don't get programming" > "I'm not a maths person" ] ] -- .pull-right[ # To .vlarge[ > "I understand more than I did yesterday" > "I can learn how to program" > "Compared to this last week, I've learnt quite a bit!" ] ] --- # Overview for today .huge[ - Going from tidy data to a data plot, using a grammar - Mapping of variables from the data to graphical elements - Using different geoms ] --- class: bg-main1 # Example: Tuberculosis data .huge.left-code[ The case notifications table From [WHO](http://www.who.int/tb/country/data/download/en/). Data is tidied here, with only counts for Australia. ] .right-plot[ ```r tb_au ``` ``` ## # A tibble: 192 x 6 ## country iso3 year count gender age ## <chr> <chr> <dbl> <dbl> <chr> <chr> ## 1 Australia AUS 1997 8 m 1524 ## 2 Australia AUS 1998 11 m 1524 ## 3 Australia AUS 1999 13 m 1524 ## 4 Australia AUS 2000 16 m 1524 ## 5 Australia AUS 2001 23 m 1524 ## 6 Australia AUS 2002 15 m 1524 ## 7 Australia AUS 2003 14 m 1524 ## 8 Australia AUS 2004 18 m 1524 ## 9 Australia AUS 2005 32 m 1524 ## 10 Australia AUS 2006 33 m 1524 ## # … with 182 more rows ``` ] --- ## The "100% charts" ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_bar(stat = "identity", position = "fill") + facet_grid(~ age) + scale_fill_brewer(palette="Dark2") ``` <img src="lecture-3a-slides_files/figure-html/show-100-pct-1.png" width="100%" /> ??? 100% charts, is what excel names these beasts. What do we learn? --- class: middle center bg-main5 # Let's unpack a bit. --- class: bg-main1 # Data Visualisation .blockquote.huge[ "The simple graph has brought more information to the data analyst’s mind than any other device." — John Tukey ] --- class: bg-main1 # Data Visualisation .huge[ - The creation and study of the visual representation of data. - Many tools for visualizing data (R is one of them) - Many approaches/systems within R for making data visualizations (**ggplot2** is one of them, and that's what we're going to use). ] --- class: bg-main1 # ggplot2 `\(\in\)` tidyverse .left-code[ <img src="img/ggplot2-part-of-tidyverse.png" width="100%" /> ] .right-plot.vlarge[ - **ggplot2** is tidyverse's data visualization package - The `gg` in "ggplot2" stands for Grammar of Graphics - It is inspired by the book **Grammar of Graphics** by Leland Wilkinson <sup>†</sup> - A grammar of graphics is a tool that enables us to concisely describe the components of a graphic ] .footnote[ <sup>†</sup> Source: [BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) ] --- <img src="img/grammar-of-graphics.png" width="100%" /> [From BloggoType](http://bloggotype.blogspot.com/2016/08/holiday-notes2-grammar-of-graphics.html) --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au) ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-1-out-1.png" width="100%" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, * aes(x = year, * y = count)) ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-2-out-1.png" width="100%" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count)) + * geom_point() ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-3-out-1.png" width="100%" /> ] --- # Our first ggplot! (what's the data again?) .vlarge.center.middle[ <table> <thead> <tr> <th style="text-align:left;"> country </th> <th style="text-align:left;"> iso3 </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> count </th> <th style="text-align:left;"> gender </th> <th style="text-align:left;"> age </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 1998 </td> <td style="text-align:right;"> 11 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2000 </td> <td style="text-align:right;"> 16 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2001 </td> <td style="text-align:right;"> 23 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2002 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2003 </td> <td style="text-align:right;"> 14 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2004 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2005 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> <tr> <td style="text-align:left;"> Australia </td> <td style="text-align:left;"> AUS </td> <td style="text-align:right;"> 2006 </td> <td style="text-align:right;"> 33 </td> <td style="text-align:left;"> m </td> <td style="text-align:left;"> 1524 </td> </tr> </tbody> </table> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count)) + * geom_col() ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-4-out-1.png" width="100%" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, * fill = gender)) + geom_col() ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-5-out-1.png" width="100%" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, fill = gender)) + * geom_col(position = "fill") ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-6-out-1.png" width="100%" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + * scale_fill_brewer( * palette = "Dark2" * ) ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-7-out-1.png" width="100%" /> ] --- # Our first ggplot! .left-code[ ```r library(ggplot2) ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + scale_fill_brewer( palette = "Dark2" ) + * facet_wrap(~ age) ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-8-out-1.png" width="100%" /> ] ??? - First argument provided is the name of the data, `tb_au` - Variable mapping: year is mapped to x, count is mapped to y, gender is mapped to colour, and age is used to subset the data and make separate plots - The column geom is used, `geom_col` - We are mostly interested in proportions between gender, over years, separately by age. The `position = "fill"` option in `geom_bar` sets the heights of the bars to be all at 100%. It ignores counts, and emphasizes the proportion of males and females. --- ## The "100% charts" ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_bar(stat = "identity", position = "fill") + facet_grid(~ age) + scale_fill_brewer(palette="Dark2") ``` <img src="lecture-3a-slides_files/figure-html/gg-show-100pct-1.png" width="100%" /> -- .huge[ What do we learn ] ??? 100% charts, is what excel names these beasts. What do we learn? --- # What do we learn? .huge[ - Focus is on **proportion** in each category. - Across (almost) all ages, and years, the proportion of males having TB is higher than females - These proportions tend to be higher in the older age groups, for all years. ] --- class: bg-main1 # Code structure of ggplot .huge[ - `ggplot()` is the main function - Plots are constructed in layers - Structure of code for plots can often be summarised as ```r ggplot(data = [dataset], mapping = aes(x = [x-variable], y = [y-variable])) + geom_xxx() + other options ``` ] --- class: bg-main1 # How to use ggplot .huge[ - To use ggplot2 functions, first load tidyverse ```r library(tidyverse) ``` - For help with the ggplot2, see [ggplot2.tidyverse.org](http://ggplot2.tidyverse.org/) ] --- class: bg-main1 .vvhuge[ Let's look at some more options to emphasise different features ] --- # .left-code[ ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + scale_fill_brewer( palette = "Dark2" ) + * facet_wrap(~ age) ``` ] .right-plot[ <img src="lecture-3a-slides_files/figure-html/first-gg-9-out-1.png" width="100%" /> ] --- # Emphasizing different features with ggplot2 ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col(position = "fill") + scale_fill_brewer( palette = "Dark2") + * facet_grid(~ age) ``` <img src="lecture-3a-slides_files/figure-html/first-gg-10-1.png" width="100%" /> --- # Emphasise ... ? ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + * geom_col() + scale_fill_brewer( palette = "Dark2") + facet_grid(~ age) ``` <img src="lecture-3a-slides_files/figure-html/first-gg-11-1.png" width="100%" /> --- class: bg-main1 # What do we learn? .huge[ - `, position = "fill"` was removed - Focus is on **counts** in each category. - Different across ages, and years, counts tend to be lower in middle age (45-64) - 1999 saw a bit of an outbreak, in most age groups, with numbers doubling or tripling other years. - Incidence has been increasing among younger age groups in recent years. ] --- # Emphasise ... ? ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + * geom_col(position = "dodge") + scale_fill_brewer(palette = "Dark2") + facet_grid(~ age) ``` <img src="lecture-3a-slides_files/figure-html/gg-side-by-side-1.png" width="100%" /> --- class: bg-main1 # What do we learn? .huge[ - `, position="dodge"` is used in `geom_col` - Focus is on **counts by gender**, predominantly male incidence. - Incidence among males relative to females is from middle age on. - There is similar incidence between males and females in younger age groups. ] --- # Separate bar charts ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col() + scale_fill_brewer(palette = "Dark2") + * facet_grid(gender ~ age) ``` <img src="lecture-3a-slides_files/figure-html/gg-separate-1.png" width="100%" /> --- class: bg-main1 # What do we learn? .huge[ - `facet_grid(gender ~ age) +` faceted by gender as well as age - note `facet_grid` vs `facet_wrap` - Easier to focus separately on males and females. - 1999 outbreak mostly affected males. - Growing incidence in the 25-34 age group is still affecting females but seems to be have stablised for males. ] --- # ~~Pie charts?~~ Rose Charts ```r ggplot(tb_au, aes(x = year, y = count, fill = gender)) + geom_col() + scale_fill_brewer(palette="Dark2") + facet_grid(gender ~ age) + * coord_polar() + theme(axis.text = element_blank()) ``` <img src="lecture-3a-slides_files/figure-html/gg-rose-1.png" width="100%" /> --- class: bg-main1 # What do we learn? .huge[ - Bar charts in polar coordinates produce rose charts. - `coord_polar() +` plot is made in polar coordinates, rather than the default Cartesian coordinates - Emphasizes the middle years as low incidence. ] --- # Rainbow charts? ```r ggplot(tb_au, aes(x = 1, y = count, * fill = factor(year))) + geom_col(position = "fill") + facet_grid(gender ~ age) ``` <img src="lecture-3a-slides_files/figure-html/gg-rainbow-1.png" width="100%" /> --- # What do we see in the code?? .huge[ - A single stacked bar, in each facet. - Year is mapped to colour. - Notice how the mappings are different. A single number is mapped to x, that makes a single stacked bar chart. - year is now mapped to colour (that's what gives us the rainbow charts!) ] --- # What do we learn? .vhuge[ - Pretty chart but not easy to interpret. ] --- ## (Actual) Pie charts ```r ggplot(tb_au, aes(x = 1, y = count, fill = factor(year))) + geom_col(position = "fill") + * facet_grid(gender ~ age) + * coord_polar(theta = "y") + theme(axis.text = element_blank()) ``` <img src="lecture-3a-slides_files/figure-html/gg-pie2-1.png" width="100%" /> --- class: bg-main1 # What is different in the code? .vhuge[ - `coord_polar(theta="y")` is using the y variable to do the angles for the polar coordinates to give a pie chart. ] --- # What do we learn? .vhuge[ - Pretty chart but not easy to interpret, or make comparisons across age groups. ] --- ## Why? [The various looks of David Bowie](https://www.wired.com/wp-content/uploads/2016/01/DB-Transformation-Colour.gif) .left-code.large[ - Using named plots, eg pie chart, bar chart, scatterplot, is like seeing animals in the zoo. - The grammar of graphics allows you to define the mapping between variables in the data, with elements of the plot. - It allows us to see and understand how plots are similar or different. - And you can see how variations in the definition create variations in the plot. ] .right-plot[ <img src="https://www.wired.com/wp-content/uploads/2016/01/DB-Transformation-Colour.gif" style="width:50%" /> ] --- # Your Turn: .huge[ - Do the lab exercises - Take the lab quiz - Use the rest of the lab time to coordinate with your group on the first assignment. ] --- # References .huge[ - [Chapter 3 of R for Data Science](https://r4ds.had.co.nz/data-visualisation.html) - [Data made available from WHO](https://www.who.int/tb/country/data/download/en/) - [Garret Aden Buie's gentle introduction to ggplot2](https://pkg.garrickadenbuie.com/gentle-ggplot2/#1) - [Mine Çetinkaya-Rundel's introduction to ggplot using star wars.](https://github.com/rstudio-education/datascience-box/tree/master/slides/u1_d02-data-and-viz) ] --- ## Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.