class: center, middle, inverse, title-slide # ETC1010: Data Modelling and Computing ## Lecture 9A: Networks and Graphs ### Dr. Nicholas Tierney & Professor Di Cook ### EBS, Monash U. ### 2019-09-25 --- class: bg-main1 # Announcements .vlarge[ - Assignment 3 has been released - NO LECTURE ON FRIDAY - Project deadlines: - **Deadline 3 (11th October) **: Electronic copy of your data, and a page of data description, and cleaning done, or needing to be done. - **Deadline 4 (18th October) **: Final version of story board uploaded. - Practical exam: **18th October in class at 8am** ] --- class: bg-main1 # recap: Last week on tidy text data --- class: bg-main1 # Network analysis -- # A description of phone calls .huge[ - Johnny --> Liz - Liz --> Anna - Johnny -- > Dan - Dan --> Liz - Dan --> Lucy ] --- class: bg-main1 ### As a graph <img src="lecture-9a-slides_files/figure-html/make-call-graph-1.png" width="90%" style="display: block; margin: auto;" /> --- class: bg-main1 # And as an association matrix .vvvhuge[ [DEMO] ] --- class: bg-main1 # Nodes and edges? .huge[ Netword data can be thought of as two related tables, **nodes** and **edges**: - **nodes** are connection points - **edges** are the connections between points ] --- class: bg-main1 # Example: Mad Men. (Nodes = characters from the series) ``` ## # A tibble: 45 x 2 ## label Gender ## <fct> <fct> ## 1 Betty Draper female ## 2 Don Draper male ## 3 Harry Crane male ## 4 Joan Holloway female ## 5 Lane Pryce male ## 6 Peggy Olson female ## 7 Pete Campbell male ## 8 Roger Sterling male ## 9 Sal Romano male ## 10 Henry Francis male ## # … with 35 more rows ``` --- class: bg-main1 # Example: Mad Men. (Edges = how they are associated) ``` ## # A tibble: 39 x 2 ## Name1 Name2 ## <fct> <fct> ## 1 Betty Draper Henry Francis ## 2 Betty Draper Random guy ## 3 Don Draper Allison ## 4 Don Draper Bethany Van Nuys ## 5 Don Draper Betty Draper ## 6 Don Draper Bobbie Barrett ## 7 Don Draper Candace ## 8 Don Draper Doris ## 9 Don Draper Faye Miller ## 10 Don Draper Joy ## # … with 29 more rows ``` --- class: bg-main1 # Why care about these relationships? .huge[ - **Telephone exchanges**: Nodes are the phone numbers. Edges would indicate a call was made betwen two numbers. - **Book or movie plots**: Nodes are the characters. Edges would indicate whether they appear together in a scene, or chapter. If they speak to each other, various ways we might measure the association. - **Social media**: nodes would be the people who post on facebook, including comments. Edges would measure who comments on who's posts. ] --- class: bg-main1 # Drawing these relationships out: .vlarge[ One way to describe these relationships is to provide association matrix between many objects. ] <img src="images/network_data.png" width="80%" style="display: block; margin: auto;" /> (Image created by Sam Tyner.) --- class: bg-main1 # Example: Madmen <img src="images/Mad-men-title-card.jpg" width="90%" style="display: block; margin: auto;" /> *Source: [wikicommons](https://en.wikipedia.org/wiki/Mad_Men#/media/File:Mad-men-title-card.jpg)* --- class: bg-main1 # Generate a network view .huge[ - Create a layout (in 2D) which places nodes which are most related close, - Plot the nodes as points, connect the appropriate lines - Overlaying other aspects, e.g. gender ] --- class: bg-main1 # introducing `tidygraph` and `ggraph` .pull-left[ ```r library(tidygraph) library(ggraph) madmen_graph <- tbl_graph( nodes = madmen$vertices, edges = madmen$edges, directed = FALSE ) ``` ] .pull-right[ ``` ## # A tbl_graph: 45 nodes and 39 edges ## # ## # An undirected simple graph with 6 components ## # ## # Node Data: 45 x 2 (active) ## label Gender ## <fct> <fct> ## 1 Betty Draper female ## 2 Don Draper male ## 3 Harry Crane male ## 4 Joan Holloway female ## 5 Lane Pryce male ## 6 Peggy Olson female ## # … with 39 more rows ## # ## # Edge Data: 39 x 2 ## from to ## <int> <int> ## 1 1 2 ## 2 1 3 ## 3 4 5 ## # … with 36 more rows ``` ] --- class: bg-main1 # plotting using ggraph ```r gg_madmen <- ggraph(madmen_graph, layout = "kk") + geom_edge_link() + geom_node_label(aes(colour = Gender, label = label)) ``` --- class: bg-main1 # plotting using ggraph ```r gg_madmen ``` <img src="lecture-9a-slides_files/figure-html/print-ggraph-madmen-1.png" width="90%" style="display: block; margin: auto;" /> --- class: bg-main1 # Which actor was most connected? ```r madmen_graph %>% activate(nodes) %>% mutate(count = centrality_degree()) %>% arrange(-count) %>% as_tibble() ``` ``` ## # A tibble: 45 x 3 ## label Gender count ## <fct> <fct> <dbl> ## 1 Joan Holloway female 14 ## 2 Woman at the Clios party female 6 ## 3 Janine female 5 ## 4 Duck Phillips male 4 ## 5 Betty Draper female 3 ## 6 Rachel Menken female 3 ## 7 Hildy female 3 ## 8 Joy female 2 ## 9 Vicky female 2 ## 10 Don Draper male 1 ## # … with 35 more rows ``` --- class: bg-main1 # `activate()` what now? .pull-left[ ```r madmen_graph ``` ``` ## # A tbl_graph: 45 nodes and 39 edges ## # ## # An undirected simple graph with 6 components ## # ## # Node Data: 45 x 2 (active) ## label Gender ## <fct> <fct> ## 1 Betty Draper female ## 2 Don Draper male ## 3 Harry Crane male ## 4 Joan Holloway female ## 5 Lane Pryce male ## 6 Peggy Olson female ## # … with 39 more rows ## # ## # Edge Data: 39 x 2 ## from to ## <int> <int> ## 1 1 2 ## 2 1 3 ## 3 4 5 ## # … with 36 more rows ``` ] .huge.pull-right[ - need to tell dplyr if you are working on `nodes` or `edges`. - `activate` means we don't need a `mutate_nodes` or `mutate_edges` commands ] --- class: bg-main1 # `centrality` what now? .pull-left[ ```r madmen_graph %>% activate(nodes) %>% * mutate(count = centrality_degree()) %>% arrange(-count) %>% as_tibble() ``` ``` ## # A tibble: 45 x 3 ## label Gender count ## <fct> <fct> <dbl> ## 1 Joan Holloway female 14 ## 2 Woman at the Clios party female 6 ## 3 Janine female 5 ## 4 Duck Phillips male 4 ## 5 Betty Draper female 3 ## 6 Rachel Menken female 3 ## 7 Hildy female 3 ## 8 Joy female 2 ## 9 Vicky female 2 ## 10 Don Draper male 1 ## # … with 35 more rows ``` ] .pull-right.vlarge[ - How central is a node or edge in a graph? - definition is inherently vague - there are many different centrality scores that exist - `centrality_degree()` says: "What is the number of adjacent edges?" ] --- class: bg-main1 # What do we learn? .huge[ - Joan Holloway had a lot of affairs, all with loyal partners except for his wife Betty, who had two affairs herself - Followed by Woman at Clios party ] --- class: bg-main1 # Example: American college football .huge[ Early American football outfits were like Australian AFL today! ] <img src="images/1480px-Unknown_Early_American_Football_Team.jpg" width="50%" style="display: block; margin: auto;" /> *Source: [wikicommons](https://commons.wikimedia.org/wiki/File:Unknown_Early_American_Football_Team.jpg)* --- class: bg-main1 # Example: American college football .huge[ Fall 2000 Season of [Division I college football](https://en.wikipedia.org/wiki/NCAA_Division_I). - Nodes are the teams, edges are the matches. - Teams are broken into "conferences" which are the primary competition, but they can play outside this group. ] --- class: bg-main1 # Example: American college football ```r foot_graph ``` ``` ## # A tbl_graph: 115 nodes and 613 edges ## # ## # A directed acyclic simple graph with 1 component ## # ## # Node Data: 115 x 3 (active) ## uni conference schools ## <chr> <chr> <chr> ## 1 BrighamYoung Mountain West <NA> ## 2 FloridaState Atlantic Coast <NA> ## 3 Iowa Big Ten <NA> ## 4 KansasState Big Twelve <NA> ## 5 NewMexico Mountain West <NA> ## 6 TexasTech Big Twelve <NA> ## # … with 109 more rows ## # ## # Edge Data: 613 x 3 ## from to same.conf ## <int> <int> <dbl> ## 1 1 2 0 ## 2 3 4 0 ## 3 1 5 1 ## # … with 610 more rows ``` --- class: bg-main1 ```r set.seed(2019-09-25-1117) gg_foot_graph <- ggraph(foot_graph, layout = "fr") + geom_edge_link(alpha = 0.2) + geom_node_point(size = 7, alpha = 0.9, aes(colour = conference)) + scale_colour_brewer(palette = "Paired") + theme(legend.position = "bottom") ``` --- class: bg-main1 <img src="lecture-9a-slides_files/figure-html/print-gg-foot-graph-1.png" width="80%" style="display: block; margin: auto;" /> --- class: bg-main1 # What do we learn? .vlarge[ - Remember layout is done to place nodes that are more similar close together in the display. - The colours indicate conference the team belongs too. For the most part, conferences are clustered, more similar to each other than other conferences. - There are some clusters of conference groups, eg Mid-American, Big East, and Atlantic Coast - The Independents are independent - Some teams play far afield from their conference. ] --- class: bg-main1 ## Example: Harry Potter characters <img src="images/1069px-Harry_Potter_Platform_Kings_Cross.jpg" width="50%" style="display: block; margin: auto;" /> *Source: [wikicommons](https://commons.wikimedia.org/wiki/File:Harry_Potter_Platform_Kings_Cross.jpg)* --- class: bg-main1 .huge[ There is a connection between two students if one provides emotional support to the other at some point in the book. - Code to pull the data together is provided by Sam Tyner [here](https://github.com/sctyner/geomnet/blob/master/README.Rmd#harry-potter-peer-support-network). ] --- class: bg-main1 # Harry potter data as nodes and edges ```r hp ``` ``` ## # A tbl_graph: 64 nodes and 434 edges ## # ## # A directed multigraph with 29 components ## # ## # Node Data: 64 x 4 (active) ## name schoolyear gender house ## <chr> <dbl> <chr> <chr> ## 1 Adrian Pucey 1989 M Slytherin ## 2 Alicia Spinnet 1989 F Gryffindor ## 3 Angelina Johnson 1989 F Gryffindor ## 4 Anthony Goldstein 1991 M Ravenclaw ## 5 Blaise Zabini 1991 M Slytherin ## 6 C. Warrington 1989 M Slytherin ## # … with 58 more rows ## # ## # Edge Data: 434 x 3 ## from to book ## <int> <int> <dbl> ## 1 11 25 1 ## 2 11 26 1 ## 3 11 44 1 ## # … with 431 more rows ``` --- class: bg-main1 # Let's plot the characters ```r ggraph_hp <- ggraph(hp, layout = "fr") + geom_edge_link(alpha = 0.2) + geom_node_point(aes(colour = house, shape = gender)) + # geom_node_text(aes(label = name)) + facet_edges(~book, ncol = 2) + scale_colour_manual(values = c("#941B08","#F1F31C", "#071A80", "#154C07")) ``` --- class: bg-main1 # Let's plot the characters ```r ggraph_hp ``` <img src="lecture-9a-slides_files/figure-html/ggraph-hp-1.png" width="90%" style="display: block; margin: auto;" /> --- class: bg-main1 # Your turn: rstudio.cloud .vlarge.pull-left[ - Read in last semesters class data, which contains `s1_name` and `s2_name` are the first names of class members, and tutors, with the latter being the "go-to" person for the former. - Write the code to produce a class network that looks something like below ] .pull-right[ ![](images/class_network.png) ] <!-- ## Simpsons Your turn to make a network diagram for the Simpsons. The measure of association will be "that the two characters had lines in the same episode together". - How many characters appeared only in one episode? (You will want to drop these) - Write code to search if a character has a line in an episode - Compile a dataset of episode (rows) and character (columns) which is a binary matrix where 1 indicates the character had a line in the episode, and 0 is otherwise - Gather the matrix into long form, with these columns: `episode`, `character`, `had a line` (0,1) - Filter the rows with `had a line` equal to 1. - Count the number of times the pair of characters appeared. This now forms your edge set, with an additional column of the strength of the relationship - Make your network display --> --- class: bg-main1 ## Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.