class: center, middle, inverse, title-slide # ETC1010: Data Modelling and Computing ## Lecture 5B: Web Scraping ### Dr. Nicholas Tierney & Professor Di Cook ### EBS, Monash U. ### 2019-08-30 --- background-image: url(https://www.kdnuggets.com/images/cartoon-turkey-data-science.jpg) background-size: contain background-position: 50% 50% class: left, white .vvhuge[ While the song is playing... ] .vhuge[ Draw a mental model / concept map of last lectures content on Missing Data. ] --- class: bg-main1 # Overview .huge[ - Assignment comments - Exam / study advice - Different file formats - Audio / binary - Web data - responsible scraping - scraping - JSON ] --- class: bg-main1 # Assignment 2 notes .huge[ - Some questions might seem a bit strange - this is normal! - Sometimes you can't answer the exact question with the data you have. - It might be frustrating - that's OK! - Steer away from the idea of "correct code" and reframe this as "code thta works". There is no one answer - remember the Tower of Babel example from the start of class, there are many ways to do the same thing in R. ] --- class: bg-main1 # Assignment 2 notes .huge[ - For the last two question about combining the crime data with income data, **Rest assured** there is no "trick" - the data isn't perfect here. - Our marking advice: Describe how you think the solution you have provided helps answer the question. - Show us a plot - Explain what you see in the plot - what do you think we can learn from the information we have? - Describe what other information you might like to have ] --- class: bg-main5 # Learning technique: study -- .huge[ - Practice previous exam in exam conditions (last semester's exam is on dmac.netlify.com) ] -- .huge[ - Think up exam questions, write them down ] -- .huge[ - Practice explaining concepts - either out aloud by yourself, in the mirror, to a friend, to an empty room. Saying things out aloud builds better connections. ] --- class: bg-main5 # Learning technique: in exam -- .huge[ - Peruse the exam before starting ] -- .huge[ - Take off one shoe (My high school teacher claimed this worked) ] -- .huge[ - Number questions in order from easy-hard - start with the easiest one first ] -- .huge[ - Work out a marks : minutes ratio. (e.g.,60 marks with 60 minutes means one mark per minute) ] --- class: bg-main1 # Recap on some tricky topics .huge[ - assignment ("gets" - `<-`) - pipes (from the textbook) ] --- class: bg-main1 # The pipe operator: `%>%` .pull-left.huge[ - Code to tell a story about a little bunny foo foo (borrowed from https://r4ds.had.co.nz/pipes.html): - Using functions for each verb: `hop()`, `scoop()`, `bop()`. ] .pull-right.huge[ > Little bunny Foo Foo Went hopping through the forest Scooping up the field mice And bopping them on the head ] --- class: bg-main1 # Approach: Intermediate steps .huge[ ```r foo_foo_1 <- hop(foo_foo, through = forest) foo_foo_2 <- scoop(foo_foo_1, up = field_mice) foo_foo_3 <- bop(foo_foo_2, on = head) ``` ] -- .huge[ - Main downside: forces you to name each intermediate element. - Sometimes these steps form natural names. If this is the case - go ahead. - **But many times there are not natural names** - Adding number suffixes to make the names unique leads to problems. ] --- class: bg-main1 # Approach: Intermediate steps .huge[ ```r foo_foo_1 <- hop(foo_foo, through = forest) foo_foo_2 <- scoop(foo_foo_1, up = field_mice) foo_foo_3 <- bop(foo_foo_2, on = head) ``` ] -- .huge[ - Code is cluttered with unimportant names - Suffix has to be carefully incremented on each line. - I've done this! - 99% of the time I miss a number somewhere, and there goes my evening ... debugging my code. ] --- class: bg-main1 # Another Approach: Overwrite the original .huge[ ```r foo_foo <- hop(foo_foo, through = forest) foo_foo <- scoop(foo_foo, up = field_mice) foo_foo <- bop(foo_foo, on = head) ``` ] -- .huge[ - Overwrite originals instead of creating intermediate objects - Less typing (and less thinking). Less likely to make mistakes? - **Painful debugging**: need to re-run the code from the top. - Repitition of object - (`foo_foo` written 6 times!) Obscures what changes. ] --- class: bg-main1 # (Yet) Another approach: function composition .huge[ ```r bop( scoop( hop(foo_foo, through = forest), up = field_mice ), on = head ) ``` ] -- .huge[ - You need to read inside-out, and right-to-left. - Arguments are spread far apart - Harder to read ] --- class: bg-main1 # Pipe `%>%` can help! .huge.pull-left[ `f(x)` `g(f(x))` `h(g(f(x)))` ] -- .vlarge.pull-right[ `x %>% f()` `x %>% f() %>% g()` `x %>% f() %>% g() %>% h()` ] --- class: bg-main1 # Solution: Use the pipe - `%>%` .huge[ ```r foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head) ``` ] .huge[ - focusses on verbs, not nouns. - Can be read as a series of function compositions like actions. > Foo Foo hops, then scoops, then bops. - read more at: https://r4ds.had.co.nz/pipes.html ] --- # Take 3 minutes to discuss these two concepts with your table .vhuge[ - What are pipes - What is assignment? ] --- class: bg-main1 # The many shapes and sizes of data --- class: bg-main1 # Data as an audio file ``` ## # A tibble: 50,001 x 3 ## t left right ## <int> <int> <int> ## 1 1 28 29 ## 2 2 27 28 ## 3 3 26 24 ## 4 4 24 27 ## 5 5 22 18 ## 6 6 15 19 ## 7 7 15 13 ## 8 8 12 13 ## 9 9 15 16 ## 10 10 18 16 ## # … with 49,991 more rows ``` --- class: bg-main1 # Plotting audio data? <img src="lecture-5b-slides_files/figure-html/show-audio-1.png" width="90%" style="display: block; margin: auto;" /> --- # Compare left and right channels <img src="lecture-5b-slides_files/figure-html/gg-compare-left-and-right-1.png" width="90%" style="display: block; margin: auto;" /> ??? Oh, same sound is on both channels! A tad drab. --- class: bg-main1 # Compute statistics ```r df_wavs %>% filter(channel == "left") %>% group_by(word) %>% summarise( m = mean(value), s = sd(value), mx = max(value), mn = min(value) ) %>% as_table() ``` <table> <thead> <tr> <th style="text-align:left;"> word </th> <th style="text-align:right;"> m </th> <th style="text-align:right;"> s </th> <th style="text-align:right;"> mx </th> <th style="text-align:right;"> mn </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> data </td> <td style="text-align:right;"> 0.004 </td> <td style="text-align:right;"> 1602.577 </td> <td style="text-align:right;"> 8393 </td> <td style="text-align:right;"> -15386 </td> </tr> <tr> <td style="text-align:left;"> word </td> <td style="text-align:right;"> 0.009 </td> <td style="text-align:right;"> 1506.626 </td> <td style="text-align:right;"> 6601 </td> <td style="text-align:right;"> -11026 </td> </tr> </tbody> </table> --- class: bg-main1 # Di's music ``` ## artist type lvar lave lmax lfener ## Dancing Queen Abba Rock 17600755.6 -90.0068667 29921 105.92095 ## Knowing Me Abba Rock 9543020.9 -75.7667187 27626 102.83616 ## Take a Chance Abba Rock 9049481.5 -98.0629244 26372 102.32488 ## Mamma Mia Abba Rock 7557437.3 -90.4710616 28898 101.61648 ## Lay All You Abba Rock 6282285.6 -88.9526309 27940 100.30076 ## Super Trouper Abba Rock 4665866.7 -69.0208450 25531 100.24848 ## I Have A Dream Abba Rock 3369670.3 -71.6828788 14699 104.59686 ## The Winner Abba Rock 1135862.0 -67.8190467 8928 104.34921 ## Money Abba Rock 6146942.6 -76.2807500 22962 102.24066 ## SOS Abba Rock 3482881.8 -74.1300038 15517 104.36243 ## V1 Vivaldi Classical 3677296.2 66.6530970 24229 99.25243 ## V2 Vivaldi Classical 771491.8 21.6560941 6936 104.36737 ## V3 Vivaldi Classical 5227573.2 88.6465556 17721 104.61255 ## V4 Vivaldi Classical 334719.1 13.8318847 4123 104.35005 ## V5 Vivaldi Classical 836904.9 34.5677377 17306 88.82821 ## V6 Vivaldi Classical 13936436.7 216.2317586 30355 104.82354 ## V7 Vivaldi Classical 3636324.3 9.8366363 21450 100.52727 ## V8 Vivaldi Classical 295397.4 2.8143227 2985 108.28717 ## V9 Vivaldi Classical 4335879.0 10.9015767 22271 101.32881 ## V10 Vivaldi Classical 472630.0 3.8890862 8194 98.83731 ## M1 Mozart Classical 2819795.0 -5.8667602 14939 102.25358 ## M2 Mozart Classical 2836957.9 -5.6074580 13382 102.04646 ## M3 Mozart Classical 9089372.1 -5.9719205 25265 102.23796 ## M4 Mozart Classical 4056229.7 -6.2272904 21328 97.59887 ## M5 Mozart Classical 1568925.6 -6.1993790 15839 99.47580 ## M6 Mozart Classical 7758409.1 -5.7700183 22496 101.66942 ## All in a Days Work Eels Rock 40275677.4 -10.6893416 32759 106.07617 ## Saturday Morning Eels Rock 129472199.3 50.2115773 32759 114.00229 ## The Good Old Days Eels Rock 18838849.0 -0.0222764 30386 105.45611 ## Love of the Loveless Eels Rock 43201194.3 1.8926897 32759 108.37835 ## Girl Eels Rock 88547131.0 0.3358761 32744 112.00916 ## Agony Eels Rock 16285811.4 -0.1405876 30106 103.94171 ## Rock Hard Times Eels Rock 54651552.8 1.9848514 32759 108.87503 ## Restraining Eels Rock 12322434.4 1.0259790 23221 103.29906 ## Lone Wolf Eels Rock 63878899.8 3.3592505 32759 109.53291 ## Wrong About Bobby Eels Rock 43668186.9 -2.0125667 32759 108.35559 ## Love Me Do Beatles Rock 28806811.2 -5.7452189 24159 109.95808 ## I Want to Hold Your Hand Beatles Rock 61257693.6 -6.0340682 28502 111.81813 ## Cant Buy Me Love Beatles Rock 76729438.0 -5.9583971 30102 112.71792 ## I Feel Fine Beatles Rock 52497242.7 -5.7314591 29911 110.66333 ## Ticket to Ride Beatles Rock 68104547.0 -6.1449114 30415 111.59957 ## Help Beatles Rock 52569372.5 -5.7166301 32318 110.34618 ## Yesterday Beatles Rock 23080907.6 -6.0298337 28169 107.30541 ## Yellow Submarine Beatles Rock 39908667.5 -6.2616915 29061 109.47442 ## Eleanor Rigby Beatles Rock 18819753.2 -6.1265193 21680 108.65894 ## Penny Lane Beatles Rock 58614798.7 -5.9971242 31131 111.18901 ## B1 Beethoven Classical 8368952.9 -0.9538330 26645 101.74095 ## B2 Beethoven Classical 293608.3 -0.1247094 4554 103.26160 ## B3 Beethoven Classical 8051764.6 -0.3316964 24194 101.20132 ## B4 Beethoven Classical 23493873.6 -0.9411538 32766 106.18220 ## B5 Beethoven Classical 1640232.8 1.3899979 20877 94.59029 ## B6 Beethoven Classical 343973.1 -2.4748955 9225 93.38874 ## B7 Beethoven Classical 3644784.2 -1.0426907 24633 97.86394 ## B8 Beethoven Classical 15030950.3 -1.4394652 26066 106.39731 ## The Memory of Trees Enya New wave 1135493.1 -10.6183398 9994 102.16132 ## Anywhere Is Enya New wave 12230252.2 -17.8372700 24968 105.75748 ## Pax Deorum Enya New wave 1723627.9 -6.8327065 13227 101.86845 ## Waterloo Abba Rock 24898675.9 -93.9961871 29830 107.73299 ## V11 Vivaldi Classical 1879989.2 12.7213373 8601 105.81750 ## V12 Vivaldi Classical 737349.6 5.7190022 7089 102.92123 ## V13 Vivaldi Classical 2865979.9 21.4467629 17282 102.11314 ## Hey Jude Beatles Rock 8651854.1 -6.1322408 18509 83.88195 ## lfreq ## Dancing Queen 59.57379 ## Knowing Me 58.48031 ## Take a Chance 124.59397 ## Mamma Mia 48.76513 ## Lay All You 74.02039 ## Super Trouper 81.40140 ## I Have A Dream 305.18689 ## The Winner 277.66056 ## Money 165.15799 ## SOS 146.73700 ## V1 329.53792 ## V2 843.83240 ## V3 165.76781 ## V4 293.99972 ## V5 198.38305 ## V6 198.46716 ## V7 877.77243 ## V8 58.41722 ## V9 176.53441 ## V10 526.04942 ## M1 342.26017 ## M2 511.85517 ## M3 429.27618 ## M4 343.75319 ## M5 288.44819 ## M6 459.24182 ## All in a Days Work 65.48281 ## Saturday Morning 41.40515 ## The Good Old Days 165.24210 ## Love of the Loveless 174.64185 ## Girl 392.28702 ## Agony 312.06322 ## Rock Hard Times 312.37864 ## Restraining 185.59771 ## Lone Wolf 66.13469 ## Wrong About Bobby 98.83404 ## Love Me Do 126.50757 ## I Want to Hold Your Hand 294.67263 ## Cant Buy Me Love 100.20089 ## I Feel Fine 110.39972 ## Ticket to Ride 107.30853 ## Help 137.10594 ## Yesterday 173.50631 ## Yellow Submarine 91.49508 ## Eleanor Rigby 164.92667 ## Penny Lane 92.46240 ## B1 354.12025 ## B2 316.03761 ## B3 397.18666 ## B4 529.26679 ## B5 445.00551 ## B6 331.68283 ## B7 181.01349 ## B8 249.08280 ## The Memory of Trees 155.06430 ## Anywhere Is 79.31957 ## Pax Deorum 49.01748 ## Waterloo 146.04306 ## V11 58.83780 ## V12 175.94562 ## V13 61.44533 ## Hey Jude 219.53773 ``` --- class: bg-main1 # Plot Di's music <img src="lecture-5b-slides_files/figure-html/gg-di-music-1.png" width="90%" style="display: block; margin: auto;" /> --- class: bg-main1 # Plot Di's Music <img src="lecture-5b-slides_files/figure-html/gg-di-music-points-1.png" width="90%" style="display: block; margin: auto;" /> Abba is just different from everyone else! --- class: bg-main1 # Question time: .huge[ - "How does `data` appear different than `statistics` in the time series?" - "What format is the data in an audio file?" - "How is Abba different from the other music clips?", ]
01
:
30
--- class: bg-main1 # Why look at audio data? .huge[ - Data comes in many shapes and sizes - Audio data can be transformed ("rectangled") into a data.frame - Another type of data is data on the web. - Extracting data from websites is called "web scraping". ] --- # Scraping the web: what? why? .huge[ - Increasing amount of data is available on the web. - These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors. - Web scraping is the process of extracting this information automatically and transform it into a structured dataset. ] --- # Scraping the web: what? why? .huge[ 1. Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy). 2. Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files. - Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data... But python, perl, java are also efficient tools. ] --- class: bg-main1 # Web Scraping with `rvest` and `polite` --- class: bg-main1 # Hypertext Markup Language .huge[ Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy). ] ```html <html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body> </html> ``` --- class: bg-main1 # rvest + polite: Simplify processing and manipulating HTML data .huge[ - `bow()` - check if the data can be scraped appropriately - `scrape()` - scrape website data (with nice defaults) - `html_nodes()` - select specified nodes from the HTML document using CSS selectors. - `html_table` - parse an HTML table into a data frame. - `html_text` - extract tag pairs' content. ] --- class: bg-main1 # SelectorGadget: css selectors .large[ - Using a tool called selector gadget to **help** identify the html elements of interest - Does this by constructing a css selector which can be used to subset the html document. Selector | Example | Description ------------ |------------------| ------------------------------------------------ element | `p` | Select all <p> elements element element | `div p` | Select all <p> elements inside a <div> element element>element | `div > p` | Select all <p> elements with <div> as a parent .class | `.title` | Select all elements with class="title" \#id | `.name` | Select all elements with id="name" [attribute] | `[class]` | Select all elements with a class attribute [attribute=value] | `[class=title]` | Select all elements with class="title" ] --- class: bg-main1 # SelectorGadget .vlarge[ - SelectorGadget: Open source tool that eases CSS selector generation and discovery - Install the [Chrome Extension](https://chrome.google.com/webstore/detail/selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb) - A box will open in the bottom right of the website. Click on a page element that you would like your selector to match (it will turn green). SelectorGadget will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector. - Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs. ] --- class: bg-main1 ## Top 250 movies on IMDB .left-code.huge[ Take a look at the source code, look for the tag `table` tag: <br> http://www.imdb.com/chart/top ] .right-plot[ <img src="images/imdb_top_250.png" width="90%" style="display: block; margin: auto;" /> ] --- class: bg-main1 ## First check to make sure you're allowed! ```r # install.packages("remotes") # remotes::install_github("dmi3kno/polite") library(polite) bow("http://www.imdb.com") ``` ``` ## <polite session> http://www.imdb.com ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 25 rules are defined for 1 bots ## Crawl delay: 5 sec ## The path is scrapable for this user-agent ``` -- ```r bow("http://www.facebook.com") ``` ``` ## <polite session> http://www.facebook.com ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 313 rules are defined for 15 bots ## Crawl delay: 5 sec ## The path is not scrapable for this user-agent ``` --- class: bg-main1 # Join in .vhuge[ Go to [rstudio.cloud](https://rstudio.cloud/) `\(\rightarrow\)` Lecture 5B `\(\rightarrow\)` Make a copy `\(\rightarrow\)` `lecture-5B.Rmd` ] --- class: bg-main1 # Demo .huge[ Let's go to http://www.imdb.com/chart/top ] --- class: bg-main1 # Bow and scrape ```r imdb_session <- bow("http://www.imdb.com/chart/top") imdb_session ``` ``` ## <polite session> http://www.imdb.com/chart/top ## User-agent: polite R package - https://github.com/dmi3kno/polite ## robots.txt: 25 rules are defined for 1 bots ## Crawl delay: 5 sec ## The path is scrapable for this user-agent ``` ```r imdb_data <- scrape(imdb_session) imdb_data ``` ``` ## {html_document} ## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> ## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<scrip ... ## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" width="1" styl ... ``` --- class: bg-main1 # Select and format pieces: titles - `html_nodes()` ```r library(rvest) imdb_data %>% html_nodes(".titleColumn a") ``` ``` ## {xml_nodeset (250)} ## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [10] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [11] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [14] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [15] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [17] <a href="/title/tt0073486/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [18] <a href="/title/tt0099685/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [19] <a href="/title/tt0047478/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## [20] <a href="/title/tt0114369/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ... ## ... ``` --- class: bg-main1 # Select and format pieces: titles - `html_text() ` ```r imdb_data %>% html_nodes(".titleColumn a") %>% html_text() ``` ``` ## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Godfather: Part II" ## [4] "The Dark Knight" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Good, the Bad and the Ugly" ## [10] "Fight Club" ## [11] "The Lord of the Rings: The Fellowship of the Ring" ## [12] "Forrest Gump" ## [13] "Inception" ## [14] "Star Wars: Episode V - The Empire Strikes Back" ## [15] "The Lord of the Rings: The Two Towers" ## [16] "The Matrix" ## [17] "One Flew Over the Cuckoo's Nest" ## [18] "Goodfellas" ## [19] "Seven Samurai" ## [20] "Se7en" ## [21] "Cidade de Deus" ## [22] "Life Is Beautiful" ## [23] "The Silence of the Lambs" ## [24] "Star Wars" ## [25] "It's a Wonderful Life" ## [26] "Saving Private Ryan" ## [27] "Spirited Away" ## [28] "The Green Mile" ## [29] "Léon: The Professional" ## [30] "Interstellar" ## [31] "The Usual Suspects" ## [32] "Avengers: Endgame" ## [33] "American History X" ## [34] "The Lion King" ## [35] "Back to the Future" ## [36] "Modern Times" ## [37] "The Pianist" ## [38] "Terminator 2: Judgment Day" ## [39] "The Intouchables" ## [40] "Psycho" ## [41] "Gladiator" ## [42] "City Lights" ## [43] "The Departed" ## [44] "Whiplash" ## [45] "Once Upon a Time in the West" ## [46] "The Prestige" ## [47] "Casablanca" ## [48] "Grave of the Fireflies" ## [49] "Rear Window" ## [50] "Cinema Paradiso" ## [51] "Raiders of the Lost Ark" ## [52] "Alien" ## [53] "Memento" ## [54] "Apocalypse Now" ## [55] "The Great Dictator" ## [56] "The Lives of Others" ## [57] "Avengers: Infinity War" ## [58] "Spider-Man: Into the Spider-Verse" ## [59] "Django Unchained" ## [60] "The Shining" ## [61] "Paths of Glory" ## [62] "WALL·E" ## [63] "Sunset Blvd." ## [64] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb" ## [65] "Princess Mononoke" ## [66] "Oldeuboi" ## [67] "Witness for the Prosecution" ## [68] "The Dark Knight Rises" ## [69] "Once Upon a Time in America" ## [70] "Aliens" ## [71] "Coco" ## [72] "American Beauty" ## [73] "Kimi no na wa." ## [74] "Braveheart" ## [75] "Das Boot" ## [76] "3 Idiots" ## [77] "Taare Zameen Par" ## [78] "Star Wars: Episode VI - Return of the Jedi" ## [79] "Reservoir Dogs" ## [80] "Toy Story" ## [81] "Amadeus" ## [82] "Dangal" ## [83] "M - Eine Stadt sucht einen Mörder" ## [84] "Requiem for a Dream" ## [85] "Good Will Hunting" ## [86] "Vertigo" ## [87] "Inglourious Basterds" ## [88] "2001: A Space Odyssey" ## [89] "Eternal Sunshine of the Spotless Mind" ## [90] "Citizen Kane" ## [91] "Full Metal Jacket" ## [92] "Jagten" ## [93] "North by Northwest" ## [94] "A Clockwork Orange" ## [95] "Snatch" ## [96] "Amelie" ## [97] "The Kid" ## [98] "Bicycle Thieves" ## [99] "Scarface" ## [100] "Singin' in the Rain" ## [101] "Lawrence of Arabia" ## [102] "Toy Story 3" ## [103] "The Sting" ## [104] "Taxi Driver" ## [105] "Per qualche dollaro in più" ## [106] "Metropolis" ## [107] "Double Indemnity" ## [108] "Jodaeiye Nader az Simin" ## [109] "To Kill a Mockingbird" ## [110] "Ikiru" ## [111] "Indiana Jones and the Last Crusade" ## [112] "The Apartment" ## [113] "Up" ## [114] "L.A. Confidential" ## [115] "Monty Python and the Holy Grail" ## [116] "Incendies" ## [117] "Capharnaüm" ## [118] "Heat" ## [119] "Rashomon" ## [120] "Yojimbo" ## [121] "Batman Begins" ## [122] "Die Hard" ## [123] "Green Book" ## [124] "Unforgiven" ## [125] "Downfall" ## [126] "Bacheha-Ye aseman" ## [127] "Some Like It Hot" ## [128] "Howl's Moving Castle" ## [129] "The Great Escape" ## [130] "My Neighbor Totoro" ## [131] "All About Eve" ## [132] "Ran" ## [133] "A Beautiful Mind" ## [134] "Pan's Labyrinth" ## [135] "Casino" ## [136] "El secreto de sus ojos" ## [137] "Raging Bull" ## [138] "Idi i smotri" ## [139] "Lock, Stock and Two Smoking Barrels" ## [140] "Babam ve Oglum" ## [141] "The Treasure of the Sierra Madre" ## [142] "The Wolf of Wall Street" ## [143] "Judgment at Nuremberg" ## [144] "Three Billboards Outside Ebbing, Missouri" ## [145] "Chinatown" ## [146] "The Gold Rush" ## [147] "Inside Out" ## [148] "Dial M for Murder" ## [149] "V for Vendetta" ## [150] "Det sjunde inseglet" ## [151] "Warrior" ## [152] "There Will Be Blood" ## [153] "Room" ## [154] "Trainspotting" ## [155] "Andhadhun" ## [156] "No Country for Old Men" ## [157] "The Elephant Man" ## [158] "The Sixth Sense" ## [159] "Shutter Island" ## [160] "The Thing" ## [161] "Eskiya" ## [162] "The Bridge on the River Kwai" ## [163] "Gone with the Wind" ## [164] "The Third Man" ## [165] "On the Waterfront" ## [166] "Blade Runner" ## [167] "Smultronstället" ## [168] "Finding Nemo" ## [169] "Jurassic Park" ## [170] "Gran Torino" ## [171] "Fargo" ## [172] "Kill Bill: Vol. 1" ## [173] "The Deer Hunter" ## [174] "Tôkyô monogatari" ## [175] "The Big Lebowski" ## [176] "Stalker" ## [177] "The Truman Show" ## [178] "Relatos salvajes" ## [179] "Mary and Max" ## [180] "Hacksaw Ridge" ## [181] "Gone Girl" ## [182] "In the Name of the Father" ## [183] "Sherlock Jr." ## [184] "The General" ## [185] "How to Train Your Dragon" ## [186] "Mr. Smith Goes to Washington" ## [187] "The Grand Budapest Hotel" ## [188] "Salinui chueok" ## [189] "Persona" ## [190] "Before Sunrise" ## [191] "Catch Me If You Can" ## [192] "Toy Story 4" ## [193] "Into the Wild" ## [194] "Cool Hand Luke" ## [195] "Le salaire de la peur" ## [196] "12 Years a Slave" ## [197] "Network" ## [198] "Andrei Rublev" ## [199] "Life of Brian" ## [200] "Rang De Basanti" ## [201] "La passion de Jeanne d'Arc" ## [202] "Prisoners" ## [203] "Stand by Me" ## [204] "Mad Max: Fury Road" ## [205] "Platoon" ## [206] "Rush" ## [207] "Million Dollar Baby" ## [208] "Hachi: A Dog's Tale" ## [209] "Logan" ## [210] "Ben-Hur" ## [211] "Barry Lyndon" ## [212] "Hotel Rwanda" ## [213] "Amores perros" ## [214] "Kaze no tani no Naushika" ## [215] "Spotlight" ## [216] "Harry Potter and the Deathly Hallows: Part 2" ## [217] "Les quatre cents coups" ## [218] "Dead Poets Society" ## [219] "Rebecca" ## [220] "Rocky" ## [221] "Gangs of Wasseypur" ## [222] "Monsters, Inc." ## [223] "Lagaan: Once Upon a Time in India" ## [224] "It Happened One Night" ## [225] "La haine" ## [226] "El ángel exterminador" ## [227] "Once Upon a Time in Hollywood" ## [228] "Ah-ga-ssi" ## [229] "Munna Bhai M.B.B.S." ## [230] "The Princess Bride" ## [231] "Swades: We, the People" ## [232] "PK" ## [233] "Butch Cassidy and the Sundance Kid" ## [234] "The Help" ## [235] "Drishyam" ## [236] "Before Sunset" ## [237] "A Wednesday" ## [238] "Paris, Texas" ## [239] "The Terminator" ## [240] "Akira" ## [241] "Sholay" ## [242] "Faa yeung nin wa" ## [243] "La battaglia di Algeri" ## [244] "The End of Evangelion" ## [245] "Mou gaan dou" ## [246] "Guardians of the Galaxy" ## [247] "La leggenda del pianista sull'oceano" ## [248] "Kis Uykusu" ## [249] "Aladdin" ## [250] "Das Cabinet des Dr. Caligari" ``` --- class: bg-main1 # Select and format pieces: save it ```r titles <- imdb_data %>% html_nodes(".titleColumn a") %>% html_text() ``` --- class: bg-main1 # Select and format pieces: years - nodes ```r imdb_data %>% html_nodes(".secondaryInfo") ``` ``` ## {xml_nodeset (250)} ## [1] <span class="secondaryInfo">(1994)</span> ## [2] <span class="secondaryInfo">(1972)</span> ## [3] <span class="secondaryInfo">(1974)</span> ## [4] <span class="secondaryInfo">(2008)</span> ## [5] <span class="secondaryInfo">(1957)</span> ## [6] <span class="secondaryInfo">(1993)</span> ## [7] <span class="secondaryInfo">(2003)</span> ## [8] <span class="secondaryInfo">(1994)</span> ## [9] <span class="secondaryInfo">(1966)</span> ## [10] <span class="secondaryInfo">(1999)</span> ## [11] <span class="secondaryInfo">(2001)</span> ## [12] <span class="secondaryInfo">(1994)</span> ## [13] <span class="secondaryInfo">(2010)</span> ## [14] <span class="secondaryInfo">(1980)</span> ## [15] <span class="secondaryInfo">(2002)</span> ## [16] <span class="secondaryInfo">(1999)</span> ## [17] <span class="secondaryInfo">(1975)</span> ## [18] <span class="secondaryInfo">(1990)</span> ## [19] <span class="secondaryInfo">(1954)</span> ## [20] <span class="secondaryInfo">(1995)</span> ## ... ``` --- class: bg-main1 # Select and format pieces: years - text ```r imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() ``` ``` ## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)" "(1966)" ## [10] "(1999)" "(2001)" "(1994)" "(2010)" "(1980)" "(2002)" "(1999)" "(1975)" "(1990)" ## [19] "(1954)" "(1995)" "(2002)" "(1997)" "(1991)" "(1977)" "(1946)" "(1998)" "(2001)" ## [28] "(1999)" "(1994)" "(2014)" "(1995)" "(2019)" "(1998)" "(1994)" "(1985)" "(1936)" ## [37] "(2002)" "(1991)" "(2011)" "(1960)" "(2000)" "(1931)" "(2006)" "(2014)" "(1968)" ## [46] "(2006)" "(1942)" "(1988)" "(1954)" "(1988)" "(1981)" "(1979)" "(2000)" "(1979)" ## [55] "(1940)" "(2006)" "(2018)" "(2018)" "(2012)" "(1980)" "(1957)" "(2008)" "(1950)" ## [64] "(1964)" "(1997)" "(2003)" "(1957)" "(2012)" "(1984)" "(1986)" "(2017)" "(1999)" ## [73] "(2016)" "(1995)" "(1981)" "(2009)" "(2007)" "(1983)" "(1992)" "(1995)" "(1984)" ## [82] "(2016)" "(1931)" "(2000)" "(1997)" "(1958)" "(2009)" "(1968)" "(2004)" "(1941)" ## [91] "(1987)" "(2012)" "(1959)" "(1971)" "(2000)" "(2001)" "(1921)" "(1948)" "(1983)" ## [100] "(1952)" "(1962)" "(2010)" "(1973)" "(1976)" "(1965)" "(1927)" "(1944)" "(2011)" ## [109] "(1962)" "(1952)" "(1989)" "(1960)" "(2009)" "(1997)" "(1975)" "(2010)" "(2018)" ## [118] "(1995)" "(1950)" "(1961)" "(2005)" "(1988)" "(2018)" "(1992)" "(2004)" "(1997)" ## [127] "(1959)" "(2004)" "(1963)" "(1988)" "(1950)" "(1985)" "(2001)" "(2006)" "(1995)" ## [136] "(2009)" "(1980)" "(1985)" "(1998)" "(2005)" "(1948)" "(2013)" "(1961)" "(2017)" ## [145] "(1974)" "(1925)" "(2015)" "(1954)" "(2005)" "(1957)" "(2011)" "(2007)" "(2015)" ## [154] "(1996)" "(2018)" "(2007)" "(1980)" "(1999)" "(2010)" "(1982)" "(1996)" "(1957)" ## [163] "(1939)" "(1949)" "(1954)" "(1982)" "(1957)" "(2003)" "(1993)" "(2008)" "(1996)" ## [172] "(2003)" "(1978)" "(1953)" "(1998)" "(1979)" "(1998)" "(2014)" "(2009)" "(2016)" ## [181] "(2014)" "(1993)" "(1924)" "(1926)" "(2010)" "(1939)" "(2014)" "(2003)" "(1966)" ## [190] "(1995)" "(2002)" "(2019)" "(2007)" "(1967)" "(1953)" "(2013)" "(1976)" "(1966)" ## [199] "(1979)" "(2006)" "(1928)" "(2013)" "(1986)" "(2015)" "(1986)" "(2013)" "(2004)" ## [208] "(2009)" "(2017)" "(1959)" "(1975)" "(2004)" "(2000)" "(1984)" "(2015)" "(2011)" ## [217] "(1959)" "(1989)" "(1940)" "(1976)" "(2012)" "(2001)" "(2001)" "(1934)" "(1995)" ## [226] "(1962)" "(2019)" "(2016)" "(2003)" "(1987)" "(2004)" "(2014)" "(1969)" "(2011)" ## [235] "(2015)" "(2004)" "(2008)" "(1984)" "(1984)" "(1988)" "(1975)" "(2000)" "(1966)" ## [244] "(1997)" "(2002)" "(2014)" "(1998)" "(2014)" "(1992)" "(1920)" ``` --- class: bg-main1 # Select and format pieces: years - remove-brackets ```r imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_replace("\\(", "") %>% # remove ( str_replace("\\)", "") %>% # remove ) as.numeric() ``` ``` ## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 1999 2001 1994 2010 1980 2002 1999 1975 ## [18] 1990 1954 1995 2002 1997 1991 1977 1946 1998 2001 1999 1994 2014 1995 2019 1998 1994 ## [35] 1985 1936 2002 1991 2011 1960 2000 1931 2006 2014 1968 2006 1942 1988 1954 1988 1981 ## [52] 1979 2000 1979 1940 2006 2018 2018 2012 1980 1957 2008 1950 1964 1997 2003 1957 2012 ## [69] 1984 1986 2017 1999 2016 1995 1981 2009 2007 1983 1992 1995 1984 2016 1931 2000 1997 ## [86] 1958 2009 1968 2004 1941 1987 2012 1959 1971 2000 2001 1921 1948 1983 1952 1962 2010 ## [103] 1973 1976 1965 1927 1944 2011 1962 1952 1989 1960 2009 1997 1975 2010 2018 1995 1950 ## [120] 1961 2005 1988 2018 1992 2004 1997 1959 2004 1963 1988 1950 1985 2001 2006 1995 2009 ## [137] 1980 1985 1998 2005 1948 2013 1961 2017 1974 1925 2015 1954 2005 1957 2011 2007 2015 ## [154] 1996 2018 2007 1980 1999 2010 1982 1996 1957 1939 1949 1954 1982 1957 2003 1993 2008 ## [171] 1996 2003 1978 1953 1998 1979 1998 2014 2009 2016 2014 1993 1924 1926 2010 1939 2014 ## [188] 2003 1966 1995 2002 2019 2007 1967 1953 2013 1976 1966 1979 2006 1928 2013 1986 2015 ## [205] 1986 2013 2004 2009 2017 1959 1975 2004 2000 1984 2015 2011 1959 1989 1940 1976 2012 ## [222] 2001 2001 1934 1995 1962 2019 2016 2003 1987 2004 2014 1969 2011 2015 2004 2008 1984 ## [239] 1984 1988 1975 2000 1966 1997 2002 2014 1998 2014 1992 1920 ``` --- class: bg-main1 # Select and format pieces: years - `parse_number()` ```r imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% parse_number() ``` ``` ## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 1999 2001 1994 2010 1980 2002 1999 1975 ## [18] 1990 1954 1995 2002 1997 1991 1977 1946 1998 2001 1999 1994 2014 1995 2019 1998 1994 ## [35] 1985 1936 2002 1991 2011 1960 2000 1931 2006 2014 1968 2006 1942 1988 1954 1988 1981 ## [52] 1979 2000 1979 1940 2006 2018 2018 2012 1980 1957 2008 1950 1964 1997 2003 1957 2012 ## [69] 1984 1986 2017 1999 2016 1995 1981 2009 2007 1983 1992 1995 1984 2016 1931 2000 1997 ## [86] 1958 2009 1968 2004 1941 1987 2012 1959 1971 2000 2001 1921 1948 1983 1952 1962 2010 ## [103] 1973 1976 1965 1927 1944 2011 1962 1952 1989 1960 2009 1997 1975 2010 2018 1995 1950 ## [120] 1961 2005 1988 2018 1992 2004 1997 1959 2004 1963 1988 1950 1985 2001 2006 1995 2009 ## [137] 1980 1985 1998 2005 1948 2013 1961 2017 1974 1925 2015 1954 2005 1957 2011 2007 2015 ## [154] 1996 2018 2007 1980 1999 2010 1982 1996 1957 1939 1949 1954 1982 1957 2003 1993 2008 ## [171] 1996 2003 1978 1953 1998 1979 1998 2014 2009 2016 2014 1993 1924 1926 2010 1939 2014 ## [188] 2003 1966 1995 2002 2019 2007 1967 1953 2013 1976 1966 1979 2006 1928 2013 1986 2015 ## [205] 1986 2013 2004 2009 2017 1959 1975 2004 2000 1984 2015 2011 1959 1989 1940 1976 2012 ## [222] 2001 2001 1934 1995 1962 2019 2016 2003 1987 2004 2014 1969 2011 2015 2004 2008 1984 ## [239] 1984 1988 1975 2000 1966 1997 2002 2014 1998 2014 1992 1920 ``` --- class: bg-main1 # Select and format pieces: years - remove-brackets ```r years <- imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_replace("\\(", "") %>% # remove ( str_replace("\\)", "") %>% # remove ) as.numeric() ``` --- class: bg-main1 # Select and format pieces: scores - nodes ```r imdb_data %>% html_nodes(".imdbRating strong") ``` ``` ## {xml_nodeset (250)} ## [1] <strong title="9.2 based on 2,128,368 user ratings">9.2</strong> ## [2] <strong title="9.1 based on 1,461,399 user ratings">9.1</strong> ## [3] <strong title="9.0 based on 1,016,494 user ratings">9.0</strong> ## [4] <strong title="9.0 based on 2,093,060 user ratings">9.0</strong> ## [5] <strong title="8.9 based on 606,501 user ratings">8.9</strong> ## [6] <strong title="8.9 based on 1,104,626 user ratings">8.9</strong> ## [7] <strong title="8.9 based on 1,514,027 user ratings">8.9</strong> ## [8] <strong title="8.9 based on 1,669,018 user ratings">8.9</strong> ## [9] <strong title="8.8 based on 632,203 user ratings">8.8</strong> ## [10] <strong title="8.8 based on 1,701,403 user ratings">8.8</strong> ## [11] <strong title="8.8 based on 1,529,661 user ratings">8.8</strong> ## [12] <strong title="8.8 based on 1,637,303 user ratings">8.8</strong> ## [13] <strong title="8.7 based on 1,866,315 user ratings">8.7</strong> ## [14] <strong title="8.7 based on 1,065,616 user ratings">8.7</strong> ## [15] <strong title="8.7 based on 1,369,352 user ratings">8.7</strong> ## [16] <strong title="8.6 based on 1,531,882 user ratings">8.6</strong> ## [17] <strong title="8.6 based on 842,283 user ratings">8.6</strong> ## [18] <strong title="8.6 based on 919,580 user ratings">8.6</strong> ## [19] <strong title="8.6 based on 288,358 user ratings">8.6</strong> ## [20] <strong title="8.6 based on 1,307,507 user ratings">8.6</strong> ## ... ``` --- class: bg-main1 # Select and format pieces: scores - text ```r imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() ``` ``` ## [1] "9.2" "9.1" "9.0" "9.0" "8.9" "8.9" "8.9" "8.9" "8.8" "8.8" "8.8" "8.8" "8.7" "8.7" ## [15] "8.7" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5" "8.5" ## [29] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" ## [43] "8.5" "8.5" "8.5" "8.5" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" ## [57] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3" "8.3" "8.3" "8.3" "8.3" ## [71] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" ## [85] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.2" "8.2" ## [99] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [113] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [127] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" ## [141] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [155] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [169] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [183] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" ## [197] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.0" ## [211] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" ## [225] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" ## [239] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" ``` --- class: bg-main1 # Select and format pieces: scores - as-numeric ```r imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() %>% as.numeric() ``` ``` ## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6 ## [22] 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 ## [43] 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ## [64] 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 ## [85] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 ## [127] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [148] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [169] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 ## [190] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.0 ## [211] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ## [232] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 ``` --- class: bg-main1 # Select and format pieces: scores - save ```r scores <- imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() %>% as.numeric() ``` --- class: bg-main1 # Select and format pieces: put it all together ```r imdb_top_250 <- tibble(title = titles, year = years, score = scores) imdb_top_250 ``` ``` ## # A tibble: 250 x 3 ## title year score ## <chr> <dbl> <dbl> ## 1 The Shawshank Redemption 1994 9.2 ## 2 The Godfather 1972 9.1 ## 3 The Godfather: Part II 1974 9 ## 4 The Dark Knight 2008 9 ## 5 12 Angry Men 1957 8.9 ## 6 Schindler's List 1993 8.9 ## 7 The Lord of the Rings: The Return of the King 2003 8.9 ## 8 Pulp Fiction 1994 8.9 ## 9 The Good, the Bad and the Ugly 1966 8.8 ## 10 Fight Club 1999 8.8 ## # … with 240 more rows ``` --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.1 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- class: bg-main1 # Aside: Yet another approach - pull the table with `html_table()` .vlarge[ - requires notation we haven't used yet (e.g., what is `[[]]`) - requires substantial text cleaning - If there is time we can cover this at the end of class ] ```r imdb_table <- html_table(imdb_data) glimpse(imdb_table[[1]]) ``` ``` ## Observations: 250 ## Variables: 5 ## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ `Rank & Title` <chr> "1.\n The Shawshank Redemption\n (1994)", "2.\n … ## $ `IMDb Rating` <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.7,… ## $ `Your Rating` <chr> "12345678910\n \n \n \n NOT … ## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ``` --- class: bg-main1 # Clean up / enhance .huge[ May or may not be a lot of work depending on how messy the data are - See if you like what you got: ] ```r glimpse(imdb_top_250) ``` ``` ## Observations: 250 ## Variables: 3 ## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "T… ## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 1999, 2001, 1994, 2… ## $ score <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.7, 8.7, 8.7… ``` --- class: bg-main1 # Clean up / enhance .huge[ - Add a variable for rank ] ```r imdb_top_250 %>% mutate( rank = 1:nrow(imdb_top_250) ) ``` ``` ## # A tibble: 250 x 4 ## title year score rank ## <chr> <dbl> <dbl> <int> ## 1 The Shawshank Redemption 1994 9.2 1 ## 2 The Godfather 1972 9.1 2 ## 3 The Godfather: Part II 1974 9 3 ## 4 The Dark Knight 2008 9 4 ## 5 12 Angry Men 1957 8.9 5 ## 6 Schindler's List 1993 8.9 6 ## 7 The Lord of the Rings: The Return of the King 2003 8.9 7 ## 8 Pulp Fiction 1994 8.9 8 ## 9 The Good, the Bad and the Ugly 1966 8.8 9 ## 10 Fight Club 1999 8.8 10 ## # … with 240 more rows ``` --- <table> <thead> <tr> <th style="text-align:left;"> title </th> <th style="text-align:left;"> year </th> <th style="text-align:left;"> score </th> <th style="text-align:left;"> rank </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> The Shawshank Redemption </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 9.2 </td> <td style="text-align:left;"> 1 </td> </tr> <tr> <td style="text-align:left;"> The Godfather </td> <td style="text-align:left;"> 1972 </td> <td style="text-align:left;"> 9.1 </td> <td style="text-align:left;"> 2 </td> </tr> <tr> <td style="text-align:left;"> The Godfather: Part II </td> <td style="text-align:left;"> 1974 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 3 </td> </tr> <tr> <td style="text-align:left;"> The Dark Knight </td> <td style="text-align:left;"> 2008 </td> <td style="text-align:left;"> 9 </td> <td style="text-align:left;"> 4 </td> </tr> <tr> <td style="text-align:left;"> 12 Angry Men </td> <td style="text-align:left;"> 1957 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 5 </td> </tr> <tr> <td style="text-align:left;"> Schindler's List </td> <td style="text-align:left;"> 1993 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 6 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Return of the King </td> <td style="text-align:left;"> 2003 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 7 </td> </tr> <tr> <td style="text-align:left;"> Pulp Fiction </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.9 </td> <td style="text-align:left;"> 8 </td> </tr> <tr> <td style="text-align:left;"> The Good, the Bad and the Ugly </td> <td style="text-align:left;"> 1966 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 9 </td> </tr> <tr> <td style="text-align:left;"> Fight Club </td> <td style="text-align:left;"> 1999 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 10 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Fellowship of the Ring </td> <td style="text-align:left;"> 2001 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 11 </td> </tr> <tr> <td style="text-align:left;"> Forrest Gump </td> <td style="text-align:left;"> 1994 </td> <td style="text-align:left;"> 8.8 </td> <td style="text-align:left;"> 12 </td> </tr> <tr> <td style="text-align:left;"> Inception </td> <td style="text-align:left;"> 2010 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 13 </td> </tr> <tr> <td style="text-align:left;"> Star Wars: Episode V - The Empire Strikes Back </td> <td style="text-align:left;"> 1980 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 14 </td> </tr> <tr> <td style="text-align:left;"> The Lord of the Rings: The Two Towers </td> <td style="text-align:left;"> 2002 </td> <td style="text-align:left;"> 8.7 </td> <td style="text-align:left;"> 15 </td> </tr> <tr> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> <td style="text-align:left;"> ... </td> </tr> </tbody> </table> --- class: bg-main5 # Your Turn .huge[ How would you go about answering this question: Which 1995 movies made the list? ] --- class: bg-main1 ```r imdb_top_250 %>% filter(year == 1995) ``` ``` ## # A tibble: 8 x 3 ## title year score ## <chr> <dbl> <dbl> ## 1 Se7en 1995 8.6 ## 2 The Usual Suspects 1995 8.5 ## 3 Braveheart 1995 8.3 ## 4 Toy Story 1995 8.3 ## 5 Heat 1995 8.2 ## 6 Casino 1995 8.2 ## 7 Before Sunrise 1995 8.1 ## 8 La haine 1995 8 ``` --- class: bg-main1 # Your turn .huge[ How would you go about answering this question: Which years have the most movies on the list? ] -- ```r imdb_top_250 %>% group_by(year) %>% summarise(total = n()) %>% arrange(desc(total)) %>% head(5) ``` ``` ## # A tibble: 5 x 2 ## year total ## <dbl> <int> ## 1 1995 8 ## 2 2014 8 ## 3 2004 7 ## 4 1957 6 ## 5 1997 6 ``` --- class: bg-main1 # Your Turn .huge[ How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time. ] -- ```r imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(score)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm") + xlab("year") ``` --- class: bg-main1 <img src="lecture-5b-slides_files/figure-html/visualise-score-year-print-1.png" width="90%" style="display: block; margin: auto;" /> --- class: bg-main1 # Other common formats: JSON .huge[ - JavaScript Object Notation (JSON). - A language-independent data format, and supplants extensible markup language (XML). - Data are sometimes stored as JSON, which requires special unpacking ] --- class: bg-main1 # Unpacking JSON: Example JSON from [jsonlite](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html) .pull-left[ ```r library(jsonlite) json_mario <- '[ { "Name": "Mario", "Age": 32, "Occupation": "Plumber" }, { "Name": "Peach", "Age": 21, "Occupation": "Princess" }, {}, { "Name": "Bowser", "Occupation": "Koopa" } ]' ``` ] .pull-right[ ```r mydf <- fromJSON(json_mario) mydf ``` ``` ## Name Age Occupation ## 1 Mario 32 Plumber ## 2 Peach 21 Princess ## 3 <NA> NA <NA> ## 4 Bowser NA Koopa ``` ] --- # Potential challenges with web scraping .huge[ - Unreliable formatting at the source - Data broken into many pages - Data arriving in multiple excel file formats - ... We will come back to this when we learn about functions next week. > Compare the display of information at [gumtree melbourne](https://www.gumtree.com.au/s-monash/l3001600) to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments? ] --- # Further exploring People write R packages to access online data! Check out: .huge[ - [cricinfo by Sayani Gupta and Rob Hyndman](https://docs.ropensci.org/cricketdata/) - [rwalkr by Earo Wang](https://github.com/earowang/rwalkr) - [fitzRoy for AFL data](https://github.com/jimmyday12/fitzRoy/) - [Top 40 lists of R packages by Joe Rickert](https://rviews.rstudio.com/2019/07/24/june-2019-top-40-r-packages/) - they usually include a "data" section. ] --- # Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.