While the song is playing...
Draw a mental model / concept map of last lectures content on Missing Data.
<-
)%>%
hop()
, scoop()
, bop()
.Little bunny Foo Foo Went hopping through the forest Scooping up the field mice And bopping them on the head
foo_foo_1 <- hop(foo_foo, through = forest)foo_foo_2 <- scoop(foo_foo_1, up = field_mice)foo_foo_3 <- bop(foo_foo_2, on = head)
foo_foo_1 <- hop(foo_foo, through = forest)foo_foo_2 <- scoop(foo_foo_1, up = field_mice)foo_foo_3 <- bop(foo_foo_2, on = head)
foo_foo_1 <- hop(foo_foo, through = forest)foo_foo_2 <- scoop(foo_foo_1, up = field_mice)foo_foo_3 <- bop(foo_foo_2, on = head)
foo_foo_1 <- hop(foo_foo, through = forest)foo_foo_2 <- scoop(foo_foo_1, up = field_mice)foo_foo_3 <- bop(foo_foo_2, on = head)
foo_foo <- hop(foo_foo, through = forest)foo_foo <- scoop(foo_foo, up = field_mice)foo_foo <- bop(foo_foo, on = head)
foo_foo <- hop(foo_foo, through = forest)foo_foo <- scoop(foo_foo, up = field_mice)foo_foo <- bop(foo_foo, on = head)
foo_foo
written 6 times!) Obscures what changes.bop( scoop( hop(foo_foo, through = forest), up = field_mice ), on = head)
bop( scoop( hop(foo_foo, through = forest), up = field_mice ), on = head)
%>%
can help!f(x)
g(f(x))
h(g(f(x)))
%>%
can help!f(x)
g(f(x))
h(g(f(x)))
x %>% f()
x %>% f() %>% g()
x %>% f() %>% g() %>% h()
%>%
foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head)
Foo Foo hops, then scoops, then bops.
## # A tibble: 50,001 x 3## t left right## <int> <int> <int>## 1 1 28 29## 2 2 27 28## 3 3 26 24## 4 4 24 27## 5 5 22 18## 6 6 15 19## 7 7 15 13## 8 8 12 13## 9 9 15 16## 10 10 18 16## # … with 49,991 more rows
Oh, same sound is on both channels! A tad drab.
df_wavs %>% filter(channel == "left") %>% group_by(word) %>% summarise( m = mean(value), s = sd(value), mx = max(value), mn = min(value) ) %>% as_table()
word | m | s | mx | mn |
---|---|---|---|---|
data | 0.004 | 1602.577 | 8393 | -15386 |
word | 0.009 | 1506.626 | 6601 | -11026 |
## artist type lvar lave lmax lfener## Dancing Queen Abba Rock 17600755.6 -90.0068667 29921 105.92095## Knowing Me Abba Rock 9543020.9 -75.7667187 27626 102.83616## Take a Chance Abba Rock 9049481.5 -98.0629244 26372 102.32488## Mamma Mia Abba Rock 7557437.3 -90.4710616 28898 101.61648## Lay All You Abba Rock 6282285.6 -88.9526309 27940 100.30076## Super Trouper Abba Rock 4665866.7 -69.0208450 25531 100.24848## I Have A Dream Abba Rock 3369670.3 -71.6828788 14699 104.59686## The Winner Abba Rock 1135862.0 -67.8190467 8928 104.34921## Money Abba Rock 6146942.6 -76.2807500 22962 102.24066## SOS Abba Rock 3482881.8 -74.1300038 15517 104.36243## V1 Vivaldi Classical 3677296.2 66.6530970 24229 99.25243## V2 Vivaldi Classical 771491.8 21.6560941 6936 104.36737## V3 Vivaldi Classical 5227573.2 88.6465556 17721 104.61255## V4 Vivaldi Classical 334719.1 13.8318847 4123 104.35005## V5 Vivaldi Classical 836904.9 34.5677377 17306 88.82821## V6 Vivaldi Classical 13936436.7 216.2317586 30355 104.82354## V7 Vivaldi Classical 3636324.3 9.8366363 21450 100.52727## V8 Vivaldi Classical 295397.4 2.8143227 2985 108.28717## V9 Vivaldi Classical 4335879.0 10.9015767 22271 101.32881## V10 Vivaldi Classical 472630.0 3.8890862 8194 98.83731## M1 Mozart Classical 2819795.0 -5.8667602 14939 102.25358## M2 Mozart Classical 2836957.9 -5.6074580 13382 102.04646## M3 Mozart Classical 9089372.1 -5.9719205 25265 102.23796## M4 Mozart Classical 4056229.7 -6.2272904 21328 97.59887## M5 Mozart Classical 1568925.6 -6.1993790 15839 99.47580## M6 Mozart Classical 7758409.1 -5.7700183 22496 101.66942## All in a Days Work Eels Rock 40275677.4 -10.6893416 32759 106.07617## Saturday Morning Eels Rock 129472199.3 50.2115773 32759 114.00229## The Good Old Days Eels Rock 18838849.0 -0.0222764 30386 105.45611## Love of the Loveless Eels Rock 43201194.3 1.8926897 32759 108.37835## Girl Eels Rock 88547131.0 0.3358761 32744 112.00916## Agony Eels Rock 16285811.4 -0.1405876 30106 103.94171## Rock Hard Times Eels Rock 54651552.8 1.9848514 32759 108.87503## Restraining Eels Rock 12322434.4 1.0259790 23221 103.29906## Lone Wolf Eels Rock 63878899.8 3.3592505 32759 109.53291## Wrong About Bobby Eels Rock 43668186.9 -2.0125667 32759 108.35559## Love Me Do Beatles Rock 28806811.2 -5.7452189 24159 109.95808## I Want to Hold Your Hand Beatles Rock 61257693.6 -6.0340682 28502 111.81813## Cant Buy Me Love Beatles Rock 76729438.0 -5.9583971 30102 112.71792## I Feel Fine Beatles Rock 52497242.7 -5.7314591 29911 110.66333## Ticket to Ride Beatles Rock 68104547.0 -6.1449114 30415 111.59957## Help Beatles Rock 52569372.5 -5.7166301 32318 110.34618## Yesterday Beatles Rock 23080907.6 -6.0298337 28169 107.30541## Yellow Submarine Beatles Rock 39908667.5 -6.2616915 29061 109.47442## Eleanor Rigby Beatles Rock 18819753.2 -6.1265193 21680 108.65894## Penny Lane Beatles Rock 58614798.7 -5.9971242 31131 111.18901## B1 Beethoven Classical 8368952.9 -0.9538330 26645 101.74095## B2 Beethoven Classical 293608.3 -0.1247094 4554 103.26160## B3 Beethoven Classical 8051764.6 -0.3316964 24194 101.20132## B4 Beethoven Classical 23493873.6 -0.9411538 32766 106.18220## B5 Beethoven Classical 1640232.8 1.3899979 20877 94.59029## B6 Beethoven Classical 343973.1 -2.4748955 9225 93.38874## B7 Beethoven Classical 3644784.2 -1.0426907 24633 97.86394## B8 Beethoven Classical 15030950.3 -1.4394652 26066 106.39731## The Memory of Trees Enya New wave 1135493.1 -10.6183398 9994 102.16132## Anywhere Is Enya New wave 12230252.2 -17.8372700 24968 105.75748## Pax Deorum Enya New wave 1723627.9 -6.8327065 13227 101.86845## Waterloo Abba Rock 24898675.9 -93.9961871 29830 107.73299## V11 Vivaldi Classical 1879989.2 12.7213373 8601 105.81750## V12 Vivaldi Classical 737349.6 5.7190022 7089 102.92123## V13 Vivaldi Classical 2865979.9 21.4467629 17282 102.11314## Hey Jude Beatles Rock 8651854.1 -6.1322408 18509 83.88195## lfreq## Dancing Queen 59.57379## Knowing Me 58.48031## Take a Chance 124.59397## Mamma Mia 48.76513## Lay All You 74.02039## Super Trouper 81.40140## I Have A Dream 305.18689## The Winner 277.66056## Money 165.15799## SOS 146.73700## V1 329.53792## V2 843.83240## V3 165.76781## V4 293.99972## V5 198.38305## V6 198.46716## V7 877.77243## V8 58.41722## V9 176.53441## V10 526.04942## M1 342.26017## M2 511.85517## M3 429.27618## M4 343.75319## M5 288.44819## M6 459.24182## All in a Days Work 65.48281## Saturday Morning 41.40515## The Good Old Days 165.24210## Love of the Loveless 174.64185## Girl 392.28702## Agony 312.06322## Rock Hard Times 312.37864## Restraining 185.59771## Lone Wolf 66.13469## Wrong About Bobby 98.83404## Love Me Do 126.50757## I Want to Hold Your Hand 294.67263## Cant Buy Me Love 100.20089## I Feel Fine 110.39972## Ticket to Ride 107.30853## Help 137.10594## Yesterday 173.50631## Yellow Submarine 91.49508## Eleanor Rigby 164.92667## Penny Lane 92.46240## B1 354.12025## B2 316.03761## B3 397.18666## B4 529.26679## B5 445.00551## B6 331.68283## B7 181.01349## B8 249.08280## The Memory of Trees 155.06430## Anywhere Is 79.31957## Pax Deorum 49.01748## Waterloo 146.04306## V11 58.83780## V12 175.94562## V13 61.44533## Hey Jude 219.53773
Abba is just different from everyone else!
data
appear different than statistics
in the time series?"01:30
rvest
and polite
Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).
<html> <head> <title>This is a title</title> </head> <body> <p align="center">Hello world!</p> </body></html>
bow()
- check if the data can be scraped appropriatelyscrape()
- scrape website data (with nice defaults)html_nodes()
- select specified nodes from the HTML document using CSS selectors.html_table
- parse an HTML table into a data frame.html_text
- extract tag pairs' content.Selector | Example | Description |
---|---|---|
element | p |
Select all <p> elements |
element element | div p |
Select all <p> elements inside a <div> element |
element>element | div > p |
Select all <p> elements with <div> as a parent |
.class | .title |
Select all elements with class="title" |
#id | .name |
Select all elements with id="name" |
[attribute] | [class] |
Select all elements with a class attribute |
[attribute=value] | [class=title] |
Select all elements with class="title" |
Take a look at the source code, look for the tag table
tag:
http://www.imdb.com/chart/top
# install.packages("remotes")# remotes::install_github("dmi3kno/polite")library(polite)bow("http://www.imdb.com")
## <polite session> http://www.imdb.com## User-agent: polite R package - https://github.com/dmi3kno/polite## robots.txt: 25 rules are defined for 1 bots## Crawl delay: 5 sec## The path is scrapable for this user-agent
# install.packages("remotes")# remotes::install_github("dmi3kno/polite")library(polite)bow("http://www.imdb.com")
## <polite session> http://www.imdb.com## User-agent: polite R package - https://github.com/dmi3kno/polite## robots.txt: 25 rules are defined for 1 bots## Crawl delay: 5 sec## The path is scrapable for this user-agent
bow("http://www.facebook.com")
## <polite session> http://www.facebook.com## User-agent: polite R package - https://github.com/dmi3kno/polite## robots.txt: 313 rules are defined for 15 bots## Crawl delay: 5 sec## The path is not scrapable for this user-agent
Go to rstudio.cloud \(\rightarrow\) Lecture 5B \(\rightarrow\) Make a copy \(\rightarrow\) lecture-5B.Rmd
imdb_session <- bow("http://www.imdb.com/chart/top")imdb_session
## <polite session> http://www.imdb.com/chart/top## User-agent: polite R package - https://github.com/dmi3kno/polite## robots.txt: 25 rules are defined for 1 bots## Crawl delay: 5 sec## The path is scrapable for this user-agent
imdb_data <- scrape(imdb_session)imdb_data
## {html_document}## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<scrip ...## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" width="1" styl ...
html_nodes()
library(rvest)imdb_data %>% html_nodes(".titleColumn a")
## {xml_nodeset (250)}## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [10] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [11] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [14] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [15] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [17] <a href="/title/tt0073486/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [18] <a href="/title/tt0099685/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [19] <a href="/title/tt0047478/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## [20] <a href="/title/tt0114369/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8 ...## ...
html_text()
imdb_data %>% html_nodes(".titleColumn a") %>% html_text()
## [1] "The Shawshank Redemption" ## [2] "The Godfather" ## [3] "The Godfather: Part II" ## [4] "The Dark Knight" ## [5] "12 Angry Men" ## [6] "Schindler's List" ## [7] "The Lord of the Rings: The Return of the King" ## [8] "Pulp Fiction" ## [9] "The Good, the Bad and the Ugly" ## [10] "Fight Club" ## [11] "The Lord of the Rings: The Fellowship of the Ring" ## [12] "Forrest Gump" ## [13] "Inception" ## [14] "Star Wars: Episode V - The Empire Strikes Back" ## [15] "The Lord of the Rings: The Two Towers" ## [16] "The Matrix" ## [17] "One Flew Over the Cuckoo's Nest" ## [18] "Goodfellas" ## [19] "Seven Samurai" ## [20] "Se7en" ## [21] "Cidade de Deus" ## [22] "Life Is Beautiful" ## [23] "The Silence of the Lambs" ## [24] "Star Wars" ## [25] "It's a Wonderful Life" ## [26] "Saving Private Ryan" ## [27] "Spirited Away" ## [28] "The Green Mile" ## [29] "Léon: The Professional" ## [30] "Interstellar" ## [31] "The Usual Suspects" ## [32] "Avengers: Endgame" ## [33] "American History X" ## [34] "The Lion King" ## [35] "Back to the Future" ## [36] "Modern Times" ## [37] "The Pianist" ## [38] "Terminator 2: Judgment Day" ## [39] "The Intouchables" ## [40] "Psycho" ## [41] "Gladiator" ## [42] "City Lights" ## [43] "The Departed" ## [44] "Whiplash" ## [45] "Once Upon a Time in the West" ## [46] "The Prestige" ## [47] "Casablanca" ## [48] "Grave of the Fireflies" ## [49] "Rear Window" ## [50] "Cinema Paradiso" ## [51] "Raiders of the Lost Ark" ## [52] "Alien" ## [53] "Memento" ## [54] "Apocalypse Now" ## [55] "The Great Dictator" ## [56] "The Lives of Others" ## [57] "Avengers: Infinity War" ## [58] "Spider-Man: Into the Spider-Verse" ## [59] "Django Unchained" ## [60] "The Shining" ## [61] "Paths of Glory" ## [62] "WALL·E" ## [63] "Sunset Blvd." ## [64] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"## [65] "Princess Mononoke" ## [66] "Oldeuboi" ## [67] "Witness for the Prosecution" ## [68] "The Dark Knight Rises" ## [69] "Once Upon a Time in America" ## [70] "Aliens" ## [71] "Coco" ## [72] "American Beauty" ## [73] "Kimi no na wa." ## [74] "Braveheart" ## [75] "Das Boot" ## [76] "3 Idiots" ## [77] "Taare Zameen Par" ## [78] "Star Wars: Episode VI - Return of the Jedi" ## [79] "Reservoir Dogs" ## [80] "Toy Story" ## [81] "Amadeus" ## [82] "Dangal" ## [83] "M - Eine Stadt sucht einen Mörder" ## [84] "Requiem for a Dream" ## [85] "Good Will Hunting" ## [86] "Vertigo" ## [87] "Inglourious Basterds" ## [88] "2001: A Space Odyssey" ## [89] "Eternal Sunshine of the Spotless Mind" ## [90] "Citizen Kane" ## [91] "Full Metal Jacket" ## [92] "Jagten" ## [93] "North by Northwest" ## [94] "A Clockwork Orange" ## [95] "Snatch" ## [96] "Amelie" ## [97] "The Kid" ## [98] "Bicycle Thieves" ## [99] "Scarface" ## [100] "Singin' in the Rain" ## [101] "Lawrence of Arabia" ## [102] "Toy Story 3" ## [103] "The Sting" ## [104] "Taxi Driver" ## [105] "Per qualche dollaro in più" ## [106] "Metropolis" ## [107] "Double Indemnity" ## [108] "Jodaeiye Nader az Simin" ## [109] "To Kill a Mockingbird" ## [110] "Ikiru" ## [111] "Indiana Jones and the Last Crusade" ## [112] "The Apartment" ## [113] "Up" ## [114] "L.A. Confidential" ## [115] "Monty Python and the Holy Grail" ## [116] "Incendies" ## [117] "Capharnaüm" ## [118] "Heat" ## [119] "Rashomon" ## [120] "Yojimbo" ## [121] "Batman Begins" ## [122] "Die Hard" ## [123] "Green Book" ## [124] "Unforgiven" ## [125] "Downfall" ## [126] "Bacheha-Ye aseman" ## [127] "Some Like It Hot" ## [128] "Howl's Moving Castle" ## [129] "The Great Escape" ## [130] "My Neighbor Totoro" ## [131] "All About Eve" ## [132] "Ran" ## [133] "A Beautiful Mind" ## [134] "Pan's Labyrinth" ## [135] "Casino" ## [136] "El secreto de sus ojos" ## [137] "Raging Bull" ## [138] "Idi i smotri" ## [139] "Lock, Stock and Two Smoking Barrels" ## [140] "Babam ve Oglum" ## [141] "The Treasure of the Sierra Madre" ## [142] "The Wolf of Wall Street" ## [143] "Judgment at Nuremberg" ## [144] "Three Billboards Outside Ebbing, Missouri" ## [145] "Chinatown" ## [146] "The Gold Rush" ## [147] "Inside Out" ## [148] "Dial M for Murder" ## [149] "V for Vendetta" ## [150] "Det sjunde inseglet" ## [151] "Warrior" ## [152] "There Will Be Blood" ## [153] "Room" ## [154] "Trainspotting" ## [155] "Andhadhun" ## [156] "No Country for Old Men" ## [157] "The Elephant Man" ## [158] "The Sixth Sense" ## [159] "Shutter Island" ## [160] "The Thing" ## [161] "Eskiya" ## [162] "The Bridge on the River Kwai" ## [163] "Gone with the Wind" ## [164] "The Third Man" ## [165] "On the Waterfront" ## [166] "Blade Runner" ## [167] "Smultronstället" ## [168] "Finding Nemo" ## [169] "Jurassic Park" ## [170] "Gran Torino" ## [171] "Fargo" ## [172] "Kill Bill: Vol. 1" ## [173] "The Deer Hunter" ## [174] "Tôkyô monogatari" ## [175] "The Big Lebowski" ## [176] "Stalker" ## [177] "The Truman Show" ## [178] "Relatos salvajes" ## [179] "Mary and Max" ## [180] "Hacksaw Ridge" ## [181] "Gone Girl" ## [182] "In the Name of the Father" ## [183] "Sherlock Jr." ## [184] "The General" ## [185] "How to Train Your Dragon" ## [186] "Mr. Smith Goes to Washington" ## [187] "The Grand Budapest Hotel" ## [188] "Salinui chueok" ## [189] "Persona" ## [190] "Before Sunrise" ## [191] "Catch Me If You Can" ## [192] "Toy Story 4" ## [193] "Into the Wild" ## [194] "Cool Hand Luke" ## [195] "Le salaire de la peur" ## [196] "12 Years a Slave" ## [197] "Network" ## [198] "Andrei Rublev" ## [199] "Life of Brian" ## [200] "Rang De Basanti" ## [201] "La passion de Jeanne d'Arc" ## [202] "Prisoners" ## [203] "Stand by Me" ## [204] "Mad Max: Fury Road" ## [205] "Platoon" ## [206] "Rush" ## [207] "Million Dollar Baby" ## [208] "Hachi: A Dog's Tale" ## [209] "Logan" ## [210] "Ben-Hur" ## [211] "Barry Lyndon" ## [212] "Hotel Rwanda" ## [213] "Amores perros" ## [214] "Kaze no tani no Naushika" ## [215] "Spotlight" ## [216] "Harry Potter and the Deathly Hallows: Part 2" ## [217] "Les quatre cents coups" ## [218] "Dead Poets Society" ## [219] "Rebecca" ## [220] "Rocky" ## [221] "Gangs of Wasseypur" ## [222] "Monsters, Inc." ## [223] "Lagaan: Once Upon a Time in India" ## [224] "It Happened One Night" ## [225] "La haine" ## [226] "El ángel exterminador" ## [227] "Once Upon a Time in Hollywood" ## [228] "Ah-ga-ssi" ## [229] "Munna Bhai M.B.B.S." ## [230] "The Princess Bride" ## [231] "Swades: We, the People" ## [232] "PK" ## [233] "Butch Cassidy and the Sundance Kid" ## [234] "The Help" ## [235] "Drishyam" ## [236] "Before Sunset" ## [237] "A Wednesday" ## [238] "Paris, Texas" ## [239] "The Terminator" ## [240] "Akira" ## [241] "Sholay" ## [242] "Faa yeung nin wa" ## [243] "La battaglia di Algeri" ## [244] "The End of Evangelion" ## [245] "Mou gaan dou" ## [246] "Guardians of the Galaxy" ## [247] "La leggenda del pianista sull'oceano" ## [248] "Kis Uykusu" ## [249] "Aladdin" ## [250] "Das Cabinet des Dr. Caligari"
titles <- imdb_data %>% html_nodes(".titleColumn a") %>% html_text()
imdb_data %>% html_nodes(".secondaryInfo")
## {xml_nodeset (250)}## [1] <span class="secondaryInfo">(1994)</span>## [2] <span class="secondaryInfo">(1972)</span>## [3] <span class="secondaryInfo">(1974)</span>## [4] <span class="secondaryInfo">(2008)</span>## [5] <span class="secondaryInfo">(1957)</span>## [6] <span class="secondaryInfo">(1993)</span>## [7] <span class="secondaryInfo">(2003)</span>## [8] <span class="secondaryInfo">(1994)</span>## [9] <span class="secondaryInfo">(1966)</span>## [10] <span class="secondaryInfo">(1999)</span>## [11] <span class="secondaryInfo">(2001)</span>## [12] <span class="secondaryInfo">(1994)</span>## [13] <span class="secondaryInfo">(2010)</span>## [14] <span class="secondaryInfo">(1980)</span>## [15] <span class="secondaryInfo">(2002)</span>## [16] <span class="secondaryInfo">(1999)</span>## [17] <span class="secondaryInfo">(1975)</span>## [18] <span class="secondaryInfo">(1990)</span>## [19] <span class="secondaryInfo">(1954)</span>## [20] <span class="secondaryInfo">(1995)</span>## ...
imdb_data %>% html_nodes(".secondaryInfo") %>% html_text()
## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)" "(1966)"## [10] "(1999)" "(2001)" "(1994)" "(2010)" "(1980)" "(2002)" "(1999)" "(1975)" "(1990)"## [19] "(1954)" "(1995)" "(2002)" "(1997)" "(1991)" "(1977)" "(1946)" "(1998)" "(2001)"## [28] "(1999)" "(1994)" "(2014)" "(1995)" "(2019)" "(1998)" "(1994)" "(1985)" "(1936)"## [37] "(2002)" "(1991)" "(2011)" "(1960)" "(2000)" "(1931)" "(2006)" "(2014)" "(1968)"## [46] "(2006)" "(1942)" "(1988)" "(1954)" "(1988)" "(1981)" "(1979)" "(2000)" "(1979)"## [55] "(1940)" "(2006)" "(2018)" "(2018)" "(2012)" "(1980)" "(1957)" "(2008)" "(1950)"## [64] "(1964)" "(1997)" "(2003)" "(1957)" "(2012)" "(1984)" "(1986)" "(2017)" "(1999)"## [73] "(2016)" "(1995)" "(1981)" "(2009)" "(2007)" "(1983)" "(1992)" "(1995)" "(1984)"## [82] "(2016)" "(1931)" "(2000)" "(1997)" "(1958)" "(2009)" "(1968)" "(2004)" "(1941)"## [91] "(1987)" "(2012)" "(1959)" "(1971)" "(2000)" "(2001)" "(1921)" "(1948)" "(1983)"## [100] "(1952)" "(1962)" "(2010)" "(1973)" "(1976)" "(1965)" "(1927)" "(1944)" "(2011)"## [109] "(1962)" "(1952)" "(1989)" "(1960)" "(2009)" "(1997)" "(1975)" "(2010)" "(2018)"## [118] "(1995)" "(1950)" "(1961)" "(2005)" "(1988)" "(2018)" "(1992)" "(2004)" "(1997)"## [127] "(1959)" "(2004)" "(1963)" "(1988)" "(1950)" "(1985)" "(2001)" "(2006)" "(1995)"## [136] "(2009)" "(1980)" "(1985)" "(1998)" "(2005)" "(1948)" "(2013)" "(1961)" "(2017)"## [145] "(1974)" "(1925)" "(2015)" "(1954)" "(2005)" "(1957)" "(2011)" "(2007)" "(2015)"## [154] "(1996)" "(2018)" "(2007)" "(1980)" "(1999)" "(2010)" "(1982)" "(1996)" "(1957)"## [163] "(1939)" "(1949)" "(1954)" "(1982)" "(1957)" "(2003)" "(1993)" "(2008)" "(1996)"## [172] "(2003)" "(1978)" "(1953)" "(1998)" "(1979)" "(1998)" "(2014)" "(2009)" "(2016)"## [181] "(2014)" "(1993)" "(1924)" "(1926)" "(2010)" "(1939)" "(2014)" "(2003)" "(1966)"## [190] "(1995)" "(2002)" "(2019)" "(2007)" "(1967)" "(1953)" "(2013)" "(1976)" "(1966)"## [199] "(1979)" "(2006)" "(1928)" "(2013)" "(1986)" "(2015)" "(1986)" "(2013)" "(2004)"## [208] "(2009)" "(2017)" "(1959)" "(1975)" "(2004)" "(2000)" "(1984)" "(2015)" "(2011)"## [217] "(1959)" "(1989)" "(1940)" "(1976)" "(2012)" "(2001)" "(2001)" "(1934)" "(1995)"## [226] "(1962)" "(2019)" "(2016)" "(2003)" "(1987)" "(2004)" "(2014)" "(1969)" "(2011)"## [235] "(2015)" "(2004)" "(2008)" "(1984)" "(1984)" "(1988)" "(1975)" "(2000)" "(1966)"## [244] "(1997)" "(2002)" "(2014)" "(1998)" "(2014)" "(1992)" "(1920)"
imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_replace("\\(", "") %>% # remove ( str_replace("\\)", "") %>% # remove ) as.numeric()
## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 1999 2001 1994 2010 1980 2002 1999 1975## [18] 1990 1954 1995 2002 1997 1991 1977 1946 1998 2001 1999 1994 2014 1995 2019 1998 1994## [35] 1985 1936 2002 1991 2011 1960 2000 1931 2006 2014 1968 2006 1942 1988 1954 1988 1981## [52] 1979 2000 1979 1940 2006 2018 2018 2012 1980 1957 2008 1950 1964 1997 2003 1957 2012## [69] 1984 1986 2017 1999 2016 1995 1981 2009 2007 1983 1992 1995 1984 2016 1931 2000 1997## [86] 1958 2009 1968 2004 1941 1987 2012 1959 1971 2000 2001 1921 1948 1983 1952 1962 2010## [103] 1973 1976 1965 1927 1944 2011 1962 1952 1989 1960 2009 1997 1975 2010 2018 1995 1950## [120] 1961 2005 1988 2018 1992 2004 1997 1959 2004 1963 1988 1950 1985 2001 2006 1995 2009## [137] 1980 1985 1998 2005 1948 2013 1961 2017 1974 1925 2015 1954 2005 1957 2011 2007 2015## [154] 1996 2018 2007 1980 1999 2010 1982 1996 1957 1939 1949 1954 1982 1957 2003 1993 2008## [171] 1996 2003 1978 1953 1998 1979 1998 2014 2009 2016 2014 1993 1924 1926 2010 1939 2014## [188] 2003 1966 1995 2002 2019 2007 1967 1953 2013 1976 1966 1979 2006 1928 2013 1986 2015## [205] 1986 2013 2004 2009 2017 1959 1975 2004 2000 1984 2015 2011 1959 1989 1940 1976 2012## [222] 2001 2001 1934 1995 1962 2019 2016 2003 1987 2004 2014 1969 2011 2015 2004 2008 1984## [239] 1984 1988 1975 2000 1966 1997 2002 2014 1998 2014 1992 1920
parse_number()
imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% parse_number()
## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 1999 2001 1994 2010 1980 2002 1999 1975## [18] 1990 1954 1995 2002 1997 1991 1977 1946 1998 2001 1999 1994 2014 1995 2019 1998 1994## [35] 1985 1936 2002 1991 2011 1960 2000 1931 2006 2014 1968 2006 1942 1988 1954 1988 1981## [52] 1979 2000 1979 1940 2006 2018 2018 2012 1980 1957 2008 1950 1964 1997 2003 1957 2012## [69] 1984 1986 2017 1999 2016 1995 1981 2009 2007 1983 1992 1995 1984 2016 1931 2000 1997## [86] 1958 2009 1968 2004 1941 1987 2012 1959 1971 2000 2001 1921 1948 1983 1952 1962 2010## [103] 1973 1976 1965 1927 1944 2011 1962 1952 1989 1960 2009 1997 1975 2010 2018 1995 1950## [120] 1961 2005 1988 2018 1992 2004 1997 1959 2004 1963 1988 1950 1985 2001 2006 1995 2009## [137] 1980 1985 1998 2005 1948 2013 1961 2017 1974 1925 2015 1954 2005 1957 2011 2007 2015## [154] 1996 2018 2007 1980 1999 2010 1982 1996 1957 1939 1949 1954 1982 1957 2003 1993 2008## [171] 1996 2003 1978 1953 1998 1979 1998 2014 2009 2016 2014 1993 1924 1926 2010 1939 2014## [188] 2003 1966 1995 2002 2019 2007 1967 1953 2013 1976 1966 1979 2006 1928 2013 1986 2015## [205] 1986 2013 2004 2009 2017 1959 1975 2004 2000 1984 2015 2011 1959 1989 1940 1976 2012## [222] 2001 2001 1934 1995 1962 2019 2016 2003 1987 2004 2014 1969 2011 2015 2004 2008 1984## [239] 1984 1988 1975 2000 1966 1997 2002 2014 1998 2014 1992 1920
years <- imdb_data %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_replace("\\(", "") %>% # remove ( str_replace("\\)", "") %>% # remove ) as.numeric()
imdb_data %>% html_nodes(".imdbRating strong")
## {xml_nodeset (250)}## [1] <strong title="9.2 based on 2,128,368 user ratings">9.2</strong>## [2] <strong title="9.1 based on 1,461,399 user ratings">9.1</strong>## [3] <strong title="9.0 based on 1,016,494 user ratings">9.0</strong>## [4] <strong title="9.0 based on 2,093,060 user ratings">9.0</strong>## [5] <strong title="8.9 based on 606,501 user ratings">8.9</strong>## [6] <strong title="8.9 based on 1,104,626 user ratings">8.9</strong>## [7] <strong title="8.9 based on 1,514,027 user ratings">8.9</strong>## [8] <strong title="8.9 based on 1,669,018 user ratings">8.9</strong>## [9] <strong title="8.8 based on 632,203 user ratings">8.8</strong>## [10] <strong title="8.8 based on 1,701,403 user ratings">8.8</strong>## [11] <strong title="8.8 based on 1,529,661 user ratings">8.8</strong>## [12] <strong title="8.8 based on 1,637,303 user ratings">8.8</strong>## [13] <strong title="8.7 based on 1,866,315 user ratings">8.7</strong>## [14] <strong title="8.7 based on 1,065,616 user ratings">8.7</strong>## [15] <strong title="8.7 based on 1,369,352 user ratings">8.7</strong>## [16] <strong title="8.6 based on 1,531,882 user ratings">8.6</strong>## [17] <strong title="8.6 based on 842,283 user ratings">8.6</strong>## [18] <strong title="8.6 based on 919,580 user ratings">8.6</strong>## [19] <strong title="8.6 based on 288,358 user ratings">8.6</strong>## [20] <strong title="8.6 based on 1,307,507 user ratings">8.6</strong>## ...
imdb_data %>% html_nodes(".imdbRating strong") %>% html_text()
## [1] "9.2" "9.1" "9.0" "9.0" "8.9" "8.9" "8.9" "8.9" "8.8" "8.8" "8.8" "8.8" "8.7" "8.7"## [15] "8.7" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.5" "8.5" "8.5"## [29] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5"## [43] "8.5" "8.5" "8.5" "8.5" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4"## [57] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3" "8.3" "8.3" "8.3" "8.3"## [71] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"## [85] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.2" "8.2"## [99] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"## [113] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"## [127] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"## [141] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"## [155] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"## [169] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"## [183] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"## [197] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.0"## [211] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0"## [225] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0"## [239] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0"
imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() %>% as.numeric()
## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.6 8.6 8.6 8.6 8.6 8.6## [22] 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5## [43] 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4## [64] 8.4 8.4 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3## [85] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2## [106] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2## [127] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1## [148] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1## [169] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1## [190] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.0## [211] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0## [232] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
scores <- imdb_data %>% html_nodes(".imdbRating strong") %>% html_text() %>% as.numeric()
imdb_top_250 <- tibble(title = titles, year = years, score = scores)imdb_top_250
## # A tibble: 250 x 3## title year score## <chr> <dbl> <dbl>## 1 The Shawshank Redemption 1994 9.2## 2 The Godfather 1972 9.1## 3 The Godfather: Part II 1974 9 ## 4 The Dark Knight 2008 9 ## 5 12 Angry Men 1957 8.9## 6 Schindler's List 1993 8.9## 7 The Lord of the Rings: The Return of the King 2003 8.9## 8 Pulp Fiction 1994 8.9## 9 The Good, the Bad and the Ugly 1966 8.8## 10 Fight Club 1999 8.8## # … with 240 more rows
title | year | score |
---|---|---|
The Shawshank Redemption | 1994 | 9.2 |
The Godfather | 1972 | 9.1 |
The Godfather: Part II | 1974 | 9 |
The Dark Knight | 2008 | 9 |
12 Angry Men | 1957 | 8.9 |
Schindler's List | 1993 | 8.9 |
The Lord of the Rings: The Return of the King | 2003 | 8.9 |
Pulp Fiction | 1994 | 8.9 |
The Good, the Bad and the Ugly | 1966 | 8.8 |
Fight Club | 1999 | 8.8 |
The Lord of the Rings: The Fellowship of the Ring | 2001 | 8.8 |
Forrest Gump | 1994 | 8.8 |
Inception | 2010 | 8.7 |
Star Wars: Episode V - The Empire Strikes Back | 1980 | 8.7 |
The Lord of the Rings: The Two Towers | 2002 | 8.7 |
... | ... | ... |
html_table()
[[]]
)imdb_table <- html_table(imdb_data)glimpse(imdb_table[[1]])
## Observations: 250## Variables: 5## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …## $ `Rank & Title` <chr> "1.\n The Shawshank Redemption\n (1994)", "2.\n …## $ `IMDb Rating` <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.7,…## $ `Your Rating` <chr> "12345678910\n \n \n \n NOT …## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
May or may not be a lot of work depending on how messy the data are
glimpse(imdb_top_250)
## Observations: 250## Variables: 3## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfather: Part II", "T…## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 1999, 2001, 1994, 2…## $ score <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.7, 8.7, 8.7…
imdb_top_250 %>% mutate( rank = 1:nrow(imdb_top_250) )
## # A tibble: 250 x 4## title year score rank## <chr> <dbl> <dbl> <int>## 1 The Shawshank Redemption 1994 9.2 1## 2 The Godfather 1972 9.1 2## 3 The Godfather: Part II 1974 9 3## 4 The Dark Knight 2008 9 4## 5 12 Angry Men 1957 8.9 5## 6 Schindler's List 1993 8.9 6## 7 The Lord of the Rings: The Return of the King 2003 8.9 7## 8 Pulp Fiction 1994 8.9 8## 9 The Good, the Bad and the Ugly 1966 8.8 9## 10 Fight Club 1999 8.8 10## # … with 240 more rows
title | year | score | rank |
---|---|---|---|
The Shawshank Redemption | 1994 | 9.2 | 1 |
The Godfather | 1972 | 9.1 | 2 |
The Godfather: Part II | 1974 | 9 | 3 |
The Dark Knight | 2008 | 9 | 4 |
12 Angry Men | 1957 | 8.9 | 5 |
Schindler's List | 1993 | 8.9 | 6 |
The Lord of the Rings: The Return of the King | 2003 | 8.9 | 7 |
Pulp Fiction | 1994 | 8.9 | 8 |
The Good, the Bad and the Ugly | 1966 | 8.8 | 9 |
Fight Club | 1999 | 8.8 | 10 |
The Lord of the Rings: The Fellowship of the Ring | 2001 | 8.8 | 11 |
Forrest Gump | 1994 | 8.8 | 12 |
Inception | 2010 | 8.7 | 13 |
Star Wars: Episode V - The Empire Strikes Back | 1980 | 8.7 | 14 |
The Lord of the Rings: The Two Towers | 2002 | 8.7 | 15 |
... | ... | ... | ... |
How would you go about answering this question: Which 1995 movies made the list?
imdb_top_250 %>% filter(year == 1995)
## # A tibble: 8 x 3## title year score## <chr> <dbl> <dbl>## 1 Se7en 1995 8.6## 2 The Usual Suspects 1995 8.5## 3 Braveheart 1995 8.3## 4 Toy Story 1995 8.3## 5 Heat 1995 8.2## 6 Casino 1995 8.2## 7 Before Sunrise 1995 8.1## 8 La haine 1995 8
How would you go about answering this question: Which years have the most movies on the list?
How would you go about answering this question: Which years have the most movies on the list?
imdb_top_250 %>% group_by(year) %>% summarise(total = n()) %>% arrange(desc(total)) %>% head(5)
## # A tibble: 5 x 2## year total## <dbl> <int>## 1 1995 8## 2 2014 8## 3 2004 7## 4 1957 6## 5 1997 6
How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time.
How would you go about creating this visualization: Visualize the average yearly score for movies that made it on the top 250 list over time.
imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(score)) %>% ggplot(aes(y = avg_score, x = year)) + geom_point() + geom_smooth(method = "lm") + xlab("year")
library(jsonlite)json_mario <- '[ { "Name": "Mario", "Age": 32, "Occupation": "Plumber" }, { "Name": "Peach", "Age": 21, "Occupation": "Princess" }, {}, { "Name": "Bowser", "Occupation": "Koopa" }]'
mydf <- fromJSON(json_mario)mydf
## Name Age Occupation## 1 Mario 32 Plumber## 2 Peach 21 Princess## 3 <NA> NA <NA>## 4 Bowser NA Koopa
Compare the display of information at gumtree melbourne to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments?
People write R packages to access online data! Check out:
This work is licensed under a Creative Commons Attribution 4.0 International License.
While the song is playing...
Draw a mental model / concept map of last lectures content on Missing Data.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |