class: center, middle, inverse, title-slide # ETC1010: Data Modelling and Computing ## Lecture 6A: Style, file paths, & functions ### Dr. Nicholas Tierney & Professor Di Cook ### EBS, Monash U. ### 2019-09-04 --- class: bg-main1 # Recap .huge[ - Missing Data - Web Scraping ] --- class: bg-main1 # Upcoming due dates .huge[ - Assignment 3: Thursday 26th September (Released 12th September) - Practical Exam: 16th October (see practice exam on dmac) - Project: 18th October (See examples of past projects) ] --- class: bg-main1 # Practical Exam? .huge[ - A live data analysis - 1 Hour - See example on website ] --- class: bg-main1 # Project? .huge[ - Collect / find your own data - Clean the data - Determine interesting questions to answer about the data - Plan how to execute analysis of the data - Communicate the idea, data cleaning, and analysis (oral presentation) - Further details are on the dmac website ] --- class: bg-main1 # Lecture Overview .huge[ - Coding style - File paths and Rstudio projects - (Intro to) Using functions ] --- class: bg-main5 # Style guide .huge[ > "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." -- Hadley Wickham - Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/ - There's more to it than what we'll cover today, we'll mention more as we introduce more functionality, and do a recap later in the semester ] --- class: bg-main5 # File names and code chunk labels .huge[ - Do not use spaces in file names, use `-` or `_` to separate words - Use all lowercase letters ] ```r # Good ucb-admit.csv # Bad UCB Admit.csv ``` --- class: bg-main5 # Object names .huge[ - Use `_` to separate words in object names - Use informative but short object names - Do not reuse object names within an analysis ] ```r # Good acs_employed # Bad acs.employed acs2 acs_subset acs_subsetted_for_males ``` --- class: bg-main5 # Spacing .huge[ - Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls. - Always put a space after a comma, and never before (just like in regular English). ] ```r # Good average <- mean(feet / 12 + inches, na.rm = TRUE) # Bad average<-mean(feet/12+inches,na.rm=TRUE) ``` --- class: bg-main5 # ggplot .huge[ - Always end a line with `+` - Always indent the next line ] ```r # Good ggplot(diamonds, mapping = aes(x = price)) + geom_histogram() # Bad ggplot(diamonds,mapping=aes(x=price))+geom_histogram() ``` --- class: bg-main5 # Long lines .huge[ - Limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. - Take advantage of RStudio editor's auto formatting for indentation at line breaks. ] --- class: bg-main5 # Assignment .huge[ - Use `<-` not `=` ] ```r # Good x <- 2 # Bad x = 2 ``` --- class: bg-main5 # Quotes .huge[ Use `"`, not `'`, for quoting text. The only exception is when the text already contains double quotes and no single quotes. ] ```r ggplot(diamonds, mapping = aes(x = price)) + geom_histogram() + # Good labs(title = "`Shine bright like a diamond`", # Good x = "Diamond prices", # Bad y = 'Frequency') ``` --- class: bg-main1 # Your Turn: ed .huge[ - Go to ed and assess your understanding of code style (there may be some technical difficulty) ] --- class: bg-main3 # File Paths and organising yourself .huge[ - It's important when you start working on your own machine that you understand _file storage hygiene_. - It helps prevent unexpected problems and makes you more productive - You'll spend less time fighting against strange file paths. - Not sure what a file path is? We'll explain that as well. ] --- class: bg-main1 # Your Turn .huge[ Discuss: 1. What your normal "workflow" is for starting a new project 2. Possible challenges that might arise when maintaining your project ] --- class: bg-main1 # A Mantra: When you start a new project \- Open a new RStudio project .huge[ - This section is heavily influenced by [Jenny Bryan's great blog post on project based workflows.](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) - Sometimes this is the first line of an R Script or R markdown file. ```r setwd("c:/really/long/file/path/to/this/directory) ``` - What do you think the `setwd` code does? ] --- class: bg-main1 # What does `setwd()` do? .huge[ - "set my working directory to this specific working directory". - It means that you can read in data and other things like this: ```r data <- read_csv("data/mydata.csv") ``` - Instead of ```r data <- read_csv("c:/really/long/file/path/to/this/directory/data/mydata.csv") ``` ] --- class: bg-main1 # What even is a file path? .huge[ - This has the effect of **making the file paths work in your file** - This is a problem because, among other things, using `setwd()` like this: - Has 0% chance of working on someone else's machine (**this includes you in >6 months**) - Your file is not self-contained and portable. (Think: _"What if this folder moved to /Downloads, or onto another machine?"_) - To get this to work, you need to hand edit the file path to your machine. - This is painful. And when you do this all the time, it gets old, fast. ] --- class: bg-main1 # What even is a file path? .huge[ - This all might be a bit confusing if you don't know what a file path is. - A file path is: "the machine-readable directions to where files on your computer live." - So, this file path: ``` /Users/njtierney/rmd4sci-materials/demo-gapminder.Rmd ``` Describes the location of the file "demo-gapminder.Rmd". ] --- class: bg-main1 # What even is a file path .huge[ We could visualise the path ``` /Users/njtierney/rmd4sci-materials/demo-gapminder.Rmd ``` as: ``` users └── njtierney └── rmd4sci-materials └── demo-gapminder.Rmd ``` ] --- class: bg-main1 .huge[ - To read in the `gapminder.csv` file, you might need to write code like this: ```r gapminder <- read_csv("/Users/njtierney/Desktop/rmd4sci-materials/data/gapminder.csv") ``` This is a problem, because this is not portable code. ] --- class: bg-main1 .huge[ If you have an RStudio project file inside the `rmd4sci-materials` folder, you can instead write the following: ```r gapminder <- read_csv("data/gapminder.csv") ``` ] --- class: bg-main1 # Your Turn .huge[ - (1-2 minutes) What folders are above the `health.csv` file in the following given file path? ] `"/Users/miles/etc1010/week1/data/health.csv"` --- class: bg-main1 # Your Turn .huge[ * What would be the result of using the following code in `demo-gapminder.Rmd`, and then using the code, and then moving this to another location, say inside your C drive? ```r setwd("Downloads/etc1010/week1/week1.Rmd) ``` ] --- class: bg-main1 # Is there an answer to the madness? .huge[ - This file path situation is a real pain. - Is there an answer to the madness? ] -- .huge[ The answer is yes! I highly recommend when you start on a new idea, new research project, paper. Anything that is new. It should start its life as an **rstudio project**. ] --- class: bg-main1 # Rstudio projects .huge[ An rstudio project helps keep related work together in the same place. Amongst other things, they: * Keep all your files together * Set the working directory to the project directory * Starts a new session of R * Restore previously edited files into the editor tabs * Restore other rstudio settings * Allow for multiple R projects open at the same time. ] --- class: bg-main1 # Rstudio projects .vlarge[ This helps keep you sane, because: * Your projects are each independent. * You can work on different projects at the same time. * Objects and functions you create and run from project idea won't impact one another. * You can refer to your data and other projects in a consistent way. And finally, the big one **RStudio projects help resolve file path problems**, because they automatically set the working directory to the location of the rstudio project. ] --- class: bg-main1 # The "here" package .huge[ - RStudio projects help resolve file path problems - In some cases you might have many folders in your r project. To help navigate them appropriately, you can use the `here` package to provide the full path directory, in a compact way. ```r here::here("data") ``` returns ``` [1] "/Users/njtierney/Desktop/rmd4sci-materials/data" ``` ] --- class: bg-main1 # The `here` package .huge[ ```r here::here("data", "gapminder.csv") ``` returns ``` [1] "/Users/njtierney/Desktop/rmd4sci-materials/data/gapminder.csv" ``` You can read the above `here` code as: > In the folder `data`, there is a file called `gapminder.csv`, can you please give me the full path to that file? ] --- class: bg-main1 # The `here` package .huge[ This is really handy for a few reasons: 1. It makes things _completely_ portable 1. Rmarkdown documents have a special way of looking for files, this helps eliminate file path pain. 1. If you decide to not use RStudio projects, you have code that will work on _any machine_ ] --- class: bg-main1 # Remember .huge[ > If the first line of your R script is ``` setwd("C:\Users\jenny\path\that\only\I\have") ``` > I will come into your office and SET YOUR COMPUTER ON FIRE 🔥. -- Jenny Bryan ] --- class: bg-main1 # Aside: How to create an RStudio project .huge[ - Go to [section 5.12 of rmarkdown for scientists](https://rmd4sci.njtierney.com/workflow.html#aside-creating-an-rstudio-project) ] --- class: bg-main1 # Summary of file paths and rstudio projects .huge[ In this lesson we've: * Learnt what file paths are * How to setup an rstudio project * How to construct full file paths with the `here` package ] --- class: bg-black center middle .white[ # Motivating Functions ] --- class: bg-main1 # Do you see any problems with this code? ```r st_episode <- st %>% html_nodes(".np_right_arrow .bp_sub_heading") %>% html_text() %>% str_replace(" episodes", "") %>% as.numeric() got_episode <- got %>% html_nodes(".np_right_arrow .bp_sub_heading") %>% html_text() %>% str_replace(" episodes", "") %>% as.numeric() twd_episode <- got %>% html_nodes(".np_right_arrow .bp_sub_heading") %>% html_text() %>% str_replace(" episodes", "") %>% as.numeric() ``` --- class: bg-main1 # Next Lecture: Why functions? .huge[ - Automate common tasks in a power powerful and general way than copy-and-pasting: - You can give a function an evocative name that makes your code easier to understand. - As requirements change, you only need to update code in one place, instead of many. - You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another). ] -- .huge[ - Down the line: Improve your reach as a data scientist by writing functions (and packages!) that others use ] --- ## Share and share alike <a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.