This assignment is designed to simulate a scenario where you are taking over someone’s existing work, and continuing with it to draw some further insights.
This is a real world dataset taken from the Crime Statistics Agency Victoria. https://www.crimestatistics.vic.gov.au/download-data, specifically the data called “Data tables - Spotlight: Burglary/Break and Enter Offences Recorded in Victoria visualisation - year ending December 2018 (XLSX, 4.4 MB)”. The raw data is used for this assignment, with no changes made.
You are writing a quick summary of the data, following along from guidance from Amelia, and some of the questions your manager has. This is not a formal report, but rather something you are giving to your manager that describes the data, and some interesting insights. We have written example text for the first section on Monash, and would like you to explore another area. Our example writings are a good example of how to get full marks.
Your “colleague”, Amelia (in the text treatment below) has written some helpful hints throughout the assignment to help guide you.
Questions that are work marks are indicated with **
at the start and end of the question, as well as a number of marks in parenthesis.
This assignment will be worth 4% of your total grade, and will be marked out of 16 marks total.
10 marks for the questions
Your marks will then be weighted according to peer evaluation.
Sections that contain marks are indicated with **
, and will have the number of marks indicated in parentheses. For example:
# `**` What are the types of item divisions? How many are there? (0.5 Mark) `**`
As of week 1, you have seen some of the code used here, but I do not expect you to know immediately what the code below does. This is a challenge for you! We will be covering skills on data summary and data visualisation in the next two weeks, but this assignment is designed to simulate a real life work situation - this means that there are some things where you need to “learn on the job”. But the vast majority of the assignment will cover things that you will have seen in class, or the readings.
Remember, you can look up the help file for functions by typing ?function_name
. For example, ?mean
. Feel free to google questions you have about how to do other kinds of plots, and post on the ED if you have any questions about the assignment.
To complete the assignment you will need to fill in the blanks for function names, arguments, or other names. These sections are marked with ***
or ___
. At a minimum, your assignment should be able to be “knitted” using the knit
button for your Rmarkdown document.
If you want to look at what the assignment looks like in progress, but you do not have valid R code in all the R code chunks, remember that you can set the chunk options to eval = FALSE
. If you do this, please remember to ensure that you remove this chunk option or set it to eval = TRUE
when you submit the assignment, to ensure all your R code runs.
You will be completing this assignment in your assigned groups. A reminder regarding our recommendations for completing group assignments:
Your assignments will be peer reviewed, and results checked for reproducibility. This means:
Each student will be randomly assigned another team’s submission to provide feedback on three things:
This assignment is due in by close of business (5pm) on Friday 16th August. You will submit the assignment via ED. Please change the file name to include your teams name. For example, if you are team dplyr
, your assignment file name could read: “assignment-1-2019-s2-team-dplyr.Rmd”
You work as a data scientist in the well named company, “The Security Company”, that sells security products: alarms, surveillance cameras, locks, screen doors, big doors, and so on.
It’s your second day at the company, and you’re taken to your desk. Your boss says to you:
Amelia has managed to find this treasure trove of data - get this: crime statistics on breaking and entering around Victoria for the past years! Unfortunately, Amelia just left on holiday to New Zealand. They discovered this dataset the afternoon before they left on holiday, and got started on doing some data analysis.
We’ve got a meeting coming up soon where we need to discuss some new directions for the company, and we want you to tell us about this dataset and what we can do with it. We want to focus on Monash, since we have a few big customers in that area, and then we want you to help us compare that whatever area has the highest burglary.
You’re in with the new hires of data scientists here. We’d like you to take a look at the data and tell me what the spreadsheet tells us. I’ve written some questions on the report for you to answer, and there are also some questions from Amelia I would like you to look at as well.
Most Importantly, can you get this to me by COB Friday 16th August (COB = Close of Business at 5pm).
I’ve given this dataset to some of the other new hire data scientists as well, you’ll all be working as a team on this dataset. I’d like you to all try and work on the questions separately, and then combine your answers together to provide the best results.
From here, you are handed a USB stick. You load this into your computer, and you see a folder called “vic-crime”. In it is a folder called “data-raw”, and an Rmarkdown file. It contains the start of a data analysis. Your job is to explore the data and answer the questions in the document.
Note that the text that is written was originally written by Amelia, and you need to make sure that their name is kept up top, and to pay attention to what they have to say in the document! # Data read in.
Amelia: First, let’s read in the data using the function
read_excel()
from thereadxl
package, and clean up the names, using therename
function fromdplyr
.
library(readxl)
crime_raw <- read_excel("data-raw/Data_tables_spotlight_burglary_break_and_enter_visualisation_year_ending_December_2018_v3.xlsx",
sheet = 6)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
crime <- crime_raw %>%
rename(year = `Year ending December`,
local_gov_area = `Local Government Area`,
offence_subgroup = `Offence Subgroup`,
item_division = `Property Item Division`,
item_subdivision = `Property Item Subdivision`,
n_property_items = `Number of Property Items`)
Amelia: Let’s print the data and look at the first few rows.
year | local_gov_area | offence_subgroup | item_division | item_subdivision | n_property_items |
---|---|---|---|---|---|
2009 | Alpine | B321 Residential non-aggravated burglary | Cash/Document | Cash/Document | 3 |
2009 | Alpine | B321 Residential non-aggravated burglary | Cigarettes/Liquor | Cigarettes/Liquor | 3 |
2009 | Alpine | B321 Residential non-aggravated burglary | Electrical Appliances | Other Electrical Appliances | 4 |
Amelia: And what are the names of the columns in the dataset?
names(crime)
## [1] "year" "local_gov_area" "offence_subgroup"
## [4] "item_division" "item_subdivision" "n_property_items"
Amelia: How many years of data are there?
summary(crime$year)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2009 2011 2014 2014 2016 2018
Amelia: We have data that goes from 2009 until 2018, that’s nine years of data!
How many Local Government Areas (LGAs) are there? And what are the LGAs called?
n_distinct(crime$local_gov_area)
## [1] 80
unique(crime$local_gov_area)
## [1] "Alpine" "Ararat" "Ballarat"
## [4] "Banyule" "Bass Coast" "Baw Baw"
## [7] "Bayside" "Benalla" "Boroondara"
## [10] "Brimbank" "Buloke" "Campaspe"
## [13] "Cardinia" "Casey" "Central Goldfields"
## [16] "Colac-Otway" "Corangamite" "Darebin"
## [19] "East Gippsland" "Frankston" "Gannawarra"
## [22] "Glen Eira" "Glenelg" "Golden Plains"
## [25] "Greater Bendigo" "Greater Dandenong" "Greater Geelong"
## [28] "Greater Shepparton" "Hepburn" "Hindmarsh"
## [31] "Hobsons Bay" "Horsham" "Hume"
## [34] "Indigo" "Kingston" "Knox"
## [37] "Latrobe" "Loddon" "Macedon Ranges"
## [40] "Manningham" "Mansfield" "Maribyrnong"
## [43] "Maroondah" "Melbourne" "Melton"
## [46] "Mildura" "Mitchell" "Moira"
## [49] "Monash" "Moonee Valley" "Moorabool"
## [52] "Moreland" "Mornington Peninsula" "Mount Alexander"
## [55] "Moyne" "Murrindindi" "Nillumbik"
## [58] "Northern Grampians" "Port Phillip" "Pyrenees"
## [61] "Queenscliffe" "South Gippsland" "Southern Grampians"
## [64] "Stonnington" "Strathbogie" "Surf Coast"
## [67] "Swan Hill" "Towong" "Wangaratta"
## [70] "Warrnambool" "Wellington" "West Wimmera"
## [73] "Whitehorse" "Whittlesea" "Wodonga"
## [76] "Wyndham" "Yarra" "Yarra Ranges"
## [79] "Yarriambiack" "Victoria"
Amelia: That’s a lot of areas - about 80!
What are the types of offence subgroups? How many are there?
unique(crime$offence_subgroup)
## [1] "B321 Residential non-aggravated burglary"
## [2] "B322 Non-residential non-aggravated burglary"
## [3] "B311 Residential aggravated burglary"
## [4] "B319 Unknown aggravated burglary"
## [5] "B329 Unknown non-aggravated burglary"
## [6] "B312 Non-residential aggravated burglary"
n_distinct(crime$offence_subgroup)
## [1] 6
Amelia: Remember that you can learn more about what these functions do by typing
?unique
or?n_distinct
into the console.
**
What are the types of item divisions? How many are there? (0.5 Mark) **
RESPONSE: There are a total of 25 item divisions.
unique(crime$item_division)
## [1] "Cash/Document" "Cigarettes/Liquor"
## [3] "Electrical Appliances" "Firearms/Ammunition"
## [5] "Food" "Garden Items"
## [7] "Household Items" "Jewellery"
## [9] "Marine Property" "Other"
## [11] "Personal Property" "Sporting Goods"
## [13] "Tools" "Tv/Vcr"
## [15] "Clothing" "Photographic Equip"
## [17] "Power Tools" "Car Accessories"
## [19] "Weapons" "Domestic Pets"
## [21] "Furniture" "Timber/Build Mat"
## [23] "Police Property" "Livestock"
## [25] "Explosives"
n_distinct(crime$item_division)
## [1] 25
**
What are the types of item subdivisions? (0.5 Mark) **
RESPONSE: There are a total of 32 item sub-divisions.
unique(crime$item_subdivision)
## [1] "Cash/Document" "Cigarettes/Liquor"
## [3] "Other Electrical Appliances" "Video Game Unit"
## [5] "Firearms/Ammunition" "Food"
## [7] "Garden Items" "Household Items"
## [9] "Jewellery" "Marine Property"
## [11] "Other Property Items" "Personal Property"
## [13] "Sporting Goods" "Tools"
## [15] "Tv/Vcr" "Clothing"
## [17] "Photographic Equip" "Power Tools"
## [19] "Car Accessories" "Computer"
## [21] "Mobile Phone" "Weapons"
## [23] "Key" "Domestic Pets"
## [25] "Speaker" "Furniture"
## [27] "Timber/Build Mat" "Police Property"
## [29] "Livestock" "Explosives"
## [31] "Laptop" "Tablet Computer"
n_distinct(crime$item_subdivision)
## [1] 32
**
What is the summary of the number of property items? (0.5 Mark) **
summary(crime$n_property_items)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 7.00 74.15 29.00 45027.00
**
Can you tell me what each row represents, and what each of the columns measure? (1 Mark) **
Amelia: We need to describe what each row of the data represents, and take our best guess at what we think each column measures. It might be worthwhile looking through the excel sheet in the
data
folder, or on the website where the data was extracted.
RESPONSE: Each row represents an observation. Using Table 6 (from the raw data, and as per the prior question), the row tells us that for a given year, government area, offence subgroup, property item division and property item subdivision, the number of units (or property units) that were stolen. That is, for a given set of variables (the columns), we are able to find out the related number of property items stolen. From this, summaries can be formed as completed in the question prior.
Amelia: Let’s group by year and then sum up the number of property items. Then we can take this information and use
ggplot
to plot the year on the x axis, andn_items
on the y axis, and number of items as a column withgeom_col()
.
crime_year_n_items <- crime %>%
group_by(year) %>%
summarise(n_items = sum(n_property_items))
library(ggplot2)
ggplot(crime_year_n_items,
aes(x = year,
y = n_items)) +
geom_col()
Amelia: I try and write three sentences complete about what I learn in a graphic. You should start with a quick summary of what the graphic shows you. Then, describe what is on the x axis, the y axis, and any other colours used to separate the data. You then need to describe what you learn. So, I would say:
“A summary of the number of items stolen from burglaries for each year from 2009 until 2018. On the x axis is each year, and the y axis is the number of items stolen. We learn that the number of items stolen stays around 300,000 (3e+05 means the number 3 with 5 zeros after it), but from 201, the number of items stolen has decreased each year.”
Amelia: Let’s filter the data down to the ‘Monash’ LGAs.
crime_monash <- crime %>% filter(local_gov_area == "Monash")
Amelia: Let’s count the number of crimes per year.
crime_count_monash <- crime_monash %>% count(year)
ggplot(crime_count_monash,
aes(x = year,
y = n)) +
geom_col()
Amelia: This plot shows the number of burglary crimes per year across Victoria. The x axis shows the year, and the y axis shows the number of crimes scored for that year. There appears to be a slight upwards trend, but it looks variable for each year.
Amelia: We count the number of observations in each
offence_subgroup
to tell us which are the most common.
crime_monash %>% count(offence_subgroup)
## # A tibble: 6 x 2
## offence_subgroup n
## <chr> <int>
## 1 B311 Residential aggravated burglary 117
## 2 B312 Non-residential aggravated burglary 10
## 3 B319 Unknown aggravated burglary 6
## 4 B321 Residential non-aggravated burglary 273
## 5 B322 Non-residential non-aggravated burglary 248
## 6 B329 Unknown non-aggravated burglary 35
Amelia: The top subgroups are “B321 Residential non-aggravated burglary”, at 273, followed by “B322 Non-residential non-aggravated burglary” at 248.
Amelia: We take the crime data, then group by year, and count the number of offences in each year. We then plot this data. On the x axis we have year. On the y axis we have n, the number of crimes that take place in a subgroup in a year, and we are colouring according to the offence subgroup, and drawing this with a line, then making sure that the limits go from 0 to 30.
crime_year_offence_monash <- crime_monash %>%
group_by(year) %>%
count(offence_subgroup)
ggplot(crime_year_offence_monash,
aes(x = year,
y = n,
colour = offence_subgroup)) +
geom_line() +
lims(y = c(0, 35)) # Makes sure the y axis goes to zero
Amelia: This shows us that the most common offence is “residential non-aggravated burglary”,
Amelia: We count up the item subdivisions, which is the smallest category on items. We then plot number of times an item is stolen, and reorder the y axis so that the items are in order of most to least.
crime_items_monash <- crime_monash %>%
count(item_subdivision)
# save an object of the maximum number of items stolen
# to help construct the plot below.
max_items_stolen <- max(crime_items_monash$n)
ggplot(crime_items_monash,
aes(x = n,
y = reorder(item_subdivision, n))) +
geom_point() +
lims(x = c(0, max_items_stolen)) # make sure x axis goes from 0
Amelia:
Amelia: This could be where we focus our next marketing campaign! Let’s take the crime data, then count the number of rows in each local_gov_area, and take the top 5 results using
top_n
, and arrange in descending order by the column “n”
crime %>%
count(local_gov_area) %>%
top_n(n = 5) %>%
arrange(desc(n))
## Selecting by n
## # A tibble: 5 x 2
## local_gov_area n
## <chr> <int>
## 1 Victoria 1335
## 2 Casey 831
## 3 Wyndham 830
## 4 Greater Geelong 828
## 5 Brimbank 817
**
) Which LGA had the most crime? (0.5 Mark) (**
)RESPONSE: Victoria was the LGA with the most crime. It had 1,335 reported instances.
crime %>%
count(local_gov_area) %>%
top_n(n = 1) %>%
arrange(desc(n))
## Selecting by n
## # A tibble: 1 x 2
## local_gov_area n
## <chr> <int>
## 1 Victoria 1335
**
Subset the data to be the LGA with the most crime. (0.5 Mark) **
crime_victoria <- crime %>%
filter(local_gov_area == "Victoria")
**
Is crime in Victoria increasing? (1 Mark) **
crime_count_victoria <- crime_victoria %>% count(year)
ggplot(crime_count_victoria,
aes(x = year,
y = n)) +
geom_col()
RESPONSE: This plot shows the number of burglary crimes per year across the local government area of “Victoria”. The x-axis shows the year, and the y-axis shows the number of crimes recorded for that year. No specific trend is apparent, however each years’ number of recorded crimes varies/fluctuates around approximately 135.
**
What are the most common offences at Victoria across all years? (1 Marks) **
crime_victoria %>%
count(offence_subgroup)
## # A tibble: 6 x 2
## offence_subgroup n
## <chr> <int>
## 1 B311 Residential aggravated burglary 287
## 2 B312 Non-residential aggravated burglary 154
## 3 B319 Unknown aggravated burglary 29
## 4 B321 Residential non-aggravated burglary 308
## 5 B322 Non-residential non-aggravated burglary 308
## 6 B329 Unknown non-aggravated burglary 249
RESPONSE: The most common offences in Victoria across all years are tied with “Residential non-aggravated burglary” and “Non-residential non-aggravated burglary” both with 308 recorded instnces across all years.
**
Are any of these offences increasing over time? (1 Mark) **
crime_year_offence_victoria <- crime_victoria %>%
group_by(year) %>%
count(offence_subgroup)
ggplot(crime_year_offence_victoria,
aes(x = year,
y = n,
colour = offence_subgroup)) +
geom_line() +
lims(y = c(0, 35)) # Makes sure the y axis goes to zero
Amelia: I would write three sentences complete about what I learn in this graphic. You should start with a quick summary of what the graphic shows you. Then, describe what is on the x axis, the y axis, and any other colours used to separate the data. You then need to describe what you learn.
RESPONSE: This shows us that the most common offence in 2018 became “non-residential non-aggravated burglary”. The graph depicts a general decreasing trend in “Unknown aggravated burglary” and “Unknown non-aggravated burglary”. However, a general increasing trend is depicted for “Non-residential aggravated burglary”, “Residential aggravateg burglary” and “Non-residential non-aggravated burglary”. In summary, from the graph above we see (1) B321 and B322 and B311 have the similar fluctuate trend around 30. (2) B329 is fluctuate around 25 to 30 between 1999 to 2013, then it shows an obviously decreasing utill year 2017. After 2017 it shows an increasing trend. (3) B312 shows a fluctuate around 15 between 2009 to 2018 (4) B319 has a sharp decreas from 2009 to 2010, then it shows a fluctuate between 2010 and 2013. After 2013, it shows a relatively smooth trends until 2016. After 2016, it shows a decreasing trend.
crime_items_victoria <- crime_victoria %>%
count(item_subdivision)
ggplot(crime_items_victoria,
aes(x = n,
y = reorder(item_subdivision, n))) +
geom_point()
RESPONSE: The most common sub-division items stolen are Personal Property and Cash/Documents. This is closely followed by items such as Mobile Phones, Keys and Computers.
bind_rows()
Amelia: You can stack the data together using
bind_rows()
.
crime_top_monash <- bind_rows(crime_monash,
crime_victoria)
Amelia: Use ggplot to create two separate plots for each local government area using
facet_wrap()
on local government area.
crime_year_offence_both <- crime_top_monash %>%
group_by(year, local_gov_area) %>%
count(offence_subgroup)
gg_crime_offence <- ggplot(crime_year_offence_both,
aes(x = year,
y = n,
colour = offence_subgroup)) +
geom_line() +
facet_wrap(~ local_gov_area)
gg_crime_offence
crime_items_both <- crime_top_monash %>%
group_by(local_gov_area) %>%
count(item_subdivision)
ggplot(crime_items_both,
aes(x = n,
y = reorder(item_subdivision, n), # reorder the points
colour = local_gov_area)) +
geom_point()
**
Do you have any recommendations about future directions with this dataset? Is there anything else in the excel spreadsheet we could look at? (2 Mark) **
Amelia: I was planning on looking at the other tabs in the spreadsheet to help us use information on the tool used to break in. How could we use what is in there? And what is in there that looks useful?
RESPONSE: Our analysis is yet to consider numerous variables provided in the dataset. These include recorded instances of the “Building Point of Entry”, the “Building Method of Entry” and the “Tool Used” to gain entry. Analysis of these related instances may highlight a certain point of entry and method that is more common than the rest. As a result, a campaign can be launched targeting that specific means, informing the community as such and further helping reduce instances of this form of crime into the future. For example, if the “Rear/Back” is the most common building entry point with the most common method of entry being “Cut/Remove Flywire” (again, for example), then a campaign can be launched encouraging the community to install ‘crimsafe’ or related products protecting themselves against this mode/means of entry.
**
For our presentation to stake holders, you get to pick one figure to show them, which of the ones above would you choose? Why? Recreate the figure below here and write 3 sentences about it (2.5 Marks) **
I would include the following figure:
crime_year_offence_both <- crime_top_monash %>%
group_by(year, local_gov_area) %>%
count(offence_subgroup)
gg_crime_offence <- ggplot(crime_year_offence_both,
aes(x = year,
y = n,
colour = offence_subgroup)) +
geom_line() +
facet_wrap(~ local_gov_area)
gg_crime_offence
```
Amelia: Remember, when you are describing data visualisation, You should start with a quick summary of what the graphic shows you. Then, describe what is on the x axis, the y axis, and any other colours used to separate the data. You then need to describe what you learn.
RESPONSE: We would include the figure above. This figure demonstrates the reported instances of various forms of offence subgroups (being the alternating [non]residential [non]aggrevated burglary combinations), with their related reported number of intances attributed to a particular year. These subgroups are easily identifiable by their unique colour in the graph with a key being provided so they can easily be distinguished. The reason for choosing this graph relates to the context we are in. Being security company, we do wish to sell security goods (such as cameras, sensors and various preventative break-and-enter products such as ‘crim-safe’) to our customers. As such, we include the regions of Monash (assuming this is the region our firm operates in) as well as Victoria as this demonstrates the highest number of burglaries for all LGA’s (and is therefore used as a reference point). From these graphs, we can present to our stakeholders that certain forms of burglaries (being “unknown non-aggrevateg burglary” and “residential non-aggrevated burglary”) are on the decline for our local region. However, being a business their purpose is generally adopting a profit motive primarily, then informing the community (educational purpose) following this. As this graph demonstrates that all forms of aggrevated burglary are on the rise, particularly residential, our stakeholders will be better informed regarding this current crime environment. From this, the security firm can begin to promote their goods and services up for sale (such as ‘crim-safe’ to prevent break-and-enter style burglaries). From portraying this message, the security company is able to not only inform the community but will also likely generate revenue and profit as a result.
Amelia: I have got to remember to cite all the R packages that I have used, and any Stack Overflow questions, blog posts, text books, from online that I have used to help me answer questions.
Data downloaded from https://www.crimestatistics.vic.gov.au/download-data
Packages used (look for things which were loaded with library()
): * ggplot2 * dplyr * readxl * base * datasets * graphics * grDevices * methods * stats * utils