---
title: "ETC1010: Data Modelling and Computing"
author: "Professor Di Cook, EBS, Monash U."
output:
learnr::tutorial:
css: "css/logo.css"
runtime: shiny_prerendered
---
```{r setup, include=FALSE}
library(learnr)
library(knitr)
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE,
collapse = TRUE,
fig.height = 6,
fig.width = 6,
fig.align = "center",
cache = FALSE)
tutorial_html_dependency()
library(tidyverse)
```
# Advanced data modeling
## Course web site
This is a link to the course web site, in case you need to go back and forth between tutorial and web materials: [http://dmac.dicook.org](http://dmac.dicook.org)
## Recap
To make a network analysis, you need
- an association matrix, that describes how nodes (vertices) are connected to each other
- a layout algorithm to place the nodes optimally so that the fewest edges cross, or that the nodes that are most closely associated are near to each other.
## Quantitative association matrices
The previous association matrices were black and white:
![](images/network_data.png)
but you could have the association between nodes described as real numbers also. For example, these are the number of times that these people called each other in the last week:
```{r echo=FALSE}
d <- matrix(c(0, 5, 4, 1, 1,
5, 0, 4, 2, 1,
4, 4, 0, 0, 0,
1, 2, 0, 0, 6,
1, 1, 0, 6, 0), ncol=5, byrow=T)
colnames(d) <- c("Meg", "Tay", "Yat", "Zili", "Jess")
rownames(d) <- colnames(d)
kable(d)
```
We would need to turn this into an edge data set:
```{r echo=FALSE}
d_edges <- d %>% as_tibble() %>%
mutate(from = rownames(d)) %>%
gather(to, count, -from)
d_edges
```
and then decide what corresponds to a "connection". Let's say, they need to have called each other at least 4 times, to be considered connected.
```{r echo=FALSE}
d_edges <- d_edges %>%
filter(count>3)
d_edges
```
and then we can make the network diagram.
```{r}
library(geomnet)
set.seed(1110100)
ggplot(data = d_edges, aes(from_id = from, to_id = to)) +
geom_net(layout.alg = "kamadakawai",
size = 2, labelon = TRUE, vjust = -0.6, ecolour = "grey60",
directed =FALSE, fontsize = 3, ealpha = 0.5) +
theme_net()
```
## Data: Last 4 months of currency USD cross-rates
`r set.seed(7);emo::ji("shocked")` SO let's try this with cross-currency rates across the globe!
- Data extracted from http://openexchangerates.org/api/historical
- R packages `jsonlite`, processed with `tidyverse`, `lubridate`
```{r echo=FALSE, fig.width=5, fig.height=5}
library(tidyverse)
library(lubridate)
library(gridExtra)
ru <- read_csv("data/rates_new.csv")
ru <- ru %>% arrange(date)
ru %>% head()
p1 <- ggplot(ru, aes(x=date, y=AUD)) + geom_line()
p2 <- ggplot(ru, aes(x=date, y=EUR)) + geom_line()
p3 <- ggplot(ru, aes(x=date, y=JPY)) + geom_line()
p4 <- ru %>% select(date, AUD, EUR, JPY) %>%
gather(curr, value, -date) %>%
ggplot(aes(x=date, y=value, colour=curr, group=curr)) +
geom_line() + theme(legend.position="none") +
scale_colour_brewer(palette="Dark2")
grid.arrange(p1, p2, p3, p4, ncol=2)
```
### Your turn
Make some plots (or google) to answer these questions
- Is the NZD more similar to AUD, EUR, or JPY? (What currency is NZD?)
- Is SGD more similar to AUD, EUR, or JPY? (What currency is SGD?)
- How many currencies are there in the British Isles?
```{r}
p1 <- ggplot(ru, aes(x=date, y=AUD)) + geom_line()
p2 <- ggplot(ru, aes(x=date, y=EUR)) + geom_line()
p3 <- ggplot(ru, aes(x=date, y=JPY)) + geom_line()
p4 <- ggplot(ru, aes(x=date, y=SGD)) + geom_line()
grid.arrange(p1, p2, p3, p4, ncol=2)
```
## Pre-processing data
### Keep only currencies that change over the period
Some currencies don't change very much. These should be filtered from the analysis, because in a study of currency movement, if it doesn't move then there is nothing more to be said.
To filter out these currencies we use a statistic called [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation). This measures the standard deviation of a currency relative to its mean. If a mean is high, we would expect a currency to change more, that is, relatively the standard deviation would be larger to consider it to be changing.
```{r fig.width=5, fig.height=5}
# Compute coefficient of variation. We will only analyse
# currencies that have changes substantially over this time.
# Dates dropped
cv <- function(x) sd(x)/mean(x)
s <- ru %>% select(-date) %>%
summarise_all(funs(cv)) %>%
gather(curr, cv) %>%
filter(cv > 0.0027)
ru_sub <- ru %>% select(s$curr)
```
### Remove currencies that are not currencies
Some of the currencies, are not really currencies. Google these ones: ALL, XAG, XDR, XPT - what are they?
```{r}
# Remove non-currencies
ru_sub <- ru_sub %>% select(-ALL, -XAG, -XDR, -XPT)
```
### Standardize the currencies
To examine overall trend regardless of actual USD cross rate, standardise the values to have mean 0 and standard deviation 1.
```{r}
scale01 <- function(x) (x-mean(x))/sd(x)
ru_sub <- ru_sub %>%
mutate_all(funs(scale01))
ru_sub_dt <- ru_sub %>% mutate(date=ru$date)
p1 <- ggplot(ru_sub_dt, aes(x=date, y=AUD)) + geom_line()
p2 <- ggplot(ru_sub_dt, aes(x=date, y=EUR)) + geom_line()
p3 <- ggplot(ru_sub_dt, aes(x=date, y=JPY)) + geom_line()
p4 <- ru_sub_dt %>% select(date, AUD, EUR, JPY) %>%
gather(curr, value, -date) %>%
ggplot(aes(x=date, y=value, colour=curr, group=curr)) +
geom_line() + theme(legend.position="none") +
scale_colour_brewer(palette="Dark2")
grid.arrange(p1, p2, p3, p4, ncol=2)
```
### Compute distances between all pairs of currencies
Euclidean distance is used to compute similarity between all pairs of currencies.
$d_{ij} = \sqrt{\sum_{i=1}^{t}{(C_{1i}-C_{2i})^2}}$
```{r}
# Compute distance between currencies
# Need to transpose! Turn matrix around, rows <--> columns
ru_sub_t <- t(ru_sub)
ru_sub_t <- data.frame(ru_sub_t)
d <- as.matrix(dist(ru_sub_t, diag=TRUE, upper=TRUE))
colnames(d) <- as.factor(colnames(ru_sub))
rownames(d) <- as.factor(colnames(ru_sub))
quantile(d, probs=c(0, 0.25, 0.5, 0.75, 1))
```
NOTE: A distance matrix is the inverse of an association matrix. With a distance matrix close to 0 means the pair are most similar. For an association matrix far from zero means the pair are close. Either can be used to generate a network.
## Make the network
### Gather the data into long form, and filter based on similarity
Here only the pairs of currencies who are closer than "4" to each other are kept.
```{r}
d_zero <- d
d_zero_tbl <- d_zero %>%
as_tibble() %>%
mutate(curr1=rownames(d_zero)) %>%
gather(curr2, dst, -curr1) %>%
filter(dst<3) %>%
filter(curr1 != curr2)
```
### Network laid out
```{r}
# Make network
library(geomnet)
set.seed(10052016)
ggplot(data = d_zero_tbl, aes(from_id = curr1, to_id = curr2)) +
geom_net(layout.alg = "kamadakawai",
size = 2, labelon = TRUE, vjust = -0.6, ecolour = "grey60",
directed =FALSE, fontsize = 3, ealpha = 0.5) +
#xlim(c(-0.05, 1.05)) +
theme_net() +
theme(legend.position = "bottom")
```
### Your turn
Make a plot of the AUD vs the SGD (using the standardised units). Do they look like they are trending together as suggested by the network?
```{r eval=FALSE}
p1 <- ggplot(ru_sub_dt, aes(x=date, y=AUD)) + geom_line()
p2 <- ggplot(ru_sub_dt, aes(x=date, y=SGD)) + geom_line()
p3 <- ggplot(ru_sub_dt, aes(x=date, y=NZD)) + geom_line()
p4 <- ru_sub_dt %>% select(date, AUD, SGD, NZD) %>%
gather(curr, value, -date) %>%
ggplot(aes(x=date, y=value, colour=curr, group=curr)) +
geom_line() + #theme(legend.position="none") +
scale_colour_brewer(palette="Dark2")
grid.arrange(p1, p2, p3, p4, ncol=2)
```
## Cluster analysis
Network analysis is related to clustering the data. This means, can we group the nodes (points) into a small number of clusters. Both start from a distance matrix.
### k-means algorithm
The simplest clustering algorithm is called $k$-means. The goal is to divide the observations into $k$ groups. You can try different values of $k$ to determine which is better.
The algorithm is simple:
1. Partition the $n$ objects into
$k$ initial clusters $c_1,...,c_k$.
2. Do the following:
- Assign all objects $X_i, i=1,...,n$ to cluster $c_k$ that has the closest mean.
- Update the cluster means, from the new groups
3. Repeat 2 until no changes in cluster assignment is observed (this means the algorithm has converged).
There is a really nice explanation, and animation [here](https://theanlim.rbind.io/post/clustering-k-means-k-means-and-gganimate/).
And the animation package also has a small demonstration
```{r eval=FALSE}
library(animation)
ani.options(interval = 1)
par(mar = c(3, 3, 1, 1.5), mgp = c(1.5, 0.5, 0))
kmeans.ani()
```
```{r eval=FALSE}
kmeans.ani(x = cbind(X1 = runif(100), X2 = runif(100)), centers = 5,
pch = 1:5, col = 1:5)
```
## Clustering currencies
We will choose to divide the currencies into four groups.
```{r}
k <- 6
cntrs <- ru_sub_t[1:6,]
ru_km <- kmeans(ru_sub_t, cntrs, iter.max=50, nstart=5)
sort(ru_km$cluster)
```
Join the cluster ids to the data and plot the original time series.
```{r fig.height=5, fig.width=5}
ru_cl <- tibble(curr=names(ru_km$cluster), cl=factor(ru_km$cluster))
ru_sub_long <- ru_sub %>%
mutate(date=ru$date) %>%
gather(curr, value, -date) %>%
left_join(ru_cl, by="curr")
p <- ggplot(ru_sub_long, aes(x=date, y=value, group=curr, colour=cl, label=curr)) +
geom_line() +
facet_wrap(~cl)
library(plotly)
ggplotly(p)
```
*What do we learn?*
- Cluster 4 currencies have effectively been devaluing relative to the USD ove the last few months. This cluster includes the AUD.
- Cluster 3 had a short sharp devaluing, but has been stable since.
- Cluster 1 are currencies that all have the same, many, short term fluctuations.
- Cluster 4 is a miscellaneous cluster.
### Deciding on k
Deciding on $k$, can be done by examining the total within sum of squares (tot.withinss). This measures the spread of the points around its cluster mean. If the value is 0, then all points would coincide with the cluster mean - the perfect solution. As you increase $k$ this will always get smaller tot.withinss, so a good solution is one that produces a large reduction relative to the smaller $k$.
```{r fig.height=4, fig.width=4}
ru_km$tot.withinss
twss <- NULL
for (i in 3:7) {
k <- i
cntrs1 <- ru_sub_t[1:k,]
ru_km1 <- kmeans(ru_sub_t, cntrs1, iter.max=50, nstart=5)
twss <- c(twss, ru_km1$tot.withinss)
}
twss_df <- tibble(k=3:7, twss=twss)
ggplot(twss_df, aes(x=k, y=twss)) + geom_line()
```
### Your turn
- Fit the $k=6$ cluster $k$-means model
- Plot the solution, the time series of the currencies that belong to each of the 6 clusters.
## Share and share alike

This work is licensed under a Creative Commons Attribution 4.0 International License.