Introduction

Team members: Michael Roberts, Min Liu, Nicholas Hovens, Zhexu Zhang

What is tennis

via GIPHY

  • tennis is a racket sport
  • can be played individually against a single opponent or between two teams of two players each
  • each player uses the racket to hit the tennis ball over the net to their opponents side of the court
  • the object of the game is to hit the ball in such a way that the opponent cannot play a valid return
  • the player who plays the winning shot gets a point

About our data

Australian Men’s ATP Rankings

Cleaning of the data

atp_players1 <- atp_players %>%
 select(name, player_id, country_code)%>%
 filter(country_code == "AUS")

atp_rankings1 <- atp_rankings %>%
 select(ranking, player_id, date)%>%
 filter(ranking <= 100) 

atp_data <- atp_players1 %>%
  left_join(atp_rankings1, by = "player_id") %>%
  mutate(year = year(date), 
         day = day(date),
         month = month(date, label=TRUE, abbr=TRUE))#, locale = "English"))

current_atprankings <- fetch_atp_rankings("2019-05-20", min_rank = 1, max_rank = 100) %>%
 select(player, rank, date) %>%
  rename(c("player" = "name", "rank" = "ranking"))

currentatp_data <- atp_players1 %>%
  full_join(current_atprankings, by = "name") %>%
  na.omit(currentatp_data)
  • the atp_players data was read in, relevant variables were selected and the country was filtered to display only Australian players
  • the atp_rankings data was read in, the relevant variables were selected and the rankings were filtered to show the top 100
  • the atp players and atp rankings data was then combined via a left join so that the rankings can be attributed to name
  • the date of the rankings was also split into 3 variables: year, month and day
  • the same was done for the current ATP rankings (fetch_atp_rankings), a couple of the variables were renamed so they could be joined to the atp_players data and the NA values were omitted

Individual Australian players ATP rankings

  • Lleyton Hewitt has been our highest ranked (1) male tennis player since 2001
  • He was ranked in the top 100 from 2001 - 2012, then 2013 - 2015

via GIPHY

  • We have had at least 1 player ranked in the top 100 each year since 2001
  • We have had at least 1 player ranked in the top 50 each year except 2009 and 2011
  • Australia hasn’t had any men ranked in the top 10 since 2006

Total Australian’s in the ATP top 100 per year

  • Between 2001-2018, 2001 saw the most Australian male tennis players in the top 100 with 7
  • Rapid decline after that with Australia seeing between 1 or 2 male tennis players in the top 100 from 2003 - 2012
  • Looks to be a steady increase from 2013 onwards

Current Australian’s in the top 100 ATP Rankings

  • As of right now, Australia has 5 players ranked in the ATP top 100
  • Continues with the trending increase of Australian players in the ATP top 100
  • Highest Australian right now is Nick Kyrgios who is ranked at 36

via GIPHY

Top ten countries with players in the ATP top 100 since 2001

  • Spain has had the most players ranked in the top 100 since 2001 with 234 players
  • Australia has had the 10th most with 52 players
  • 7 of the countries in the top ten are from Europe

Australian Women’s WTA Rankings

Cleaning of the data

wta_players1 <- wta_players %>%
 select(name, player_id, country_code)%>%
 filter(country_code == "AUS")

wta_rankings1 <- wta_rankings %>%
 select(ranking, player_id, date)%>%
 filter(ranking <= 100) 

wta_data <- wta_players1 %>%
  left_join(wta_rankings1, by = "player_id") %>%
  mutate(year = year(date), 
         day = day(date),
         month = month(date, label=TRUE, abbr=TRUE))#, locale = "English"))

wta_ausrankings <- wta_data %>%
  select(name, ranking, year, day, month) %>%
  filter(year > 2000, month == "Jan", day < 8)

current_wtarankings <- fetch_wta_rankings(min_rank = 1, max_rank = 100) %>%
  select(player, nation, rank, date) %>%
  filter(nation == "Australia") %>%
  • similar to the cleaning for the ATP rankings
  • wta_players, wta_rankings, fetch_wta_rankings were used to obtain the data

Individual Australian players WTA rankings

  • Samantha Stosur has been our highest ranked female tennis player (6) since 2001
  • She has been in the top 100 since 2005 and the top 25 from 2010 - 2015

via GIPHY

  • Since 2001, there hasn’t been a year with no Australian female tennis players in the top 100
  • Australia has had at least one female tennis player ranked in the top 50 every year since 2001, except for 2009

Total Australian’s in the WTA top 100 per year

  • 2002 and 2008 had the most female players in the top 100 with ten
  • 2014 and 2017 had the least with two
  • Seems to be a steady decrease in players in the wta top 100 since 2008

Current Australian’s in the top 100 WTA Rankings

* As of right now, Australia has 5 players ranked in the WTA top 100 * This is the most Australia has had in the WTA top 100 since 2012 * This could show the beginning of another rise in Australians in the WTA top 100 * Highest Australian right now is Ash Barty who is ranked 8th

via GIPHY

Top ten countries with players in the WTA top 100 since 2001

  • Russia has had the most players ranked in the top 100 since 2001 with 314 players
  • Australia has had the 9th most with 89
  • 7 of the countries in the top ten are from Europe

Predicting players ATP rankings in ten years

In this section, we will attempt to discover if the ranking of an ATP player can be predicted to any degree of reliability through looking at their ranking in the early days of their professional career and their overall performance in ATP tournaments with a special focus on grand slams. This section continues to use the deuce package, and the first thing to do here is to obtain a profile of each player, with their rankings and some other personal information, such as date of birth. This is accomplished through the atp_rankings data set, which is joined to the atp_players table, a second table which contains a more fleshed out version of the players name, date of birth. The result of this join is displayed below:

We would now like to split the rankings of players into their ATP ranking points after the first 3,6,9 months, then 1 through to 5 years of their arrival on the ATP tour. This is done by taking the ranking table formed previously, grouping by player, and filtering out all dates outside the time period of interest (so to determine ATP ranking points after 6 months, we select the last recorded value of ATP ranking points after filtering out dates beyond 6 months after their first recorded date). This results in tables for each of the data sets, which are then joined together to create a single time breakdown of the progression of these players ATP ranking points.

OBTAINING SOME INDICATION GRAND SLAM PERFORMANCE

How players perform in the glamorous and exciting grand slams over the years may well prove to be a predictor of ranking points after a solid amount of time competing at the top level (10 years, hopefully). Below is calculated the number of wins and losses recorded for each player in grand slams. It is not however, a good idea to split this variable into similar time frames as the ranking variables above, as some players may not have even played a grand slam in their first year, or even 2 or 3 years on tour. Therefore, we take this observation after 4 years.

This is found by taking the table provided by the deuce package, atp_matches (containing point by point breakdown statistics of many grand slam matches over the course of history), and joining this data set with that describing ATP ranking points after 4 years. The date in this second table provides the cutoff date past which we should not consider grand slam results. A snippet of the code to produce this overall data set is shown below:

Separate tables are created for wins and losses in grand slams, and finally they are joined to create the final data set.

#******************************************4 YEAR GS************************************************************
atp_matches <- rename(atp_matches,replace = c("winner_name" = "name"))

tournaments_4yr <- join(atp_matches, four_year, by = "name") %>% select(c(tourney_name, tourney_level, name, winner_rank, winner_rank_points, loser_name, loser_rank, loser_rank_points, tourney_start_date, date_4yr, rnkng_pnts_4yr))


gs_4yr <- tournaments_4yr %>% filter(tourney_level == "Grand Slams") %>% group_by(name) %>% filter(tourney_start_date <= date_4yr)
gs_4yr_wins <- count(gs_4yr$name) %>% rename(c("x" = "name", "freq" = "gs_wins"))
gs_4yr_wins <- join(gs_4yr_wins, four_year, by = "name") %>% separate(date_4yr, into = c("year_4yr", "month_4yr", "day_4yr")) %>% select(name, gs_wins)


gs_4yr <- tournaments_4yr %>% filter(tourney_level == "Grand Slams") %>% group_by(loser_name) %>% filter(tourney_start_date <= date_4yr)
gs_4yr_losses <- count(gs_4yr$loser_name) %>% rename(c("x" = "name", "freq" = "gs_losses"))
gs_4yr_losses <- join(gs_4yr_losses, four_year, by = "name") %>% separate(date_4yr, into = c("year_4yr", "month_4yr", "day_4yr")) %>% select(name, gs_losses)


gs_4yr_wlr <- join(gs_4yr_losses, gs_4yr_wins, by = "name")

head(gs_4yr_wlr)

OBTAINING OVERALL ATP TOURNAMENT PERFORMANCE

In tennis, there is a fair number of players who may do well in tournaments other than grand slams, but who repeatedly fail to bring it together on the big stage. (Alexander Zverev and Nick Kyrgios are debatedly two of those, just to name a few.) We thought it would be good to include some reference to these other tournaments as indicators of performance on the ATP tour.

Luckily this information was available in the atp_matches table once more, and allowed us to focus on all professional level tournaments including grand slams. This table was again joined to the four year ranking table, or the same reason as it was in the previous step (obtaining grand slam performance). Four years was chosen as the cutoff again, to make the wins and losses obtained here comparable to those of the grand slam data.

## 'data.frame':    14729 obs. of  10 variables:
##  $ name             : chr  "A Benson" "A Escofet" "A Hall" "A Macdonald" ...
##  $ rnkng_pnts_3mnths: chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_6mnths: chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_9mnths: chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_1yr   : chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_2yr   : chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_3yr   : chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_4yr   : chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_5yr   : chr  "0" "0" "0" "1" ...
##  $ rnkng_pnts_10yr  : chr  "0" "0" "0" "1" ...

Having obtained this data we could now join the atp tournaments, grand slams and ranking statistics and combine them all in one table, shown below.

##join atp tournament performance to ranking data
ranking_variables_distinct <- merge(ranking_variables, atp, by = "name")


##have to do this joining BEFORE ADJUSTING FOR DUPLICATES
ranking_variables_distinct <- merge(ranking_variables_distinct, gs_4yr_wlr, by = "name")


head(ranking_variables_distinct)

The data set contained some duplicate rows as some players had identical names! This was dealt with by adjusting the players name each time it came up, adding an incrementing number to the end of it e.g. William Brown 1, William Brown 2 etc, if there were multiple instances of William Brown for some reason.

## 'data.frame':    1089 obs. of  14 variables:
##  $ name             : chr  "Aaron Krickstein" "Abe Segal" "Adrian Mannarino" "Adrian Marcu" ...
##  $ rnkng_pnts_3mnths: chr  "0" "0" "4" "0" ...
##  $ rnkng_pnts_6mnths: chr  "0" "0" "4" "0" ...
##  $ rnkng_pnts_9mnths: chr  "0" "0" "7" "0" ...
##  $ rnkng_pnts_1yr   : chr  "0" "0" "5" "0" ...
##  $ rnkng_pnts_2yr   : chr  "0" "0" "60" "0" ...
##  $ rnkng_pnts_3yr   : chr  "0" "0" "82" "0" ...
##  $ rnkng_pnts_4yr   : chr  "0" "0" "340" "0" ...
##  $ rnkng_pnts_5yr   : chr  "0" "0" "391" "0" ...
##  $ rnkng_pnts_10yr  : chr  "0" "0" "729" "22" ...
##  $ atp_wins         : int  122 2 3 2 17 245 3 4 3 27 ...
##  $ atp_losses       : int  47 6 6 7 30 158 14 4 9 9 ...
##  $ gs_losses        : int  5 5 2 1 4 22 1 1 1 1 ...
##  $ gs_wins          : int  15 2 NA NA 5 44 1 NA NA 3 ...

PCA ANALYSIS ON THE VARIABLES ON THIS DATASET

In this section we conduct principal component analysis on all the variables in this data set. This will allow us to identify which variables are intercorrelated, and which of them are most important in explaining variation among them. The variables will be separated into principal components which will highlight the most important information in the data set.

The first step here is to take our data set and remove the name column, then format all the other variables as numeric, a key condition for PCA.

## 'data.frame':    1089 obs. of  13 variables:
##  $ rnkng_pnts_3mnths: num  0 0 4 0 2 0 0 0 0 0 ...
##  $ rnkng_pnts_6mnths: num  0 0 4 0 2 0 57 0 0 0 ...
##  $ rnkng_pnts_9mnths: num  0 0 7 0 28 0 57 0 0 0 ...
##  $ rnkng_pnts_1yr   : num  0 0 5 0 57 0 48 0 0 0 ...
##  $ rnkng_pnts_2yr   : num  0 0 60 0 50 0 156 0 0 0 ...
##  $ rnkng_pnts_3yr   : num  0 0 82 0 150 0 260 0 0 458 ...
##  $ rnkng_pnts_4yr   : num  0 0 340 0 724 0 281 0 0 261 ...
##  $ rnkng_pnts_5yr   : num  0 0 391 0 885 0 611 74 0 42 ...
##  $ rnkng_pnts_10yr  : num  0 0 729 22 410 0 547 12 0 1 ...
##  $ atp_wins         : num  122 2 3 2 17 245 3 4 3 27 ...
##  $ atp_losses       : num  47 6 6 7 30 158 14 4 9 9 ...
##  $ gs_losses        : num  5 5 2 1 4 22 1 1 1 1 ...
##  $ gs_wins          : num  15 2 NA NA 5 44 1 NA NA 3 ...
## 'data.frame':    1089 obs. of  12 variables:
##  $ rnkng_pnts_6mnths: num  0 0 4 0 2 0 57 0 0 0 ...
##  $ rnkng_pnts_9mnths: num  0 0 7 0 28 0 57 0 0 0 ...
##  $ rnkng_pnts_1yr   : num  0 0 5 0 57 0 48 0 0 0 ...
##  $ rnkng_pnts_2yr   : num  0 0 60 0 50 0 156 0 0 0 ...
##  $ rnkng_pnts_3yr   : num  0 0 82 0 150 0 260 0 0 458 ...
##  $ rnkng_pnts_4yr   : num  0 0 340 0 724 0 281 0 0 261 ...
##  $ rnkng_pnts_5yr   : num  0 0 391 0 885 0 611 74 0 42 ...
##  $ rnkng_pnts_10yr  : num  0 0 729 22 410 0 547 12 0 1 ...
##  $ atp_wins         : num  122 2 3 2 17 245 3 4 3 27 ...
##  $ atp_losses       : num  47 6 6 7 30 158 14 4 9 9 ...
##  $ gs_losses        : num  5 5 2 1 4 22 1 1 1 1 ...
##  $ gs_wins          : num  15 2 NA NA 5 44 1 NA NA 3 ...

DATA STANDARDISATION

In principal component analysis, variables are often scaled (i.e. standardized). This is a good idea when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …); otherwise, the PCA outputs obtained will be severely affected.

The goal here is to make the variables comparable. Generally variables are scaled to have: i) standard deviation one and ii) mean zero The function PCA() [FactoMineR package] can be used.

This standardization to the same scale avoids some variables to become dominant just because of their large measurement units. It makes variable comparable.

The R code below, computes principal component analysis on the active individuals/variables:

library("FactoMineR")
res.pca1 <- PCA(ranking_variables_distinct, graph = FALSE)
print(res.pca1)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 1089 individuals, described by 12 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

VISUALIZATION AND INTERPRETATION

We will use the factoextra R package to help in the interpretation of PCA.

Eigenvalues / Variances The eigenvalues measure the amount of variation retained by each principal component in the data set. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.

We examine the eigenvalues to determine the number of principal components to be considered. The eigenvalues and the proportion of variances (i.e., information) retained by the principal components (PCs) can be extracted using the function get_eigenvalue() from the factoextra package.

##        eigenvalue variance.percent cumulative.variance.percent
## Dim.1  5.50886318       45.9071932                    45.90719
## Dim.2  3.19194422       26.5995352                    72.50673
## Dim.3  1.29124833       10.7604028                    83.26713
## Dim.4  0.52624329        4.3853607                    87.65249
## Dim.5  0.49032933        4.0860777                    91.73857
## Dim.6  0.27878221        2.3231851                    94.06175
## Dim.7  0.22789740        1.8991450                    95.96090
## Dim.8  0.13433485        1.1194571                    97.08036
## Dim.9  0.12205968        1.0171640                    98.09752
## Dim.10 0.09557539        0.7964616                    98.89398
## Dim.11 0.07987972        0.6656643                    99.55965
## Dim.12 0.05284239        0.4403533                   100.00000

Eigenvalues can be used to determine the number of principal components to retain after PCA (Kaiser 1961):

An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This is true only when the data is standardized.

You can also limit the number of components to a number that accounts for a certain fraction of the total variance. So for example here, as can be seen from the eigen values above, we might choose to stop at dimension 7, as up to this point, 95% of the variation is explained.

SCREE PLOT

An alternative method to determine the number of principal components is to look at a Scree Plot, which is the plot of eigenvalues ordered from largest to the smallest. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size (Jollife 2002, Peres-Neto, Jackson, and Somers (2005)).

The scree plot can be produced using the function fviz_eig() or fviz_screeplot() from the factoextra package.

From the plot above, we might want to stop at the 7th principal component. 95% of the information (variances) contained in the data are retained by the first five principal components.PUT RIGHT VALUES IN

RESULTS

To extract the results for variables from a PCA output we can use the function get_pca_var() (factoextra package). This function provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)

The components of the get_pca_var() can be used in the plot of our variables as follows:

var\(coord: coordinates of variables to create a scatter plot var\)cos2: represents the quality of representation for variables on the factor map. It’s calculated as the squared coordinates: var.cos2 = var.coord * var.coord. var$contrib: contains the contributions (in percentage) of the variables to the principal components. The contribution of a variable (var) to a given principal component is (in percentage) : (var.cos2 * 100) / (total cos2 of the component).

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
##                       Dim.1      Dim.2      Dim.3        Dim.4
## rnkng_pnts_6mnths 0.7396830 0.06636937  0.5356839  0.183535865
## rnkng_pnts_9mnths 0.8003324 0.07234540  0.5316438  0.097870521
## rnkng_pnts_1yr    0.8290986 0.08838776  0.4255562 -0.002166276
## rnkng_pnts_2yr    0.8603207 0.15971845 -0.0469459 -0.260805001
## rnkng_pnts_3yr    0.8709076 0.14336455 -0.1810676 -0.268405180
## rnkng_pnts_4yr    0.8767204 0.13948107 -0.3413019 -0.027754990
##                          Dim.5
## rnkng_pnts_6mnths -0.083898074
## rnkng_pnts_9mnths -0.055824477
## rnkng_pnts_1yr     0.005418478
## rnkng_pnts_2yr     0.185113611
## rnkng_pnts_3yr     0.166862814
## rnkng_pnts_4yr     0.061171122
##                       Dim.1       Dim.2       Dim.3        Dim.4
## rnkng_pnts_6mnths 0.5471309 0.004404894 0.286957200 3.368541e-02
## rnkng_pnts_9mnths 0.6405320 0.005233857 0.282645175 9.578639e-03
## rnkng_pnts_1yr    0.6874045 0.007812396 0.181098086 4.692751e-06
## rnkng_pnts_2yr    0.7401516 0.025509985 0.002203917 6.801925e-02
## rnkng_pnts_3yr    0.7584800 0.020553394 0.032785471 7.204134e-02
## rnkng_pnts_4yr    0.7686386 0.019454968 0.116487020 7.703395e-04
##                          Dim.5
## rnkng_pnts_6mnths 0.0070388868
## rnkng_pnts_9mnths 0.0031163722
## rnkng_pnts_1yr    0.0000293599
## rnkng_pnts_2yr    0.0342670488
## rnkng_pnts_3yr    0.0278431986
## rnkng_pnts_4yr    0.0037419062
##                       Dim.1     Dim.2      Dim.3        Dim.4       Dim.5
## rnkng_pnts_6mnths  9.931829 0.1380003 22.2232388 6.401110e+00 1.435542699
## rnkng_pnts_9mnths 11.627299 0.1639708 21.8892964 1.820192e+00 0.635567175
## rnkng_pnts_1yr    12.478155 0.2447535 14.0250393 8.917456e-04 0.005987792
## rnkng_pnts_2yr    13.435651 0.7991990  0.1706811 1.292544e+01 6.988578295
## rnkng_pnts_3yr    13.768358 0.6439146  2.5390523 1.368974e+01 5.678468970
## rnkng_pnts_4yr    13.952762 0.6095021  9.0212716 1.463847e-01 0.763141420

CORRELATION CIRCLE

The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations (Abdi and Williams 2010). Color by cos2 values: quality on the factor map

The plot above is also known as variable correlation plots. It shows the relationships between all variables. It can be interpreted as follows:

Positively correlated variables are grouped together. We can see that atp_wins and gs_wins are correlated, as well as atp_losses and gs_losses. This makes sense, as players who do well in most atp tournaments will tend to also perform in the grand slams, although there are some outliers as mentioned previously. Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants). Interestingly, ranking points and tournament performance are not correlated, from this data. This is somewhat counter intuitive, as one would suspect that good tournament performance would result in better ranks. However, there would be some players who do well in tournaments which don’t offer a high number of ranking points, which would explain why tournament performance and ranking points are not really correlated here.

The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map. Due to the cluster of variables on this correlation plot, it is hard to see the quality for each variable. The cos2 on the key of this plot is explained below.

QUALITY OF REPRESENTATION

The quality of representation of the variables on factor map is called cos2 (square cosine, squared coordinates).

A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle.

A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.

##                       Dim.1       Dim.2       Dim.3        Dim.4
## rnkng_pnts_6mnths 0.5471309 0.004404894 0.286957200 3.368541e-02
## rnkng_pnts_9mnths 0.6405320 0.005233857 0.282645175 9.578639e-03
## rnkng_pnts_1yr    0.6874045 0.007812396 0.181098086 4.692751e-06
## rnkng_pnts_2yr    0.7401516 0.025509985 0.002203917 6.801925e-02
##                          Dim.5
## rnkng_pnts_6mnths 0.0070388868
## rnkng_pnts_9mnths 0.0031163722
## rnkng_pnts_1yr    0.0000293599
## rnkng_pnts_2yr    0.0342670488

A bar chart of the cos2 is probably a more user-friendly result:

Contributions of variables to PCs The contributions of variables in accounting for the variability in a given principal component are expressed in percentage.

Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set. Variables that are not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis. The larger the value of the contribution, the more the variable contributes to the component.

##                       Dim.1     Dim.2      Dim.3        Dim.4       Dim.5
## rnkng_pnts_6mnths  9.931829 0.1380003 22.2232388 6.401110e+00 1.435542699
## rnkng_pnts_9mnths 11.627299 0.1639708 21.8892964 1.820192e+00 0.635567175
## rnkng_pnts_1yr    12.478155 0.2447535 14.0250393 8.917456e-04 0.005987792
## rnkng_pnts_2yr    13.435651 0.7991990  0.1706811 1.292544e+01 6.988578295

Once again bar charts is also helpful to visualise which variables are contributing most to the variations. Here are the contributions of each variable to PC1.

Likewise the contributions to PC2

Now the total contribution to PC1 and PC2

Clearly these plots confirm the information plotted in the correlation circle above. The ranking points are correlated with Dim 1 and tournament performance to Dim2.

The red dashed line on the graphs above indicate the expected average contribution.

LINEAR REGRESSION TO PREDICT RANKING AT TEN YEAR MARK

We will now perform multiple linear regression to determine the contributions of different variables to rnkng_pnts_10yr, the outcome variable here.

The first model takes into account all possible variables, with ranking points at ten years as the outcome.

## 
## Call:
## lm(formula = rnkng_pnts_10yr ~ rnkng_pnts_6mnths + rnkng_pnts_9mnths + 
##     rnkng_pnts_1yr + rnkng_pnts_2yr + rnkng_pnts_3yr + rnkng_pnts_4yr + 
##     rnkng_pnts_5yr + gs_losses + gs_wins + atp_wins + atp_losses, 
##     data = ranking_variables_distinct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3416.0   -49.6   -36.2    -3.5  6816.2 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       48.69587   33.81612   1.440 0.150274    
## rnkng_pnts_6mnths  6.98672    2.87544   2.430 0.015338 *  
## rnkng_pnts_9mnths  0.65027    2.80687   0.232 0.816855    
## rnkng_pnts_1yr    -5.08387    1.53472  -3.313 0.000968 ***
## rnkng_pnts_2yr     0.68122    0.32309   2.108 0.035317 *  
## rnkng_pnts_3yr    -0.66500    0.17652  -3.767 0.000178 ***
## rnkng_pnts_4yr     1.00740    0.12186   8.267 6.14e-16 ***
## rnkng_pnts_5yr     0.53293    0.08612   6.189 9.93e-10 ***
## gs_losses          3.88989    8.99951   0.432 0.665694    
## gs_wins           -1.25455    5.11875  -0.245 0.806454    
## atp_wins           0.38167    0.87349   0.437 0.662269    
## atp_losses        -1.08804    1.32196  -0.823 0.410738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 569.1 on 759 degrees of freedom
##   (318 observations deleted due to missingness)
## Multiple R-squared:  0.5443, Adjusted R-squared:  0.5376 
## F-statistic:  82.4 on 11 and 759 DF,  p-value: < 2.2e-16

Clearly ranking points at 6 months, and tournament results are quite insignificant here, indicating that as one might expect, ranking can only be loosely predicted by previous rankings at around 5 or 6 years prior to the current date. The R squared value of this regression is a weak 0.5376, telling us that predicting will be unreliable at the best of times.

We will now remove the variables which don’t contribute to the prediction, and build a new model.

## 
## Call:
## lm(formula = rnkng_pnts_10yr ~ rnkng_pnts_1yr + rnkng_pnts_2yr + 
##     rnkng_pnts_3yr + rnkng_pnts_4yr + rnkng_pnts_5yr, data = ranking_variables_distinct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3382.3   -49.1   -49.1   -49.1  6998.3 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    49.07364   17.33843   2.830 0.004736 ** 
## rnkng_pnts_1yr -2.03759    0.76367  -2.668 0.007741 ** 
## rnkng_pnts_2yr  0.56868    0.27292   2.084 0.037423 *  
## rnkng_pnts_3yr -0.55301    0.15042  -3.676 0.000248 ***
## rnkng_pnts_4yr  0.89055    0.10558   8.435  < 2e-16 ***
## rnkng_pnts_5yr  0.62342    0.07373   8.456  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 523.5 on 1083 degrees of freedom
## Multiple R-squared:  0.5146, Adjusted R-squared:  0.5124 
## F-statistic: 229.7 on 5 and 1083 DF,  p-value: < 2.2e-16

Interestingly the model hardly improves, meaning that even though ranking points in the first 5 years a player is on tour can be somewhat helpful in predicting where they will be in ten years, it is still anybody’s guess.

It can however be seen that the later ranking points (those at 3, 4 and 5 years) are more significant than earlier ones, so a final regression model will be constructed, taking only these into account.

## 
## Call:
## lm(formula = rnkng_pnts_10yr ~ rnkng_pnts_3yr + rnkng_pnts_4yr + 
##     rnkng_pnts_5yr, data = ranking_variables_distinct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3414.6   -44.1   -44.1   -44.1  6993.2 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    44.13758   17.31243   2.549   0.0109 *  
## rnkng_pnts_3yr -0.46959    0.11983  -3.919 9.45e-05 ***
## rnkng_pnts_4yr  0.91851    0.10076   9.115  < 2e-16 ***
## rnkng_pnts_5yr  0.60105    0.07258   8.281 3.55e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 525.2 on 1085 degrees of freedom
## Multiple R-squared:  0.5107, Adjusted R-squared:  0.5093 
## F-statistic: 377.4 on 3 and 1085 DF,  p-value: < 2.2e-16

Once again, the R squared value does not change, remaining 0.5093.

CONCLUSIONS

You may be wondering why ranking points at 10 years was chosen as an outcome variable, as it is highly variable. Surely career high ranking or cumulative ranking points would have been more informative outcome variables. The simple answer to this is that there was insufficient data (not to mention time:) ) to obtain these sort of variables.

Overall however, it is plain that trying to predict the chosen outcome variable was unreliable at best. Only a few variables which were closer to the outcome can predict with anything approaching 50% reliability.

It is perhaps not surprising that trying to predict sport in general is highly difficult, as if it was able to be predicted easily, there would be no interest in it. This simply highlights the exciting nature of the game of tennis, in that matches can change drastically across a couple of sets of even games or points, and there are so many variables which contribute the outcome of any one match (location, court surface or player injury record or status, for example) that it is very hard to forecast results.

Throughout this analysis it became increasingly evident that one cannot simply trust data given, and the importance of conducting sanity checks throughout the analysis cannot be overestimated, both for detecting errors in ones own analysis and for noticing unrealistic readings in the supplied data. For example, some of the data sets in the deuce package contained multiple missing values making analysis unreliable, and sometimes contained multiple instances of the same player (up to 80 instances of identical players).

Left-handed vs Right-handed players

Dataset

  • First we just use atp_players data set and join to the data set atp_rankings using left_join by player id to get the ranking data and handedness type of player at the same time.
  • then unselect “date” variable in this data set because it’s same as ranking_date.
  • filter date from 2015 to 2018 for our research.

Average ranking of each player

  • Because the ranking period of each player is not the same, we take the average ranking to analyse whether the handedness type has any effect on player ranking.
  • Because the data is too large and we only take the latest 4 years from 2015 to 2018 to calculate average ranking for each handedness type and got the data set like it.

Distruibution of Average ranking.

  • We use the box plot to observe the distribution of player’s average ranking for right hand and left hand.
ggplot(atp_htpall,aes(x=hand, y=AVE)) + 
  geom_boxplot() +
  facet_wrap(~year)

* We can see that the difference between the two handedness type is tiny. * The distribution of left-handed players are worse than right-handed players. But it can’t be prove that the claim is true or false.

Detailed analyse

  • For the detailed analyse, we create a subset to compare the ranking of left-handed players and right-handed players.
  • Because the ranking rule in tennis is that players’ ranking refreshed once a week base on their performance in matches.

And after cleaning the data set, we found that the common ranking weeks for left-handed players and right-handed players is 33weeks in 2018.

So we choose 33 weeks as our ranking period and calculate the average ranking for left-handed players and right-handed players in latest year 2018.

Comparision

ggplot(Sample18,aes(x=month, y=n, color=hand, group=hand, width=0.5)) +
  geom_line() 

  • We can see that the average ranking of left-handed players is much lower than the right-handed players among 33weeks.
  • What about the trend of average ranking for left-handed players and right-handed players from 2015 to 2018?

Among 2015-2018

  • Similarly, we got the common ranking weeks data set of average ranking for all left-handed players and right-handed players from 2015 to 2018

And got the average ranking for left-handed players and right-handed players from 2015 to 2017 in their ranking period.

Then we plot the data by bar chart.

ggplot(Sampleall,aes(x=month,y=n,fill=hand)) +
 geom_bar(stat="identity", position=position_dodge())+
  facet_wrap(~year)+
  scale_x_continuous(breaks = seq(1,12,1))

  • It can be clearly seen that right-handed players ’ average ranking are better than left-handed players in 2015 and 2016, and left-handed player’s average ranking are better than right-handed players in 2017 and 2018.

What we learnt

  • Actually there is no direct relationship between the right-handed and left-handed.
  • The left-handed forehand usually likes to hit the right-handed backhand, which is a problem of habit.
  • It is considered that backhand is not good to return the ball, and this is exactly the forehand that left-handed people can make strength.
  • So there is a saying that left-handed people have an advantage over right-handed people, but it doesn’t directly affect player rankings.

Winners and Losers

Tidy the data

  • For exploring topic 4, we use a ATP data file from “deuce” package named as “atp_matches”. In this original data file, it has a lot of missing(NA) values. Furthermore, not all of the variables are what we want to focus on. Thus, we use the following code to tidy the original data file:

select(year,tourney_name,surface,winner_name,loser_name) %>% na.omit(matches)

  • i.e. at first, selecting the variable we want. second, exclude the na values.

Count the no. of winnings and losings for each players

  • Clearly, Roger Federer hits the winners most for 1218 times during 1997-2018.
  • The second is Jimmy Connors who hits the winners for 1197 times from 1970-1995.
  • Joao Sousa hits the losers most for 695 times from 2014 to 2018.

Roger Federer’s Career growth

  • By looking at the above 2 graphs, Roger Federer started his career in 1997.
  • Federer was ranked in the top 100 in the world in 1998.
  • His tennis career has grown steadily since 2000 and reached his peak in 2005/2006.
  • From 2008, his grades slipped due to Infectious mononucleosis(disease), and reached his trough in 2013 due to both illness and injury.
  • After 2014, it is shown a up-down volatility trend due to intermittent injuries.

Jimmy Connors’s Career growth

  • By looking at the above 2 graphs, Jimmy Connors started his career in 1970.
  • During 1970 and 1974, his career shows significant growth.
  • After that, it is shown a gradual decline and he retired in 1995 finally.

Joao Sousa’s Career growth

  • By looking at the overall trend from 2004 to 2018, there’s no doubt that Joao Sousa’s professional career is a decline since the probability of losings and the times of being losers increase year by year.

THANKS FOR LISTENING! :)