Team members: Michael Roberts, Min Liu, Nicholas Hovens, Zhexu Zhang
atp_players1 <- atp_players %>%
select(name, player_id, country_code)%>%
filter(country_code == "AUS")
atp_rankings1 <- atp_rankings %>%
select(ranking, player_id, date)%>%
filter(ranking <= 100)
atp_data <- atp_players1 %>%
left_join(atp_rankings1, by = "player_id") %>%
mutate(year = year(date),
day = day(date),
month = month(date, label=TRUE, abbr=TRUE))#, locale = "English"))
current_atprankings <- fetch_atp_rankings("2019-05-20", min_rank = 1, max_rank = 100) %>%
select(player, rank, date) %>%
rename(c("player" = "name", "rank" = "ranking"))
currentatp_data <- atp_players1 %>%
full_join(current_atprankings, by = "name") %>%
na.omit(currentatp_data)
wta_players1 <- wta_players %>%
select(name, player_id, country_code)%>%
filter(country_code == "AUS")
wta_rankings1 <- wta_rankings %>%
select(ranking, player_id, date)%>%
filter(ranking <= 100)
wta_data <- wta_players1 %>%
left_join(wta_rankings1, by = "player_id") %>%
mutate(year = year(date),
day = day(date),
month = month(date, label=TRUE, abbr=TRUE))#, locale = "English"))
wta_ausrankings <- wta_data %>%
select(name, ranking, year, day, month) %>%
filter(year > 2000, month == "Jan", day < 8)
current_wtarankings <- fetch_wta_rankings(min_rank = 1, max_rank = 100) %>%
select(player, nation, rank, date) %>%
filter(nation == "Australia") %>%
* As of right now, Australia has 5 players ranked in the WTA top 100 * This is the most Australia has had in the WTA top 100 since 2012 * This could show the beginning of another rise in Australians in the WTA top 100 * Highest Australian right now is Ash Barty who is ranked 8th
In this section, we will attempt to discover if the ranking of an ATP player can be predicted to any degree of reliability through looking at their ranking in the early days of their professional career and their overall performance in ATP tournaments with a special focus on grand slams. This section continues to use the deuce package, and the first thing to do here is to obtain a profile of each player, with their rankings and some other personal information, such as date of birth. This is accomplished through the atp_rankings data set, which is joined to the atp_players table, a second table which contains a more fleshed out version of the players name, date of birth. The result of this join is displayed below:
We would now like to split the rankings of players into their ATP ranking points after the first 3,6,9 months, then 1 through to 5 years of their arrival on the ATP tour. This is done by taking the ranking table formed previously, grouping by player, and filtering out all dates outside the time period of interest (so to determine ATP ranking points after 6 months, we select the last recorded value of ATP ranking points after filtering out dates beyond 6 months after their first recorded date). This results in tables for each of the data sets, which are then joined together to create a single time breakdown of the progression of these players ATP ranking points.
How players perform in the glamorous and exciting grand slams over the years may well prove to be a predictor of ranking points after a solid amount of time competing at the top level (10 years, hopefully). Below is calculated the number of wins and losses recorded for each player in grand slams. It is not however, a good idea to split this variable into similar time frames as the ranking variables above, as some players may not have even played a grand slam in their first year, or even 2 or 3 years on tour. Therefore, we take this observation after 4 years.
This is found by taking the table provided by the deuce package, atp_matches (containing point by point breakdown statistics of many grand slam matches over the course of history), and joining this data set with that describing ATP ranking points after 4 years. The date in this second table provides the cutoff date past which we should not consider grand slam results. A snippet of the code to produce this overall data set is shown below:
Separate tables are created for wins and losses in grand slams, and finally they are joined to create the final data set.
#******************************************4 YEAR GS************************************************************
atp_matches <- rename(atp_matches,replace = c("winner_name" = "name"))
tournaments_4yr <- join(atp_matches, four_year, by = "name") %>% select(c(tourney_name, tourney_level, name, winner_rank, winner_rank_points, loser_name, loser_rank, loser_rank_points, tourney_start_date, date_4yr, rnkng_pnts_4yr))
gs_4yr <- tournaments_4yr %>% filter(tourney_level == "Grand Slams") %>% group_by(name) %>% filter(tourney_start_date <= date_4yr)
gs_4yr_wins <- count(gs_4yr$name) %>% rename(c("x" = "name", "freq" = "gs_wins"))
gs_4yr_wins <- join(gs_4yr_wins, four_year, by = "name") %>% separate(date_4yr, into = c("year_4yr", "month_4yr", "day_4yr")) %>% select(name, gs_wins)
gs_4yr <- tournaments_4yr %>% filter(tourney_level == "Grand Slams") %>% group_by(loser_name) %>% filter(tourney_start_date <= date_4yr)
gs_4yr_losses <- count(gs_4yr$loser_name) %>% rename(c("x" = "name", "freq" = "gs_losses"))
gs_4yr_losses <- join(gs_4yr_losses, four_year, by = "name") %>% separate(date_4yr, into = c("year_4yr", "month_4yr", "day_4yr")) %>% select(name, gs_losses)
gs_4yr_wlr <- join(gs_4yr_losses, gs_4yr_wins, by = "name")
head(gs_4yr_wlr)
In tennis, there is a fair number of players who may do well in tournaments other than grand slams, but who repeatedly fail to bring it together on the big stage. (Alexander Zverev and Nick Kyrgios are debatedly two of those, just to name a few.) We thought it would be good to include some reference to these other tournaments as indicators of performance on the ATP tour.
Luckily this information was available in the atp_matches table once more, and allowed us to focus on all professional level tournaments including grand slams. This table was again joined to the four year ranking table, or the same reason as it was in the previous step (obtaining grand slam performance). Four years was chosen as the cutoff again, to make the wins and losses obtained here comparable to those of the grand slam data.
## 'data.frame': 14729 obs. of 10 variables:
## $ name : chr "A Benson" "A Escofet" "A Hall" "A Macdonald" ...
## $ rnkng_pnts_3mnths: chr "0" "0" "0" "1" ...
## $ rnkng_pnts_6mnths: chr "0" "0" "0" "1" ...
## $ rnkng_pnts_9mnths: chr "0" "0" "0" "1" ...
## $ rnkng_pnts_1yr : chr "0" "0" "0" "1" ...
## $ rnkng_pnts_2yr : chr "0" "0" "0" "1" ...
## $ rnkng_pnts_3yr : chr "0" "0" "0" "1" ...
## $ rnkng_pnts_4yr : chr "0" "0" "0" "1" ...
## $ rnkng_pnts_5yr : chr "0" "0" "0" "1" ...
## $ rnkng_pnts_10yr : chr "0" "0" "0" "1" ...
Having obtained this data we could now join the atp tournaments, grand slams and ranking statistics and combine them all in one table, shown below.
##join atp tournament performance to ranking data
ranking_variables_distinct <- merge(ranking_variables, atp, by = "name")
##have to do this joining BEFORE ADJUSTING FOR DUPLICATES
ranking_variables_distinct <- merge(ranking_variables_distinct, gs_4yr_wlr, by = "name")
head(ranking_variables_distinct)
The data set contained some duplicate rows as some players had identical names! This was dealt with by adjusting the players name each time it came up, adding an incrementing number to the end of it e.g. William Brown 1, William Brown 2 etc, if there were multiple instances of William Brown for some reason.
## 'data.frame': 1089 obs. of 14 variables:
## $ name : chr "Aaron Krickstein" "Abe Segal" "Adrian Mannarino" "Adrian Marcu" ...
## $ rnkng_pnts_3mnths: chr "0" "0" "4" "0" ...
## $ rnkng_pnts_6mnths: chr "0" "0" "4" "0" ...
## $ rnkng_pnts_9mnths: chr "0" "0" "7" "0" ...
## $ rnkng_pnts_1yr : chr "0" "0" "5" "0" ...
## $ rnkng_pnts_2yr : chr "0" "0" "60" "0" ...
## $ rnkng_pnts_3yr : chr "0" "0" "82" "0" ...
## $ rnkng_pnts_4yr : chr "0" "0" "340" "0" ...
## $ rnkng_pnts_5yr : chr "0" "0" "391" "0" ...
## $ rnkng_pnts_10yr : chr "0" "0" "729" "22" ...
## $ atp_wins : int 122 2 3 2 17 245 3 4 3 27 ...
## $ atp_losses : int 47 6 6 7 30 158 14 4 9 9 ...
## $ gs_losses : int 5 5 2 1 4 22 1 1 1 1 ...
## $ gs_wins : int 15 2 NA NA 5 44 1 NA NA 3 ...
In this section we conduct principal component analysis on all the variables in this data set. This will allow us to identify which variables are intercorrelated, and which of them are most important in explaining variation among them. The variables will be separated into principal components which will highlight the most important information in the data set.
The first step here is to take our data set and remove the name column, then format all the other variables as numeric, a key condition for PCA.
## 'data.frame': 1089 obs. of 13 variables:
## $ rnkng_pnts_3mnths: num 0 0 4 0 2 0 0 0 0 0 ...
## $ rnkng_pnts_6mnths: num 0 0 4 0 2 0 57 0 0 0 ...
## $ rnkng_pnts_9mnths: num 0 0 7 0 28 0 57 0 0 0 ...
## $ rnkng_pnts_1yr : num 0 0 5 0 57 0 48 0 0 0 ...
## $ rnkng_pnts_2yr : num 0 0 60 0 50 0 156 0 0 0 ...
## $ rnkng_pnts_3yr : num 0 0 82 0 150 0 260 0 0 458 ...
## $ rnkng_pnts_4yr : num 0 0 340 0 724 0 281 0 0 261 ...
## $ rnkng_pnts_5yr : num 0 0 391 0 885 0 611 74 0 42 ...
## $ rnkng_pnts_10yr : num 0 0 729 22 410 0 547 12 0 1 ...
## $ atp_wins : num 122 2 3 2 17 245 3 4 3 27 ...
## $ atp_losses : num 47 6 6 7 30 158 14 4 9 9 ...
## $ gs_losses : num 5 5 2 1 4 22 1 1 1 1 ...
## $ gs_wins : num 15 2 NA NA 5 44 1 NA NA 3 ...
## 'data.frame': 1089 obs. of 12 variables:
## $ rnkng_pnts_6mnths: num 0 0 4 0 2 0 57 0 0 0 ...
## $ rnkng_pnts_9mnths: num 0 0 7 0 28 0 57 0 0 0 ...
## $ rnkng_pnts_1yr : num 0 0 5 0 57 0 48 0 0 0 ...
## $ rnkng_pnts_2yr : num 0 0 60 0 50 0 156 0 0 0 ...
## $ rnkng_pnts_3yr : num 0 0 82 0 150 0 260 0 0 458 ...
## $ rnkng_pnts_4yr : num 0 0 340 0 724 0 281 0 0 261 ...
## $ rnkng_pnts_5yr : num 0 0 391 0 885 0 611 74 0 42 ...
## $ rnkng_pnts_10yr : num 0 0 729 22 410 0 547 12 0 1 ...
## $ atp_wins : num 122 2 3 2 17 245 3 4 3 27 ...
## $ atp_losses : num 47 6 6 7 30 158 14 4 9 9 ...
## $ gs_losses : num 5 5 2 1 4 22 1 1 1 1 ...
## $ gs_wins : num 15 2 NA NA 5 44 1 NA NA 3 ...
In principal component analysis, variables are often scaled (i.e. standardized). This is a good idea when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, …); otherwise, the PCA outputs obtained will be severely affected.
The goal here is to make the variables comparable. Generally variables are scaled to have: i) standard deviation one and ii) mean zero The function PCA() [FactoMineR package] can be used.
This standardization to the same scale avoids some variables to become dominant just because of their large measurement units. It makes variable comparable.
The R code below, computes principal component analysis on the active individuals/variables:
library("FactoMineR")
res.pca1 <- PCA(ranking_variables_distinct, graph = FALSE)
print(res.pca1)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 1089 individuals, described by 12 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
We will use the factoextra R package to help in the interpretation of PCA.
Eigenvalues / Variances The eigenvalues measure the amount of variation retained by each principal component in the data set. Eigenvalues are large for the first PCs and small for the subsequent PCs. That is, the first PCs corresponds to the directions with the maximum amount of variation in the data set.
We examine the eigenvalues to determine the number of principal components to be considered. The eigenvalues and the proportion of variances (i.e., information) retained by the principal components (PCs) can be extracted using the function get_eigenvalue() from the factoextra package.
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 5.50886318 45.9071932 45.90719
## Dim.2 3.19194422 26.5995352 72.50673
## Dim.3 1.29124833 10.7604028 83.26713
## Dim.4 0.52624329 4.3853607 87.65249
## Dim.5 0.49032933 4.0860777 91.73857
## Dim.6 0.27878221 2.3231851 94.06175
## Dim.7 0.22789740 1.8991450 95.96090
## Dim.8 0.13433485 1.1194571 97.08036
## Dim.9 0.12205968 1.0171640 98.09752
## Dim.10 0.09557539 0.7964616 98.89398
## Dim.11 0.07987972 0.6656643 99.55965
## Dim.12 0.05284239 0.4403533 100.00000
Eigenvalues can be used to determine the number of principal components to retain after PCA (Kaiser 1961):
An eigenvalue > 1 indicates that PCs account for more variance than accounted by one of the original variables in standardized data. This is commonly used as a cutoff point for which PCs are retained. This is true only when the data is standardized.
You can also limit the number of components to a number that accounts for a certain fraction of the total variance. So for example here, as can be seen from the eigen values above, we might choose to stop at dimension 7, as up to this point, 95% of the variation is explained.
An alternative method to determine the number of principal components is to look at a Scree Plot, which is the plot of eigenvalues ordered from largest to the smallest. The number of component is determined at the point, beyond which the remaining eigenvalues are all relatively small and of comparable size (Jollife 2002, Peres-Neto, Jackson, and Somers (2005)).
The scree plot can be produced using the function fviz_eig() or fviz_screeplot() from the factoextra package.
From the plot above, we might want to stop at the 7th principal component. 95% of the information (variances) contained in the data are retained by the first five principal components.PUT RIGHT VALUES IN
To extract the results for variables from a PCA output we can use the function get_pca_var() (factoextra package). This function provides a list of matrices containing all the results for the active variables (coordinates, correlation between variables and axes, squared cosine and contributions)
The components of the get_pca_var() can be used in the plot of our variables as follows:
var\(coord: coordinates of variables to create a scatter plot var\)cos2: represents the quality of representation for variables on the factor map. It’s calculated as the squared coordinates: var.cos2 = var.coord * var.coord. var$contrib: contains the contributions (in percentage) of the variables to the principal components. The contribution of a variable (var) to a given principal component is (in percentage) : (var.cos2 * 100) / (total cos2 of the component).
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
## Dim.1 Dim.2 Dim.3 Dim.4
## rnkng_pnts_6mnths 0.7396830 0.06636937 0.5356839 0.183535865
## rnkng_pnts_9mnths 0.8003324 0.07234540 0.5316438 0.097870521
## rnkng_pnts_1yr 0.8290986 0.08838776 0.4255562 -0.002166276
## rnkng_pnts_2yr 0.8603207 0.15971845 -0.0469459 -0.260805001
## rnkng_pnts_3yr 0.8709076 0.14336455 -0.1810676 -0.268405180
## rnkng_pnts_4yr 0.8767204 0.13948107 -0.3413019 -0.027754990
## Dim.5
## rnkng_pnts_6mnths -0.083898074
## rnkng_pnts_9mnths -0.055824477
## rnkng_pnts_1yr 0.005418478
## rnkng_pnts_2yr 0.185113611
## rnkng_pnts_3yr 0.166862814
## rnkng_pnts_4yr 0.061171122
## Dim.1 Dim.2 Dim.3 Dim.4
## rnkng_pnts_6mnths 0.5471309 0.004404894 0.286957200 3.368541e-02
## rnkng_pnts_9mnths 0.6405320 0.005233857 0.282645175 9.578639e-03
## rnkng_pnts_1yr 0.6874045 0.007812396 0.181098086 4.692751e-06
## rnkng_pnts_2yr 0.7401516 0.025509985 0.002203917 6.801925e-02
## rnkng_pnts_3yr 0.7584800 0.020553394 0.032785471 7.204134e-02
## rnkng_pnts_4yr 0.7686386 0.019454968 0.116487020 7.703395e-04
## Dim.5
## rnkng_pnts_6mnths 0.0070388868
## rnkng_pnts_9mnths 0.0031163722
## rnkng_pnts_1yr 0.0000293599
## rnkng_pnts_2yr 0.0342670488
## rnkng_pnts_3yr 0.0278431986
## rnkng_pnts_4yr 0.0037419062
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## rnkng_pnts_6mnths 9.931829 0.1380003 22.2232388 6.401110e+00 1.435542699
## rnkng_pnts_9mnths 11.627299 0.1639708 21.8892964 1.820192e+00 0.635567175
## rnkng_pnts_1yr 12.478155 0.2447535 14.0250393 8.917456e-04 0.005987792
## rnkng_pnts_2yr 13.435651 0.7991990 0.1706811 1.292544e+01 6.988578295
## rnkng_pnts_3yr 13.768358 0.6439146 2.5390523 1.368974e+01 5.678468970
## rnkng_pnts_4yr 13.952762 0.6095021 9.0212716 1.463847e-01 0.763141420
The correlation between a variable and a principal component (PC) is used as the coordinates of the variable on the PC. The representation of variables differs from the plot of the observations: The observations are represented by their projections, but the variables are represented by their correlations (Abdi and Williams 2010). Color by cos2 values: quality on the factor map
The plot above is also known as variable correlation plots. It shows the relationships between all variables. It can be interpreted as follows:
Positively correlated variables are grouped together. We can see that atp_wins and gs_wins are correlated, as well as atp_losses and gs_losses. This makes sense, as players who do well in most atp tournaments will tend to also perform in the grand slams, although there are some outliers as mentioned previously. Negatively correlated variables are positioned on opposite sides of the plot origin (opposed quadrants). Interestingly, ranking points and tournament performance are not correlated, from this data. This is somewhat counter intuitive, as one would suspect that good tournament performance would result in better ranks. However, there would be some players who do well in tournaments which don’t offer a high number of ranking points, which would explain why tournament performance and ranking points are not really correlated here.
The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from the origin are well represented on the factor map. Due to the cluster of variables on this correlation plot, it is hard to see the quality for each variable. The cos2 on the key of this plot is explained below.
The quality of representation of the variables on factor map is called cos2 (square cosine, squared coordinates).
A high cos2 indicates a good representation of the variable on the principal component. In this case the variable is positioned close to the circumference of the correlation circle.
A low cos2 indicates that the variable is not perfectly represented by the PCs. In this case the variable is close to the center of the circle.
## Dim.1 Dim.2 Dim.3 Dim.4
## rnkng_pnts_6mnths 0.5471309 0.004404894 0.286957200 3.368541e-02
## rnkng_pnts_9mnths 0.6405320 0.005233857 0.282645175 9.578639e-03
## rnkng_pnts_1yr 0.6874045 0.007812396 0.181098086 4.692751e-06
## rnkng_pnts_2yr 0.7401516 0.025509985 0.002203917 6.801925e-02
## Dim.5
## rnkng_pnts_6mnths 0.0070388868
## rnkng_pnts_9mnths 0.0031163722
## rnkng_pnts_1yr 0.0000293599
## rnkng_pnts_2yr 0.0342670488
A bar chart of the cos2 is probably a more user-friendly result:
Contributions of variables to PCs The contributions of variables in accounting for the variability in a given principal component are expressed in percentage.
Variables that are correlated with PC1 (i.e., Dim.1) and PC2 (i.e., Dim.2) are the most important in explaining the variability in the data set. Variables that are not correlated with any PC or correlated with the last dimensions are variables with low contribution and might be removed to simplify the overall analysis. The larger the value of the contribution, the more the variable contributes to the component.
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## rnkng_pnts_6mnths 9.931829 0.1380003 22.2232388 6.401110e+00 1.435542699
## rnkng_pnts_9mnths 11.627299 0.1639708 21.8892964 1.820192e+00 0.635567175
## rnkng_pnts_1yr 12.478155 0.2447535 14.0250393 8.917456e-04 0.005987792
## rnkng_pnts_2yr 13.435651 0.7991990 0.1706811 1.292544e+01 6.988578295
Once again bar charts is also helpful to visualise which variables are contributing most to the variations. Here are the contributions of each variable to PC1.
Likewise the contributions to PC2
Now the total contribution to PC1 and PC2
Clearly these plots confirm the information plotted in the correlation circle above. The ranking points are correlated with Dim 1 and tournament performance to Dim2.
The red dashed line on the graphs above indicate the expected average contribution.
We will now perform multiple linear regression to determine the contributions of different variables to rnkng_pnts_10yr, the outcome variable here.
The first model takes into account all possible variables, with ranking points at ten years as the outcome.
##
## Call:
## lm(formula = rnkng_pnts_10yr ~ rnkng_pnts_6mnths + rnkng_pnts_9mnths +
## rnkng_pnts_1yr + rnkng_pnts_2yr + rnkng_pnts_3yr + rnkng_pnts_4yr +
## rnkng_pnts_5yr + gs_losses + gs_wins + atp_wins + atp_losses,
## data = ranking_variables_distinct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3416.0 -49.6 -36.2 -3.5 6816.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.69587 33.81612 1.440 0.150274
## rnkng_pnts_6mnths 6.98672 2.87544 2.430 0.015338 *
## rnkng_pnts_9mnths 0.65027 2.80687 0.232 0.816855
## rnkng_pnts_1yr -5.08387 1.53472 -3.313 0.000968 ***
## rnkng_pnts_2yr 0.68122 0.32309 2.108 0.035317 *
## rnkng_pnts_3yr -0.66500 0.17652 -3.767 0.000178 ***
## rnkng_pnts_4yr 1.00740 0.12186 8.267 6.14e-16 ***
## rnkng_pnts_5yr 0.53293 0.08612 6.189 9.93e-10 ***
## gs_losses 3.88989 8.99951 0.432 0.665694
## gs_wins -1.25455 5.11875 -0.245 0.806454
## atp_wins 0.38167 0.87349 0.437 0.662269
## atp_losses -1.08804 1.32196 -0.823 0.410738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 569.1 on 759 degrees of freedom
## (318 observations deleted due to missingness)
## Multiple R-squared: 0.5443, Adjusted R-squared: 0.5376
## F-statistic: 82.4 on 11 and 759 DF, p-value: < 2.2e-16
Clearly ranking points at 6 months, and tournament results are quite insignificant here, indicating that as one might expect, ranking can only be loosely predicted by previous rankings at around 5 or 6 years prior to the current date. The R squared value of this regression is a weak 0.5376, telling us that predicting will be unreliable at the best of times.
We will now remove the variables which don’t contribute to the prediction, and build a new model.
##
## Call:
## lm(formula = rnkng_pnts_10yr ~ rnkng_pnts_1yr + rnkng_pnts_2yr +
## rnkng_pnts_3yr + rnkng_pnts_4yr + rnkng_pnts_5yr, data = ranking_variables_distinct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3382.3 -49.1 -49.1 -49.1 6998.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.07364 17.33843 2.830 0.004736 **
## rnkng_pnts_1yr -2.03759 0.76367 -2.668 0.007741 **
## rnkng_pnts_2yr 0.56868 0.27292 2.084 0.037423 *
## rnkng_pnts_3yr -0.55301 0.15042 -3.676 0.000248 ***
## rnkng_pnts_4yr 0.89055 0.10558 8.435 < 2e-16 ***
## rnkng_pnts_5yr 0.62342 0.07373 8.456 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 523.5 on 1083 degrees of freedom
## Multiple R-squared: 0.5146, Adjusted R-squared: 0.5124
## F-statistic: 229.7 on 5 and 1083 DF, p-value: < 2.2e-16
Interestingly the model hardly improves, meaning that even though ranking points in the first 5 years a player is on tour can be somewhat helpful in predicting where they will be in ten years, it is still anybody’s guess.
It can however be seen that the later ranking points (those at 3, 4 and 5 years) are more significant than earlier ones, so a final regression model will be constructed, taking only these into account.
##
## Call:
## lm(formula = rnkng_pnts_10yr ~ rnkng_pnts_3yr + rnkng_pnts_4yr +
## rnkng_pnts_5yr, data = ranking_variables_distinct)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3414.6 -44.1 -44.1 -44.1 6993.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.13758 17.31243 2.549 0.0109 *
## rnkng_pnts_3yr -0.46959 0.11983 -3.919 9.45e-05 ***
## rnkng_pnts_4yr 0.91851 0.10076 9.115 < 2e-16 ***
## rnkng_pnts_5yr 0.60105 0.07258 8.281 3.55e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 525.2 on 1085 degrees of freedom
## Multiple R-squared: 0.5107, Adjusted R-squared: 0.5093
## F-statistic: 377.4 on 3 and 1085 DF, p-value: < 2.2e-16
Once again, the R squared value does not change, remaining 0.5093.
You may be wondering why ranking points at 10 years was chosen as an outcome variable, as it is highly variable. Surely career high ranking or cumulative ranking points would have been more informative outcome variables. The simple answer to this is that there was insufficient data (not to mention time:) ) to obtain these sort of variables.
Overall however, it is plain that trying to predict the chosen outcome variable was unreliable at best. Only a few variables which were closer to the outcome can predict with anything approaching 50% reliability.
It is perhaps not surprising that trying to predict sport in general is highly difficult, as if it was able to be predicted easily, there would be no interest in it. This simply highlights the exciting nature of the game of tennis, in that matches can change drastically across a couple of sets of even games or points, and there are so many variables which contribute the outcome of any one match (location, court surface or player injury record or status, for example) that it is very hard to forecast results.
Throughout this analysis it became increasingly evident that one cannot simply trust data given, and the importance of conducting sanity checks throughout the analysis cannot be overestimated, both for detecting errors in ones own analysis and for noticing unrealistic readings in the supplied data. For example, some of the data sets in the deuce package contained multiple missing values making analysis unreliable, and sometimes contained multiple instances of the same player (up to 80 instances of identical players).
Do Left-handed players have an advantage over Right-handed players in the last four years from 2015 to 2018 in ATP?
There is a saying that left-handed people have an advantage over right-handed people.Is it true or not?
ggplot(atp_htpall,aes(x=hand, y=AVE)) +
geom_boxplot() +
facet_wrap(~year)
* We can see that the difference between the two handedness type is tiny. * The distribution of left-handed players are worse than right-handed players. But it can’t be prove that the claim is true or false.
And after cleaning the data set, we found that the common ranking weeks for left-handed players and right-handed players is 33weeks in 2018.
So we choose 33 weeks as our ranking period and calculate the average ranking for left-handed players and right-handed players in latest year 2018.
ggplot(Sample18,aes(x=month, y=n, color=hand, group=hand, width=0.5)) +
geom_line()
And got the average ranking for left-handed players and right-handed players from 2015 to 2017 in their ranking period.
Then we plot the data by bar chart.
ggplot(Sampleall,aes(x=month,y=n,fill=hand)) +
geom_bar(stat="identity", position=position_dodge())+
facet_wrap(~year)+
scale_x_continuous(breaks = seq(1,12,1))
select(year,tourney_name,surface,winner_name,loser_name) %>% na.omit(matches)