<<

Improving the Efficiency Index and Quantifying Player Value Eric Xu, Timothy Tran, Rony Sulca, Ahmad Ahmadiyan 12/2/2016

Abstract

This report creates an improved efficiency index to determine NBA players’ efficiency with respect to other players in the league and that ultimately attempts to quantify the value of each player with respect to their salary. We began by compiling and formatting the data from basketball-reference.com. The variables in our dataset range from quantitative variables such as points and rebounds to qualitative variables such as their country of origin and position. We then conducted exploratory data analysis (EDA). This process began with calculating summary statistics and frequencies for each variable in our dataset, and then plotted the distributions and frequencies for quantitative and qualitative variables respectively. We used histograms, boxplots, and bar charts to represent these distributions and frequencies. Each plot gave valuable insights to each variable. We then sought a more heuristic and comprehensive value to judge a player’s skill than the existing efficiency, so we created an efficiency index using principal component analysis and calculated an efficiency value for each player. This way, we can compare players using multivariate analysis. We drew comparisons between what our final efficiency values regarded a good player versus what the NBA believes to be good players. Finally, we understand not everyone who likes basketball is a statistician; they might want to understand this project but lack the background knowledge. We created a solution in the form of a shiny app. We display data in scatterplots based on a user’s selections on a straightforward interface. We believe that the app is simple to use and enables depth in data analysis accessible to anyone.

Introduction

Basketball has long used a composite statistic called an efficiency rating (EFF) to represent the performance of players. In theory, the EFF accounts equally for a player’s offensive and defensive contributions through various player statistics, but because currently recorded statistics struggle to capture defensive player performance, in practice the EFF tends to favor offensive players. Through our research, we aim to improve upon the existing EFF with our own efficiency index calculated through principal component analysis and compare the EFF with salary to calculate value ratings for each player.

Data

For our data, we utilized the latest statistics on basketball players’ performance and profile available on Basketball Reference website. Our analysis is focused on players in the 2015 - 2016 regular season. In that regard, we have used mainly three subsets of data from all 30 active teams in the NBA league, including Roster, and Totals, and Salaries, which represent in total 34 different variables, 32 of which have been incorporated in our analysis.

1 Methodology

Data Collection

Our primary source of data was directly web scraped from the Basketball Reference website. After fetching the URL’s of each team, we parsed the HTML pages using the XML package. After reading the XML tables on each team’s page, we identified the Roster, Totals, and Salary tables and stored each series of the data as strings and saved them as data frames in comma separated values (csv) files and subsequently categorized them in distinct directories that would be beneficial for our later stages of data manipulation and analysis. We also consistently identified the columns’ names using the available glossary at source website along the process. At the next stage, we merged the data for each team into a single dataset, saved in a master csv file.

Data Cleaning

Due to the noisiness of the scraped data, we could not directly use it for our main analysis. For cleaning purposes, unnecessary data in certain columns were dropped, including the players’ rankings and numbers that were team specific and unrelated to the performance of individual team members relative to other players across the league. Most of the columns had been read as factors or characters that were changed accordingly to be in their own proper class for sake of future analysis. Birth dates of the player in particular had been modified to be in date format. Using the grep library, the salary amounts were reformatted to be only numbers and also the height converted from foot to meter standard. Some players had played on multiple teams in 2015 - 2016 season and repeatedly showed up on the aggregated list of data that were dropped by the duplicated() command.

Subset of clean dataset ## # A tibble: 10 × 7 ## X player position height weight birth.date origin ## ## 1 1 Kent Bazemore SF 77 201 1989-07-01 us ## 2 2 Tim Hardaway SG 78 205 1992-03-16 us ## 3 3 Kirk Hinrich PG 76 190 1981-01-02 us ## 4 4 Al Horford C 82 245 1986-06-03 do ## 5 5 Kris Humphries PF 81 235 1985-02-06 us ## 6 6 Kyle Korver SG 79 212 1981-03-17 us ## 7 7 Paul Millsap PF 80 246 1985-02-10 us ## 8 8 Mike Muscala C 83 240 1991-07-01 us ## 9 9 Lamar Patterson SG 77 225 1991-08-12 us ## 10 10 Dennis Schroder PG 73 172 1993-09-15 de Figure 1. The first ten rows and seven columns of our clean dataset.

Exploratory Data Analysis

After the data was merged across tables, we conducted exploratory data analysis (EDA) to analyze the distribution of each variable individually. We began by calculating summary statistics. For each of our quantitative variables, we calculated values for the variable’s mean, minimum, first quartile, median, third quartile, maximum, standard deviation, and range. We did this by writing a function that calculates each value we wanted to calculate and then used the apply() function to calculate it across each qualitative variable in the clean data frame.

2 For the qualitative variables, we used a for loop to loop through each variable and print a table of each value and its corresponding tally and proportion, created using dplyr. We used the sink() function to save the output of these functions to eda-output.txt. We then plotted graphs of the quantitative and qualitative variables. For each of the quantitative variables, e.g. height, salary, and turnovers, we plotted histograms and boxplots. We did this by using a for loop to loop through each variable and generate the graphs. For the qualitative variables, we plotted bar charts of frequencies. We used ggsave() to export the plots to the images/ directory.

Efficiency Index

Comparing a player’s points, rebounds, assists and other statistics against other players is simple. However, when deciding to contract a player, a team cannot simply pick a person with the best shooting or best rebounding skills: they need to get an entire package. However, looking at each statistic individually makes it difficult to see who is the overall better player. That is where our efficiency index comes into play. By performing a principal component analysis, we can see how players overall are ranked against each other, not just in individual stats. While efficiency can be calculated through raw statistics, that would not take into account the player’s position in a team. For example, it is natural for a taller player like a center to get more rebounds. Conversely, a smaller player such as a guard would get more assists. To account for those differences, in this analysis we compare efficiencies of players within their respective position. To begin, we took the data from the Data Cleaning section and trimmed it down even more. The statistics we considered were points, rebounds, assists, steals, blocks, missed field goals, missed free throws, and turnovers. This is not an exhaustive list of every component in basketball, but we believe these represent the variables that best discriminate between players’ performance. We extracted the clean data and calculated missed shots by subtracting made shots from shots attempted. Next, we filtered the data by the 5 positions in professional basketball: Center, Power Forward, Small Forward, Shooting Guard, and Point Guard. After filtering we had five small data frames that we used to calculate the principal components. We used the prcomp() function in R to create the principal components for us. We made sure to use scaled data, which is mean-centered and standardized. This can be achieved by setting the parameters “center” and “scale” to TRUE. We used the first principal component calculated using the prcomp() function as weights for our calculations. Finally we multiplied each statistic with the respective weight, making sure to take into account the standard deviation. Thus, the final weight is the principal component divided by the standard deviation. We did this for each player in each position, before summing up the values to get the final efficiency value. It is important to note that missed field goals, missed free throws, and turnovers are all negative measures of performance in terms of winning a basketball game. To account for this, we made a simple correction factor so that those indexes would negatively affect a player’s final efficiency. Since we had collected some salary data, we naturally became curious for the value of a player per dollar. Easily, we constructed a new column that created a ratio of a player’s efficiency and their salary during that season. Finally, we bound all the small data frames together with the vector containing the efficiency indexes. This helped greatly in our understanding of how PCA functions work and how well they correspond to established rankings in basketball.

Shiny App

When creating the shiny app, we opted for a sidebar panel layout. We placed the modifiable variables on the left of the app in a small side panel, and we put the displayed output on the main panel, located at the center and right of the app. Once this layout was chosen, we began adding variables for the user to manipulate.

3 Figure 2. Screenshot of our shiny app, showing the correlation between points scored and salary with players color coded by position. For the app describing the player’s statistics, we began by adding the variables that will be depicted in the two axes of our scatterplot. Those variables are obtained from the EFF breakdown performed in the previous step. We then added an option to choose whether we want to color code our plot, and if so, which color palette we want to use. To aid the visualization, we also allowed a numeric input that describes the size of the point in our plot. Furthermore, to be able to see the concentration of points in any specific area, we included a slider depicting the depth of the point’s color. This way, the user can lower the color depth and see where in the plot the points cluster the most. As indicated by the instructions, we also included, under the plot, the correlation coefficient between variables. This way, the user can immediately see how strongly a variable is related to another. We would like to note that correlation coefficients apply only to continuous values; as a result, when displaying data about the position of the players (discrete values), no correlation coefficient will be given. We also decided to allow our app to display the plot for any other data files, given the files contain columns with the same name as the default file. That is, if another group does a similar project, but with different numbers, then our app can display those results as well. The user need only upload his file using our widget. For the app describing the team’s statistics: We displayed a horizontal bar graph, where the y-axis will always be the team names (described by their initials). The x-axis (the actual data), can be manipulated by our first widget. The statistics available for this x-axis come from the Team salaries file obtained in previous steps. This data can be displayed in either ascending or descending order, depending on the user’s choice. This choice was made possible by adding a radio button. When choosing additional variables for the user to manipulate, we began by adding a slider describing the width of the bars in our plot. This way, the user can choose a width that will help him/her better visualize

4 our results. Also, we decided to add color coding to the bar plot. The variable chosen for the color coding was total payroll by default, but we added a selective widget so that the user can choose what variable to base the color coding on. We continued by adding a “show legend” option that is always helpful in apps like these. Finally, to allow the user to have total control over the color coding, just like the previous app, we added two numeric widgets that describe the colors (number labeled) that will be used for the two extremes of the color range. That is, the user can choose the color that represents the lowest value, and the color that represents the higher value. Note: we sprinkled short lines of text in both of our apps to briefly explain what some widgets or plots describe.

Results

Exploratory Data Analysis

Using the output of our EDA summary statistics and plots, we began by analyzing our quantitative variables. Unsurprisingly, basketball players are quite tall, with a median height of 6’ 7”. The standard deviation is small at about 3.5”. The median weight is 220 lbs with a standard deviation of 26.5 lbs. Salary has a distribution skewed highly to the right, with the mean salary of $4.7M far higher than the median salary of $2.6M. 75% of players have a salary between $1.1M and $6.3M, with a large cluster of outliers earning above $15M. The age distribution of players is skewed slightly right, with a mean age of 26.6 years compared to a median age of 26 years. Salary Distribution of Teams

0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07 Salary (USD)

Figure 3. This boxplot shows the distribution of salaries is skewed heavily to the right, with a significant number of outliers earning salaries much higher than the median. When looking years of experience in the NBA, the distribution is skewed to the right. The middle 75% of players have between 2 and 8 years of experience, while the top 25% ranges from 8 to 20 years of experience. Regarding the qualitative variables, we noted that positions are evenly distributed, with the standard deviation of the proportions of each position equal to only 0.97%. The most common position at 21.4% of all players was the power forward, and the least common was the center, at 18.7% of all players. For countries of origin, the most common by far was the USA, with 78.3% of all players from there. Four other countries had above 1% share of all players: Canada, Britain, France, and Australia in descending order. The remaining 36 countries each have fewer than 5 players in the NBA.

5 Number of Players in Each Position

100

75

50

25 Number of Players

0 C PF PG SF SG Position Figure 4. This bar chart shows the similar proportion of players in each position. For college data, we had a 18.3% of our players have no recorded college information. Of the remainder, the top five schools are University of Kentucky, Duke University, University of Kansas, UCLA, and University of North Carolina. No college has an overwhelming proportion of representation in the NBA. Regarding player counts per team, most teams have a similar number of players in the roster with a median value of 15. The Memphis Grizzlies have the greatest number of players at 23.

Efficiency Index Values

The results for efficiency were interesting to say the least. Before beginning the efficiency analysis, we acknowledged that defensive players like Kawhi Leonard might not have a high rating due to the fact that there doesn’t exist a defensive “stat.” Naturally, we assumed that the strong offensive players like or triple-double players like Lebron James would be favored in this analysis. From the results, that was not the case. Overall, it seemed that the efficiency indexes favored “low variance” players. By low variance, we do not mean consistent players. For example, , who seemed to have the best balance of stats ended up near the middle of the pack in the efficiency column. By low variance, we mean the analysis seemed to favor players that were in consistent ranges between the factors. Kobe Bryant, for example, is above average in points scored, rebounds, and assists but in none of those categories did he dominate that season, and the PCA calculation awarded that aspect of his performance. On the other hand, Stephen Curry crushed the points scored category, but hangs around average in rebounds, and it seems that the calculation put him in the bottom 25 percent of players, although he was the first unanimous MVP during that regular season. Another thing to take note of is that analysis seemed to especially punish missed shots. James Harden, who scored the most points that season, ended up with nearly the worst efficiency. This differs from the variance cases because his rebounding and categories were still very good. The best explanation is that the missed shots statistic and turnovers were really punishing to players like James Harden and Damian Lillard, both players who had great scores in points, rebounding, and assists, but unfortunately were also at the top of missed shots and turnovers. Another interesting trend is that the analysis appeared to preferred larger players overall, specifically those who were decent scorers and decent rebounders. Possibly the system likes that larger players tend to play closer to the basket, resulting in fewer missed shots. Another thing to note is that larger players usually pass less compared to other players, likely reducing their overall rate. Overall we think the efficiency analysis is good, but could be improved. Players seem to be punished too hard for missing shots, possibly a solution is to somehow take into account shot percentage. Naturally, the players that take more shots will end up with a higher miss total. Instead, we believe players should be punished if they have bad percentages because in a sense that is shooting efficiency. A similar thing could be said for

6 turnovers. Possibly an assist to turnover ratio or a ratio turnover per time period of controlling a ball. This could help out players that run the offense in a respective team. We like that the analysis favors balanced players, however as a result of this specialists seemed to be penalized. For example, Bismack Biyombo had a great rebounding season, but did not score much, and this gave him a poor efficiency rating. However, it is obvious he is on the court to , thus we think there could be a way to scale the weights to help players like this out. Ultimately, we stand by our final analysis given the categories as constraints, because it seems like it is only a couple tweaks from being very optimal. Finally, the value ratings calculated using our efficiency analysis and salary statistics poorly measure a player’s overall value to a team. The salary distribution of players in the NBA is heavily skewed right and varies greatly with a standard deviation of $5.2M. Because the season salaries for star players in the NBA is so high, their value ratings were very low, and the highest value ratings were awarded to players who contributed little throughout the season and were paid the least.

Trends Illustrated in our Shiny App

One interesting result that our shiny app on teams’ statistics demonstrated was the positive correlation between Total Payroll and Median Salary. We can see that when displaying the bar plot with Total Payroll as our axis, and Maximum Salary as our color coding variable, the teams with the highest maximum salary tend to be the team with the highest Total Payroll, as expected. Also, when color coding by Minimum Salary, we can see that even though the highest payroll teams have high minimum salaries, some high payroll teams do have very low minimum salaries. This is expected since the minimum salary a team pays to one player is not fully descriptive of how much the team pays its most valuable athletes. The average salary can also be seen to be very total payroll dependent as one would hypothesize.

Figure 5. This screenshot of our shiny app shows total payrolls of teams with the salaries colored by median salary amounts, and shows that high total payroll teams still have red, low median salaries.

7 Surprisingly, the median salary sheds some light into the teams’ statistics that we may not have guessed previously. By color coding by the median salary, we noticed that some high payroll teams have a very low median salary. This shows that these high paying teams have a salary distribution skewed heavily to the right, a trend reflected in our EDA. For a high payroll team like the Cleveland Cavaliers to have a low median salary, that means most of the players are paid a lower salary. So, for the team to reach the high total payroll group, that team must pay its valuable athletes a lot of money. This way the total payroll for the team remains high. So, this plot teaches us which teams have a very wide gap between average player salary and valuable player salary. We can see that LAC doesn’t have this wide of a gap because it’s the highest total payroll team while keeping its median salary very high.

Conclusion

Our efficiency index builds a good foundation to improve upon the existing EFF. It could be made more accurate by taking into account additional statistics, such as shot percentage, as well as custom weights assigned by position to account for players such as Biyombo, a dedicated rebounder, to receive a higher efficiency rating. Overall, our efficiency index succeeds at balancing the contribution of offensive and defensive players, valuing low variance players most highly. Our shiny app also demonstrates some clear correlations between salary and player peformance. High salary performers do not necessarily all perform better than median salary performers. Additionally, there are a significant number of very high paid players, suggesting that factors besides performance on the court inform influence the amount paid to each player. A question raised during the calculation of a value rating was, could we capture the value of players in terms of their contribution to the team’s revenue in addition to their performance on the court? We believe this would more accurately represent the monetary value of star players and low salary players. Our value rating could be improved by accounting for the celebrity of each player, given that a player’s popularity and star value increases attendance at games, purchase of merchandise, and overall revenue for the teams. This would require data beyond the scope of our analysis, and would be a good next step for research.

8