Improving the Basketball Efficiency Index and Quantifying Player Value
Total Page:16
File Type:pdf, Size:1020Kb
Improving the Basketball Efficiency Index and Quantifying Player Value Eric Xu, Timothy Tran, Rony Sulca, Ahmad Ahmadiyan 12/2/2016 Abstract This report creates an improved efficiency index to determine NBA players’ efficiency with respect to other players in the league and that ultimately attempts to quantify the value of each player with respect to their salary. We began by compiling and formatting the data from basketball-reference.com. The variables in our dataset range from quantitative variables such as points and rebounds to qualitative variables such as their country of origin and position. We then conducted exploratory data analysis (EDA). This process began with calculating summary statistics and frequencies for each variable in our dataset, and then plotted the distributions and frequencies for quantitative and qualitative variables respectively. We used histograms, boxplots, and bar charts to represent these distributions and frequencies. Each plot gave valuable insights to each variable. We then sought a more heuristic and comprehensive value to judge a player’s skill than the existing efficiency, so we created an efficiency index using principal component analysis and calculated an efficiency value for each player. This way, we can compare players using multivariate analysis. We drew comparisons between what our final efficiency values regarded a good player versus what the NBA believes to be good players. Finally, we understand not everyone who likes basketball is a statistician; they might want to understand this project but lack the background knowledge. We created a solution in the form of a shiny app. We display data in scatterplots based on a user’s selections on a straightforward interface. We believe that the app is simple to use and enables depth in data analysis accessible to anyone. Introduction Basketball has long used a composite statistic called an efficiency rating (EFF) to represent the performance of players. In theory, the EFF accounts equally for a player’s offensive and defensive contributions through various player statistics, but because currently recorded statistics struggle to capture defensive player performance, in practice the EFF tends to favor offensive players. Through our research, we aim to improve upon the existing EFF with our own efficiency index calculated through principal component analysis and compare the EFF with salary to calculate value ratings for each player. Data For our data, we utilized the latest statistics on basketball players’ performance and profile available on Basketball Reference website. Our analysis is focused on players in the 2015 - 2016 regular season. In that regard, we have used mainly three subsets of data from all 30 active teams in the NBA league, including Roster, and Totals, and Salaries, which represent in total 34 different variables, 32 of which have been incorporated in our analysis. 1 Methodology Data Collection Our primary source of data was directly web scraped from the Basketball Reference website. After fetching the URL’s of each team, we parsed the HTML pages using the XML package. After reading the XML tables on each team’s page, we identified the Roster, Totals, and Salary tables and stored each series of the data as strings and saved them as data frames in comma separated values (csv) files and subsequently categorized them in distinct directories that would be beneficial for our later stages of data manipulation and analysis. We also consistently identified the columns’ names using the available glossary at source website along the process. At the next stage, we merged the data for each team into a single dataset, saved in a master csv file. Data Cleaning Due to the noisiness of the scraped data, we could not directly use it for our main analysis. For cleaning purposes, unnecessary data in certain columns were dropped, including the players’ rankings and numbers that were team specific and unrelated to the performance of individual team members relative to other players across the league. Most of the columns had been read as factors or characters that were changed accordingly to be in their own proper class for sake of future analysis. Birth dates of the player in particular had been modified to be in date format. Using the grep library, the salary amounts were reformatted to be only numbers and also the height converted from foot to meter standard. Some players had played on multiple teams in 2015 - 2016 season and repeatedly showed up on the aggregated list of data that were dropped by the duplicated() command. Subset of clean dataset ## # A tibble: 10 × 7 ## X player position height weight birth.date origin ## <int> <chr> <chr> <dbl> <int> <date> <chr> ## 1 1 Kent Bazemore SF 77 201 1989-07-01 us ## 2 2 Tim Hardaway SG 78 205 1992-03-16 us ## 3 3 Kirk Hinrich PG 76 190 1981-01-02 us ## 4 4 Al Horford C 82 245 1986-06-03 do ## 5 5 Kris Humphries PF 81 235 1985-02-06 us ## 6 6 Kyle Korver SG 79 212 1981-03-17 us ## 7 7 Paul Millsap PF 80 246 1985-02-10 us ## 8 8 Mike Muscala C 83 240 1991-07-01 us ## 9 9 Lamar Patterson SG 77 225 1991-08-12 us ## 10 10 Dennis Schroder PG 73 172 1993-09-15 de Figure 1. The first ten rows and seven columns of our clean dataset. Exploratory Data Analysis After the data was merged across tables, we conducted exploratory data analysis (EDA) to analyze the distribution of each variable individually. We began by calculating summary statistics. For each of our quantitative variables, we calculated values for the variable’s mean, minimum, first quartile, median, third quartile, maximum, standard deviation, and range. We did this by writing a function that calculates each value we wanted to calculate and then used the apply() function to calculate it across each qualitative variable in the clean data frame. 2 For the qualitative variables, we used a for loop to loop through each variable and print a table of each value and its corresponding tally and proportion, created using dplyr. We used the sink() function to save the output of these functions to eda-output.txt. We then plotted graphs of the quantitative and qualitative variables. For each of the quantitative variables, e.g. height, salary, and turnovers, we plotted histograms and boxplots. We did this by using a for loop to loop through each variable and generate the graphs. For the qualitative variables, we plotted bar charts of frequencies. We used ggsave() to export the plots to the images/ directory. Efficiency Index Comparing a player’s points, rebounds, assists and other statistics against other players is simple. However, when deciding to contract a player, a team cannot simply pick a person with the best shooting or best rebounding skills: they need to get an entire package. However, looking at each statistic individually makes it difficult to see who is the overall better player. That is where our efficiency index comes into play. By performing a principal component analysis, we can see how players overall are ranked against each other, not just in individual stats. While efficiency can be calculated through raw statistics, that would not take into account the player’s position in a team. For example, it is natural for a taller player like a center to get more rebounds. Conversely, a smaller player such as a point guard would get more assists. To account for those differences, in this analysis we compare efficiencies of players within their respective position. To begin, we took the data from the Data Cleaning section and trimmed it down even more. The statistics we considered were points, rebounds, assists, steals, blocks, missed field goals, missed free throws, and turnovers. This is not an exhaustive list of every component in basketball, but we believe these represent the key variables that best discriminate between players’ performance. We extracted the clean data and calculated missed shots by subtracting made shots from shots attempted. Next, we filtered the data by the 5 positions in professional basketball: Center, Power Forward, Small Forward, Shooting Guard, and Point Guard. After filtering we had five small data frames that we used to calculate the principal components. We used the prcomp() function in R to create the principal components for us. We made sure to use scaled data, which is mean-centered and standardized. This can be achieved by setting the parameters “center” and “scale” to TRUE. We used the first principal component calculated using the prcomp() function as weights for our calculations. Finally we multiplied each statistic with the respective weight, making sure to take into account the standard deviation. Thus, the final weight is the principal component divided by the standard deviation. We did this for each player in each position, before summing up the values to get the final efficiency value. It is important to note that missed field goals, missed free throws, and turnovers are all negative measures of performance in terms of winning a basketball game. To account for this, we made a simple correction factor so that those indexes would negatively affect a player’s final efficiency. Since we had collected some salary data, we naturally became curious for the value of a player per dollar. Easily, we constructed a new column that created a ratio of a player’s efficiency and their salary during that season. Finally, we bound all the small data frames together with the vector containing the efficiency indexes. This helped greatly in our understanding of how PCA functions work and how well they correspond to established rankings in basketball. Shiny App When creating the shiny app, we opted for a sidebar panel layout. We placed the modifiable variables on the left of the app in a small side panel, and we put the displayed output on the main panel, located at the center and right of the app.