OPTIMIZING NBA LINEUPS by Jason Spector
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing NBA Lineups Item Type text; Electronic Thesis Authors Spector, Jason Publisher The University of Arizona. Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author. Download date 03/10/2021 19:00:48 Link to Item http://hdl.handle.net/10150/641741 OPTIMIZING NBA LINEUPS by Jason Spector ____________________________ Copyright © Jason Spector 2020 A Thesis Submitted to the Faculty of the GRADUATE INTERDISCIPLINARY PROGRAM IN STATISTICS AND DATA SCIENCE In Partial Fulfillment of the Requirements For the Degree of MASTER OF SCIENCE In the Graduate College THE UNIVERSITY OF ARIZONA 2020 2 THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE As members of the Master’s Committee, we certify that we have read the thesis prepared by Jason Spector, titled Optimizing NBA Lineups and recommend that it be accepted as fulfilling the dissertation requirement for the Master’s Degree. _____ __________________________________ Date: __05/28/2020 Joseph Watkins _________________________________________________________________ Date: ____________05/29/2020 Edward Bedrick _________________________________________________________________ Date: ____________ Jin Zhou Final approval and acceptance of this thesis is contingent upon the candidate’s submission of the final copies of the thesis to the Graduate College. I hereby certify that I have read this thesis prepared under my direction and recommend that it be accepted as fulfilling the Master’s requirement. ____________________________________________________________Date: ____05/28/2020________ Joseph Watkins Master’s Thesis Committee Chair GIDP in Statistics and Data Science 3 TABLE OF CONTENTS Page List of Figures . 4 List of Tables . 5 Abstract . 6 1 Introduction . 7 1.1 Background . 7 1.2 Data . 8 1.3 Methods . 8 1.4 Methods Detailed . 9 2 Results . 13 2.1 Feature Selection . 18 2.2 Classification . 21 3 Conclusion . 25 Appendix . 27 References . 33 4 LIST OF FIGURES Page Figure 1: Points Per Minutes Played Before Filtering . 10 Figure 2: Points Per Minutes Played After Filtering . 11 Figure 3: QQ Plot Points Per Minute . 12 Figure 4: Neural Network Diagram . 13 Figure 5: Dense Layers 1/16th Scale Diagram . 14 Figure 6: Neural Network Training . 15 Figure 7: Points Per Minutes Played Predictions . 17 Figure 8: Points Per Minutes Player Neural Network Predictions . 18 Figure 9: Random Forest Variable Importance . 19 Figure 10: Neural Network Training Reduced Features . 20 Figure 11: Neural Network Training Offensive Points Scored . 22 Figure 12: Neural Network Training Defensive Points Allowed . 23 5 LIST OF TABLES Table 1 Summary of Data before Filtering. 10 Table 2: Summary of Data after Filtering . 11 Table 3: Prediction Distributions . 16 Table 4: Simplified Prediction Distributions . 20 Table 5: Confusion Matrix Offensive Points Scored . 22 Table 6: Classification Scores for Offensive Points Scored . 22 Table 7: Confusion Matrix Defensive Points Allowed . 23 Table 8: Classification Scores for Defensive Points Allowed . 23 Table 9: Best Subset Variable Forward Selection . 27 Table 10: Variable Glossary . 30 6 Abstract The goal of every NBA coach is to put the best lineup on the floor against the given opposition. A coach will pick individual players to form a lineup from a variety of factors with the end goal of scoring more points than the opposing lineup. This paper aims to analyze and assess NBA lineup creation from individual statistics using various forms of machine learning. We started by web- scraping individual player statistics and five-man lineup data from basketballreference.com. Then the general box score and advanced statistics of the individual players were joined to the players in the lineup. The lineup data was used to train a linear regression model, a random forest, a support vector machine, an extreme gradient boosted model, and a neural network. All models were evaluated on their mean absolute error with the final goal of getting as close to the points the lineup actually scored. None of the models created a conclusive algorithm to accurately portray the lineup capabilities from individual statistics. This was due to the fact that individual per game statistics do not hold enough information about how combinations of players might perform together or if the performance by the player would be above or below their expected individual statistics. Furthermore, the models often overfit the data, having the ability to understand the patterns of the training data well but not able to generalize. Because of this our focus shifted to creating simpler models with fewer features. Fewer features caused a slight increase in performance but not by much. Finally, we repeated the process with a classification of bad, average, and great offensive or defensive ability as the output in hopes of at least being able to classify a good offensive or defensive lineup. Both classification networks were only slightly better than random guessing however. 7 Chapter 1 Introduction 1.1 Background In an NBA game each team prepares a strategy that designed to allow them to score more points than their opponent. The first part of that strategy is which players to play. With 15 players on a team and only 5 allowed on the floor, which of the 3003 possible combinations does a coach pick? Generally, we start with the best five players, but they cannot play the whole game. How do we know which group of five players will score the most points? Coaches could look at matchups against the opposing lineup or try balancing their obviously best players with small combinations of other players. Other basketball analytics enthusiasts have investigated lineup performance and type of players. Zigzaganalytics explored lineups by showing which factors of the game made the same lineup perform well in one game and poorly in another [1]. While a paper submitted to MIT Sloan conference by Kalman and Bosch created clusters of players and looked to see what combination of these clusters created the best net rated lineup [2]. At the end of the day, a successful lineup is one that either hinders the opponent from scoring or scores more points. Since defensive statistics are hard to measure, this paper focused on points scored with a goal of picking specific players from their individual statistics to estimate the points scored by the combination of those players lineup. 8 1.2 Data The data used was scraped from basketballreference, LLC. 5-man lineup data [3], individual player statistics (basic and advanced) [4], NBA regular season standings [5], current rosters [6], and schedule results for all teams [7] from the current 2019-2020 season were all scraped and put into data frames. The 5-man lineup data was game by game data. If the lineup was used in more than one game, like a starting lineup, then more than one row with that lineup can be found in the data. The individual statistics were per game statistics in order to keep consistency between the lineup data and individual data. The standings data was used for win percentages of the teams to give an idea of the abilities of the teams. The win percentages were the percentages at the time of the season suspension in the 2019-2020 season. 1.3 Methods The goal of the procedure is to predict actual lineup production in real games and mock a coach’s decision making process for creating a lineup based on individual statistics and how long a lineup will play. Modeling lineup points from lineup data is not particularly difficult, look at field goals made values. Modeling lineup points from individual data where the only connection is the data belongs to the same individual proved to be much more difficult. Summary: 1. Clean data and conduct an exploratory analysis. 9 2. Create feature matrix and implement machine learning model for linear regression, random forest, support vector machine, extreme gradient boost, and neural network to predict points per minute against an opponent. 3. Compare models’ ability to accurately access a lineup’s propensity to score points against a particular team. 4. Analyze feature importance in models and simplified models to be fit again. 5. Compare regression output with a classification output. 1.4 Methods Detailed We began by exploring the data set for problems that could cause difficulties in implementing any machine learning procedure. Missing data was common for players who did not participate in some aspect of the game. For example, players who had never shot a three pointer had a blank data point for three point field goals made (3P), three point field goals attempted (3PA), and 3P percent instead of having a value of zero or not applicable. These data points were replaced with zeros. Other problems occurred with lineups input into games for a short period of time, per minute data for lineups playing small periods in a game and only a few times in the season created noisy predictions for variable points per minute. Playing once for two minutes and scoring zero points does not imply that lineup will never score in a similar situation. Table 1 shows that an NBA team plays an average of 15.46 lineups per game with an average of 3.14 minutes per lineup. These lineups score anywhere from zero to 40 points per minute. The problems of per minute data have already been found. Figure 1 shows the distribution of minutes played and points per minute. The outliers are clearly from low minute playing lineups. Lastly, 10 lineups in the data set played 6.24 games of the roughly 65 games per team that have been played before the season was suspended. Table 1: Summary of Data before Filtering Number Lineups Minutes Played Points per Games Played per Game by Lineups Minute by Lineup Min.