Optimizing NBA Lineups

Item Type text; Electronic Thesis

Authors Spector, Jason

Publisher The University of Arizona.

Rights Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction, presentation (such as public display or performance) of protected items is prohibited except with permission of the author.

Download date 03/10/2021 19:00:48

Link to Item http://hdl.handle.net/10150/641741

OPTIMIZING NBA LINEUPS

by

Jason Spector

______Copyright © Jason Spector 2020

A Thesis Submitted to the Faculty of the

GRADUATE INTERDISCIPLINARY PROGRAM

IN STATISTICS AND DATA SCIENCE

In Partial Fulfillment of the Requirements

For the Degree of

MASTER OF SCIENCE

In the Graduate College

THE UNIVERSITY OF ARIZONA

2020

2

THE UNIVERSITY OF ARIZONA GRADUATE COLLEGE

As members of the Mae Committee, we certify that we have read the thesis prepared by Jason Spector, titled Optimizing NBA Lineups and recommend that it be accepted as fulfilling the dissertation requirement for the Mae Degee.

______Date: __05/28/2020 Joseph Watkins

______Date: ______05/29/2020 Edward Bedrick

______Date: ______Jin Zhou

Final approval and acceptance of this thesis i coningen on he candidae bmiion of he final copies of the thesis to the Graduate College.

I hereby certify that I have read this thesis prepared under my direction and recommend that it be accepted as fulfilling the Mae requirement.

______Date: ____05/28/2020______Joseph Watkins Mae Thei Committee Chair GIDP in Statistics and Data Science

3

TABLE OF CONTENTS

Page

List of Figures ...... 4

List of Tables ...... 5

Abstract ...... 6

1 Introduction ...... 7

1.1 Background ...... 7

1.2 Data ...... 8

1.3 Methods ...... 8

1.4 Methods Detailed ...... 9

2 Results ...... 13

2.1 Feature Selection ...... 18

2.2 Classification ...... 21

3 Conclusion ...... 25

Appendix ...... 27

References ...... 33 4

LIST OF FIGURES

Page

Figure 1: Points Per Minutes Played Before Filtering ...... 10

Figure 2: Points Per Minutes Played After Filtering ...... 11

Figure 3: QQ Plot Points Per Minute ...... 12

Figure 4: Neural Network Diagram ...... 13

Figure 5: Dense Layers 1/16th Scale Diagram ...... 14

Figure 6: Neural Network Training ...... 15

Figure 7: Points Per Minutes Played Predictions ...... 17

Figure 8: Points Per Minutes Player Neural Network Predictions ...... 18

Figure 9: Random Forest Variable Importance ...... 19

Figure 10: Neural Network Training Reduced Features ...... 20

Figure 11: Neural Network Training Offensive Points Scored ...... 22

Figure 12: Neural Network Training Defensive Points Allowed ...... 23 5

LIST OF TABLES

Table 1 Summary of Data before Filtering...... 10

Table 2: Summary of Data after Filtering ...... 11

Table 3: Prediction Distributions ...... 16

Table 4: Simplified Prediction Distributions ...... 20

Table 5: Confusion Matrix Offensive Points Scored ...... 22

Table 6: Classification Scores for Offensive Points Scored ...... 22

Table 7: Confusion Matrix Defensive Points Allowed ...... 23

Table 8: Classification Scores for Defensive Points Allowed ...... 23

Table 9: Best Subset Variable Forward Selection ...... 27

Table 10: Variable Glossary ...... 30

6

Abstract

The goal of every NBA coach is to put the best lineup on the floor against the given opposition. A coach will pick individual players to form a lineup from a variety of factors with the end goal of scoring more points than the opposing lineup. This paper aims to analyze and assess NBA lineup creation from individual statistics using various forms of machine learning. We started by web- scraping individual player statistics and five-man lineup data from basketballreference.com. Then the general box score and advanced statistics of the individual players were joined to the players in the lineup. The lineup data was used to train a linear regression model, a random forest, a support vector machine, an extreme gradient boosted model, and a neural network. All models were evaluated on their mean absolute error with the final goal of getting as close to the points the lineup actually scored. None of the models created a conclusive algorithm to accurately portray the lineup capabilities from individual statistics. This was due to the fact that individual per game statistics do not hold enough information about how combinations of players might perform together or if the performance by the player would be above or below their expected individual statistics.

Furthermore, the models often overfit the data, having the ability to understand the patterns of the training data well but not able to generalize. Because of this our focus shifted to creating simpler models with fewer features. Fewer features caused a slight increase in performance but not by much. Finally, we repeated the process with a classification of bad, average, and great offensive or defensive ability as the output in hopes of at least being able to classify a good offensive or defensive lineup. Both classification networks were only slightly better than random guessing however.

7

Chapter 1

Introduction

1.1 Background

In an NBA game each team prepares a strategy that designed to allow them to score more points than their opponent. The first part of that strategy is which players to play. With 15 players on a team and only 5 allowed on the floor, which of the 3003 possible combinations does a coach pick?

Generally, we start with the best five players, but they cannot play the whole game. How do we know which group of five players will score the most points? Coaches could look at matchups against the opposing lineup or try balancing their obviously best players with small combinations of other players. Other analytics enthusiasts have investigated lineup performance and type of players. Zigzaganalytics explored lineups by showing which factors of the game made the same lineup perform well in one game and poorly in another [1]. While a paper submitted to MIT

Sloan conference by Kalman and Bosch created clusters of players and looked to see what combination of these clusters created the best net rated lineup [2]. At the end of the day, a successful lineup is one that either hinders the opponent from scoring or scores more points. Since defensive statistics are hard to measure, this paper focused on points scored with a goal of picking specific players from their individual statistics to estimate the points scored by the combination of those players lineup.

8

1.2 Data

The data used was scraped from basketballreference, LLC. 5-man lineup data [3], individual player statistics (basic and advanced) [4], NBA regular season standings [5], current rosters [6], and schedule results for all teams [7] from the current 2019-2020 season were all scraped and put into data frames. The 5-man lineup data was game by game data. If the lineup was used in more than one game, like a , then more than one row with that lineup can be found in the data.

The individual statistics were per game statistics in order to keep consistency between the lineup data and individual data. The standings data was used for win percentages of the teams to give an idea of the abilities of the teams. The win percentages were the percentages at the time of the season suspension in the 2019-2020 season.

1.3 Methods

The goal of the procedure is to predict actual lineup production in real games and mock a coach’s decision making process for creating a lineup based on individual statistics and how long a lineup will play. Modeling lineup points from lineup data is not particularly difficult, look at field goals made values. Modeling lineup points from individual data where the only connection is the data belongs to the same individual proved to be much more difficult.

Summary:

1. Clean data and conduct an exploratory analysis. 9

2. Create feature matrix and implement machine learning model for linear regression, random

forest, support vector machine, extreme gradient boost, and neural network to predict

points per minute against an opponent.

3. Compare models’ ability to accurately access a lineup’s propensity to score points against

a particular team.

4. Analyze feature importance in models and simplified models to be fit again.

5. Compare regression output with a classification output.

1.4 Methods Detailed

We began by exploring the data set for problems that could cause difficulties in implementing any machine learning procedure. Missing data was common for players who did not participate in some aspect of the game. For example, players who had never shot a three pointer had a blank data for three point field goals made (3P), three point field goals attempted (3PA), and 3P percent instead of having a value of zero or not applicable. These data points were replaced with zeros.

Other problems occurred with lineups input into games for a short period of time, per minute data for lineups playing small periods in a game and only a few times in the season created noisy predictions for variable points per minute. Playing once for two minutes and scoring zero points does not imply that lineup will never score in a similar situation.

Table 1 shows that an NBA team plays an average of 15.46 lineups per game with an average of

3.14 minutes per lineup. These lineups score anywhere from zero to 40 points per minute. The problems of per minute data have already been found. Figure 1 shows the distribution of minutes played and points per minute. The outliers are clearly from low minute playing lineups. Lastly, 10 lineups in the data set played 6.24 games of the roughly 65 games per team that have been played before the season was suspended.

Table 1: Summary of Data before Filtering

Number Lineups Minutes Played Points per Games Played per Game by Lineups Minute by Lineup Min. 5.0000 0.1000 0.0000 1.0000 1st Qu. 13.0000 1.0000 1.1764 2.0000 Median 15.0000 2.0000 2.1428 4.0000 Mean 15.4660 3.1398 2.2553 6.2364 3rd Qu. 18.0000 3.6000 2.9591 8.0000 Max. 31.0000 30.1000 40.0000 47.0000

We want to be able to assess a lineup that can be used consistently. Thus, we eliminated all lineups that played less than 3 minutes and played in fewer than 6 games. We then removed outliers by removing lineups who were outside two standard deviations of the new mean. Table 2, Figure 2, 11 and Figure 3 show the new distributions of our data, normal with a heavily populated center and short tails.

Table 2: Summary of Data after Filtering

Number Lineups Minutes Played Points per Games Played per Game by Lineups Minute by Lineup

Min. 1.0000 3.0000 0.9090 6.0000 1st Qu. 2.0000 4.1000 1.9218 8.0000 Median 2.0000 6.3500 2.3333 12.0000 Mean 2.6534 8.1991 2.3443 15.2752 3rd Qu. 4.0000 11.3000 2.7777 20.0000 Max. 8.0000 30.1000 3.8095 47.0000

12

Figure 3: QQ Plot Points Per Minute

After cleaning our data, we had 1009 unique lineups playing in a total of 4702 situations. These lineups played in an average of 15 games for 8.20 minutes per game, scoring 2.34 points per minute. The distribution was normal with a slight left skew.

Team win percentages were joined to the lineup data to attempt to understand the quality of matchup between the two teams. Additionally, individual per game statistics were joined to the lineup data by player name and game date while player positions, teams, and opponent teams were converted into one-hot encoded vectors. One key issue however was the order of players. The players data was joined by name but which player column the player appeared in could vary. For example, Anthony Davis would generally appear as the first player but when playing with Avery

Bradley he would appear second. We accounted for order by pooling in game statistics.

The final input of our data contained an average pooled player’s statistics, team and opponent win loss percentages, and minutes played by a lineup with an output of points per minute scored by the lineup. 13

Chapter 2

Results

All arguments in machine learning models were set to the default values from the library the function came from except Xgboost, which requires a specified number of rounds. The value 50 was used for number of rounds. The neural network required substantial trial and error to determine the appropriate architecture in terms of number of hidden layers and number of neurons in each layer. The final architecture was a pooling layer followed by three hidden layers with ReLu activation functions. The network’s loss function was mean absolute error and was optimized using an adam optimizer and a learning rate of 0.001. Figure 3 diagrams the neural network architecture.

Figure 4: Neural Network Diagram

The numbers represent the input size and the labels on the left represent the layer type. Figure 4 is a 1/16th scale of

the dense layers after the average pooling layer. 14

Figure 5: Dense Layers 1/16th Scale Diagram

Early stopping was implemented with a monitor on the loss function decreasing and a patience of 50 epochs. The network was trained on a total 200 epochs with a batch size of 32 and a validation split of 20 percent. Multiple network architectures were considered but yielded similar results. If the network was not trained over 25 epochs or more than the MAE in validation would remain close to the training set but the training set would only predict a very small range.

Essentially, it guessed close to the mean because there were a large number of lineups in the mean and by guessing close to the mean the model would minimize the absolute error by taking larger errors for the fewer lineups further from the mean. However, even though the goal of the model was to have the lowest MAE we wanted the model to disperse from the mean to further distinguish a good lineup from a bad one. At the risk of over-fitting to have the model increase its prediction range we continued beyond 50 epochs to have our neural network predict higher or lower point production values. Figure 5 shows how the neural network started over-fitting near the 25 epoch 15 mark. It is clear that the network continues to learn on the training data but not on the validation data, a classic sign of over-fitting.

Some of the hyper parameters and optimizers tried included neural network length and width, stochastic gradient decent (SGD) with momentum vs. adam, and differences in learning rates and batch size. Batch sizes between 16 and 32 had similar performance Batch sizes at 8 or below increased the model’s training time without significant improvement. Learning rate followed a similar pattern. Increasing the rate produced similar the training patterns with more dramatic MAE differences between epochs. SGD with momentum performed better with classification problems and adam performed better with regression. The size of the network became very important.

Network designs having two or three hidden layers and a small neuron width of a max of 64 never drifted from the mean. In actuality these models technically predicted better in terms of MAE and not over-fitting but provided no actual usable information for coaches.

Figure 6: Neural Network Training

16

Table 3: Prediction Distributions

Actual Points Scored lm_pred rf_pred svm_pred xgboost_pred nn_pred Min. 0.9375 1.8185 1.5104 1.6408 1.1054 1.0983

1st Qu. 1.9354 2.2298 2.1920 2.1715 2.1573 2.0660

Median 2.3684 2.3417 2.3405 2.3384 2.3449 2.3168

Mean 2.3789 2.3378 2.3446 2.3321 2.3354 2.3424

3rd Qu. 2.8225 2.4571 2.5049 2.4842 2.5151 2.6192

Max. 3.7837 2.7500 3.1247 2.9740 3.5933 3.9602

MAE Train NA 0.4781 0.2279 0.3506 0.2747 0.2317

MAE Test NA 0.5197 0.5481 0.5317 0.5573 0.5904 r-squared NA -0.0117 -0.1385 -0.0693 -0.2121 -0.3436

Table 3 shows the range, MAE, and r-squared of the model predictions and training sets. The

models’ MAE was always smaller on the training set than the test set indicating that the models

were all over-fitting and not generalizing well on new data. Linear regression has the best

performance in not over-fitting with MAEs of 0.48 and 0.52, compared to the worst over-fitting

models are the neural network and xGboost with MAEs of 0.26 and 0.27 for training and 0.58 and

.056 for testing. Random forest showed a similar MAE of .022 for training and 0.55 for testing.

However, the neural network is best choice of following the distribution of the actual points scored.

All of the r-squared values are negative meaning we would have a better MAE if we simply

guessed the mean every time. If we in fact did guess the mean every time the MAE would be .0518.

This shows how individual statistics have little relevance when determining a lineups actual ability

to produce points. 17

Figure 7 below has one hundred random samples from the test set plotted with the x-axis as actual minutes played and the y-axis as points per minute. The turquoise filled circles are the actual points scored and the coral ‘x’ points are the predicted values. The fewer the minutes played by a lineup the more variance in scores and the more the models predicted closer to the mean. The models from left to right and top to bottom are linear regression, random forest, svm, xgboost, and neural network. The neural network and xgboost seem to be the best models to gain separation from the mean and distinguish lineups due to their over-fitting.

Figure 8 contains the entire test set plotted with the neural network predictions plotted over the top. The tails indicate how the model still had difficulty determining which individual statistics most contributed to a lineup’s ability. 18

2.1 Feature Selection

Having seen that all models suffer from over-fitting, our next steps were to reduce the number of features to assess those that have the most predictive value. We employ a best subset regression model with forward selection and the random forest model to select the most informative features.

In Appendix the output for the best subset model where each row depicts a model with n variables.

Each column represents the variables available. When a variable has a (*) in the nth column it was included in the best linear model of size n with forward selection. The best seven variables in linear regression were: Offensive Box Plus-Minus (obpm), Opponent Win Percentage (opp_win_per),

Offensive Percentage (orb_pct), Per Minute (ast_per_min), Assist Percentage

(ast_pct), Points Per Minute (pts_per_min), Usage Percentage (usg_pct).

To determine the top ten variable’s, we continued to search for variable importance using random forests. Figure 9 shows the importance of each variable in terms of node purity when predicting 19 actual lineup points. The top 7 were: Actual Minutes Played (actual_MP), Opponent Win

Percentage (opp_win_per), Game Location (game_location), Two Point Percentage (two_P_pct),

Free Throw Percentage (ft_pct), Assist Per Minute (ast_per_min), Personal Fouls Per Minute

(pf_per_min).

Figure 9: Random Forest Variable Importance

We took the top variables from each to create a top 10 features dataframe. The top ten included

Offensive Box Plus Minus (obpm), Opponent Win Percentage (opp_win_per), Offensive Rebound

Percentage (orb_pct), Assist Per Minute (ast_per_min), Actual Minutes Played (actual_MP),

Game Location (game_location), Points Per Minute (pts_per_min), Usage Percentage (usg_pct),

Two Point Percentage (two_P_pct), and Percentage (ft_pct). We used the same hyper parameters for our models as before just simply lowering the number of features.

20

Figure 10: Neural Network Training Reduced Features

Table 4: Simplified Prediction Distributions

Actual Points Scored lm_pred rf_pred svm_pred xgboost_pred nn_pred Min. 0.9375 1.9685 1.5498 1.5255 1.0030 1.4999 1st Qu. 1.9354 2.2607 2.2160 2.2033 2.1693 2.1513 Median 2.3684 2.3354 2.3375 2.3281 2.3464 2.3850 Mean 2.3789 2.3336 2.3392 2.3311 2.3334 2.4204 3rd Qu. 2.8225 2.4099 2.4750 2.4477 2.5065 2.6046 Max. 3.7837 2.6964 3.0941 2.9607 3.3954 3.7943 MAE Train NA 0.4877 0.2483 0.4424 0.2747 0.2317 MAE Test NA 0.5154 0.5397 0.5264 0.5527 0.5624 r-squared NA 0.0057 -0.0918 -0.0395 -0.1523 -0.2417

Table 4 shows the slight improvement in r-squared but a similar MAE for both training and testing.

Linear regression remained exactly the same when reducing to the 10 features we chose. In fact,

most models remained almost identical with the excess features model. This led us to believe that

many of the variables were not informative and we could achieve the same MAE with only 10

features. 21

2.2 Classification

The inability to shift away from the mean without sacrificing generalization suggested a shift to considering classification approaches. Thus, we bin actual points scored into three classes, a low scoring, an average scoring, and high scoring class. The low and high scoring classes were the bottom and top twenty percent of the points per minute data. The rationale was to focus on top tier lineups. Here we consider both offensive and defensive rankings of lineups. The offensive lineups used the ten features we had determined most important in the above section. Defensive capability used only defensive statistics such as steals, Defensive Box Plus Minus (dbpm), Defensive

Rebounds (drb).

A feed forward network was used in both cases. Each network had four hidden layers and a softmax output layer to predict one of the three classes. Instead of the largest probability determining the class, a softer boundary was used to push for the high or low scoring classes. This caused the low scoring class to be predicted much less often but provided more guesses for high scoring lineups.

22

Figure 11: Neural Network Training Offensive Points Scored

Table 5: Confusion Matrix Offensive Points Scored

Predicted Actual 0 1 2 0 283 143 125 1 55 75 51 2 60 85 64

Accuracy: 0.4484591

Table 6: Classification Scores for Offensive Points Scored

precision recall f1

0 0.7110 0.5136 0.5964 1 0.2475 0.4143 0.3099 2 0.2666 0.3062 0.2850

The above table showed the actual and predicted classes of the lineups, a zero was average, a one was low scoring, and a two was high scoring. 45 percent accuracy was pretty low when you considered 33 percent accuracy was expected from random guessing. Additionally, a 30% recall 23 for the highest scoring lineups is worse than random guessing. However, models that misclassified high scoring teams generally average predicted average rather than below average lineups.

Figure 12: Neural Network Training Defensive Points Allowed

Table 7: Confusion Matrix Defensive Points Allowed

Predicted Actual 0 1 2 0 334 110 127 1 57 60 62 2 66 60 65

Accuracy: 0.487779

Table 8: Classification Scores for Defensive Points Allowed

precision recall f1

0 0.7308 0.5849 0.6498 1 0.2608 0.3351 0.2933 2 0.2559 0.3403 0.2921

24

Predicting defensive classes was slightly more accurate than offense with .49, however, this was due to the fact that the model guessed the average lineups more often and by proportion it is going to get that correct. The recall for the high and low scoring classes were about even.

Overall, building a classification model instead of minimizing MAE was not very helpful and continued the pattern of pushing towards the mean.

25

Chapter 3

Conclusion

Overall, none of the models accurately predicted the lineups actual points per minute data based on individual per game data. Each model had its advantages and disadvantages. Linear regression was a simple model that produced the lowest MAE and did not over-fit to the training data but predicted a range that never moved far from the mean. The opposite occurred with our best model in terms of matching the distribution of the test set. The neural network over-fit to the training data but because of that was able to move further away from the mean and predict lineups that could possibly be better or worse than the average. Random forest and xgboost over-fit the data the way that the neural network did without pulling away from the mean as much. The support vector machine model was an in between of linear regression and neural network, it had practically the same MAE but more range than the linear model but not enough range as the neural network especially on the minimum.

All of the r-squared values were negative meaning we would have been better off selecting the mean for every lineup than any of the model’s predictions. Again, this showed that none of the models were able to accurately take in information for five individual players and predict how those five players would perform together against a particular opponent. Narrowing down our features to try to make the model generalizable helped slightly but not enough to make any of the models usable. Lastly, instead of looking for number predictions, we predicted what lineups will do significantly better and which will do significantly worse in offensive and defensive classification models. Both models were just slightly better than randomly guessing. 26

In the end, this paper shows that creating lineups blindly from statistics of individual players can be much harder than expected. Coaches place many qualitative factors on players names that cannot be recorded in statistics. Draymond Green is a prime example, while his stat line appears, his value to the team is not captured by the usual player statistics.

The next steps would be to reduce variance by considering two player combinations. Is there chemistry between two players that causes them to play particularly well together? If so, at what number of players can we no longer model their ability to play well together enough to predict their point output.

27

Appendix

Table 9: Best subset regression model with forward selection to select the most informative features. Each row depicts a model with n variables and each column represents the variables available. When a variable has a (*) in the nth column it was included in the best linear model of size n with forward selection.

Variable 1 2 3 4 5 6 7 8 9 game_location tm_win_per opp_win_per * * * * * * * * age G GS MP fg_pct three_P_pct two_P_pct efg_pct ft_pct per ts_pct fg3a_per_fga_pct fta_per_fga_pct orb_pct * * * * * * * drb_pct trb_pct ast_pct * * * * * stl_pct blk_pct tov_pct usg_pct * * * ows dws ws 28

1 2 3 4 5 6 7 8 9 ws_per_48 obpm * * * * * * * * * dbpm bpm vorp pts_per_min * * * * fg_per_min fga_per_min threeP_per_min threePA_per_min twoP_per_min twoPA_per_min ft_per_min fta_per_min orb_per_min drb_per_min trb_per_min ast_per_min * * * * * * stl_per_min blk_per_min tov_per_min pf_per_min p1_pos_G p1_pos_F p1_pos_C p2_pos_G p2_pos_C p2_pos_F p3_pos_G p3_pos_F p3_pos_C * * p4_pos_G p4_pos_F p4_pos_C p5_pos_G p5_pos_F p5_pos_C team_ATL 29

1 2 3 4 5 6 7 8 9 team_BOS team_BRK team_CHI team_CHO team_CLE team_DAL team_DEN team_DET team_GSW team_HOU team_IND team_LAC team_LAL team_MEM team_MIA team_MIL team_MIN team_NOP team_NYK team_OKC team_ORL team_PHI team_PHO team_POR team_SAC team_SAS team_TOR opp_ATL opp_BOS opp_BRK opp_CHI opp_CHO opp_CLE opp_DAL * opp_DEN opp_DET opp_GSW opp_HOU 30

1 2 3 4 5 6 7 8 9 opp_IND opp_LAC opp_LAL opp_MEM opp_MIA opp_MIL opp_MIN opp_NOP opp_NYK opp_OKC opp_ORL opp_PHI opp_PHO opp_POR opp_SAC opp_SAS opp_TOR actual_MP p1_pos_G p2_pos_G p3_pos_C p4_pos_C p5_pos_G team_UTA team_WAS opp_UTA opp_WAS

Table 10: Variable Glossary

Variable Definition game_location Home or away game tm_win_per Team win percentage opp_win_per Opponent win percentage age Age of player G Games played GS Games Started 31

MP Minutes Played fg_pct percentage three_P_pct Three point percentage two_P_pct Two point percentage efg_pct Effective Field Goal percentage ft_pct Free-throw percentage per Player Rating ts_pct fg3a_per_fga_pct Three point attempts per field goal attempt percentage fta_per_fga_pct Free-throw attempts per field goal attempt percentage orb_pct Offensive Rebound percentage drb_pct Defensive Rebound percentage trb_pct Total Rebound percentage ast_pct Assist percentage stl_pct percentage blk_pct percentage tov_pct percentage usg_pct Usage percentage ows Offensive Win Shares dws Defensive Win Shares ws Win Shares ws_per_48 Win Shares per 48 minutes obpm Offensive Box Plus-Minus dbpm Defensive Box Plus-Minus bpm Box Plus-Minus vorp Value Over Replacement Player pts_per_min Points per minute fg_per_min Field goal per minute fga_per_min Field goal attempts per minute threeP_per_min Three pointers made per minute threePA_per_min Three point attempts per minute twoP_per_min Two pointers made per minute twoPA_per_min Two point attempts per minute ft_per_min Free-throw per minute fta_per_min Free-throw attempts per minute orb_per_min Offensive Rebounds per minute drb_per_min Defensive Rebounds per minute trb_per_min Rebounds per minute ast_per_min Assists per minute stl_per_min Steals per minute 32

blk_per_min Blocks per minute tov_per_min Turnovers per minute pf_per_min Personal Fouls per minute p*_pos_* Player number * position team_* team * abbreviation opp_* opponent * abbreviation

33

References

[1] “Getting the Most out of NBA Lineups with Machine Learning.” ZigZag Analytics, http://www.zigzaganalytics.com/home/getting-the-most-out-of-nba-lineups-with-machine- learning.

[2] “NBA Lineup Analysis on Clustered Player Tendencies: A new approach to the positions of basketball & modeling lineup efficiency of soft lineup aggregates.” Samuel Kalman, Jonathan

Bosch, MIT Conference.

[3] “NBA Lineup Stats: Lineup, Per Game” Sports Reference LLC, https://www.basketball- reference.com/play-index/lineup_finder

[4] “NBA Standing Stats” Sports Reference LLC, https://www.basketball- reference.com/leagues/NBA_2020_standings.html

[5] “NBA Player Stats:Per Game.” Sports Reference LLC, https://www.basketball- reference.com/leagues/NBA_2020_per_game.html.

[6] “NBA Current Roster (varies by team).” Sports Reference LLC, https://www.basketball-reference.com/teams/GSW/2020.html

[7] “NBA Schedule (varies by month).” Sports Reference LLC, https://www.basketball-reference.com/leagues/NBA_2020_games-october.html