1

This thesis has been approved by

The Honors Tutorial College and the Department of Business Administration

______

Dr. William A. Young II O’Bleness Associate Professor of Business Analytic Analytics & Information Systems Thesis Advisor

______

Dr. Ehsan Ardjmand Assistant Professor Analytics & Information Systems Committee Member

______

Dr. Raymond Frost Director of Studies, Business Administration Analytics & Information Systems

______

2

ALL-NBA TEAM VOTING PATTERNS: USING CLASSIFICATION MODELS TO IDENTIFY HOW AND WHY PLAYERS ARE NOMINATED

______

A Thesis

Presented to

The Honors Tutorial College

Ohio University

______

In Partial Fulfillment of the Requirements for Graduation from the Honors Tutorial College with the degree of

Bachelor of Business Administration

______by

Graydon R. Levine

May 2019

3

Abstract

This research aims to connect the implications of on-court performance in the National Association (NBA) to All-NBA Team voting patterns. At the end of each regular season, a panel of writers and broadcasters vote on who they deem to be the 15 best players during that season, forming the All-NBA Team. As both the ultimate signifier of high-performance and the ultimate determinant of maximum contracts, a player’s selection to one of the All-NBA Teams can change both the course of their career and the long-term success of the team they are on. Due to the outsize importance of nominations and due to the lack of available research in this field, this research’s main purpose is two-fold: it aims to both find the best classification model for voting patterns and, concurrently, the most sensitive attributes, in order to construct a framework for future players. Subjects in the initial data set consist of all players who started at least 82 regular season games from 2010 to 2017, with the dependent variable being whether or not they were nominated to one of the three teams within the same timeframe. Due to the rarity of an All-NBA Team selection, this data set contains a dominant majority class. 21 models will be constructed based off of it, and the methodology utilizes existing classification methods—bagging, boosting, random foresting, logistic regressions, and classification trees—for model generation. After constructing these models on the original imbalanced data set, an oversampled set, and an undersampled set, it then scores all of the models against a control group and runs a sensitivity analysis on the best model. The results of the study found that the bagged classification tree run on the original imbalanced data set best predicted voting patterns, and that the assists per game a player registers during the regular season in which they either received their first nomination or registered their best PER season is the most sensitive attribute.

Keywords: classification; multiple models; voting; ensembles; basketball

4

Table of Contents

1 Introduction ...... 9

1.1 Motivation for the Study ...... 10

1.2 Significance of the Research ...... 12

1.3 Organization of the Report...... 13

2 Literature Review ...... 14

2.1 Existing Literature In the NBA ...... 14

2.1.1 Research on player decision-making and performance in the NBA ..14

2.1.2 Research on team decision-making and performance in the NBA ....18

2.2 Classification Techniques ...... 19

2.2.1 Classification trees ...... 19

2.2.2 Logistic regressions ...... 20

2.3 Ensemble modeling ...... 20

2.3.1 Aggregation of models and practical applications of methodology ..21

3 Methodology ...... 22

3.1 Data Set Formulation ...... 22

3.2 Pre-Analysis PER Insights ...... 25

3.3 Pre-Processing...... 28

3.4 Model Creation ...... 29

4 Results ...... 31

5 Conclusion ...... 37 5

6 Reflection ...... 40

6.1 Initial Project Formulation ...... 40

6.1.1 Personal interests ...... 40

6.1.2 Academic field ...... 41

6.1.3 Professional pursuits ...... 42

6.1.4 Initial project ...... 42

6.2 Deviation, New Project Formulation ...... 45

6.3 Learning Outcomes ...... 47

6.3.1 Academic ...... 47

6.3.2 Personal ...... 48

7 Bibliography ...... 51

6

Acknowledgements

Both the formulation of my thesis idea and its execution have been long, arduous processes that could not have happened without the help of many, many people.

First of all, I would like to express my sincere thanks and appreciation for my director of studies, Dr. Raymond Frost. Seeing as he is the man that granted me a position within the Honors Tutorial College in the first place, without him, none of this would be possible. Ever since my freshman year in 2015, Dr. Frost has continually encouraged me to pursue whatever weird class or project I desired with vigor. Moreover, his emphasis on fraternizing with fellow HTC Business Administration students has been instrumental in bringing me closer to my cohort; as the expression goes, you are the average of the people you spend the most time around, and Dr. Frost has facilitated my association with many, many incredible students. Finally, his more-than-evident sense of general care has provided me with the academic backbone I knew I needed the second I applied for college.

The inception of my thesis occurred during the spring semester of my sophomore year in Dr. Katherine Hartman’s research tutorial class. In her class, my cohort and I were tasked with learning different research methods each week by collecting reports incorporating various methodologies. Throughout the semester, we were to collect sources that pertained to one, central idea. Mine happened to be basketball-related, and, at the end of the semester, I had compiled a sizeable bibliography with reports that can be found in this very paper. Thank you, Dr. Hartman, for giving me the foundation necessary for this report. Also, I am sorry for never joining the CRC. 7

Next, I would like to give a huge, huge thank-you to Dr. Norman O’Reilly. As my first thesis advisor, Dr. O’Reilly provided my fledgling idea with a solid analytical background. Working with him week-to-week was a pleasure, as it is not often that an undergraduate student gets to work with one of the best researchers in any field, and even less often when that researcher is patient, receptive, and kind. Even after moving back to

Canada, Dr. O’Reilly took time out of his incredibly busy schedule to work with me—for that, I am very thankful. I am sure that if Dr. O’Reilly never moved, I would have a different yet equally successful thesis fully under his guidance.

The road to completing my thesis was jagged and included one, massive fork: switching advisors. After parting ways with Dr. O’Reilly, I chose Dr. William A. Young

II to be my “anchor” of sorts. With a CV longer than my thesis and with a breadth of wisdom to boot, Dr. Young was the perfect advisor to take my thesis home. Due to the circumstances of my thesis being delayed and my impending graduation, I forced a sped- up timeline of completion unto Dr. Young. Thankfully, he did not flinch. Through countless mental battles, countless needs for clarification, and countless emails and texts,

Dr. Young was nothing but supportive and helpful to me. He even went the extra mile by bringing in Dr. Ehsan Arjmand (too many esteemed colleagues to count!), yet another friendly face that provided new insights that were instrumental in this project’s completion.

Finally, I would like to thank the Honors Tutorial College as a whole. While my entire HTC experience has been guided by the idea of “wearing my honors lightly,” here,

I want to put on an extra metaphorical coat. HTC has provided me with too many opportunities to count, including: linking me up with other amazing people in the honors 8 dorms, letting me bypass generic education classes, exposing me to the best professors

Ohio University has to offer on intimate bases, giving me full tuition, and, yes, allowing me to conduct this very research. While I could have and should have been more directly involved with the college throughout my years in it, I cannot thank Cary Frith, Margie

Huber, Kathy White, and Dean Webster enough. You all have given me the best four years I could have ever asked for.

9

1 Introduction

The National Basketball Association (NBA) is one of the four major U.S. sports leagues, along with the National Football League, National Hockey League, and Major League

Baseball. While the NBA has held this position for more than a half-century, as of 2019, it is more popular and profitable than ever. In the 2017-2018 season, the league generated

$7.4 billion in revenue (Nath 2018). Reasons for the league’s surge in popularity are many, including: its non-stop nature, the increased exposure and vulnerability of its players, the alignment between the league and the current political climate, and more.

More than anything, though, the NBA is more popular than ever due to its aesthetically pleasing nature. As of 2018, all thirty teams have access to SportVu (a tracking system that can analyze the movement of players in real time) that has opened up new ways to run offensive sets. This development, along with a new wave of marketable star players, has created an environment in which the league’s 30 teams play a brand of basketball that is fast, high-scoring, and beautiful (Kopf 2017).

The democratization of access to on-court metrics enables all 30 teams to construct competitive rosters, but salary stipulations makes roster construction challenging to navigate, especially for small-market organizations. To try to address salary-related issues, the league’s executives periodically meet with the National

Basketball Player’s Association (NBPA) to adjust salary rules as part of collective bargaining agreements (CBAs). Set by the 2016 collective bargaining agreement, the cap for each team (the determinant of how much they can spend per-year on total salaries) in the 2017-2018 season rested at $99.093 million. This salary hike represented a $5 million increase from the season before, which is quite significant given previous, incremental 10 changes. Conversely, the minimum amount teams were forced to spend was 90 percent of the cap, or $89.18 million (Nath 2018). If teams decide to exceed the maximum cap in any given year, they are forced to pay an additional “luxury tax.” Looking at recent data, this overspending strategy proves to be successful, as, from 2012 to 2017, the five teams that spent the most money year-over-year all made the NBA playoffs (HoopsHype 2018).

With spending loosely correlating to team success, it is imperative for all teams to understand how to maximize efficiency when team building.

1.1 Motivation for the Study

Under the new CBA, measures such as performance and longevity are linked to larger contracts. Contracts for the following scenarios are all subject to higher pay than before

(Aldridge 2016):

• Veteran minimums for players with over ten years of experience

• Maximum contracts for players with seven to nine years of experience

• Players with five or more years of playing

• Rookie contracts

• Mid-level exceptions

• The bi-annual exception for players with over ten years of experience

Possibly more impactful than any of the previous scenarios is the new

“supermax.” Introduced in the 2016 CBA, the “supermax” permits a player “entering (or have just completed) his eighth or ninth season” who has achieved one of: making one of the three All-NBA Teams or having been named either the Defensive Player of the Year or MVP the year prior, having made one of the three All-NBA Teams, or having been 11 named either the Defensive Player of the Year in two of the past three seasons or the

MVP in one of the past three seasons. Finally, the player looking to receive a “supermax” must still be on the team that drafted him, unless he was traded while still on his rookie deal. For all teams, the value of a “supermax” can be up to 35 percent of the team’s total salary cap (Adams 2017).

The pen-ultimate measure of individual success in the NBA are the All-NBA

Teams. All three of the teams, each year, have somewhat rigid positional requirements: each of the three teams must be composed of two designated backcourt players ( guards and shooting guards), two designated frontcourt players (small forwards and power forwards) and one . Each year, players are nominated by a:

126-member voting panel of writers and broadcasters throughout the United States and Canada [consisting] of national media members and members from each of the league’s 30 teams. . .The media [votes] for All- NBA First, Second and Third Teams by position with points awarded on a 5-3-1 basis. (NBA 2006)

The implications of these teams are three-fold. First and foremost, they most closely point to overall individual success, as they consider a player’s complete repertoire of talents. While many people believe this to be equitable to an All-Star designation, it is not, as All-Star selections are often simply popularity contests that over-emphasize fan voting and undersell true performance. For example, in many years, retiring legends are granted All-Star status in their waning seasons, even though they no longer contribute to successful on-court outcomes. Second, All-NBA Team nominations open pathways for newer and higher contract negotiations, shaking up not only a player’s personal finances, but the team’s. To illustrate how drastic a shift such as this can be, consider that, from 12

2013-2016, Golden State Warriors point guard Stephen Curry made $11 million per year.

During this time period, Curry received multiple All-NBA Team nominations: his salary in 2019 rests at $40 million per year (Spotrac 2019). Finally, the rarity of a player being selected to one of the three teams even once is high, creating a large disparity in any given year. In summation, if an organization drafts or trades for a player with at least one

All-NBA Team selection, it is safe to say that they acquired an asset that completely shifts their entire franchise, both on the court and in the finance department.

1.2 Significance of the Research

While the ramifications of All-NBA Team selections are well-documented and understood, research in the field still has not answered the questions of how and why players are selected to the team. That is the purpose of this research.

Theoretically, this research aims to uncover any linkages from on-court performance to All-NBA Team voting patterns via comparing multiple classification models against each other through a controlled data set. As previously stated, there is always a massive imbalance between active players with and without All-NBA Team selections to their name. In the original data set, this fact is no different. First, three training data sets will be constructed—an imbalanced set, an undersampled set, and an oversampled set. Then, each data set will be classified by both ensemble and non- ensemble classification methods. After the models are constructed, each will be scored against the controlled group in order to find the most instructive model. Finally, a sensitivity analysis will be run on the model of best fit to see which attributes prove most sensitive in predicting All-NBA Team voting patterns. The chosen journal for this 13 research’s anticipated submission is the Journal of Quantitative Analysis in Sports. As

“an official journal of the American Statistical Association,” the Journal of Quantitative

Analysis in Sports is the perfect place for research of this type. It contains a multitude of analytical reports that represent both interesting topical pieces in sport and complex methodologies that align with that of which can be found in this report (Journal of

Quantitative Analysis in Sports 2019).

If done correctly, this analysis will grant future player personnel managers with a tool that optimizes roster-construction and minimizes harmful, long-term decision- making. Moreover, it will be one small step in incorporating and encouraging more foreword-thinking research into NBA circles. As one of the best-positioned leagues going forward, the inclusion of future studies along the lines of this one will give the NBA’s on-court product one more competitive advantage to lean on.

1.3 Organization of the Report

This study will be organized in the traditional ILMRC format. Its literature review will discuss both existing quantitative research done on the NBA, as well as the types of methodologies found in this report. Its methodology section will describe the pre- processing, processing, and post-processing techniques used in the report. Its results section will then lay out the results of each of the models in detail. Finally, its conclusion will discuss both summarize the report as a whole and highlight the limitations and areas of improvement associated with the research. After the conclusion, a reflection portion will detail my thesis journey from start to finish. 14

2 Literature Review

In 1980, Bill James deemed the emerging advanced analysis of baseball “Sabermetrics,” effectively beginning the era of sports analytics (Birnbaum 2019). For years,

Sabermetrics grew within the baseball community. However, it was not until the new millennium when basketball started accepting and utilizing analytical tools similar to that of Sabermetrics. Since then, multitudes of quantitative studies have been conducted on the NBA. While the vast majority of the research focuses on off-court matters, a still- emerging sector has tried to describe on-court issues. This literature review will first describe the existing research done on the NBA both chronologically and by subject type.

Then, it will explain the various types of classification models used in this report.

2.1 Existing Literature in the NBA

While mostly descriptive, a multitude of quantitative studies surrounding on-court performance factors in the NBA have been conducted since the turn of the millennium.

For the sake of the report, this research will be broken up into two camps: research surrounding individual performance and team performance.

2.1.1 Research on Player Decision-Making and Performance in the NBA

In 2010, both the availability of real-time player tracking data and NBA research was limited. With still a year before the introduction of John Hollinger’s Player Efficiency

Ratio (PER), researchers and teams had few holistic measures to work with (Hollinger

2011). In March of that year, researcher Joseph Sill conducted a study questioning the 15 validity of an already-introduced, yet little-tested, performance measurement—Advanced

Plus-Minus, a statistic that measures the difference between points scored for and against a given team with any given player on the floor. Through comparing various plus-minus splits from 2007 to 2009 while setting parameters based on years and regression, Sill found that APM is accurate, provided it is regulated by ridge regression and setting parameters (Sill 2010).

In 2011, researchers Matthew Goldman and Justin Rao aimed to expand the young field of NBA descriptive analytics by conducting a research report to describe how

NBA players select which shots to take. In their report, the researchers broke up the idea of a shot into two segments: dynamic efficiency, which “requires that marginal shot value exceeds the continuation value of the possession,” and allocative efficiency, which “is the additional requirement that at that ‘moment’, each player in the line-up has equal marginal efficacy (Goldman and Rao 2011). Using player-tracked shot data from the

2006-2010 NBA seasons, the researchers found that NBA players, as a general rule, select shots at optimal volumes and times (Goldman and Rao 2011).

In 2012, Goldman and Rao continued studying how and why NBA players make decisions when they conducted a longitudinal study of free-throw efficiency and effectiveness in the NBA. Using on-court statistics from the 2005 to 2010, the researchers found that, in pressure-filled free-throw situations, the home team performed worse at a statistically significant level while the away team experienced no significant changes

(Goldman and Rao 2012).

By the end of that year, the NBA’s increased awareness of sports analytics was apparent, as more research was coming out more frequently. In late 2012, researcher Kirk 16

Goldsberry conducted research that tried to find new ways to quantify and measure NBA players’ shooting abilities through a case study that implored the use of CourtVision—at the time, recently developed player tracking technology—in the hopes of concretely finding the NBA’s best shooter. Via a case study that “[used] game data sets for every

NBA game played between 2006 and 2011. . .[that included] player name, shot location, and shot outcome for over 700,000 attempts,” Goldsberry created two new metrics: Spread and Range, with the former finding how much a player utilizes the

“shooting area’s” area and the latter finding the “percentage of the scoring area in which a player averages more than 1 PPA” (Goldsberry 2012).

In 2013, Goldsberry, along with researcher Eric Weiss, published another report, this time with the goal measuring defensive performance better via two case studies:

“The Basket Proximity Condition” and “The Shot Proximity Condition.” Data for the studies came from “player tracking data provided by STATS (SportVu). . .for over

75,000 NBA shots during the 2011-2012 and 2012-2013 seasons. . .in the presence of 52

NBA interior defenders who faced at least 500 shot attempts during the study period”

(Goldsberry and Weiss 2013). Results for “The Basket Proximity Condition” study indicated that its best player was Roy Hibbert as, while he was within five feet of the basket, opposing players had the lowest field goal percentage. Results for the “The Shot

Proximity Condition” study indicated that its best player was Dwight Howard, as, throughout the duration of his time within five feet of the basket, opposing players shot the fewest percentage of shots (Goldsberry and Weiss 2013).

With the turn of the new year to 2014, the NBA reached an inflection point with analytics: while previous studies focused on describing simplistic measures with fairly 17 simplistic data, new studies broke new ground. In April of 2014, sports analytics researchers Christopher Barnes and Eric Uhlmann aimed to discover both how NBA players make decisions when faced with high-pressure situations and how their decisions correlated to their salaries. In their report, the researchers cited the internal conflict NBA players face when choosing between individual glory and team performance, hypothesizing that the prior option is pursued more than not. After collecting both on- court data and salaries for all NBA teams from 2004 to 2012, the researchers employed a

“multi-level analysis using hierarchical linear modeling” to check their hypothesis.

Barnes and Uhlmann found that the ratio of assists per made field goal decreased significantly in the playoffs, decreasing from 0.59 to 0.54. Moreover, they found there to be a positive correlation between team cooperation and player performance. Finally, they found that field goals positively correlated with salary increases while assists did not

(Barnes and Uhlmann 2016).

In 2015, multiple analytics researchers conducted research that aimed to empirically categorize NBA players by both star quality and role. For data, the researchers looked at 1,230 games, with 548 players, from the 2013 season. After dividing key performance metrics by minutes per game, the researchers conducted descriptive discriminant analysis. They found that elbow touches, defensive rebounds, pull-up points, close points, close touches, and defensive speed to be the key separating statistics between regular and All-Star caliber players. Moreover, they found that “total distance covered in offense and defense,” as well as variations in passing, were the best identifiers of specific NBA roles (Balcunias, et al. 2015). 18

2.1.2 Research on Team Decision-Making and Performance in the NBA

Running parallel to the development of analysis on specific players in the NBA was the development of case studies examining team dynamics. For example, in a 2010 study, researchers Chad Cross and Masaru Teramoto hypothesized that how teams win games in the NBA differs between the regular season and the playoffs. The two took data from

1999 to 2009 (Cross and Teramoto 2010):

Specifically, we examined the contributions of overall efficiency. . .along with the Four Factors (effective field goal percentage, turnover percentage, percentage, and rate) to winning games in the regular season and the playoffs, using a multiple linear regression and a logistic regression analysis.

The researchers found that, in the regular season, offensive and defensive efficiency were essential to winning, with the importance of defensive efficiency increasing in the playoffs. Moreover, they found that shooting efficiency is the most important factor in both settings (Cross and Teramoto 2010).

The amount of team-related research was thin in the next half-decade. However, in 2015, a study conducted by researchers Brian Skinner and Stephen Guy aimed to create a method that could instantly predict the success of random NBA line-ups based off of individual talent and team interactions. Using “hand-recorded data from the 2011 playoff series between the Oklahoma City Thunder and the ,” they constructed a network-modeled view of NBA offenses (Guy and Skinner 2015). For nodes, they separated the court into unique, distinct regions, so as to identify specific areas in which offenses can move. Next, to describe player movement within and throughout the nodes, the researchers combined the ideas of NBA players acting as either 19

“random agents” or as “coached agents.” For modeling, they split the network into two parts: the areas above and below the free throw line. Finally, the researchers introduced an inference algorithm that accounted for coaching intentions and original play design, so as to explain NBA networks contextually by accounting for the success rate of attempted plays, skill sets of the players involved, and play designs. Results of the study found that the players’ skill sets translated to any lineup identically regardless of role (Guy and

Skinner 2015).

2.2 Classification Techniques

While a fairly large collection of classification techniques exist, in the subsequent sections, only the ones used in this report’s methodology will be discussed.

2.2.1 Classification Trees

Classifying a data set through creating trees is one of the most popular classification techniques. When predicting a data set, classification trees compartmentalize the overall data set of interest into subsets, creating pathways with start and end nodes. Each node has branches that represent their attribute’s value. Each pathway is built by “partitioning the data into subsets that contain instances with similar value” that is determined by the homogeneity of the sample’s standard deviations. This is done by splitting up the data set by attributes, finding each attribute’s standard deviation, and choosing the one with the largest number—the decision node. This is done until all attributes are sorted until each pathway reaches its end, defined by when the coefficient of each deviation is smaller than the previously defined threshold (Sayad 2019). 20

2.2.2 Logistic Regressions

Logistic regression is the most apt means of classification for binary data sets. Its purpose is to describe data by explaining “the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables”

(Complete Dissertation 2019). To correctly run it, the data set should be cleansed of outliers via a multitude of methods, including standardization, normalization, etc. When running its algorithm, logistic regressions pool multiple linear regressions together to find the “log odds” of the event being measured (Complete Dissertation 2019).

2.3 Ensemble Modeling

Ideally, data sets used for analysis tend to have balanced majority and minority classes— in other words, the dependent variable is evenly distributed. However, in most cases, this is impossible, as real-world scenarios can be messy. When dealing with imbalanced data sets, simple classification models often fail to accurately create a framework due to the data’s skew. In these cases, combining multiple, weaker models can be beneficial to more accurately fit the data set. This is the core principle of ensemble modeling. There are three main types of ensemble methods that can be applied to compliment commonly used classification techniques (Nagpal 2017):

1. Bagging – creates random samples of the training data, classifies it, and averages

out the classifiers to reduce variance

2. Boosting – sequential classification of the data set that adds weight to missed

observations to keep correcting for prior ones 21

a. Gradient Boosting combines gradient descent with boosting

3. Random Forest – a variation of bagging that also takes the random selection of

features

If desired, two tactics are available when trying to balance majority and minority samples in a data set: oversampling and undersampling. The former method entails the artificial replication of minority class samples, while the latter entails randomly clustering groups of samples together in order to find and subsequently erase clusters with the highest disparity of majority-to-minority samples (Young, et al. 2015). In this report specifically, the undersampled data set will be generated in a similar fashion to what researchers Davis and Rahman (2013) did. In a variation on traditional K clustering, the researchers divided their data set (in a study on cardiovascular research) into sets composed of purely majority and minority sets. Then, they clustered the majority samples together and combined these clusters with the minority sample in order to reduce the gap in ratios, creating a combined data set of K (Davis and Rahman 2013).

2.3.1 Aggregation of Models and Practical Applications of Methodology

Many real-world scenarios with binomial data sets are not only improved by using the aforementioned classification techniques and models, but by the construction of multiple models. Often, one model fits a specific data set better than another, which can only be discovered through the trial and error of multiple. For example, in 2014, multiple sports researchers conducted a research study that tried to predict the outcomes of each individual NCAA men’s basketball tournament game. After using a combination of logistic regressions, neural networks, ensemble models, gradient boosts, and log-losses, 22 the researchers found that the logistic models, gradient boosted models, and neural networks worked best (Borrn, et al. 2015).

3 Methodology

In this section, a full description of the data set’s design process, its pre-processing, and the models developed from it will be discussed. All pre-processing, processing, and post- processing was conducted on Microsoft Excel using Frontline Solver’s Analytic Data

Mining For Excel extension pack (Frontline Solvers 2019).

3.1 Data Set Formulation

Play-style in the NBA is incredibly dynamic and cyclical. The league is apt to switch from a certain mode of play from season to season, thus making it imperative to define what kind of NBA needs measuring. To address this, data used in this study begins in the

2010-2011 NBA season and goes through the 2017-2018 season. The 2010-2011 season was chosen as the start of the data’s time period as it was the first year of the Lebron

James-Dwayne Wade-Chris Bosh trio on the Miami Heat. In forming the “super” team, the three changed the way the league played, as they forced other teams to adapt in two ways. Offensively, that Heat team’s roster, composed of three elite players with multitude of role players filling out the edges, emphasized efficient shot selection by aiming for close shots and three-pointers. Defensively, the team employed a hyper-aggressive scheme that required all five players on the court to be both fast and able to switch defensive assignments. In response to the Heat, all 29 other teams in the league began 23 fundamentally changing how they played to create stylistic matches, thus effectively creating the current “pace-and-space” era of basketball being analyzed in this report.

Subjects of the report include all players who started at least 82 games between the 2010 and 2017 NBA seasons. This specific parameter was chosen for three reasons

(NBA 2019):

1. No rookie has been named to one of the three All-NBA Teams since 2010

2. 82 games represent a full season, thus ensuring that all players in the sample

had the chance to make it due to their fulfillment of a rookie season at,

possibly, the highest possible level—starting 82 games

3. Rarely, if ever, have bench players been nominated to the All-NBA Team

With these limitations in place, 321 players qualified for the study, including the

49 players who made one of the three teams within the timeframe. The binary dependent variable for the study is, of course, whether the player being examined made one of the three All-NBA Teams from 2010 to 2017—encoded to be either a “1” or a “0,” with a

“1” indicating success. The ratio of majority-minority classes in the sample can be found below in

Table 1.

Table 1: Sample Class Distribution

Subject Type Count Percent 1 49 15.26 0 272 84.74 Total 321 100

24

For the 272 players that did not make the All-NBA Team during the allotted time span, ESPN columnist John Hollinger’s Player Efficiency Rating statistic was used to define their best season—in other words, the season in which the player was most likely to make one of the three teams. According to Hollinger, “The PER sums up all a player's positive accomplishments, subtracts the negative accomplishments, and returns a per- minute rating of a player's performance" (Greenberg 2017). While it is not a perfect encapsulation of a player’s talent—it over-emphasizes statistics such as rebounds—it was necessary for the sake of comparison. The league-average PER is 15.00.

As shown in Table 2, all performance metrics for the study were taken off of

Basketball-Reference’s website. While the site contains thousands of personal performance metrics, specific sets were chosen due to perceived voting impact and can be broken into eight distinct per-game camps, organized by specific time durations with 184 variables total. A full description of the inputs can be found below in Table 2.

Table 2: Input Variables

Type Description Count Averages during the regular season in which the given player Traditional either earned their first All-NBA nomination or their best season 26 according to their PER within the period examined. Career regular-season averages, through the season in which the given player either earned their first All-NBA nomination or Traditional 24 their best season according to their PER within the period examined. Career playoff averages, through the postseason prior to the season in which the given player either earned their first All- Traditional 24 NBA nomination or their best season according to their PER within the period examined. Averages during the playoffs in the postseason prior to the season in which the given player either earned their first All- Traditional 26 NBA nomination or their best season according to their PER within the period examined. 25

Averages during the regular season in which the given player Advanced either earned their first All-NBA nomination or their best season 21 according to their PER within the period examined. Career regular-season averages, through the season in which the given player either earned their first All-NBA nomination or Advanced 21 their best season according to their PER within the period examined. Career playoff averages, through the postseason prior to the season in which the given player either earned their first All- Advanced 21 NBA nomination or their best season according to their PER within the period examined. Averages during the playoffs in the postseason prior to the season in which the given player either earned their first All- Advanced 21 NBA nomination or their best season according to their PER within the period examined. Total 184

Data for each player was organized in this fashion to justify some underlying beliefs. Career averages for both regular seasons and post seasons were chosen to account for possible longevity biases: the more exposure a player has to voters, the more, or less, likely they may be to be chosen to one of the three All-NBA Teams. Averages during the season in which the player either made the All-NBA Team for the first time or registered their best PER within the study’s time period were taken to account for recency bias, as were averages for the playoffs during the season prior. Finally, as previously stated, while many other player-tracking metrics exists (such as shooting splits, play-by-play logs, etc.), the most standard traditional and advanced metrics were chosen due to both their truly summative qualities and for ease of comparison. All variables were indexed according their maximum values registered by each respective player, with a 1.00 representing the maximum score granted for each metric. 26

3.2 Pre-Analysis Insights

Table 3: All Other Players PER Measures

PER/1st Nom/BST > PER/PBFR > PER/G > PER/1st Nom/BST > PER/PBFR PER/PC PER/PC PER/G No 35 162 1 177 Yes 237 110 271 95 Total 272 272 272 272

Table 4: All-NBA Team Players Only PER Measures

PER/1st PER/1st Nom/BST > PER/PBFR>PER/PC PER/G>PER/PC Nom/BST > PER/PBFR PER/G No 0 18 10 27 Yes 49 31 39 22 Total 49 49 49 49

Table 3 and

Table 4 (see above) illustrate various dynamics associated with the sample’s subjects’

PERs. In both charts, it is apparent that the PERs during the players’ best seasons were

almost universally higher than their PERs during the playoffs before. This is partially due

to the fact that many players in the sample did not even make the playoffs the year before

their best season PER-wise, but it also speaks to the extra challenges the playoffs present.

Not only are most of the subjects’ single-season record PERs better than their PERs

during the previous year’s playoffs, but their career-average regular season PERs are also 27 higher than their career average playoff PERs—most likely due to the same two reasons.

The final similarity the two graphs share is the uniformity of each of their subject’s best single-season PERs being lower than their career-average PERs—indicative of the large number of older subjects who peaked before the timespan of the study. Finally, and maybe most interestingly, is the one disparity between the two graphs: while All-NBA

Team group has a majority of players with higher pre-best season playoff PERs than career-average PERs, the rest of the subjects have the opposite, possibly indicating the outsized impact the previous season’s playoffs have on a given year’s voting patterns.

Table 5: Descriptive Statistics of All Players

Max Min Mean Median Mode Standard Deviation Max 1.00 0.11 0.63 0.68 1.00 0.37 Mean 0.97 -0.29 0.31 0.23 0.15 0.28 Min 0.88 -1.80 0.14 0.00 0.00 0.22 Standard Deviation 0.06 0.26 0.09 0.16 0.31 0.03

Above in Table 5 are multiple statistics that describe the parameters of the original data set by player (i.e., column one, row one describes the average maximum value of the each player’s best attribute across all input variables). The descriptors found in the table bring two particularly interesting things to light. For one, the averages of all mean and median values are within 0.08 of each other, which, in a data set with total range of 2.80 between absolute minimum and maximum values and an average mode of

.15, indicates a fairly normal distributed data set. For two, the average mode of .15 itself points to the overall tendency of the players registering lower attribute scores across all 28 inputs with respect to the maximum attribute value; this makes sense as the majority class is composed of players with lower skill levels.

3.3 Pre-Processing

One of the main the theoretical goals of this report is to test the validity of multiple classification models against each other in order to find which one best fits the data.

Therefore, a control group was created to standardize the results of each model.

Before constructing models, the original data set of 321 players was first randomly partitioned. Each of the 184 on-court variables were used as inputs for the partition, with the dependent variable—whether the subject made the All-NBA Team between 2010 and 2017—serving as the output. The data partition was randomly split with 80 percent (255 records) used for training and 20 percent (66 records) used as a control group for scoring after construction of the models.

29

3.4 Model Creation

Below in Error! Reference source not found. is the workflow process breakdown for this research’s methodology:

Random Forest Classification Tree Scored

Classification Tree Scored

Bagging Ensemble Methods Stepwise Logistic Regression Scored

Classification Tree Scored

Boosting 80% Partition Stepwise Logistic Regression Scored

20% Test Stepwise Logistic Original Data Set Regression Scored Non-Ensemble 80% Partition Methods Classification Tree Scored Balanced Same path as the Oversampling 80% Partition

Balanced Same path as the Undersampling 80% Partition

Figure 1: Methodology Workflow Process Breakdown

For the sake of comparison, model construction was split into three camps: working with both the partitioned imbalanced training set and its oversampled and undersampled offshoots. The oversampled data set was generated by random seeding and included a training set with a 50-50 majority-minority split, with the total number of records coming out to be 166. The undersampled data set was created manually. First, a cutoff of 40 percent was determined to be sufficient for the proportion of majority records in the undersampled data set. From there, the K clustering method discussed in the literature 30 review was used to create a training set with a 40-60 split between the majority and minority samples. The total number of records in this set came out to be 89.

For each data set, classification models were then split again according to whether they were created using ensemble or non-ensemble techniques. The non-ensemble models constructed for each were simple logistic regressions and classification trees. For the ensembled models, a random forest, boosted, and bagged model were constructed for each of the three data sets. For each boosted and bagged model, the classified data sets were again run through logistic regressions and classification trees. However, for the both the undersampled and oversampled data set, additional work had to be done in order for the logistic regressions to be applied—with the portrait-nature of the two data sets, attributes had to be reduced in order for Frontline Solver’s software to run. To reduce each model’s set of attributes, attributes were eliminated according to lowest variance, until the regression could be applied.

Each ensemble and non-ensemble model constructed utilized adjusted normalization to standardize the data with a 0.01 correction. For all models, the Success

Probability was set to .5 and the Prior Probability Calculation was set empirically.

For each data set’s logistic regression models, 50 iterations were created.

Stepwise Selection was the method of choice, with the F-Statistic (In) and F-Statistic

(Out) set to 3.84 and 2.71, respectively. For each data set’s classification tree, the maximum number of levels, splits, and nodes, were set to 10, 50, and 20. For scoring, a fully-grown tree was created that displayed up to seven levels in addition to feature importance. For all bagged and boosted models, 25 weak learners were used for 31 processing. Finally, all boosted models were generated according to the M1_Freund method (Fruend and Schapire 1996).

4 Results

The results of all models were scored against the bank data—in other words, the random

20 percent set aside from the original data set. The charts detailing each model’s kappa statistic, accuracy rate, true positive rate, true negative rate, and precision rate can be found below in Figures 2-6.

100% 83% 90% 78% 80% 69%69% 69% 73% 70% 64% 58% 57% 60% 57% 57% 57% 60% 49% 46% 50% 44% 40% 31% 29% 30% 23% 20% 20% 10% 4% 0%

Figure 2: Kappa Statistic by Model 32

95% 100% 91%91% 91%94% 91% 86% 88% 90% 85% 85% 85% 85% 85% 77% 80% 77% 80% 76% 68% 71% 70% 58% 60% 55% 50% 40% 30% 20% 10% 0%

Figure 3: Accuracy Rate by Model

100% 92%92% 92% 90% 83%83%83%83%83% 83% 83%83%83% 80% 75%75% 75%75%75%75% 67% 70% 60% 50% 50% 42% 40% 30% 20% 10% 0%

Figure 4: True Positive Rate by Model

33

100% 98% 100% 94%94% 94% 91% 89% 89% 85% 85% 85% 85% 90% 83% 83% 80% 80% 76% 72% 74% 70% 65% 60% 52% 56% 50% 40% 30% 20% 10% 0%

Figure 5: True Negative Rate by Model

100% 100% 90% 90% 80% 75% 75% 75% 69% 70% 60% 63% 60% 56% 55% 56% 56% 56% 48% 50% 43% 44% 40% 36% 34% 35% 28% 30% 20% 20% 10% 0% Imbalanced Imbalanced Imbalanced Undersampled Undersampled Oversampled Oversampled Stepwise Bagging: Random Bagging: Boosting: Classification Boosting: Logistic Classification Forest: Logistic Classification Tree Logistic Tree Classification Tree Tree

Figure 6: Precision Rate by Model 34

After looking at each model’s measurements across each of the five statistics, it appears that the bagged classification tree run on the imbalanced training set models the scenario best. It has a 95.5 percent accuracy rate and a 4.5 percent error rate—the highest and lowest respective rates of all 21 models. Moreover, both its true negative rate of 100 percent and its false positive rate were perfect, meaning that it did not falsely predict any

All-NBA Team nominations. Perfection in this category is especially pertinent in practicality as it would not lead teams to falsely select players that are believed to have promising careers. Finally, its precision rate of 100 percent indicates that it correctly predicted all of the players in the sample that were actually nominated to the All-NBA

Team.

When examining the 21 models across all five rates, a few trends are apparent.

First of all, it seems that almost all regular logistic regressions run on all three data sets modeled the data poorly. This is especially evident in the oversampled and undersampled models, as their logistic regressions are the worst two models generated. When comparing the seven models by what data set they were run on, it seems that the models run on the original imbalanced data set fit the best; this may be due to inaccuracy inherent in running models on the small amount of samples leftover in both the undersampled and oversampled data sets. Finally, it is evident that no matter the data set, classification trees garnered the best results.

After running a sensitivity analysis on the bagged classification tree model, it appears that the assists per game a player registers during the regular season in which they either received their first nomination or registered their best PER season is the most sensitive attribute, with a standard deviation of .25. The next most sensitive attribute was 35 found to be the free throw attempts per game a player registers during the regular season in which they either received their first nomination or registered their best PER season.

Rounding out the top five is the total Win Shares a player earns during the regular season in which they either received their first nomination or registered their best PER season, the three-point field goals a player makes per game during the regular season in which they either received their first nomination or registered their best PER season, and the offensive rebounds a player gets per game during the regular season in which the player either received their first nomination or registered their best season.

The distribution of the aforementioned attributes’ standard deviations and sensitivities, as well as the highlighted sensitivity of the most sensitive attribute, can be found below in Figure 7, Figure 8, and Figure 9.

0.30

0.25

0.20

0.15

0.10

0.05

0.00

Figure 7: Imbalanced Data Set – Standard Deviation of Attributes of the Bagged Classification Tree 36

50% 40% 30% 20% 10% 0% 1 2 3

AST/1st Nom/BS FTA/1st Nom/BS WS/1st Nom/BST 3P/1st Nom/BS TRB/1st Nom/BS ORB/PLBFR 2P/1st Nom/BS DRB/1st Nom/BS VORP/1st Nom/BST TOV/1st Nom/BS FG/1st Nom/BS TOV/PLBFR PER/1st Nom/BST OWS/1st Nom/BST 3PA/G MP/G/1st Nom/BS 2PA/1st Nom/BS PTS/1st Nom/BS 3P%/1st Nom/BS 2P/PLBFR 2PA/PLBFR 3P%/PLBFR AST%/1st Nom/BST FG/G 3P/G PTS/G STL/PC

Figure 8: Imbalanced Data Set - Bagged Classification Tree Sensitivity Analysis by

Attribute

100%

80%

60%

40%

20%

0% MIN MEAN MAX AST/1st Nom/BS 0% 0% 44%

Figure 9: Sensitivity of AST/1st Nom/BS

37

5 Conclusion

The NBA is an ever-changing, ever-growing league. Since the turn of the decade, both its popularity and the breadth of quantitative research surrounding it has exploded, making sound decision-making within the league more important than ever. While available descriptive research is plentiful, as of 2019, the amount of predictive research conducted is miniscule.

In the hopes of both exploiting this lack of research and to better inform NBA front offices about how to build successful rosters, this research was conducted. The main practical purpose of this study was to discover what on-court attributes were most likely to predict a player being nominated to one of the three All-NBA Teams—a nomination that reflects on both the on-court impact and the off-court salary implications of all players. Theoretically, this research was conducted for the sake of finding the best existing classification model by comparing the validity of multiple one together, in order to find the best framework for predicting future All-NBA Team nominations.

For comparison, 21 models were created: seven were run on the original imbalanced data set, seven were run on an oversampled set, and seven were run on an undersampled data set. After running each model, all models were scored against a random control group that encapsulated 66 records from the initial data set. In short, the best classification model for this specific scenario was found to be the bagged classification tree run on the imbalanced training data. This makes sense as, in an imbalanced data set, a multi-layered classification model with multiple classifiers would be the most instructive. Moreover, the most sensitive attributes (listed above) make sense.

The most sensitive attribute, assists per game during the regular season in which they 38 either received their first nomination or registered their best PER season, could make a pronounced impact on All-NBA Team voters, as players with high per-game numbers may be viewed as selfless and worthy of a vote. Furthermore, per-game free throw attempts and per-game total rebounds both highlight a player’s willingness to hustle and to look for contact—traits that are traditionally thought of as masculine that may be admirable in the eyes of voters. The sensitivity of Win Shares is also interesting and explainable, as they encapsulate the lump-sum contributions of a player towards winning, both perceivable and not. Finally, the sensitivity of per-game three-point field goals makes sense as it is indicative of the way the NBA is in 2019: if a player makes three-pointers, he is more valuable and thus is more likely to earn a nomination.

Though this research is a step in the right direction, it comes with many parameters and is far from perfect. For one, the design of the original data set and its concurrent models was more landscape than portrait, in that it had many attributes yet little subjects. This design proved difficult when analyzing, as a landscape structure

(many samples to fewer attributes) is preferred. Future studies could correct this shortcoming by including more subjects for analysis by either lowering the standards of admittance or lengthening the given time period. Moreover, the set of on-court statistics chosen for the study were, while extensive, incomplete: totals, per-36 minute statistics, per-100 possession statistics, shot location statistics, play-by-play statistics, game highs,

All-Star game statistics, and Similarity Scores were all measurement points listed on

Basketball-Reference’s website, but were left out. Maybe even more important than the type of statistic excluded is disposition towards time: while career averages were included in the study, the players who earned All-NBA Team nominations before 2010 39 were still treated as if nothing had actually happened, thus ignoring the possible effect of repeat voting. Not only were multiple on-court, player-specific metrics left out of the research, but multiple team statistics, such as team success and franchise value, were omitted too. If the full breadth of individual and team-oriented statistics were included in the study, more holistic and informative classification models could be conducted.

Methodology-wise, the full available catalog of classification models was not used, as discriminant analysis, neural networks, k-Nearest Neighbors, and the Native Bayes methods were not employed. This presents a huge opportunity for future research: if concurrent studies add each of the aforementioned classification models to their studies, the possibility of finding an even better one (given the scenario) would only make the results even more usable. Finally, the omittance of various off-court inputs, such as social media following, city data, endorsement deals, etc., made analyzing the data much more black-and-white than it actually is. All-NBA Team voting is most certainly impacted by off-court intangibles such as these, and, if future models incorporate them, they would be truly honest.

If future studies corrected the aforementioned ills of this study, a truly consummate predictive model could be constructed. However, this research is absolutely a step in the right direction. Practically, it proposes a set of on-court metrics that were quantitively proven to be quite sensitive when it comes to All-NBA Team voting. By empirically proving this, in spite of the multitude of parameters listed, NBA front offices have an informative framework that clearly highlights what players would both maximize on-court success and salary cap negotiations. Theoretically, this study expands the already-growing area of quantitative NBA research by both utilizing predictive models 40 and a mixture-of-models approach. While most of the existing literature at the time of this report is purely descriptive, hopefully, this study will encourage future researchers to be more exploratory and predictive in their methodologies.

6 Reflection

While my thesis is no more or less demanding than others in the literal sense, its completion has been a long, strenuous journey. Though I began more than a calendar year ago in September of 2017, it will never be truly complete. From changing advisors, to changing methodology, to changing (and fighting) software multiple times, I have certainly not made it any easier on myself. In this reflective portion of the report, I will delve into full detail in regard to my journey towards publication.

6.1 Initial Project Formulation

In September of 2017, I began brainstorming possible topics for my thesis. In the brainstorming process, it was my goal to try to find a topic at the intersection of my personal interests, my current academic fields, and my aspiring professional career path.

6.1.1 Personal Interests

Sports are and have been a huge passion of mine ever since I was born. I not only love the literal entertainment value they bring, but the broader implications and consequences sports entail. While I am a fan of sports in general, basketball has always been my favorite for a few reasons. For one, it has a dynamic playstyle that makes every single game watchable regardless of each game’s relative importance. For two, the visibility 41 inherent in the sport is incredibly appealing: in all professional leagues, players’ faces are exposed, and, in most, seasons are long. For three, in the NBA specifically, ever since the league’s inception, players have been at the forefront of political conversation and change. Finally, basketball has always been right behind baseball when it comes to accepting analytics and year-over-year playstyle changes, thus making it an incredible historical case study.

6.1.2 Academic Field

After considering all four of these factors, I decided that NBA basketball would be the basis of my study.

While my degree upon graduation will indicate that I pursued a Bachelor’s in

Business Administration, my academic career at Ohio University has revolved around a specific field within business: analytics.

At Ohio University, I have taken analytics classes in business intelligence and information management, descriptive analytics, predictive analytics, and prescriptive analytics. Throughout my time exploring each of these analytic-related fields, I have come to understand the value they instill in both the academic and professional communities. While qualitative studies are excellent anecdotally, quantitative studies and measurements may be even more powerful. In a society with more available data than ever, the ability to accurately analyze and convey it can change the world. Nowhere is this data influx more visible than in sports. Though a relatively new field of research, sports analytics are booming, especially in the NBA. Front offices are increasingly taking advantage of player tracking data to make more informed decisions, thus changing the 42 landscape of the league forever. Whereas, in the past, academics conducting studies on

NBA analytics would be sunned aside, personnel employees are becoming more and more receptive. If a researcher conducts a quantitative study on the NBA now, it has a much higher likelihood of being utilized by a front office employee. Realizing this, I thought it only natural to pursue an analytic-based project.

6.1.3 Professional Pursuits

Though I went into my undergraduate career knowing only that I wanted to pursue a career in business, my time at Ohio University has directed me to one place: an NBA front office. I have always wanted to work in both a field and specific position that value innovation, efficiency, and maximum impact—NBA front offices value all three. If I do end up in an NBA front office, I wish to continue conducting research just like this, in the hopes of advancing both the field of sports analytics and the team I am on.

6.1.4 Initial Project

After deciding on my thesis’ field and vessel, I then had to choose what specific hypothesis to explore. In late 2017, most of the existing literature surrounding NBA analytics tended to focus around two areas: how to maximize team revenue via off-court measures, and how to more accurately describe on-court performance. While each of these two fields are incredibly instructive and useful for both academics and NBA front offices, they are somewhat saturated. Seeing this, I wanted to exploit an area that lacked existing research. 43

After examining the array of reports currently out for public use, I found something incredibly compelling: descriptive modeling that linked the implications of current on-court performance to overall roster construction and overall talent. While many on-court descriptive models existed that described how and why players and teams operated the way they did, few models existed that could describe what attributes, if possessed by a player, indicated overall success. If more were created, their frameworks would not only be more robust and usable long-term but would significantly add to the

NBA analytics research repertoire.

In light of all of this, I decided that my initial hypothesis was going to revolve around a somewhat-definitive player performance model. I wanted to try to describe not how and why players succeeded in certain situations, but what was it about them and their situations that made them valuable assets. To measure a player’s overall value, I first had to create a dependent variable that described what value meant to an NBA player: whether or not they made the All-NBA Team. After this, I had to define my parameters for both my subjects and for the attributes used to measure them. While the on-court parameters and on-court attributes remain the same in my current iteration, I initially aimed to encapsulate multiple off-court measurements in my study, including:

• Player’s physical measurements, taken at their NBA Draft Combine

• Their team and their team’s franchise value, number of championships, and

conference

• Their college status, draft position, and citizenship

• Their number, age, and the amount of franchises they had played for at the

time of their best season 44

• Their average salary, peak salary, and contract situation at the time of their

best season

• Their Olympic history

• Census data of their team’s host city at the time of their best season

I included all of these attributes due to their perceived impact on All-NBA Team voting. I assumed that their physical measurements played a part in that they allowed for quick “eye test” analysis that could be favorable or unfavorable depending on whether they below average or above average across a variety of factors including their height, weight, and shooting hand. I included their team and their team’s adjacent information due to the perceived impacts of conference skew and team history. College status, draft position, and citizenship were included to account for a possible college-centric bias

(whether the player was already favored before college or if they had to spend a prolonged amount of time there), draft position bias (the higher the pick, the greater the expectations), and U.S.-centric bias. Number, age, and franchises played for were included due to jersey implications, age bias, and visibility throughout the league, respectively. Salary information was included to account for the expectations that come with higher/lower salary. Olympic history was included to account for possible international exposure bias. Finally, census data was included to account for the possible ramifications of market size.

Data collection for all of the aforementioned attributes was manually collected. In a time period spanning multiple months, I collected data from:

• Basketball-Reference.com

• The U.S. Census Bureau 45

• Forbes

All data was copied-and-pasted into a master Excel spreadsheet. In total, I had compiled

73,199 unique records, or cells, for analysis.

After compiling my master list, my first thesis advisor and I, Dr. Norman

O’Reilly, began constructing descriptive models. The overall hypothesis of the report was to run a descriptive model on my 321-player sample that aimed to describe why the players that were selected to one of the All-NBA Teams were selected—and why the others were not. In addition to the overall hypothesis of the report, we constructed four sub-questions (with means of analysis listed):

1. What is the platonic ideal of each position? In other words, what unique

mixture of attributes listed in the data most accurately correlated to an All-

NBA Team vote?

2. Which set of attributes most-correlated to an All-NBA Team vote: advanced

metrics or traditional metrics?

3. Whether regular season or playoff performance have a more pronounced

impact on All-NBA Team voting

4. How to produce the “perfect roster”—in other words, how to get five players

on the same team at the same time all having career years

6.2 Deviation, New Project Formulation

After formulating my initial thesis’ hypothesis and adjacent questions, insurmountable issues occurred. For one, I realized that the scope of my research was so broad and so unfounded that it was very unlikely that any meaningful outcomes could be reached. 46

While the on-court metrics fully encapsulated everything that could and should be measured pertaining to my dependent variable, the off-court measurements did not. For example, census data for players on the could not be found and perfectly matched with that of their American counterparts. Moreover, for players that got traded during their best season, no census data, or team data, could be tracked. With these loads of missing records, the models were sure to be lacking in quality.

More sinister than these issues, however, was my general lack of aim. If I wished to be published to a leading sports analytics journal, my report had to be tightened up. By including specific off-court metrics, I then had to defend my thesis against countless other omitted off-court metrics, from social media numbers, to family, and more.

Furthermore, by adding in four sub-questions, the overall purpose and significance of my report was muddled, as I shrouded and delineated from my main point of interest.

Finally, the issue of advisor relations became too much to overcome. While Dr.

O’Reilly was still at Ohio University at the start of my thesis, he got a new position at a

Canadian university in the Fall of 2018. Though we continued to try to work together through this period, the inability of frequent, face-to-face meetings made it necessary to switch advisors.

In January of 2019, I officially switched advisors, choosing Dr. William Young of the Ohio University Analytics and Information Systems department. In choosing Dr.

Young, I gained yet another incredibly knowledgeable sports analytics researcher and, more importantly, an advisor that could meet face-to-face weekly.

After an initial meeting with Dr. Young, I switched thesis topics in three, key ways: 47

1. I decided to use only my already-existing on-court metrics with my already

existing sample group, so as to tighten up the results of the analysis and

overall research scope.

2. I decided to make my initial hypothesis predictive, in that, by examining the

impact of each attribute on All-NBA Team voting outcomes, I would create a

framework that could be used to predict future players’ fates, so as to create

an even more robust model.

3. I eliminated the incorporation of any sub-questions, so as to not distract from

the report’s main idea.

After switching both ideas and advisors, my thesis officially had a clear end point.

Now, I had a clear idea and an advisor that I could meet with face-to-face with on a weekly basis.

6.3 Learning Outcomes

Throughout my year-and-a-half thesis journey, I learned a multitude of academic and personal lessons. Below are the main takeaways of both.

6.3.1 Academic

While I have always been an excellent student, delving into a true research project was somewhat of a shock. In the past, writing essays consisted of little to no planning, little to no organization, and little to no interdependence. I would decide on a topic, write about it in solitude, and forget it the second a first draft was written. 48

This thesis turned every previous mode of operation I internalized upside-down.

Time and again while collecting data, analyzing data, and writing, I moved too quickly, only to fail even more quickly. Whether it was incorrectly typing in data, running the wrong analysis, or writing about something completely irrelevant, I learned that planning should take supremacy over doing. After slowing down and writing out frameworks for each of the three aforementioned areas, though the initial process was intensive and slow,

I was able to move through my plan with ease and without error.

Second to learning the virtues of planning was my realization of the need to defend everything I write. With my (hopefully) publishable manuscript and with my attached scholarly essay, I realized that I could not write anything irrational. Whereas, in the past, my essays and reports fell on professors’ ears with little to no consequence, the public and serious nature of these reports mandated that they only include information that would hold up in true academic settings. No longer will I write without a reason.

6.3.2 Personal

As an incoming salesman, I thought that I understood the necessities of expectations, transparency, and accountability—however, this journey has proved me wrong on numerous occasions.

Throughout working with both Dr. O’Reilly and Dr. Young, I learned how to hold both myself and my associates truly accountable. As the writer and conductor of my thesis, I came to understand that it was my responsibility to set proper guidelines for both myself and my advisors. Too often, I put forth loose deadlines and loose demands, which often led to nothing but confusion. In doing so, I not only slowed down the completion of 49 the thesis but hurt my own social and professional capital with my advisors. Upon completion of my research, I now understand how important it is to set SMART (specific, measurable, achievable, relevant, and time-bound) goals.

Not only did I learn to hold both myself and my associates accountable in conducting this research, but I learned the value of being open and clear. Many times, between deliverables and assigned work, I would communicate that I had either already finished something or was working on it when I was not. Though a white lie, it vastly inhibited my ability to communicate effectively with my advisors; if I was honest, I may have received more pertinent help than I ended up actually receiving. Moreover, I tended to wait to reach out for help until it was absolutely necessary, thus raising my stress, tightening my timeline, and eliciting false confidence. After conducting this research, I now know the importance of reaching out frequently and clearly.

Finally, my thesis journey has taught me how to hold myself and my associates accountable. Personally, I let too many things slide without any guilt or self-reflection.

This, again, led to a false sense of security that only masked what really needed to get done. With my advisors, my initial lack of accountability was even more apparent. When working with Dr. O’Reilly (who was wonderful to work with—I only left due to the aforementioned geographical constraints and my own inability to formulate a proper research scope), while in Canada, I frequently allowed for days, and sometimes weeks, of no communication. Though no fault of his own, I decided to stay working with him due to the fear of ostracizing him when, in reality, it would have been best for both parties.

Moreover, in working with both him and Dr. Young, I often convinced myself that I understood the various analytical processes I was employing when I, in fact, did not. 50

Now and forevermore, I will make sure to be honest with both myself and my associates so as to foster more mutually beneficial personal and academic relationships.

51

Bibliography

Adams, Jonathan. 2017. NBA Free Agency 2017: What Is a Supermax Contract? July 1. Accessed February 10, 2019. https://heavy.com/sports/2017/07/nba-supermax- contract-deal-salary-how-much-designated-player/.

Aldridge, David. 2016. NBA, NBPA reach tentative seven-year CBA agreement. December 14. Accessed February 12, 2019. https://www.nba.com/article/2016/12/14/nba-and-nbpa-reach-tentative-labor-deal.

Balcunias, Mindaugas, Julio Calleja-Gonzalez, Tim McGarry, Sergio Saiz, Jaime Sampaio, and Xavi Schelling. 2015. "Exploring Game Performance in the National Basketball Association Using Player Tracking Data." PLOS ONE.

Barnes, Christopher, and Eric Uhlmann. 2016. "Selfish Play Increases during High- Stakes NBA Games and is Rewarded with More Lucrative Contracts." PLOS ONE.

Birnbaum, Paul. 2019. A Guide To Sabermetric Research. Accessed February 11, 2019. https://sabr.org/sabermetrics.

Borrn, Luke, Peter Bull, Dmitri Illushin, Aaron Kaufman, Anthony Liu, Andrew Reece, Sherrie Wang, Alec Yeh, and Lo-Hua Yuan. 2015. "A mixture-of-modelers approach to forecasting NCAA tournament outcomes." Journal of Quantitative Analysis in Sports 13-27.

Complete Dissertation. 2019. What is Logistic Regression. Accessed February 15, 2019. https://www.statisticssolutions.com/what-is-logistic-regression/.

Cross, Chad, and Masaru Teramoto. 2010. "Relative Importance of Performance Factors in Winning NBA Games in Regular Season versus Playoffs." Journal of Quantitative Analysis in Sports 2-2.

Davis, Darryl, and Mostafizur Rahman. 2013. "Cluster Based Under-Sampling for Unbalanced Cardiovascular Data." Proceedings of the World Congress on Engineering.

Frontline Solvers. 2019. Analytic Solver Data Mining Add-In For Excel (Formlerly XLMiner). Accessed January 11, 2019. https://www.solver.com/xlminer-data- mining.

Fruend, Yoav, and Robert Schapire. 1996. "Experiments with a New Boosting Algorithm." Machine Learning: Proceedings of the Thirteenth International Conference 1-9.

52

Goldman, Matt, and Justin Rao. 2011. Allocative and Dynamic Efficiency in NBA Decision Making. Accessed March 2, 2019. http://www.sloansportsconference.com/wp-content/uploads/2011/08/Allocation- and-Dynamic-Efficiency-in-NBA-Decision-Making1.pdf. —. 2012. Effort vs. Concentration: The Asymmetric Impact of Pressure on NBA Performance. March. Accessed January 17, 2019. https://pdfs.semanticscholar.org/35be/724131d8f47ba2f5835020717b1d6a9ad1cd. pdf.

Goldsberry, Kirk. 2012. CourtVision: New Visual and Spatial Analytics for the NBA. Accessed March 1, 2019. http://www.sloansportsconference.com/wp- content/uploads/2012/02/Goldsberry_Sloan_Submission.pdf. .

Goldsberry, Kirk, and Eric Weiss. 2013. The Dwight Effect: A New Ensemble of Interior Defense Analytics for the NBA. Accessed February 20, 2019. http://www.sloansportsconference.com/wp- content/uploads/2013/The%20Dwight%20Effect%20A%20New%20Ensemble%2 0of%20Interior%20Defense%20Analytics%20for%20the%20NBA.pdf.

Greenberg, Neil. 2017. What is Player Efficiency Rating? April 13. Accessed March 2, 2019. https://www.washingtonpost.com/what-is-player-efficiency- rating/37939879-1c08-4cfa-aff3- 51c2a2ae060e_note.html?noredirect=on&utm_term=.923c116f0419. .

Guy, Stephen, and Brian Skinner. 2015. "A Method for Using Player Tracking Data in Basketball to Learn Player Skills and Predict Team Performance." PLOS ONE.

Hollinger, John. 2011. What is PER? August 8. Accessed January 17, 2019. http://www.espn.com/nba/columns/story?columnist=hollinger_john&id=2850240.

HoopsHype. 2018. 2017/18 NBA Salaries. Accessed January 6, 2019. https://hoopshype.com/salaries/2017-2018/. .

Journal of Quantitative Analysis in Sports. 2019. Journal of Quantitative Analysis in Sports. Accessed December 15, 2018. https://www.degruyter.com/view/j/jqas.

Kopf, Dan. 2017. Data analytics have made the NBA unrecognizable. October 18. Accessed January 11, 2019. https://qz.com/1104922/data-analytics-have- revolutionized-the-nba/.

Nagpal, Anuja. 2017. Decision Tree Ensembles- Bagging and Boosting: Random Forest and Gradient Boosting. October 17. Accessed December 19, 2018. https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting- 266a8ba60fd9.

53

Nath, Trevir. 2018. The NBA's Business Model. June 22. Accessed February 18, 2019. https://www.investopedia.com/articles/investing/070715/nbas-business- model.asp.

NBA. 2006. MVP Nash Highlights All-NBA First Team. May 17. Accessed March 3, 2019. http://www.nba.com/news/AllNBA_060517.html.

—. 2019. Year-by-year list of All-NBA Teams. Accessed March 3, 2019. https://www.nba.com/history/awards/all-nba-team. Sayad, Saed. 2019. Decision Tree - Regression. Accessed December 18, 2018. https://www.saedsayad.com/decision_tree_reg.htm.

Sill, Joseph. 2010. Improved NBA Adjusted +/- Using Regularization and Out-of-Sample Testing. March 6. Accessed February 10, 2019. http://www.sloansportsconference.com/wp- content/uploads/2015/09/joeSillSloanSportsPaperWithLogo.pdf.

Spotrac. 2019. Stephen Curry. Accessed March 16, 2019. https://www.spotrac.com/nba/golden- state-warriors/stephen-curry-6287/.

Young, William, Scott Nykl, Gary Weckman, and David Chelberg. 2015. "Using Voronoi diagrams to improve classification performances when modeling imbalanced datasets." Neural Computing and Applications 1041-1054.