Channel Islands
Total Page:16
File Type:pdf, Size:1020Kb
Channel Islands CALIFORNIA STATE UNIVERSITY The Importance of NBA Box Score Statistics and the Value of Statistical Outbursts A Thesis Presented to The Faculty of the Computer Science Department In (Partial) Fulfillment of the Requirements for the Degree Masters of Science in Computer Science by Student Name: Advisor: Jack BJ Bension Dr. Michael S oltys November 2019 © 2019 Jack BJ Bension ALL RIGHTS RESERVED APPROVED FOR MS IN COMPUTER SCIENCE Advisor: Dr. Michael Soltys Date Dr. Jason Isaacs Date Dr. Brian Thoms Date APPROVED FOR THE UNIVERSITY Dr. Osman Ozturgut Date Non-Exclusive Distribution License In order for California Slate University Channel Islands (CSUCI) to reproduce, translate and distribute your submission worldwide through the CSUCI Institutional Repository, your agreement to the following terms is necessary. The authors) retain 3ny copyright currently on the item a s well as the ability to submit the item to publislwrs or other repositories. By Signing and submitting this license, you (the authors) or copyright owner) grants to CSUCI the nonexclusive right to reproduce, translate (a s defined below), and/or distribute your submission (including the abstract) worldwide in print and electronic format and m any medium, including but not limited to audio or video. You agree that CSUCI may, without changing the content, translate the submission to any medium or format for the purpose o f preservation. You also agree that CSUCI m ay keep m ore than o n e copy o f this submission for purposes of security, backup and preservation. You represent that the submission is your original work, and that you have the right to grant the rights contained in this license. You also represent that your submission d oes not. to the best of your knowledge, infringe upon anyone's copyright. You also represent and warrant that the submission contains no libelous or other unlawful matter and makes n o improper invasion o f the privacy o f any other person. If the submission contains material for which you d o not hold copyright, you represent that you hove obtained the unrestricted permission o f the copyright owner to grant CSUCI the rights required by this license, and that such third party owned material is clearly identified and acknowledged within the text or content Of the submission. You take full responsibility to obtain permission to use any material that is not your own. This permission must be granted to you before you sign this form. IF THE SUBMISSION IS BASED UPON WORK THAT HAS BEEN SPONSORED OR SUPPORTED BY AN AGENCY O R ORGANIZATION OTHER THAN CSUCI. YOU REPRESENT THAT YOU HAVE FULFILLED ANY RIGHT OF REVIEW O R OTHER OBLIGATIONS REQUIRED BY SUCH CONTRACT O R AGREEMENT The CSUCI Institutional Repository will deariy identify your nam e(s) as the authors) or ow n e r s) of the Submission, and will not make any alteration, other than a s allowed by this license, to your submission T itle of item : T h e im p ortan ce of N B A B ox S core S tatistics and th e valu e of statistical ou tbu rsts 3 t o 5 keyw ords or phrases to describe the item : machine learning, nba statistics, decision tree classifier A u t h o r ( s ) N am e (Print): Jack B J B ension D ate 12/13/19 this is a permitted modified version of the non-exclusive distribution license from mit libraries and the university of kansas The Importance of NBA Box Score Statistics and the Value of Statistical Outbursts Jack BJ Bension December 9, 2019 Abstract The Nation Basketball Association (NBA) has embraced the 21st Century by increasing its use of advanced analytics. New and evolving statistics can be used to determine how efficient a player is while he is on the court. However, even though a player is being efficient, his performance may not lead to victories. This paper creates a Decision Tree Classifier model that helps to determine, through game by game statistics, an NBA player’s value to a team’s chance to win. Players tested in the model demonstrate that having a high PER does not always lead to being a great asset for their team. The models cre ated also distinguish what statistics are important for All-Star and Starter level players. The All-Star model favors individually focused, offensive statistics; whereas the Starter model places a higher level of importance on team statistics. Contents 1 Introduction 1 1.1 M otivation......................................................................................... 2 2 Background 3 2.1 History of NBA A n a ly tics ............................................................ 3 2.2 Applications of Advanced A nalytics............................................. 7 2.2.1 General Managers................................................................ 8 2.2.2 C o a ch e s................................................................................ 8 2.2.3 S co u ts ................................................................................... 9 2.3 New Statistics................................................................................... 11 2.4 What is Machine Learning? ......................................................... 24 2.4.1 Different Types of Machine L ea rn in g............................. 24 2.4.2 Regression vs Classification M o d e ls ................................ 27 2.5 P ython............................................................................................... 32 2.5.1 pandas A P I ......................................................................... 33 2.5.2 NumPy A P I ......................................................................... 35 2.5.3 scikit-learn A P I ................................................................... 36 3 C ontribution 39 3.1 Statistics A n a ly zed ......................................................................... 39 3.2 Creating M odels................................................................................ 40 3.3 Gathering and Organizing the D a ta ............................................. 40 3.4 Training the Models ...................................................................... 42 4 Experiments and Justification 43 4.1 Finding Important Variables ......................................................... 43 4.2 Creating New Models ...................................................................... 45 4.3 The Lack of Defense ...................................................................... 47 4.3.1 D R tg ...................................................................................... 47 4.3.2 The Value of a Block ......................................................... 48 4.4 Player Tests ...................................................................................... 49 4.4.1 Difference in Play Style ...................................................... 51 4.5 Salary vs Data ................................................................................ 52 5 Conclusion and Future Work 56 i References 60 ii List of Figures 1 This chart demonstrates one of the effects that the 2004-2005 Phoenix Suns had on the NBA. Between their performance and the increased emphasis on analytics, the NBA has seen a dramatic increase in the use of the 3 point shot [19].................. 5 2 A sample provided by Synergy that lists Kobe Bryant’s scoring tendencies [19].................................................................................... 6 3 A sample of Second Spectrum’s software. A green bar indi cates that the player is wide open and is not currently being contested by a defender. An orange bar indicates that the player is being contested by an opponent but still has room to score. Red indicates that the player is fully guarded and has little to no room to shoot the ball. The percentage over each player shows their accuracy from that range based on data gathered on the player [17].............................................................. 7 4 A sample of different reports that are produced by Second Spectrum’s software [17]................................................................... 10 5 A graphical representation of clustering. Each circle of data points represents a group of records that have similar traits [14]. 26 6 A graphical representation of a supervised learning model [14]. 27 7 An example of a linear regression model plot. The predictor variables are represented by x and the response variables are represented by y [14].......................................................................... 29 8 An example of the data splitting process that occurs in a deci sion tree classifier. Each colored section represents a different category [14]....................................................................................... 30 9 A plot of a decision tree. The rules and logic created by the Decision Tree Classifier can be clearly viewed through this image [2].............................................................................................. 31 10 The documentation on the read-csv procedure with basic pa rameter inputs [18]............................................................................ 34 11 An example of concat and its ability to merge different DataFrame structures together [18]..................................................................... 35 12 The isfinite function is a tool that NumPy provides. It is used to identity null data and eliminate it from the DataFrame structures [11]..................................................................................... 36 iii 13 The definitions of the procedures and the attribute