Simulation-Based Projections for Baseball Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Simulation-Based Projections for Baseball Statistics A Thesis Presented to the Faculty of California State Polytechnic University, Pomona In Partial Fulfillment Of the Requirements for the Degree Master of Science In Computer Science By Daniel Adam Acevedo 2018 SIGNATURE PAGE THESIS: SIMULATION-BASED PROJECTIONS FOR BASEBALL STATISTICS AUTHOR: DANIEL ADAM ACEVEDO TERM SUBMITTED: Spring 2018 Computer Science Department Dr. Yu Sun ____________________________________ Thesis Committee Chair Department of Computer Science Dr. Abdelfattah Amamra ____________________________________ Department of Computer Science Dr. Sampath Jayarathna ____________________________________ Department of Computer Science ii ACKNOWLEDGEMENTS I would like to thank my family for their love, support, and for all of the sacrifices they’ve made so that I could get an education, resulting in a Master’s Degree. I would like to thank my advisor, Dr. Yu Sun, for his guidance and help throughout my time conducting this thesis. I would also like to thank my Dr. Sampath Jayarathna and Dr. Abdelfattah Amamra for being members of my committee. iii ABSTRACT Baseball is an unpredictable sport. The introduction of sabermetrics established an opening for the application of computer science methods within the game’s evaluation. Every Major League Baseball organization has developed their own method of measuring players’ results and making predictions as to what they should expect from a player entering a season. While most industry models use their own statistical analysis to perform predictions, this thesis introduces a new model that uses simulations in addition to statistical analysis in order to make predictions. The results of this thesis show that this model is comparable to some of the best projection systems available. iv TABLE OF CONTENTS SIGNATURE PAGE .................................................................................................... ii ACKNOWLEDGEMENTS ......................................................................................... iii ABSTRACT ................................................................................................................. iv LIST OF FIGURES ..................................................................................................... vi 1. INTRODUCTION .....................................................................................................1 2. ACQUIRING DATA AND WEIGHTS.....................................................................4 2.1. Data Used ....................................................................................................................... 4 2.2. Data Acquisition ............................................................................................................. 5 2.2.1. Setting up the Databases............................................................................................ 5 2.2.2. Obtaining the Most Recent Four Year Period ............................................................ 6 2.2.3. Obtain Weights ......................................................................................................... 7 2.2.4. Applying Weights and Regression to the Mean ......................................................... 9 3. IMPLEMENTATION ............................................................................................. 10 3.1. Explanation of a Simulation ........................................................................................ 10 3.2. Creating a Prediction ................................................................................................... 11 3.3. Apply Age Regression .................................................................................................. 12 3.4. Prediction Example ...................................................................................................... 13 4. ANALYSIS ............................................................................................................... 19 4.1. Metrics for Evaluation ................................................................................................. 19 4.2. Explanation of Industry Projections............................................................................ 19 4.3. Explanation of Metrics ................................................................................................. 20 4.4. Predictions Comparison ............................................................................................... 21 v 4.4.1. All Players .............................................................................................................. 22 4.4.2. Players with less than three years played ................................................................. 26 4.4.3. Players with four or more years played .................................................................... 30 5. CONCLUSION ........................................................................................................ 34 vi LIST OF FIGURES Figure 1. Description of SQLite tables .............................................................................5 Figure 2. Description of statistics considered for predictions. ...........................................6 Figure 3. Spinner Board example featuring BIP vs. Not a BIP ....................................... 10 Figure 4. Spinner Board example with numerous outcomes ........................................... 11 Figure 5. Wil Myers' Generalized Spinner Board ........................................................... 14 Figure 6. Myers' Outcome Spinner Board ...................................................................... 15 Figure 7. Myers' Probabilities of outcomes given certain events occurring ..................... 16 Figure 8. Myers' Resulting BIP vs Not a BIP ................................................................. 17 Figure 9. Myers' Resulting Outcomes ............................................................................ 17 Figure 10. Myers' true rates and predicted values based on those rates ........................... 18 Figures 11-15. Any Years Played MAE of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions ................................................................................................ 23 Figures 16-15. Any Years Played RMSE of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions ................................................................................................ 24 Figures 21-25. Any Years Played R of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions ............................................................................................................. 25 Figures 26-30. Less than Four Years Played MAE of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions .......................................................................................... 27 Figures 31-35. Less than Four Years Played RMSE of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions ................................................................................. 28 Figures 36-40. Less than Four Years Played R of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions ................................................................................................ 29 vi Figures 41-45. Four or More Years Played MAE of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions .......................................................................................... 31 Figures 46-50. Four or More Years Played RMSE of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions .......................................................................................... 32 Figures 51-55. Four or More Years Played R of Hit, Home Run, Runs Scored, RBI, and WOBA Predictions ................................................................................................ 33 vii 1. INTRODUCTION Baseball is often called a game of failure – a player who hits the ball only three out of ten times is considered amongst the best players in the game. The prediction of baseball statistics could also be considered a game of failure, as prediction models try their best to utilize a large amount of data to project a player’s performance, but will ultimately never be able to perfectly predict a player’s performance consistently. Major League Baseball (MLB) organizations have a substantial interest in the performance of their projection systems: teams pay salaries to players that are consistent with their performance, with the assumption that their success is reasonably sustainable in future years. Most recently, in 2014, the Miami Marlins awarded Giancarlo Stanton with the largest monetary contract in MLB history, worth $325,000,000 over a thirteen-year period [1]. These large contracts are risks – Stanton played worse than his pre-contract average over the next two years, before obtaining the best statistical year of his career in 2017 [2]. The Marlins, under new ownership, traded Stanton to the New York Yankees, as the team could no longer afford such a large monetary contract. This highlights the importance of teams signing players to salaries that are consistent with their past performance and with an expectation that their performance will improve or stay the same, while simultaneously maintaining their budget. A well-known example of the importance of maintaining a budget is portrayed in the film Moneyball. Based on a true story, it follows Oakland Athletics’ general manager Billy Beane and his use of sabermetrics, the application of