Major Qualifying Project: Advanced Baseball Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Major Qualifying Project: Advanced Baseball Statistics Matthew Boros, Elijah Ellis, Leah Mitchell Advisors: Jon Abraham and Barry Posterro April 30, 2020 Contents 1 Background 5 1.1 The History of Baseball . .5 1.2 Key Historical Figures . .7 1.2.1 Jerome Holtzman . .7 1.2.2 Bill James . .7 1.2.3 Nate Silver . .8 1.2.4 Joe Peta . .8 1.3 Explanation of Baseball Statistics . .9 1.3.1 Save . .9 1.3.2 OBP,SLG,ISO . 10 1.3.3 Earned Run Estimators . 10 1.3.4 Probability Based Statistics . 11 1.3.5 wOBA . 12 1.3.6 WAR . 12 1.3.7 Projection Systems . 13 2 Aggregated Baseball Database 15 2.1 Data Sources . 16 2.1.1 Retrosheet . 16 2.1.2 MLB.com . 17 2.1.3 PECOTA . 17 2.1.4 CBS Sports . 17 2.2 Table Structure . 17 2.2.1 Game Logs . 17 2.2.2 Play-by-Play . 17 2.2.3 Starting Lineups . 18 2.2.4 Team Schedules . 18 2.2.5 General Team Information . 18 2.2.6 Player - Game Participation . 18 2.2.7 Roster by Game . 18 2.2.8 Seasonal Rosters . 18 2.2.9 General Team Statistics . 18 2.2.10 Player and Team Specific Statistics Tables . 19 2.2.11 PECOTA Batting and Pitching . 20 2.2.12 Game State Counts by Year . 20 2.2.13 Game State Counts . 20 1 CONTENTS 2 2.3 Conclusion . 20 3 Cluster Luck 21 3.1 Quantifying Cluster Luck . 22 3.2 Circumventing Cluster Luck with Total Bases . 24 3.3 Conclusion . 26 4 Garbage Time 27 4.1 Calculating Situational Odds of Winning . 28 4.2 Quantifying Garbage Time . 30 4.3 Garbage-Adjusted Statistics . 31 5 Preseason Model 34 5.1 Calculating Cluster Luck Adjusted Wins . 34 5.2 Roster Changes . 36 5.3 Injury Adjustments . 37 5.4 Alternative Attempts at Creating the Preseason Model . 38 5.5 Alternative Methods to Creating Preseason Models . 39 5.5.1 PECOTA and ZIPS . 39 5.5.2 Total Bases . 39 5.5.3 Fractional Rosters . 39 5.6 Regression Results . 40 5.7 Second Regression . 42 6 In-Season Model 43 6.1 Runs . 43 6.2 Cluster Luck Runs . 44 6.3 Total Bases . 45 6.4 Wins Adapted In-Season . 46 6.5 Garbage Time Exlcuded . 47 6.6 Results . 48 7 Combination of the Two Models 51 7.1 Credibility Theory . 51 7.2 Linear . 52 7.3 Mixed . 53 7.4 Results . 54 8 Single Game Betting 55 8.1 Starting Pitcher, Relief Pitcher, and Lineup . 55 8.2 Games . 56 8.3 Results . 57 8.4 Conclusion . 58 9 2019 Testing 59 9.1 Preseason Testing . 59 9.2 Conclusion . 60 10 Conclusion 61 CONTENTS 3 A Database Technical Details 63 A.1 Design Decisions . 63 A.1.1 Choosing SQLite and SQL Alchemy . 63 A.1.2 Table Design and Code Reusability . 65 A.2 Retrieving and Cleaning Data . 66 A.2.1 Retrosheet . 66 A.2.2 MLB.com Rosters . 66 A.2.3 CBS Sports Injury Reports . 67 A.3 Using the Database . 69 A.4 Full Table List . 70 A.4.1 GameLog . 70 A.4.2 PlayByPlay . 70 A.4.3 SeasonalRoster . 74 A.4.4 StartingLineup . 74 A.4.5 TeamStats . 74 A.4.6 TeamBattingStats . 75 A.4.7 TeamPitchingStats . 75 A.4.8 NonGbgTeamBattingStats . 75 A.4.9 NonGbgTeamPitchingStats . 76 A.4.10 PlayerBattingStats . 76 A.4.11 PlayerPitchingStats . 77 A.4.12 NonGbgPlayerBattingStats . 77 A.4.13 NonGbgPlayerPitchingStats . 77 A.4.14 Game State Counts By Year ........................... 78 A.4.15 Game State Counts . 78 A.4.16 RosterByGame . 78 A.4.17 Schedule . 78 A.4.18 BettingOdds . ..