CSI 5335 Project

CSI 5335 Project Baseball Analytics v1.4 due: 8am December 9, 2019 Revisions • Initial version 1.0 - November 1 • Version 1.1 - 1. Added option 4 for phase 3 2. Specified versions of system components 3. Clarified TB calculation 4. Clarified data output 5. Corrected math error on Scherzer RC and RC27. 6. Corrected typo on HDFS file name • Version 1.2 - November 18 1. Corrected the weights on Phase 2 (D should have been 0.52) 2. Added the web site with batter runs created to the Phase I description • Version 1.3 - November 23 1. Corrected the equation in Phase 2 to allow linear regression to work • Version 1.4 - December 4 1. Revised the due date to December 9 at 8am 1 Overview Baseball collects a large amount of data. From the beginnings in the late 1800's, statistics such as batting average (number of hits divided by number of at bats) and ERA (number of earned runs allowed times 9 divided by innings pitched) have defined the best (and worst) players in the game. Recently, sophisticated Name AB H 1B 2B 3B HR BB IBB HBP SF SH GiDP SB CS Mike Trout 470 137 63 27 2 45 110 14 16 4 0 5 11 2 Alex Bregman 554 164 84 37 2 41 119 2 9 8 0 9 5 1 Max Scherzer 650 144 83 43 0 18 33 2 7 ? ? ? 8 0 Table 1: Baseball Statistics statistics like WAR (Wins Above Replacement) have combined multiple data points to generate a definitive answer as to who is the best player (Mike Trout) in the game. One of the original statistics to determine an overall offensive contribution is Runs Created, defined by Bill James in the 1970s. Runs created is designed to determine the number of runs which should be scored based on the possible results of a play. The simple formula for runs created (RC) is RC = OBP ∗ TB where OBP is onbase percentage and TB is total bases (1B+2*2B+3*3B+4*HR). Over the years, Bill James improved the formula to one that considered each hitter outcome. The resulting formula is ((H + BB{CS + HBP {GDP ) ∗ (TB + (:26 ∗ (BB{IBB + HBP )) + (:52 ∗ (SH + SF + SB)))) RC = (AB + BB + HBP + SF + SH) where H is hits, BB is bases on balls (walks), CS is caught stealing, HBP is hit by pitch, GDP is grounded into double plays, TB is total bases, IBB is intentional walks, SH is sacrifice hits, SF is sacrifice flies and AB is atbats. This formula is amazingly accurate for major league baseball as a whole. In 2019, 23467 runs were scored. The RC formula estimates 23600 runs would be scored. The error is 133 or 0.6%. Since the stats used are all cumulative, we can apply RC to Each player in order to determine that player's offensive contribution (the portion of all runs in MLB generated by that player). For example, in 2019, Mike Trout made the offensive contributions in Table 1 to his team, which totals 145.03 runs created. However, for individual players, the statistics can be misleading. Some ball parks are easier to score runs than others. For example, Colorado plays in a stadium a mile above sea level. The ball travels farther in thin air so players hit better and teams score more runs. Thus individual player stats should be adjusted by the ballpark. Mike Trout played for the Los Angeles Angels of Anaheim in 2019, which has a small park factor of 1.018. Assuming the Trout's road games average out to park neutral (1.000 park effect), we should divide Trout's runs created by 1.009, yielding 143.74. Now consider Alex Bregman of the Houston Astros. Bregman's stats are also in Table 1. Bregman has 150.2 runs created, which is more than Trout. However, Minute Maid Park has a park factor of 1.083, so we need to divide Bregman's number by 1.0415, yielding 144.22. Still more than Trout, but now very close. BTW, these two players are the candidates for the MVP award in the American League. You might have noticed that Bregman has far more atbats than Trout. This is due to the fact that Trout missed a month of the season with an injury while Bregman played all year. Another stat, RC27, eliminates playing time differences by considering the number of runs created per 27 outs. Outs are Outs = AB − H + SF + SH + GiDP + CS RC∗27 and RC27 is Outs . Using RC27, Trout has a value of 11.28 while Bregman is 9.54. Finally, every time a hitter accomplishes a result, a pitcher also allows the result. Thus, we can take the same data for pitchers and determine the runs created allowed. However, the data for pitchers is harder to find. For example, see Max Scherzer in Table 1. I was unable to find the number of sacrifices allowed or grounded into double play allowed. Note! This data exists, I just don't know where to find it. I also had to calculate fields from others, such as AB and 1B. However, if we assume 0 for the missing fields and use 1.101 as the park factor for Scherzer, we can compute a RCA of 64.46 and RC27 of 3.44. 2 Project The project contains three phases. While there is a dependency between phase 1 and phase 2, phase 3 can be completed in parallel. 2.1 Data Sean Lahman maintains baseball data that is free to use. You can download various files at http://www.seanlahman.com/baseball-archive/statistics. Since the project is for academic purposes, you can freely download the data. However, be sure to attribute the data as requested. On this site, you can find sufficient data to calculate the park-adjusted RC and RC27 for batters for all seasons up to 2018, including park effects for batters (and pitchers). The park effects are on a scale of 100, so you will need to adjust accordingly. The data will be stored in HDFS in a directory /user/baseball (all of the csv files will be loaded there). The system I am using is Red Hat Linux 7.5, Python 3.6.5, Spark 2.4.4 with HDFS on port 8020. I can build a windows configuration if needed. 2.2 Phase 1 - Big Data Portion Write a pySpark program to accept a year and generate a report containing all of the batters with their RC and RC27. The program should be based on the data from SeanLahman.com and should use the map and reduce functions to simulate parallel processing. The user should be allowed set the following limits: • The minimum number of atbats (default 0; any number greater than 0 accepted) • The attribute to be sorted (default RC; RC27 also accepted) • The number of players to be listed (default all; any number greater than 0 accepted) • The year of interest (no default; error if not provided) Note that you will have to determine the appropriate park factors for each team and each year. The output should be a cvs file on HDFS (or collection of HDFS files) stored in /user/〈 your last name 〉/BD. Fox sports has a website with runs created calculated for 150 players. You can use the link at https://www.foxsports.com/mlb/stats?season=2018&category=BATTING+II&group=1&sort=4&time=0&pos=0&qual=1&sortOrder=0&splitType=0&page=1&statID=0 Note that some of the players may have slightly different stats, so if your answer is off by a little bit, double check the data before spending too much time debugging. 2.3 Phase 2 - Machine Learning Portion Calculating park adjusted RC for each player may not generate the same RC totals as using the formula over the MLB totals for the season. It would be expected the error would be larger. It may be possible to tweak some constants and generate better results. In particular, consider four constants in this reworked version of RC. (H + BB{CS + HBP {GDP ) ∗ (B ∗ TB + (C ∗ (BB{IBB + HBP )) + (D ∗ (SH + SF + SB))) (AB + BB + HBP + SF + SH) where in the original RC formula • B = 1 • C = 0.26 • D = 0.52 Write a pyspark program to use linear regression to estimate the weights. Use data from 2017 to generate your estimates, then data from 2018 to test your estimates. Ideally, the error for 2018 should be less than with the current weights, but this is not required. Similar to phase 1, using your new weights the user should be allowed to query the results by setting the following limits: • The minimum number of atbats (default 0; any number greater than 0 accepted) • The attribute to be sorted (default RC; RC27 also accepted) • The number of players to be listed (default all; any number greater than 0 accepted) • The year of interest (default 2017; 2018 also accepted) The output should be a cvs file on HDFS file(or collection of HDFS files) stored in /user/〈 your last name 〉/ML. 2.4 Phase 3 - Data Science Portion Calculating RC and RC27 for pitchers is more difficult since the Lahman data does not contain the needed information. For example, the data does not include singles, doubles and triples allowed by pitchers. In order to generate this information, extra work is required. There are four basic possibilities: 1. Find the data - The missing data may be present in another available repository.

Load more