CSI 5335 Project

Baseball Analytics v1.4

due: 8am December 9, 2019

Revisions

• Initial version 1.0 - November 1

• Version 1.1 -

1. Added option 4 for phase 3 2. Specified versions of system components 3. Clarified TB calculation 4. Clarified data output 5. Corrected math on Scherzer RC and RC27. 6. Corrected typo on HDFS file name

• Version 1.2 - November 18

1. Corrected the weights on Phase 2 (D should have been 0.52) 2. Added the web site with batter runs created to the Phase I description

• Version 1.3 - November 23

1. Corrected the equation in Phase 2 to allow linear regression to work

• Version 1.4 - December 4

1. Revised the due date to December 9 at 8am

1 Overview

Baseball collects a large amount of data. From the beginnings in the late 1800’s, statistics such as average (number of hits divided by number of at bats) and ERA (number of earned runs allowed times 9 divided by pitched) have defined the best (and worst) players in the game. Recently, sophisticated Name AB H 1B 2B 3B HR BB IBB HBP SF SH GiDP SB CS 470 137 63 27 2 45 110 14 16 4 0 5 11 2 Alex Bregman 554 164 84 37 2 41 119 2 9 8 0 9 5 1 650 144 83 43 0 18 33 2 7 ? ? ? 8 0

Table 1: statistics like WAR () have combined multiple data points to generate a definitive answer as to who is the best player (Mike Trout) in the game. One of the original statistics to determine an overall offensive contribution is Runs Created, defined by in the 1970s. Runs created is designed to determine the number of runs which should be scored based on the possible results of a play. The simple formula for runs created (RC) is

RC = OBP ∗ TB where OBP is onbase percentage and TB is (1B+2*2B+3*3B+4*HR). Over the years, Bill James improved the formula to one that considered each hitter outcome. The resulting formula is

((H + BB–CS + HBP –GDP ) ∗ (TB + (.26 ∗ (BB–IBB + HBP )) + (.52 ∗ (SH + SF + SB)))) RC = (AB + BB + HBP + SF + SH) where H is hits, BB is bases on balls (walks), CS is , HBP is by pitch, GDP is grounded into plays, TB is total bases, IBB is intentional walks, SH is sacrifice hits, SF is sacrifice flies and AB is atbats. This formula is amazingly accurate for as a whole. In 2019, 23467 runs were scored. The RC formula estimates 23600 runs would be scored. The error is 133 or 0.6%. Since the stats used are all cumulative, we can apply RC to Each player in order to determine that player’s offensive contribution (the portion of all runs in MLB generated by that player). For example, in 2019, Mike Trout made the offensive contributions in Table 1 to his team, which totals 145.03 runs created. However, for individual players, the statistics can be misleading. Some ball parks are easier to score runs than others. For example, Colorado plays in a stadium a mile above sea level. The ball travels farther in thin air so players hit better and teams score more runs. Thus individual player stats should be adjusted by the ballpark. Mike Trout played for the Los Angeles Angels of Anaheim in 2019, which has a small park factor of 1.018. Assuming the Trout’s road games average to park neutral (1.000 park effect), we should divide Trout’s runs created by 1.009, yielding 143.74. Now consider Alex Bregman of the . Bregman’s stats are also in Table 1. Bregman has 150.2 runs created, which is more than Trout. However, has a park factor of 1.083, so we need to divide Bregman’s number by 1.0415, yielding 144.22. Still more than Trout, but now very close. BTW, these two players are the candidates for the MVP award in the American League. You might have noticed that Bregman has far more atbats than Trout. This is due to the fact that Trout missed a month of the season with an injury while Bregman played all year. Another stat, RC27, eliminates playing time differences by considering the number of runs created per 27 outs. Outs are

Outs = AB − H + SF + SH + GiDP + CS

RC∗27 and RC27 is Outs . Using RC27, Trout has a value of 11.28 while Bregman is 9.54. Finally, every time a hitter accomplishes a result, a also allows the result. Thus, we can take the same data for and determine the runs created allowed. However, the data for pitchers is harder to find. For example, see Max Scherzer in Table 1. I was unable to find the number of sacrifices allowed or grounded into allowed. Note! This data exists, I just don’t know where to find it. I also had to calculate fields from others, such as AB and 1B. However, if we assume 0 for the missing fields and use 1.101 as the park factor for Scherzer, we can compute a RCA of 64.46 and RC27 of 3.44.

2 Project

The project contains three phases. While there is a dependency between phase 1 and phase 2, phase 3 can be completed in parallel.

2.1 Data

Sean Lahman maintains baseball data that is free to use. You can download various files at http://www.seanlahman.com/baseball-archive/statistics.

Since the project is for academic purposes, you can freely download the data. However, be sure to attribute the data as requested. On this site, you can find sufficient data to calculate the park-adjusted RC and RC27 for batters for all seasons up to 2018, including park effects for batters (and pitchers). The park effects are on a scale of 100, so you will need to adjust accordingly. The data will be stored in HDFS in a directory /user/baseball (all of the csv files will be loaded there). The system I am using is Red Hat Linux 7.5, Python 3.6.5, Spark 2.4.4 with HDFS on port 8020. I can build a windows configuration if needed.

2.2 Phase 1 - Big Data Portion

Write a pySpark program to accept a year and generate a report containing all of the batters with their RC and RC27. The program should be based on the data from SeanLahman.com and should use the map and reduce functions to simulate parallel processing. The user should be allowed set the following limits:

• The minimum number of atbats (default 0; any number greater than 0 accepted)

• The attribute to be sorted (default RC; RC27 also accepted) • The number of players to be listed (default all; any number greater than 0 accepted)

• The year of interest (no default; error if not provided)

Note that you will have to determine the appropriate park factors for each team and each year. The output should be a cvs file on HDFS (or collection of HDFS files) stored in /user/〈 your last name 〉/BD. Fox sports has a website with runs created calculated for 150 players. You can use the link at

https://www.foxsports.com/mlb/stats?season=2018&category=BATTING+II&group=1&sort=4&time=0&pos=0&qual=1&sortOrder=0&splitType=0&page=1&statID=0

Note that some of the players may have slightly different stats, so if your answer is off by a little bit, double check the data before spending too much time debugging.

2.3 Phase 2 - Machine Learning Portion

Calculating park adjusted RC for each player may not generate the same RC totals as using the formula over the MLB totals for the season. It would be expected the error would be larger. It may be possible to tweak some constants and generate better results. In particular, consider four constants in this reworked version of RC.

(H + BB–CS + HBP –GDP ) ∗ (B ∗ TB + (C ∗ (BB–IBB + HBP )) + (D ∗ (SH + SF + SB))) (AB + BB + HBP + SF + SH) where in the original RC formula

• B = 1

• C = 0.26

• D = 0.52 Write a pyspark program to use linear regression to estimate the weights. Use data from 2017 to generate your estimates, then data from 2018 to test your estimates. Ideally, the error for 2018 should be less than with the current weights, but this is not required. Similar to phase 1, using your new weights the user should be allowed to query the results by setting the following limits:

• The minimum number of atbats (default 0; any number greater than 0 accepted)

• The attribute to be sorted (default RC; RC27 also accepted)

• The number of players to be listed (default all; any number greater than 0 accepted)

• The year of interest (default 2017; 2018 also accepted)

The output should be a cvs file on HDFS file(or collection of HDFS files) stored in /user/〈 your last name 〉/ML. 2.4 Phase 3 - Data Science Portion

Calculating RC and RC27 for pitchers is more difficult since the Lahman data does not contain the needed information. For example, the data does not include singles, doubles and triples allowed by pitchers. In order to generate this information, extra work is required. There are four basic possibilities:

1. Find the data - The missing data may be present in another available repository. You can download the information and store it within the /user/〈 your last name 〉/DS directory in HDFS. You can then combine data from the two repositories as needed (some caveats apply).

2. Generate the data - Retrosheet.org contains every play from every major league baseball game going back to 1878. The data can be downloaded and accessed, with the missing data calculated (or all of the data). The new data can be stored in the /user/〈 your last name 〉/DS directory in HDFS.

3. Ignore the data - The RC and RC27 formulas can be modified to work without the missing data. In this case, you must the modified formula on the batter data and calculate the RMSE (root mean square error) between the new formula and the original formula. The size of the RMSE will impact the grade. The results of applying your new formula to the data can be stored in /user/〈 your last name 〉/DS directory in HDFS.

4. Use 3rd party software to parse the Retrosheet data. You can then integrate the resulting data into the Lahman data as in option 1 and store it within the /user/〈 your last name 〉/DS directory in HDFS.

However the missing data is handled, you will then write a pyspark program to generate RC and RC27 data for every pitcher. Similar to phase 1 the user should be allowed set the following limits:

• The minimum number of batters faced (default 0; any number greater than 0 accepted)

• The attribute to be sorted (default RC; RC27 also accepted)

• The number of players to be listed (default all; any number greater than 0 accepted)

• The year of interest (no default; error if not provided)

The output should be an HDFS file (or collection of HDFS files) stored in /user/¡your last name¿/DS.

3 Submission

Every program should be executable from the command line in a linux environment. I would suggest submitting a small test program to make sure you can read the data as stored. I found success with the command: sc.textFile("hdfs://localhost:8020/user/baseball/Batting.csv") where sc is a SparkContext. You should submit in a zip file:

• three pySpark report generating programs, one for each phase

• the pySpark linear regression program for phase 2

• EITHER:

– the pySpark program to combine the Lahman data with your new data (option 1) OR – the pySpark program to generate the data from Retrosheet (option 2) OR – the pySpark program to compute the RMSE (option 3) – the link to the parser program and the pySpark program to combine the Lahman data with the new data (option 4)

• A 2 page (max) report on the linear regression of phase 2 and your methodology for phase 3