Understanding Baseball, an Exercise in Data Science

Understanding baseball, an exercise in Data Science Dr. Chris Edwards, January 31, 2020 The website retrosheet.org has downloads of play-by-play information for over 100 years of baseball seasons. The information can be used to learn about the game of baseball, at least as it is played at the Major League level. My interest in the data is in creating a method to simulate how the game plays. There are many table-top baseball games (Strat-o-Matic, APBA, Statis Pro, among others) which allow fans to recreate games, or seasons, or even conduct hypothetical matchups between Babe Ruth and Nolan Ryan, for instance. The game of baseball is rich in record keeping, and this is the appeal to many fans. Other uses of this data could include learning strategies to win more games. Do managers use the proper tactics? Could we improve how the game is managed by learning what actions in the past produced success? Without being able to properly experiment, we are forced to rely on the record of observational data. However, if we can discover patterns, they may be helpful in answering some of these questions on strategy. Initial data gathering The appendix to this document contains a sample game of www.retrosheet.org data. This particular game is from April 14, 1978 and featured the visiting San Francisco Giants against the home team San Diego Padres. This game’s data file consists of 158 rows of information, some of it data about the game, some of it data about the players participating, and some of it about events that occurred during the game. (In this game, there were 101 events during the game, which is the bulk of the data one would analyze for trends about how the game plays.) The data exists in a CSV (comma separated variables) format, so it is quite easy for various computer applications to read and store. Given that each Major League baseball team is scheduled to play 162 games in a season, and there are presently 30 teams, there are a tremendous number of lines of data if one were to look at an entire season. For example, in the 1978 season, there are over 300,000 lines of information. Loading all of the game data into a computer isn’t the challenging part of this exercise; rather, the fact that the play-by-play data is simply represented as a code without context is what makes the following analyses challenging. For example, in our sample game, the first inning’s information is coded thus: play,1,0,hernl001,??,,63 play,1,0,andrr101,??,,53 play,1,0,evand001,??,,6/L play,1,1,richg001,??,,6 play,1,1,thomd001,??,,S4 play,1,1,gambo001,??,,NP sub,heint101,"Tom Heintzelman",0,2,4 com,"Rob Andrews left with split nail on right thumb" play,1,1,gambo001,??,,7 play,1,1,winfd001,??,,5 The first batter was Larry Herndon, and he hit a ground ball to the shortstop, who threw to the first basemen for the first out. There are up to 7 pieces of information on each line, some of it missing for some years. The first line of this snippet has 5 pieces of information: The information was a play, in the first inning, 1, with the visiting team batting, 0, the batter was Larry Herndon, hernl001, and the result of the play was 63. There is no record of how many outs, or what the score is, or even whether the actual play resulted in an out or something else. The user is expected to understand baseball enough to puzzle out that information. As such then, this data requires a tremendous amount of interpretation before any meaningful analysis can even be attempted. Furthermore, we see that sometimes during the game, players are replaced, either for tactical reasons (bringing in a new pitcher for example) or for injuries (as in the above snippet, where Rob Andrews was replaced by Tom Heintzelman). Other entries include comments, game info, starters, and end-of-game data. Before any analysis can be attempted, then, this data needs a lot of attention. While it is easy to find out how a particular batter fared over the course of a season by creating a frequency table, the missing context is quite important. For example, we can find that Dave Winfield of the San Diego Padres had singles on 18.21% of his appearances in the record. What is missing, however, is how and when these singles happened. Were they against good pitchers? Were they when the game result was in doubt when they would be most useful, or did they happen in blowouts? > tab( a$Single, a$Batter == "Dave WinfieldSDP" ) b a FALSE TRUE 0 138803 548 1 26224 122 b a FALSE TRUE 0 0.8411 0.8179 1 0.1589 0.1821 I needed much more information than was presented in the data download. I had to create context as well. Transformation issues In addition to the need to create context, there were other problems with the information. There are many examples in this dataset where the coding of the results was inconsistent. For example, “B-1” is used to indicate the batter ended up at first base on the play. However, it is often only used when the play needed some clarity in the explanation. For instance, when a batter gets a single, it is implied that he ended up at first base, so the scorers never include the “B-1”. If there was a fielding error by the first baseman though, coded as an “E3”, it is not clear in every case where the batter ended up. (Sometimes, when the first baseman makes a fielding error, such as letting the ball go between his legs and the right fielder eventually makes the play, the batter ends up on second base, and the scorer would likely have coded it as “E3.B-2”). Other times, the codes are inconsistent between different retrosheet scorers, so that different entries represent the same game outcome. I have often heard that “80 percent of a data scientist’s time is spent cleaning the data”. That was very much true in this project. I needed to use my knowledge of the game of baseball and become a detective to figure out what the various codes really meant. I made copious use of the table() command, and comparisons with known facts about the season in question. Fortunately, for some data, other websites had done aggregated data, so I could check my work against their values. As mentioned, in addition to the problem of coding the results into something consistent, I also had to create the “game states” that existed before the result, such as number of outs in an inning, the score of the game, and what the bases status was (man on first, bases loaded, etc). One thing I did not have to create was the inning number. Fortunately, that was one of the data values coded. Creating these base states would normally be easy, if only the coding were done consistently. In fact, one of the methods I used to determine if I had covered all coding translations correctly was to compare how many innings did not have 3 outs by my “Outs” variable. An exception to this check is in the case of the game ending in the middle of an inning, when there may not have yet been three outs. I also kept track of which players were at each of the nine positions defensively, and if there were any assists or putouts during the play. Another consistency check was to then compare if the number of outs calculated was equal to the number of putouts I assigned. For each year of data I examined, making this comparison enlightened me to new anomalies in the coding not yet encountered in the previous data sets. Other variables created during this “data cleanup” exercise were such things as “Hits”, “Outs”, “Double Plays”, and others that I would add as time passed and I had other questions I wanted answered. For example, some time later in my work I realized I wanted to know under what circumstances a manager might instruct his pitcher to issue an intentional walk. I had to go back and break the “Walks” variable into two variables, one for “Intentional Walks” and one for “Other Walks”. Fortunately, using R to code the transformations meant that I just had to add the new line and rerun the entire exercise. master$W <- ifelse( ( ( substr( master$Result, 1, 1 ) == "W" ) | substr( master$Result, 1, 2 ) == "IW" ) & ( substr( master$Result, 1, 2 ) != "WP" ), 1, 0 ) master$IW <- ifelse( ( substr( master$Result, 1, 2 ) == "IW" ), 1, 0 ) As mentioned, after the massive effort of cleaning up the results codes, and creating the game state variables, I had to perform consistency checks to verify I had reasonable data to continue to analyze. This involved such things as verifying there was a “man on first base” in my variables if there was a “1-2” (which means a man on first base ended up on second base after the play) coded in the result. In some cases, my original coding transformations were miscoding some events, and these consistency checks helped to catch the mistakes. I also wanted to add some columns of data to the data set representing information about the particular players involved in the current play. There are statistics available for each player, fielder, pitcher, batter, etc. that can be used to understand the play.

Load more