<<

Cella 1

Scott T. Cella MAT 8435 Final Project

Abstract

Statistics have always played a vital role in the game of , but it wasn‟t until recently that numbers dominated all decision making within the sport. is a way to organize this data to exercise optimal decision making, even if it means changing a popular order or pinch-hitting for a superstar late in the game. This paper will consider three models (basic, intermediate, and complex) that attempt to optimize the scoring of runs, and hence increase the number of victories for any given team. Each model uses the probability tool of Markov chains, which resembles spreadsheets used by statisticians. Although baseball consists of defense (primarily pitching) as well, we will focus our full attention on offensive production.

Introduction

In 1997, the hired long-time scout and former Billy

Beane as their general (GM). When it became clear that Oakland did not have the luxury to sign marquis free agents, Beane took on a new strategy that he coined

“moneyball”; that is, he extended contracts to lesser-known players who had high probabilities of getting on base. Though criticized by fellow GMs and the Oakland fan base, the team went on to win twenty consecutive games, the most by any team since the

1935 . Sabermetrics, which had been analyzed only by statisticians prior to this, crept its way into baseball and has changed the game ever since.

Behind each major league ball club is a minor league system populated with young talent; but many of these players are considered „trade bait‟ for the GMs to discuss. GMs get their input from team scouts who watch the minor league games, collecting data for certain players of interest. If a player has tremendous statistics, he can rather be promoted to the majors or traded for a significant price. These stats can be Cella 2 inserted into a Markov chain to analyze how a player performs at different stages in the game; however, with the high number of variables in the game of baseball, these formulas can become quite complicated.

Using such tools, team coaches are forced to make decisions to help guide their team to the most amount of victories possible. These decisions include selecting an appropriate , authorizing a base stealer to take a chance, substituting for a pinch-hitter, etc. We will first examine some of these models, and then analyze how strategic decisions can affect a team‟s performance. Ultimately, baseball is about winning, so the purpose of each model is to maximize the number of runs collectively scored by a team.

Sabermetrics and Markov Chains

Bill James, one of the founders of a statistically driven system, coined the term sabermetrics as an acronym of the Society for American Baseball Research (SABR). In short, sabermetrics uses objective evidence to explain in-game phenomena in order to help coaches make wise decisions. For instance, if a right-handed is on the mound late in a game, the manager may pinch- a left-handed batter because the probability of getting on base is higher for most players if they bat against an opposite- handed pitcher. As the abundance of data continued to be collected and analyzed, sports agents began using sabermetrics as a bargaining tool to maximize financial contracts for their players. GMs tend to counter with other sabermetrics in these negotiations so that they can minimize their cost while still signing desired players. With the insertion of such statistics, there is no doubt the baseball is a business. Cella 3

Historically, scouts have used spreadsheets to compile the necessary data before

reporting to the GM or the coaches. To

the right is a standard score card that can

track where a player is at every stage in

the game. The use of spreadsheets has

made it easy to translate this data into a

Markov chain. A Markov chain is a

mathematical tool that analyzes transitions to and from different states of a process. Due

to its „memoryless‟ nature, the next state disregards how it became to that point, and only

focuses on how to transition to a different state. A chain is said to be absorbing if there

exists a state that is impossible to leave. Obviously, every baseball game that has ever

been played has ended at some point, so baseball can be represented as an absorbing

Markov chain. With these tools in mind, one can analyze these Markov chains through

the eyes of a baseball scout.

Model #1: The Basic Model of Understanding

Since a Markov chain forces each item to be in exactly one state at a time, we can

consider twenty-four different states in a baseball half- (since each team gets a

chance to bat per inning). These twenty-four states come from having zero, one, or two

outs at a time and the eight different base occupancy possibilities (3 x 8 = 24). Our

notation will be denoted in the (runners, outs) form.

Runners: 0 1 2 3 12 13 23 123 0 (0 , 0) (1 , 0) (2 , 0) (3 , 0) (12 , 0) (13 , 0) (23 , 0) (123 , 0) Outs: 1 (0 , 1) (1 , 1) (2 , 1) (3 , 1) (12 , 1) (13 , 1) (23 , 1) (123 , 1) 2 (0 , 2) (1 , 2) (2 , 2) (3 , 2) (12 , 2) (13 , 2) (23 , 2) (123 , 2) Cella 4

Technically, there could exist four additional states in which Bases empty 0 Man on first 1 the third occurs (scoring no runs, 1 , 2 runs, or 3 runs), Man on second 2 Man on third 3 but this will mark the end of a half-inning. Note that this is a Men on first and second 12 Men on first and third 13 certainty in a baseball game, so the probability of reaching a Men on second and third 23 Bases loaded 123 third out state is one (hence, it is an absorbing state).

The keys to the analysis of Markov chains in the field of baseball are the transition matrices. This tool will determine the probabilities of moving from one state to the next using statistics compiled by a player over his career. Note that some of these transitions are impossible. For instance, if a player begins his in the

(12 , 1) state, there is no possible way to enter into the (12 , 0) state after the . This would indicate that there was one out, but now there are zero outs after he has batted.

Obviously this cannot happen in a baseball game, so each transition matrix is expected to have a reasonable number of zeros. Ideally, each player should have his own transition matrix with all of the necessary data; however, baseball tends to be a complicated game.

Sometimes it is important to recognize if the batter has changed when a transition has occurred. Consider the state (1 , 0), indicating that there is a man on first with zero outs. The possibility of reaching the (2 , 0) state can happen one of two ways: the batter hit a helping score the runner from first, or the runner on first advanced to second

(by stealing, a wild , , or ). The first instance required a change of batter and a run scored, but the second scenario left the batter constant without the runner crossing home plate. Thus it should be necessary for each batter to have several transition matrices, but this analysis can be rather overwhelming since baseball has an abundance of variables (Pankin). Cella 5

The above model has laid the groundwork for basic understanding, but there are many deficiencies that make it unpractical. First, the model sheds no light on how to calculate the amount of runs scored by a team. Being that this is the objective of offensive players, this is a glaring problem with our model. Secondly, one transition matrix is inefficient to determine what is happening in the actual game. An optimal solution would have one transition matrix containing all of the necessary data to inform the analyzer as to what occurred on the field. In addition, there is no strategy embedded in these chains. Baseball is a sport comprised of strategic decisions that are being neglected in this case. Positively speaking, however, this model gives us a foundation to work with as we venture into a stronger model that helps interpret the game in a more complete way.

Model #2: The Intermediate Model of Interpretation

A more informative Markov chain to analyze baseball was formulated by Bukiet and Harold with the hopes of looking at an entire team‟s performance. We will examine individual players in addition to the team, but baseball is a team sport that requires the production of nine players at a time. Using the same states from the previous section, consider the following 25 x 25 Markov chain (combining the four 3-out states into one state):

[ ]

where and are 8 x 8 matrices and and are column vectors. Cella 6

The zeros in the second and third row are 8 x 8 matrices with all zero entries, while the final row is a row-matrix consisting of 24 consecutive zeros followed by a one. All subscripts represent the number of outs at the beginning of that state. For each individual

8 x 8 matrix with non-zero entries, let each column represent the base occupancy state mentioned in the basic model. The matrices represent all plate appearances that result in no additional outs, the matrices signify at bats where one out occurs (but not the third out), and the C matrix indicates a that does not end the inning. The other three column vectors consist of plays that end the inning ( is a play, is a one-out double play, and is a traditional play to end the inning).

To simplify the process that follows, let us assume that all players have the same transition matrix (T). For notational purposes, let us denote the probability of moving from state i to a different state j as As an example, ( ) ( ) is the probability of a batter hitting a double plus the probability of earning a and stealing, advancing on an , a balk, or a plus the probability of a double error by the defense.

Due to the nature of matrix multiplication, the transition matrix for two consecutive batters can be denoted as three consecutive batters as and so forth. Also, we will adopt D‟Esopo and Lefkowitz‟s run production model with the following assumptions:

- A single will score a runner from second or third; man on first advances to second.

- A double scores runners on second and third; man on first advances to third.

- A triple scores everyone on base.

- There is no bunting or stealing of third base.

- Runners do not advance on outs. Cella 7

With this in mind, the model can attempt to predict the average amount of runs scored per game.

Let R be a 28 x 1 column vector used to calculate the expected amount of runs that a team should get after each state of one play (24 entries in our matrix in addition to the four possibilities for the third out). We see the following probabilities occur:

( ) ( ) ( ) (Only a homerun will score a run)

( ) ( ) ( ) (Only a homerun will score a run)

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

(A homerun scores two runs; a double or triples scores one run, and getting thrown out

trying to advance on a throw can score one)

( ) ( ) ( ) ( ) (There will always be a third out)

The key to this Markov chain is to determine how many runs one can expect a team to get at any state of the half-inning. If a team is in the state (0,2), the probability of scoring just one run is low, so the usage of a would be discouraged at this point. To demonstrate these probabilities, let us introduce E, another 28 x 1 column matrix that contains the probabilities in question. Then E suggests that the expected number of runs is the summation a runs after one play, two plays, three plays, etc. Unfortunately, this equation is not user-friendly. It is rare for a team to send more than 15 batters to the plate in one half-inning, so the larger exponents are mostly negligible. Therefore we can implement the study of matrix inverses to show the following progression:

( ) Cella 8

( ) ( )( )

( ) ( )

( ) ( )

( )

This provides a much easier equation to manipulate, assuming that the Markov chain has an inverse (which it will in a baseball situation). Because there can be no more additional runs after a three out state, we only need to concern ourselves with the first 24 rows and columns of T.

Now we will demonstrate how to read our transition matrix T at the beginning of a regulation, nine-inning baseball game. For simplicity purposes, let us assume that the nine starting players will play the entire game. Let be a 1 x 25 vector with all zero entries, except for a one in the first spot. This clearly represents the fact that the first inning begins with no outs and the bases vacated. Likewise, we denote as a similar vector, with the difference that n batters have already taken their turn at the plate. If we simply multiply (the latter signifying that current batter‟s transition matrix), we discover the probability distributions after n + 1 hitters. The benefit of having this model is that this information enables us to calculate the score of each inning. To do so, let be a 21 x 25 matrix; we use 21 rows because no team has ever exceeded this number of runs in one inning (not even in Little League, courtesy of the 10-run rule).

The rows represent the number of runs scored while the columns indicate what state the batter was in before scoring these runs.

Each time that a run scores in an inning, the probability of this occurrence is shifted down one row of . If we let T0, T1, T2, T3, and T4 represent the five Cella 9 matrices associated with runs scored in a single plate appearance, we can analyze the entries of For example, the probability of transitioning from ( ) ( ) is

( ) ( ) ( ) ( )

( ) ( )

This is the chance of a batter earning four RBIs in one plate appearance (by slugging a or a plethora of errors on the defense. The w shows that this could happen regardless of the outs). Considering the fact that all are finite, the entries in will eventually contain a string of zeros, for the entry indicating the third out. The final column of the (where m indicated the end of the inning) will sum all of the runs scored in one particular inning.

We see that calculating a score is as basic as multiplying the unit vector by each player‟s transition matrix. If is the , the second patter, etc., then the distribution of runs after one inning can be written as until a hitter records the third out. If we wanted to extend this idea for an entire game, then we can expand out matrix to have 189 rows, where the first 21 rows are the first inning, rows

22-42 are the second, etc. Once an inning conclude, the number of runs is transferred to the beginning spot in the next inning, and the process repeats. There will be a small error in this analysis because it is possible for teams to score over 20 runs in a game, though this feat is incredibly rare (though the 2007 Texas Rangers did have a 30 run game against the Orioles!).

The intermediate model is more complicated than the basic approach, but statisticians regard this as the method of choice since it explains the game, but isn‟t overly confusing. However, there are plenty of deficiencies with this model. In order for Cella 10 the transition matrices to be simply calculated, we assumed that each player has the same

T. This leads one to believe that all players are the same, that and Roy

Halladay have the same potential of hitting a homerun during an arbitrary plate appearance. If we could determine these matrices for individual players, then we can set up an optimal batting order that could maximize run production (which we will see later).

Another concern with this model is that it only tracks one team. Statisticians would prefer to look at one piece of data and understand how the game proceeded rather than continually comparing two matrices. Lastly, we assumed that the nine players that ended the game were the same teammates who began the game. In an age of pinch hitting, designated hitters, double switches, and injuring, this is a rare occasion. In recognition of these flaws, the third model will attempt to remedy these concern with one matrix.

Model #3: The Complex Model of Analysis

The final model that we will examine combines the number of innings (9 x 2, since both teams have an offensive chance per inning), the number of outs, the eight base occupancies, and the conclusion of the game. This is a total of

states in one Markov chain. If represents the home team‟s offensive production, then the matrix looks like the following:

= Cella 11

In the above matrix, T and B respectively stand for the top and bottom of each inning; meanwhile, END marks the end of the game with a probability of one, hence an absorbing state. Notice that the top of each inning contains a zero entry, pertaining to 24 x 24 matrices with all zero entries. This makes sense since it is impossible for the home team to score while playing defense. The 24 x 24 Q matrices represent the transitions that may happen within a half inning, while the 24 x 24 matrices indicate any transitions to the next inning. The F variable is a 24 x 1 vector that marks any play that may end the game, whether it is the final out or a walk off game winner (this model does not include ). The matrix that demonstrates the visiting team‟s performance is similar, with the Q entries trading positions with the 0 entries (Hirotsu).

We refer back to D‟Esopo and Lefkowitz‟s run production model (page 6), which will help us manipulate other models to elaborate on the possibilities within a baseball game. Thus the following matrices for Q and are as followed:

[ ] [ ] [ ]

where

for

[ ] and In this case, I is the 8 x 8 identity matrix; the zeros outside of the diagonal help show that outs do not advance runners. In the equation,

( ) is a 24 x 1 vector with the 8 of Cella 12

entries and is an 8 x 8 matrix of the form [ ]. The embedded in

the F and matrices signify the end of each half inning and transitions into the next inning or the end of the game.

Needless to say, this model is extremely complicated! Although it may be more telling than the intermediate model, this demonstration is hardly used because statisticians struggle in mastering the notation. Even though this is a more precise model, it is not the best model because it contains too much information. Nonetheless, mathematicians and statisticians continue to examine these models as a way to educate

GMs and coaches to improve franchises. The analysis below looks at several decisions that managers make on a daily basis, seeking to confirm or reject the success in these situations.

Selecting the Optimal Batting Order

Choosing a batting order is a critical aspect of a manager‟s job, but how

does he know which order is the best one? We will define the best order to be the

string of players that is expected to score the most amounts of runs per game.

Given that there are possible line-ups, it is not feasible to experiment with each order. In fact, if the same teammates played one game everyday, it would take over 994 years to test every batting order! However, most managers share a common philosophy with setting their lineup: the speediest player hits first, followed by the best on the team; a high takes the third position and the slugger is inserted in the cleanup role. In almost all situations, the pitcher is the last batter. One Cella 13 glarring exception to this philosophy was Tony LaRussa, manager of the winning St. Louis Cardinals in 2011, who elected to bat his slugger in the third spot, bumping the pitcher up to the eighth hole. Though criticized by some of his colleagues, his strategy was extremely successful. Instead of analyzing all 362,880 possible batting orders, statisticians have looked to create a model to optimize the number of runs scored.

The chart belows lists the desireable attributes based off of spots in a batting order as recommended by computer simulators. Many of these are obvious choices; for instance, and sluggers should not bat first and a speedy player should not follow a slow base runner, The sabermetrics key can be found below the chart. A (+) marks a desireable trait and a (-) indicates that the attribute is not highly valued.

Position 1 2 3 4 5 6 7

1 +OBA +BB/PA -INPLAY -HR/H -SBTRY 2 +SLUG +OBA -EBA +BB/PA -INPLAY 3 +SLUG +BB/PA +INPLAY 4 +SLUG +OBA -HR/H 5 +SLUG -HR/H +INPLAY +SBTRY 6 -RC/G +SLUG +OBA +INPLAY +K/PA +SBTRY 7 -OBA +INPLAY +SBTRY 8 -SLUG -OBA -BB/PA +HR/H +INPLAY 9 -INPLAY -K/PA -SLUG -OBA -BB/PA +1B/H -SBTRY

*OBA: On base average *BB/PA: Walks per plate appearance *INPLAY: Ball in play percentage *HR/H: Homerun ration *SBTRY: percentage *SLUG: *EBA: Extra base average *RC/G: per game *K/PA: per plate appearance *1B/H: Singles ration

This chart immediately raises concern because it suggests that the leadoff hitter should not put the ball in play or steal bases. Ironically, that is precisely how GMs build baseball teams, putting speed at the top of a lineup. Another puzzling aspect is that the Cella 14 cleanup hitter should not be a homerun hitter. A high slugging percentage and a low homerun ration indicates that the fourth hole batter aught to be a doubles hitter instead of a power threat (Pankin).

The reseach above has brough about skeptisism, so an experiment was excercied to test if these results are valid. A simulator was run twice using the 2007 Philadelphia

Phillies roster. The first simulation implemented the actual roster, while the experimental group molded a lineup following the chart on the previous page. The batting orders are posted and the statistics were analyzed to see which lineup produced the optimal amount of runs.

Lineup #1 Lineup #2

1. 1. 2. 2. 3. Chase Utley 3. Ryan Howard 4. Ryan Howard 4. Jimmy Rollins 5. Pat Burrell 5. Shane Victorino 6. 6. Jayson Werth 7. 7. 8. Carlos Ruiz 8. Pedro Feliz 9. Pitcher‟s Spot 9. Pitcher‟s Spot

Lineup 2 Wins Loses Average Runs Runs/Game Hits 1 71 91 .256 714 4.41 1436 2 94 68 .264 770 4.75 1487

Oddly enough, the simulator confirmed the reseach, suggesting that a lineup formualed from the chart will produce higher offensive production in all major statistical categories; however, these numbers are drastically different. This could be attributed to poorer pitching during the first simulation, unnoticed injuries, or simply bad fortune with the probabilities. A more realistic conclusion could be drawn if the expeeriment was repeated numerous amounts of times, but the simulation takes approximately 15 minutes Cella 15 to complete once. The reader may feel free to confirm these results at his/her convenience.

The Smaller Aspects of the Game

The models thus far have been fairly abstract, but it is important to remember that baseball is a concrete sport. Decisions are made and outcomes persist, but we hope that the theory of Markov chains can explain the phenomena on the field. This last section will look at two routine aspects of a game that can determine the different between a walkoff victory or a trip to extra innings. In many tied ballgames during the ninth inning, the leadoff batter could earn a walk or stroke a single. Instead of having the next batter swing away, a manager has a choice to make. We will first examine the decision whether or not to steal second base, then focus on if a sacrifice would be a better option.

These two strategies are usually implemented in the due to the fact that there is no .

A stolen base can be the spark that ignites an exciting victory, but it can also destroy a team‟s momentum and scoring opportunity. The decision that a manager makes is always under the media‟s microscope, but is it statistcally wise to endorse this chance?

The data below was compiled from the 1986 Cincinnati Reds in regards to their successful attempts at stealing a base.

Ending Number Percent Scoring Expected Runs Situation Probability (0,1) Caught 11 .324 .162 0.297 Stealing (2,0) Successful 19 .559 .609 1.098 Steal (3,0) Successful 4 .118 .923 1.533 Steal + Error Cella 16

Weighted .501 0.890 Average Values for (0,1) .471 1.017

Gain/Loss from 0.030 -0.127 SB attempt

Judging from the data, the expected number of runs decreased when the Reds attempted to steal a base. This suggests that a team would have a better chance of scoring if the current batter swung instead of taking pitches for his teammate on firstbase; but even this may not be the optional choice if the batter is a weaker hitter or a pitcher. Having a man in with one out may have a higher scoring index than the (1,0) state, which could be achieved by authorizing a . The numbers for the same 1986

Reds team are as followed.

Ending Number Percent Scoring Situation Probability (0,2) 3 .077 .071 Double Play (1,1) 4 .103 .281 Failed Attempt (2,1) Successful 28 .718 .411 Sacrifice (12,0) 4 .103 .735 Batter

So the probability of scoring after a sacrifice bunt is equal to the summation of the percent multiplied by their respective scoring probability:

( )( ) ( )( ) ( )( ) ( )( ) .

But it is known that the probability of scoring a run after the (1,0) state is .472. Therefore the chances of scoring a run go down after a sacrifice bunt. If a manager had this data for Cella 17 his team, the percentages indicate that the optimal situation is not to steal or bunt, but rather to hit away (Pankin).

The expected run probabilities can be found in the table at the end of this document. It is important for the reader to realize that these numbers have been compiled as a case study and not a full research assignment. If we wanted to run similar calculations for present teams, a healthy amount of data would need to be collected and analyzed. Other aspects of the game can also be examined, such as scoring a runner from second base on a single, the effect of turf vs. grass, intentional walks, etc. However one thing seems clear: the more one knows about his team, the better he can manage that team to lead them to more victories.

Conclusions

One of the reasons why baseball is such a beloved sport is due to its unpredictability. With 162 games throughout a single season per team, anything can happen during any given game because probability runs rampant behind the scenes. In order to bring order to this chaos, statisticians have attempted to build models that explain what happened on the field and how can the data improve a team‟s performance.

The Markov chain approach is a powerful tool that can easily record this data to benefit

GMs and managers in order to make important decisions with the goal of giving their team a better chance to win. The challenge in modeling baseball games is by no means a complete science. Most of the methods mentioned above have remained in academia, but further research can sabermetrics and make it a bigger part of the game. Cella 18

Statistics compiled by the 1986 Cincinnati Reds (Pankin)

OBSERVED MARKOV

Prob. Avg. Avg. runs

Situation Number of runs runs no pit. w/pit.

1 (0,0) 1397 0.295 0.515 0.570 0.527

2 (0,1) 1000 0.162 0.259 0.297 0.270

3 (0,2) 804 0.071 0.102 0.114 0.103

4 (1,0) 369 0.472 0.900 1.017 0.955

5 (1,1) 395 0.281 0.532 0.622 0.578

6 (1,2) 408 0.145 0.243 0.281 0.254

7 (2,0) 110 0.609 0.955 1.098 1.034

8 (2,1) 202 0.411 0.678 0.632 0.599

9 (2,2) 246 0.236 0.325 0.360 0.330

10 (3,0) 26 0.923 1.423 1.533 1.498

11 (3,1) 69 0.667 0.942 1.015 0.945

12 (3,2) 125 0.272 0.408 0.373 0.355

13 (12,0) 83 0.735 1.590 1.797 1.703

14 (12,1) 127 0.480 1.087 1.148 1.079

15 (12,2) 194 0.242 0.407 0.514 0.469

16 (13,0) 34 0.824 1.353 1.815 1.748

17 (13,1) 66 0.667 1.045 1.255 1.161

18 (13,2) 77 0.234 0.351 0.451 0.391

19 (23,0) 11 0.818 1.909 1.778 1.715

20 (23,1) 55 0.745 1.455 1.413 1.350 Cella 19

21 (23,2) 57 0.404 0.772 0.726 0.715

22 (123,0) 22 0.955 2.091 2.402 2.282

23 (123,1) 55 0.764 1.764 1.966 1.786

24 (123,2) 73 0.288 0.521 0.751 0.654

Cella 20

References

1. B. Bukiet, E.R. Harold and J.L. Palacios: A Markov chain approach to baseball. Operations Research, 45 (1997) 14–23.

2. D.A. D‟Esopo and B. Lefkowitz: The distribution of runs in the game of baseball. In S.P. Ladany and R.E. Machol (eds.): Optimal Strategies in Sports (North-Holland, Amsterdam, 1977), 55–62.

3. N. Hirotsu and M. Wright: A Markov chain approach to optimal punch hitting strategies in a designated hitter rule baseball game. Journal of the Operations Research Society of Japan. 2003, Vol. 46, No. 3, 353-371.

4. M. D.Pankin. 1978. Evaluating Offensive Performance in Baseball. Opns. Res. 26, 610-619.