<<

Mandrik 1

The Impact of Hitters on and Salary Using

Zach Mandrik

Professor Deprano

4/11/2017

Economics 490 Directed Research

Abstract This paper addresses the question if advanced sabermetrics and traditional statistics are efficient predictors in estimating a hitter’s salary entering free agency. In addition the research also looks to answer if these same statistics can significantly predict a team’s winning percentage. Using data from 1985-2011, the models found that both traditional and advanced metrics are significant in relation to salary and winning percentage. However, all of the regression models have low explanatory power in their ability to forecast results. I suggest that MLB clubs use these models as a checkpoint for free agent player salaries and/or ability to contribute to a team (winning percentage), rather than as an absolute determination of these variables.

Mandrik 2

Overview

Introduction

Literature Review

Statistics Review wOBA

wRC+

BsR

Regression Models

The Data

What are the outputs telling us?

Forecasts and Analysis

Conclusion

Suggestions for the Sabermetrics Community

References

Appendix; Graphs and Tables

Mandrik 3

Introduction

One of the major goals for a baseball franchise, or any professional sports franchise in general, is to ultimately win a championship to bring in fans. Winning as a result typically brings an inflow of revenue, which is an owner’s desire. A portion of building a winning baseball team is centered on statistics and . Thanks to the works of and many other baseball analysts, the development of sabermetrics has revolutionized the way business is done in baseball. (MLB) front offices year in and year use analytics to sign players in the off-season to bolster their rosters. It’s essentially their job to maximize the efficiency of these investments to ensure their team can compete and win more games. The goal of this research is to develop statistical models that predict a hitter’s impact on winning along with forecasting the salary they deserve using both advanced and traditional statistics.

The invention of including WAR, weighted (wRC+), weighted on base average (wOBA), and more, provide baseball analysts a deeper understanding of a player’s abilities. Using all of the necessary and available data, I want to test two separate relationships (salary and winning percentage) through regressing a multitude of hitting variables. I hypothesize that the statistics chosen will have a strong relationship with both salary and winning percentage. In addition, I hypothesize these models will enable teams to accurately validate if a player is worth the investment.

Mandrik 4

Before diving straight into the results of this paper, I plan to introduce similar studies that address my question in a different manner. This leads directly into a statistics review of the advanced statistics chosen, followed by my reasons why I choose these variables. I’ll explain the origins of the data then reveal the results of the experiment.

Lastly, I’ll provide conclusions about sabermetrics and their relationship to salary and winning percentage tied to players entering or continuing free agency.

Literature Review

Past researchers have analyzed the relationships between a variety of baseball performance variables related to pay and winning. A couple of scholarly articles including Miceli and Huber’s (2009) article in the Journal of Quantitative Analysis in

Sports, explain how there is indeed a significant relationship between performance and winning. They also concluded that there isn’t a strong relationship between pay and performance at the team level.

To test this hypothesis, Nicholas Miceli and Alan Huber used a factor analysis to distinguish which team-level variables should be included in their regressions. The hitting variables chosen based on their analysis included hits, , homeruns, and walks. After running their models they found that pay and performance are not strongly related at the team level. They did however find a statistically significant relationship between performance variables and individual pay, but the practical importance of the relationships (the R squared) was extremely low.

Mandrik 5

Miceli and Huber’s models and methods focus on the team level rather than the player level to determine where a team should focus it’s spending. This limits their regression models to using traditional statistics as independent variables to measure predicted salary and winning percentage. My models focus on the use of advanced sabermetrics to test the relationships on salary and winning percentage.

In another academic paper, Chang and Zenilman’s “Study of Sabermetrics in Major

League Baseball…”(2013) focuses on the impact of sabermetrics on free agents. They created a hedonic pricing model, which included contract length, player height, stolen bases, On-Base plus (OPS), ground into plays (GDP), and

Wins Above Replacement (WAR). With their model, they found that the theory1 has tangible and lasting impact on MLB player valuations.

Chang and Zenilman (2013) ran regressions using player salary as the dependent variable with all of the previously mentioned independent variables for 3 different time periods. These time periods were labeled as pre-moneyball (before 2000), post- moneyball (2005), and post post-moneyball (2011). As a result, they found increasing significance in certain variables including WAR. As time has passed, WAR has showed an increasing trend in monetary value as well as .

Although this paper focuses its attention on multiple variables to create a pricing model, these authors revealed the impact of WAR time on salaries. I’d assume if

1 Chang and Zenilman’s reference to “Moneyball theory” essentially means sabermetrics. Mandrik 6

WAR has had an increasing impact on salary, then other advanced statistics will carry a similar trend. It’s another reason why I analyze the effects of the other advanced statistics available on salary and winning percentage.

Statistics Review

wOBA

Tom Tango, the author of The Book, created weighted on base average (wOBA), which essentially goes beyond standard rate statistics like OPS (on base plus slugging) or average (AVG). The purpose of wOBA is to measure a hitter’s overall offensive value based on the relative values of each distinct offensive event (Tango 2007).

Unlike On-Base Percentage (OBP) or AVG, wOBA treats each offensive outcome with linear weights to credit the hitter based on the outcome (ex. HR has weight of 2.1). I included wOBA in my models because it’s easy to comprehend because it is scaled similarly to OBP. In addition, the formula is based on a continually changing weight system according to the league average, keeping the statistic current with each passing year.

The weight system that creates the seasonal constants is a part of building wOBA. It requires calculating expectancy matrices for each year to correspond to each year’s player wOBA. In general, “run expectancy measures the average number of Mandrik 7 runs scored (through the end of the current ) given the current base-out state”

(Weinberg 2016). These run expectancies essentially derive the weights, which are scaled based on base percentage (OBP). A further explanation on weights and scaling used for wOBA can be found in Weinberg’s Fangraphs article, “The

Beginner’s Guide to Deriving wOBA” (2016).

The formula for wOBA can be found in the appendix as formula 1. It basically multiplies each statistic by its corresponding weights in the numerator; unintentional walks (uBB), by pitch (HBP), singles, doubles, triples, and home runs. The denominator is simply at bats (AB) plus walks (BB) minus intentional walks (IBB) plus sacrifice flies (SF) plus hit by pitches (HBP). Since weights associated with each variable change annually, the formula included in the appendix does not always contain the same weights in formula 1.

wRC+

In baseball, the only way to win is to score more runs than the opposing team. The sabermetrics community improved Bill James’ runs created metric, which measures a hitter’s abilities to provide runs called weighted runs created plus (wRC+)

(FanGraphs 2017). Similar to wOBA, wRC+ is a rate statistic that credits a hitter based on the run value of each offensive outcome, but also controls for run environment

(ballpark and league). A further detailed explanation on how these factors are calculated can be found at Fangraphs website, www.fangraphs.com (2017). Mandrik 8

Park factors make wRC+ a highly regarded statistics because it values the hitter based on the ballpark they play in. Every ballpark has different distances, altitudes, and other factors, which is why it may be useful to provide this additional context in a statistic. For instance, a hitter who plays at Coors Field (a hitters park due to thinner air) won’t be credited as strongly versus one who hits in Petco Park (’s park).

Because wRC+ is another metric that attempts to capture a player’s overall offensive abilities, I used it as a variable for my models.

wRC+ is appealing because it’s measured on an average scale set at 100. For instance, if a player has a wRC+ of 150 that means they are 50 percentage points above the average player in their ability to create runs for their team. It’s an efficient and easy to understand statistic to compare players offensive abilities. The rule of thumb is located in the appendix (Rule of Thumb 1) and indicates the rating scale for wRC+.

The formula for wRC+ is from FanGraphs website and can be located in the appendix

(Formula 2). In general terms, wRC+ essentially looks at league average metrics compared to the target player while also including park and league factors calculated by FanGraphs.

BsR

Hitting isn’t the entirety of a player’s ability to score or drive in runs. It’s why I included BsR, which is an all-encompassing base-running statistic created by the Mandrik 9 people at FanGraphs (FanGraphs 2017). BsR is the base-running component of Wins

Above Replacement (WAR) and calculates beyond stolen bases. It’s an improvement over counting stolen bases to analyze a player’s abilities.

BsR is a blend of three other existing base-running statistics; weighted stolen bases

(wSB), ultimate base running (UBR), and weighted ground into double plays (wGDP).

The variables wGDP and wSB are self-explanatory as they are both centered on the league average annually, while UBR evaluates a player based on the run expectancies to advance (or not) on the bases. Additional information as to how these are calculated can be found at FanGraphs.com (2017). Overall, BsR evaluates the impact a player has on the bases to provide runs for his team. The formula (formula 3) and the rule of thumb for BsR (object 3) are located in the appendix. The question now becomes, “how do these variables fit in with the models?”

Regression Model

The regressions test to see if there is any difference in prediction power between traditional and advanced statistics. The goal is to discover the relationships between the dependent variables (salary and winning percentage) and a variety of independent variables. I split some of my models up by traditional statistic and advanced sabermetrics to also check if there is a difference between the two in regards to statistical significance.

Mandrik 10

Before running these models, I create correlation matrices for traditional statistics and advanced sabermetrics to avoid multicollinearity in my regressions (see fig. 1 and fig.2). The figures illustrate the relationship between all of the variables I consider using in my analysis. The colors indicate how strongly or poorly correlated the variables are. Darker shades of blue indicate higher correlation versus no color or red indicate a zero or negative correlation between variables.

I decided to use wOBA (weighted on base average), wRC+ (weighted runs created), hits (H), and strikeouts (SO) as independent variables based on the matrices. These variables are included for both dependent variables (Salary and Winning Percentage).

I did not use multi-regression in most of my models because the independent variables are correlated in some way, influencing the coefficients in the models (see regression results).

The Data

I use data from both FanGraphs and Sean Lahman’s baseball databases, which contain numerous tables of data with filtering capabilities. FanGraphs enables users to create custom data tables with any traditional or advanced statistics for any given period of time. The data I use includes various variables from the period 1985-2011 since I want to compare my predictions versus actual results from 2012-2016. After gathering data from FanGraphs, I merged it with all of the data from Lahman’s database.

Mandrik 11

Lahman’s database contains multiple tables of data including a teams table, master table, batting table, and salary table. To get the FanGraphs data to merge correctly with the salary data I combine the team’s table with the player’s table by yearID and teamID. This is because we want to know how many wins a player’s team had with that player on the team. I then integrate the salary table within Lahman by playerIDs.

After successfully performing these tasks I merge the advanced statistics from

FanGraphs with this newly formed table.

Because FanGraphs data has a different way of identifying playerID, I merge the master table with the newly merged team/batter/salary table to include full first and last name. After combing the datasets by season, first name, and last name I have a working dataset that has advanced metrics, salary, and team wins for each player for the years 1985-2011.

One last portion to mention is that all of the samples include players of age 28 or older.

The reason behind this is the average rookie age is 24 years old and a player must have six years of service to be eligible for free agency (Isaacs 2012). This isn’t a given fact but an assumption based on the average age of rookies and assuming they stay up for six years of service. The assumption is that a free agent will typically be around

28-32 years of age their first time around.

The final data set includes 2289 observations with 61 variables. Of course not every variable was used in this paper but I decided to include them anyways just in case I Mandrik 12 ever decided to go back. The details of the data can be found in the appendix (Table

5). The table contains the first 20 rows of each variable of the free agent data from

1985-2011.

What Do the Regression Outputs Mean?

All of the variables in the winning percentage regressions are highly statistically significant (illustrated by low p-values). This is a positive sign because the models indicate a positive relationship between winning percentage and the independent variables. However, all models R squared values between .02 and .06, which is fairly low. Although the explanatory power isn’t as expected, it isn’t necessarily a terrible thing because the variables in all of the regressions were still significant.

In all of the regressions, the coefficients produce expected results as far as their relationship to winning percentage. In model 1 for example, it indicates that for every additional unit of wRC+, on average, winning percentage increases by .00005217 while holding all over variables constant. That’s a relatively low number (even if scaled) and doesn’t necessarily give stellar predictions when applying the model. The good news is there’s an indication that wRC+ is statistically significant.

Likewise with winning percentage as the dependent, all the variables in the salary regressions are statistically significant (illustrated by low p-values). Again this informs researchers that there is a relationship between the salary and sabermetrics/traditional statistics in each model. There are some independent Mandrik 13 variables in the regressions that give expected coefficient signs while there are others that don’t.

For the majority of the models, the independent variables have the expected coefficient signs and statistical relationships with salary. In models 5 and 6, wRC+ demonstrates a positive relationship. However in model 5, BsR shows a negative coefficient. One would expect BsR on average would increase salary because the higher the BsR, the better the player is at running the bases. This might be the case because there is a slight negative correlation between BsR and wRC+. Model 8 regresses the traditional statistics and has a similar problem to model 6 in the sense that strikeouts have a positive coefficient for salary. Again this is probably due to the correlation between variables hits and strikeouts. Other than these two models, the other independent variables carry favorable coefficient values. The summary for each model can be found in table 1 in the appendix.

Forecasts and Analysis

The subjects for these models include three players of different calibers (All Star,

Average, and bench) all from free agent classes beyond 2011. These players are Jose

Reyes, Jeff Keppinger, and Mike Aviles. Tables 2-4 contain the actual results versus the predicted results for winning percentage and salary for each player. Though I wouldn’t deem these models as excellent, they provided moderately accurate results.

Mandrik 14

Before explaining the results for each player I must mention that in every case, winning percentage predictions should be taken with a grain of salt. This is because winning percentage is not derived from a player and is dependent on a team’s structure. The most significant and interesting results from the winning percentage models come from model 3. For all players, we see that wOBA is a significant predictor in some context for a team’s ability to win. But again take this lightly because of the context of winning percentage. Overall, there is a significant relationship between wOBA and a team’s winning percentage.

The efficiency of the results for salary varies for each player’s case. We’ll start with the all-star case in Jose Reyes. Reyes at his time was classified as a perennial all-star for his abilities to hit and run the bases. After the 2011 season, he became a free agent where he signed a 6 year $106 million contract with the Miami Marlins (Brisbee

2011). His annual average value on his contract was about $17 million dollars which is the number I compared the predicted values against. All of the predicted results were around the $4-$6 million range, which isn’t nearly close to $17 million (see table

2). These prediction values aren’t necessarily a surprise given that regressions tell us the dependent variable based on averages. This might explain why the salaries for Jeff

Keppinger and Mike Aviles were more accurate.

Jeff Keppinger was never an all star in his career but served as an everyday starter for a majority of his career. Based on his 2012 numbers, the model closely predicts his annual average value from his 3-year $12 million deal he signed for the 2013 season Mandrik 15

(Padilla 2012). Out of models 4-8, number 6 (wRC+) has the smallest differential between the actual and predicted results at $656,243.14 (see table 3). So far, the models slightly overvalue a player with similar abilities to Keppinger.

The last prediction case is Mike Aviles, who served as a starter with in 2013-2014 but later on phased into a bench role with them in 2015. After completing his 2-year contract with the Indians, he resigned with the tribe in 2015 on a 1-year $3.5 million contract. Likewise with Keppinger’s case, the models were able to predict around a $1 million dollar range from Aviles’ actual contract. However, model 8, which uses hits and strikeouts as independent variables, has the lowest differential (see table 4). One could make the case that traditional statistics are better predictors, but with Keppinger’s case advanced metrics did a better job. Overall, these models predicted well for average and bench players in the league but not as much for all-star caliber players.

Conclusion

After running these regressions using both advanced sabermetrics and traditional statistics, I believe MLB teams should use these prediction models with caution.

Although the variables are significant in each model and provide sufficient results for an average/bench player, neither the sabermetric statistics nor the traditional statistics provide sufficient explanatory power for forecasting. We know this because Mandrik 16 all of the models had extremely low R squared values. Before concluding matters entirely, there are a few problems and suggestions I have with these models that the sabermetric community could further analyze.

Suggestions for the Sabermetric Community

Aside from the models themselves, there are a couple other things I would have done differently. One of the concerns throughout the research process was gathering accurate free agent data. To avoid assumption, I wanted contract data for every free agent player dating back to 1985, since this was as far back as I could go using

Lahman’s database. I assumed players hitting free agency were around ages 28 and older based on average rookie ages. I would have preferred to have every player who reached free agency based on service time rather than an age assumption for validity.

With the proper resources, in my case time, I would have gathered data on every free agent and the amount of years in their contracts as well to ensure sound results. In addition, I also believe this experiment could be improved by using a different dependent variable other than winning percentage, perhaps marginal wins.

It would be interesting to predict efficiency using marginal wins attached to a player as the dependent variable. This would entail looking at a team’s wins before a player signed with the team versus the amount of wins attained after the player signed.

Maybe there’s a significant econometric model that uses sabermetrics to predict marginal wins and use those predicted wins in an efficiency formula. Unfortunately, due to time constraints I couldn’t perform this sort of analysis. However, I encourage Mandrik 17 those in the sabermetrics community to investigate this hypothesis and reveal what they come up with.

Lastly, the most significant downfall to this research is the lack of adjustment for inflation for salary. The salary data dates back to 1985, which could have affected the results since they are not adjusted to today’s dollars. The possible solutions for this would be to run regressions for a couple years of data and use those results to forecast salary. The other option would be to go through and adjust every player’s salary to today’s dollars, but this could become extremely time consuming. All in all, I encourage aspiring baseball researchers interested in forecasting baseball salaries to test these results and report their findings.

Mandrik 18

Sources

Brisbee, Grant 2011. “Jose Reyes Contract Details Released” SB Nation http://www.sbnation.com/2011/12/7/2618435/jose-reyes-contract-details- released 2017. “Salary Arbitration” FanGraphs http://www.fangraphs.com/library/business/mlb-salary-arbitration-rules/

Chang, Jason and Zenilman, Joshua 2013. “A Study of Sabermetrics in Major League Baseball: The Impact of Moneyball on Free Agent Salaries” Washington University at St. Louis. http://olinblog.wustl.edu/wp- content/uploads/AStudyofSabermetricsinMajorLeagueBaseball.pdf

Krautmann, Anthony and Solow, John. 2009 “The Dynamics of Performance Over the Duration of Major League Baseball Long-Term Contracts” Journal of Sports Economics. p.1 http://journals.sagepub.com/doi/pdf/10.1177/1527002508327382

Gennaro, Vince 2007. “Diamond Dollars: The Economics of Winning in Baseball (Part I)” The Hardball Times. Pg 1 http://www.hardballtimes.com/diamond-dollars-the-economics-of-winning-in- baseball-part-1/

Issacs, Noah 2012. “Minor League Leaderboard Context” FanGraphs, p.1 http://www.fangraphs.com/blogs/minor-league-leaderboard-context/

Miceli, Nicholas S. and Huber, Alan D. 2009 "`If the Team Doesn't Win, Nobody Wins:' A Team-Level Analysis of Pay and Performance Relationships in Major League Baseball," Journal of Quantitative Analysis in Sports: Vol. 5: Iss. 2, Article 6. https://www.degruyter.com/downloadpdf/j/jqas.2009.5.2/jqas.2009.5.2.1170/jqa s.2009.5.2.1170.pdf

Padilla, Doug 2012. “Jeff Keppinger Joins White Sox” ESPN http://www.espn.com/chicago/mlb/story/_/id/8732840/jeff-keppinger-agrees- deal-chicago-white-sox

Tango, Tom, Lichtman, Mitchel, and Dolphin, Andrew 2007. “Toolshed”, The Book Playing the Percentages in Baseball. Pg. 29 Potomac Books, Inc.

Weinberg, Neil 2016. “The Beginner’s Guide to Deriving wOBA” FanGraphs http://www.fangraphs.com/library/the-beginners-guide-to-deriving-/

Associated Press 2014. “Mike Aviles gets two-year contract” ESPN http://www.espn.com/mlb/story/_/id/8926748/cleveland-indians-sign-mike- aviles-two-year-deal Mandrik 19

2013. “2013 MLB Free Agent Tracker” MLB Trade Rumors http://www.mlbtraderumors.com/2013-mlb-free-agent-tracker

2013. “What is WAR?” FanGraphs http://www.fangraphs.com/library/misc/war/

2017. “Park Factors – 5 year regressed” FanGraphs http://www.fangraphs.com/library/park-factors-5-year-regressed/

2017. FanGraphs http://www.fangraphs.com/statss.aspx?playerid=1943&position=P

2017 “wRC and wRC+” Fangraphs http://www.fangraphs.com/library/offense/wrc/

2017 "wRC+ and Lessons of Context” Fangraphs http://www.fangraphs.com/library/wrc-and-lessons-of-context/

2017 “wOBA” Fangraphs http://www.fangraphs.com/library/offense/woba/

Mandrik 20

Appendix

Formula 1: wOBA

Formula 2: wRC+

Rule of Thumb 1: wRC+

Fangraphs

Formula 3: BSR

BsR = wSB + UBR + wGDP

Rule of Thumb 2:BSR

Fangraphs

Mandrik 21

Fig 1. Correlation Matrix of Traditional Statistics

Created on R Studio

Fig 2. Correlation Matrix of Advanced Sabermetrics

Created on R Studio

Mandrik 22

Table 1: Summary of models

Intercept + Independent Variable(s) Model Dependent Variable R Squared P-Value coefficients 1 Winning Percentage .04515+ .00005217wRC+ + .0001459BsR 0.03975 < 2.2e-16 2 Winning Percentage .04523 + .00005164wRC+ 0.06762 < 2.2e-16 3 Winning Percentage .40551 + .29881wOBA 0.02925 < 2.2e-16 4 Winning Percentage .04524 + .00003256H +.00001309SO 0.02859 4.00E-15 5 Salary -1005676 + 50201wRC+ + -118827BsR 0.09118 < 2.2e-16 6 Salary -1070287 + 50631wRC+ 0.0848 < 2.2e-16 7 Salary -7497207 + 34526847wOBA 0.09613 < 2.2e-16 8 Salary -159843 + 18254H + 25516SO 0.05899 < 2.2e-16

Table 2: Jose Reyes Predictions Jose Reyes Predictions 2012 (All Star) Model Predicted Results Actual Results Differential 1 0.054 0.475 0.422 2 0.053 0.475 0.423 3 0.518 0.475 0.043 4 0.052 0.475 0.424 5 $5,326,725.10 $17,666,667.00 $12,339,941.90 6 $6,119,315.00 $17,666,667.00 $11,547,352.00 7 $5,484,887.47 $17,666,667.00 $12,181,779.53 8 $4,190,287.00 $17,666,667.00 $13,476,380.00

Table 3: Jeff Keppinger Predictions Jeff Keppinger Predictions 2013 (Average) Model Predicted Results Actual Results Differential 1 0.051 0.389 0.338 2 0.052 0.389 0.337 3 0.406 0.389 0.017 4 0.045 0.389 0.344 5 $5,776,533.00 $4,000,000.00 $1,776,533.00 6 $5,410,481.00 $4,000,000.00 $1,410,481.00 7 $4,656,243.14 $4,000,000.00 $656,243.14 8 $2,912,903.00 $4,000,000.00 $1,087,097.00 Mandrik 23

Table 4: Mike Aviles Predictions Mike Aviles Predictions 2015 (Bench) Model Predicted Results Actual Results Differential 1 0.045 0.503 0.46 2 0.045 0.503 0.46 3 0.406 0.503 0.10 4 0.045 0.503 0.46 5 $2,258,857.30 $3,500,000.00 $1,241,142.70 6 $2,473,883.00 $3,500,000.00 $1,026,117.00 7 $1,894,095.38 $3,500,000.00 $1,605,904.62 8 $2,642,031.00 $3,500,000.00 $857,969.00

Model 1: WinPct = wRC+ + BsR

Mandrik 24

Model 2: WinPct = wRC+

Mandrik 25

Model 3: WinPct = wOBA

Model 4: WinPct= H + SO

Mandrik 26

Model 5: Salary = wRC+ + BsR

Model 6: Salary= wRC+

Mandrik 27

Model 7: Salary= wOBA

Model 8: Salary= H+SO

Mandrik 28

Table 5: Free Agent Data (first 20 rows) Used in Research (1985-2011)

Mandrik 29

Mandrik 30