<<

Batter Handedness Project - Herb Wilson

Contents

Introduction 1

Data Upload 1

Join with Lahman database 1

Change in Proportion of RHP PA by Year 2

MLB- differences in BA against LHP vs. RHP 2

Equilibration of Average 4

Individual variation in splits 4

Logistic regression using splits. 7

Logistic regressions using weighted On-base Average (wOBA) 11

Summary of Results 17

Introduction

This project is an exploration of batter performance against like-handed and opposite-handed . We have long known that, collectively, batters have higher batting averages against opposite-handed pitchers. Differences in performance against left-handed versus right-handed pitchers will be referred to as splits. The generality of splits favoring opposite-handed pitchers masks variability in the magnitude of batting splits among batters and variability in splits for a player among seasons. In this contribution, I test the adequacy of split values in predicting batter handedness and then examine individual variability to explore some of the nuances of the relationships. The data used primarily come from Retrosheet events data with the Lahman dataset being used for some biographical information such as full name. I used the R programming language for all statistical testing and for the creation of the graphics. A copy of the code is available on request by contacting me at [email protected]

Data Upload

The Retrosheet events data are given by year. The first step in the analysis is to upload dataframes for each year, then use the rbind function to stitch datasets together to make a dataframe and use the function colnames to add column names. For this study, I combined the data from 1955 until 2017, yielding a dataframe with nearly 10 million rows.

Join with Lahman database

Next I use the Lahman Master database to get first and last names of players using the left_join function on Retro_ID number. I then paste the last names and first names together to get a single field for player name. Finally, I filter joined Master columns I do not need.

1 Change in Proportion of RHP PA by Year

I begin by looking at the number of Plate Appearances (PA) against left-handed (LHP) versus right-handed pitchers (RHP) the 63 years in this dataframe. I first create a column with the year. Then, I filter out any events that do not pertain to the batter (e.g., , or while batting) to create a PA column. Switch-hitters were removed from the analysis.

0.750

0.725

0.700 PA against RHP/Total PA RHP/Total against PA 0.675

1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year

MLB-wide differences in BA against LHP vs. RHP

I now examine the magnitude of batter splits against LHP and RHP for each year of the study. I consider LHP first and then RHP. Here are the mean batter splits by year against LHP.

2 0.040

0.030

0.020

0.010

0.000 BA of RHB − BA of LHB (versus LHP) of LHB (versus of RHB − BA BA

1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year Here are the mean batter splits by year against RHP.

3 0.030

0.020

0.010 BA of LHB − BA of RHB (versus RHP) of RHB (versus of LHB − BA BA

1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year

Equilibration of Batting Average

To examine individual variability in splits, I need to consider only players with a minimum number of ABs against LHP and RHP in a season to avoid biased Batting Averages because of low sample size. A survey of some randomly chosen plays indicates that BA, BA against LHP and BA against RHP begin to stabilize after 100 AB. Therefore, I require that all batters must have at least 100 ABs against LHP and 100 ABs against RHP to be included in an analysis.

Individual variation in splits

The next three histograms show the distribution of LH batter splits against LHP and RHP and RH batter splits against LHP and RHP. I randomly chose three seasons to present. A clear pattern is that the proportion of LH batters with opposite splits (higher BA against LHP) is lower than the proportion of RH batters with opposite splits. ## Joining, by = "batter_RetroID"

4 Data from 1971 MLB Season

L R

20

15

10 Frequency

5

0 −0.1 0.0 0.1 −0.1 0.0 0.1 (Batting Average versus LHP) − (Batting Average versus RHP)

## Joining, by = "batter_RetroID"

5 Data from 2004 MLB Season

L R 25

20

15

10 Frequency

5

0 −0.1 0.0 0.1 0.2 −0.1 0.0 0.1 0.2 (Batting Average versus LHP) − (Batting Average versus RHP)

## Joining, by = "batter_RetroID"

6 Data from 1970 MLB Season

L R

20

Frequency 10

0 −0.1 0.0 0.1 −0.1 0.0 0.1 (Batting Average versus LHP) − (Batting Average versus RHP)

Logistic regression using Batting Average splits.

## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 1.561 28.355 ## ## Degrees of Freedom: 152 Total (i.e. Null); 151 Residual ## Null Deviance: 162.1 ## Residual Deviance: 126 AIC: 130 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6604 0.1244 0.4248 0.6306 1.9995 ## ## Coefficients: ## Estimate Std. z value Pr(>|z|) ## (Intercept) 1.5613 0.2564 6.090 1.13e-09 *** ## lhp_rhp 28.3555 5.7860 4.901 9.55e-07 ***

7 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 162.09 on 152 degrees of freedom ## Residual deviance: 126.03 on 151 degrees of freedom ## AIC: 130.03 ## ## Number of Fisher Scoring iterations: 5 Data from 1970 MLB Season

1.00

0.75

0.50

0.25

0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.10 −0.05 0.00 0.05 0.10 0.15 Batting Average vs LH Pitching − Batting Average vs RH Pitching

## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 0.928 17.282 ## ## Degrees of Freedom: 155 Total (i.e. Null); 154 Residual ## Null Deviance: 192.6 ## Residual Deviance: 167.8 AIC: 171.8 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ##

8 ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2596 -0.9671 0.5951 0.7915 1.5595 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.9280 0.1958 4.739 2.14e-06 *** ## lhp_rhp 17.2824 3.8643 4.472 7.74e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 192.58 on 155 degrees of freedom ## Residual deviance: 167.77 on 154 degrees of freedom ## AIC: 171.77 ## ## Number of Fisher Scoring iterations: 4 Data from 1999 MLB Season

1.00

0.75

0.50

0.25

0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.1 0.0 0.1 Batting Average vs LH Pitching − Batting Average vs RH Pitching

## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 1.596 40.488 ##

9 ## Degrees of Freedom: 113 Total (i.e. Null); 112 Residual ## Null Deviance: 119.9 ## Residual Deviance: 80.07 AIC: 84.07 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.00875 0.07691 0.30980 0.54284 1.63588 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.5958 0.3236 4.931 8.18e-07 *** ## lhp_rhp 40.4879 8.5643 4.728 2.27e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 119.932 on 113 degrees of freedom ## Residual deviance: 80.072 on 112 degrees of freedom ## AIC: 84.072 ## ## Number of Fisher Scoring iterations: 6 Data from 1961 MLB Season

1.00

0.75

0.50

0.25

0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.10 −0.05 0.00 0.05 0.10 Batting Average vs LH Pitching − Batting Average vs RH Pitching

10 Logistic regressions using weighted On-base Average (wOBA)

Batting Average is a rather coarse measure of batting performance. I calculated weighted on-base averages for all players. wOBA gives a more nuanced assessment of batting performance by incorporating walks and power. I use logistic regression on the same years analyzed above (1970, 1999, 1961) with batting average splits to compare the strength of the models. Logistic Regression for wOBA splits for the 1970 MLB season ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 1.561 28.355 ## ## Degrees of Freedom: 152 Total (i.e. Null); 151 Residual ## Null Deviance: 162.1 ## Residual Deviance: 126 AIC: 130 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6604 0.1244 0.4248 0.6306 1.9995 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.5613 0.2564 6.090 1.13e-09 *** ## lhp_rhp 28.3555 5.7860 4.901 9.55e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 162.09 on 152 degrees of freedom ## Residual deviance: 126.03 on 151 degrees of freedom ## AIC: 130.03 ## ## Number of Fisher Scoring iterations: 5

11 Data from 1970 MLB Season

1.00

0.75

0.50

0.25

0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.10 −0.05 0.00 0.05 0.10 0.15 wOBA vs LH Pitching − wOBA vs RH Pitching 1970 MLB Season

L R

20 Frequency 10

0

−0.1 0.0 0.1 −0.1 0.0 0.1 wOBA versus LHP − wOBA versus RHP

Logistic Regression for wOBA splits for the 1999 MLB season

12 ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 0.928 17.282 ## ## Degrees of Freedom: 155 Total (i.e. Null); 154 Residual ## Null Deviance: 192.6 ## Residual Deviance: 167.8 AIC: 171.8 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2596 -0.9671 0.5951 0.7915 1.5595 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.9280 0.1958 4.739 2.14e-06 *** ## lhp_rhp 17.2824 3.8643 4.472 7.74e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 192.58 on 155 degrees of freedom ## Residual deviance: 167.77 on 154 degrees of freedom ## AIC: 171.77 ## ## Number of Fisher Scoring iterations: 4

13 Data from 1999 MLB Season

1.00

0.75

0.50

0.25

0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.1 0.0 0.1 wOBA vs LH Pitching − wOBA vs RH Pitching 1999 MLB Season

L R

20

15

10 Frequency

5

0

−0.1 0.0 0.1 −0.1 0.0 0.1 wOBA versus LHP − wOBA versus RHP

Logistic Regression for wOBA splits for the 1961 MLB season

14 ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 1.596 40.488 ## ## Degrees of Freedom: 113 Total (i.e. Null); 112 Residual ## Null Deviance: 119.9 ## Residual Deviance: 80.07 AIC: 84.07 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.00875 0.07691 0.30980 0.54284 1.63588 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.5958 0.3236 4.931 8.18e-07 *** ## lhp_rhp 40.4879 8.5643 4.728 2.27e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 119.932 on 113 degrees of freedom ## Residual deviance: 80.072 on 112 degrees of freedom ## AIC: 84.072 ## ## Number of Fisher Scoring iterations: 6

15 Data from 1961 MLB Season

1.00

0.75

0.50

0.25

0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.10 −0.05 0.00 0.05 0.10 wOBA vs LH Pitching − wOBA vs RH Pitching 1961 MLB Season

L R

15

10 Frequency

5

0

−0.10 −0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10 wOBA versus LHP − wOBA versus RHP

16 Summary of Results

1) MLB batters face right-handed pitchers more than twice as often as left-handed pitchers. 2) Average batting splits have varied around an equilibrium value over the period 1955-2017 for left-handed batters. We see a decrease in average splits over the post 20 years for right-handed batters. There is enough variation to necessitate consideration of individual seasons rather than combining all the data. 3) At the individual level, a smaller proportion of left-handed batters are opposite hitters (hitting LHP better than RHP) compared to the proportion of right-handed batters who are opposite hitters (hitting RHP better than LHP). Greater experience against RHP may explain this pattern for right-handed batters. 4) Logistic regression, a technique to predict binary outcomes (in this case, left-handed or right-handed batter), is applied to each of the 63 seasons, testing the ability of BA splits to predict handedness of a batter. In every case, the logistic regression was statistically significant (p < 0.05). However, the regressions showed a consistent asymmetry. Even splits (BA versus LHP = BA versus RHP) predicts a right-handed batter. This shift in the regression curve is explained by the dearth of opposite left-handed batters. 5) To provide a more granular analysis, I performed logistic regressions on each of the 63 seasons using splits in weighted On-base Average (wOBA). Again, each of the 63 regressions was statistically significant. Comparison of the AIC values (Akaike Information Criterion) between the BA and wOBA regressions consistently showed wOBA provided a better fit (lower AIC values). 6) Since batters are exposed to left-handed pitchers less often than right-handed pitchers, right-handed batters get more practice against same-handed pitching than left-handed batters. This result may explain the paucity of left-handed batters with opposite splits and the greater left-right splits for left-handed batters.

17