Batter Handedness Project - Herb Wilson
Total Page:16
File Type:pdf, Size:1020Kb
Batter Handedness Project - Herb Wilson Contents Introduction 1 Data Upload 1 Join with Lahman database 1 Change in Proportion of RHP PA by Year 2 MLB-wide differences in BA against LHP vs. RHP 2 Equilibration of Batting Average 4 Individual variation in splits 4 Logistic regression using Batting Average splits. 7 Logistic regressions using weighted On-base Average (wOBA) 11 Summary of Results 17 Introduction This project is an exploration of batter performance against like-handed and opposite-handed pitchers. We have long known that, collectively, batters have higher batting averages against opposite-handed pitchers. Differences in performance against left-handed versus right-handed pitchers will be referred to as splits. The generality of splits favoring opposite-handed pitchers masks variability in the magnitude of batting splits among batters and variability in splits for a single player among seasons. In this contribution, I test the adequacy of split values in predicting batter handedness and then examine individual variability to explore some of the nuances of the relationships. The data used primarily come from Retrosheet events data with the Lahman dataset being used for some biographical information such as full name. I used the R programming language for all statistical testing and for the creation of the graphics. A copy of the code is available on request by contacting me at [email protected] Data Upload The Retrosheet events data are given by year. The first step in the analysis is to upload dataframes for each year, then use the rbind function to stitch datasets together to make a dataframe and use the function colnames to add column names. For this study, I combined the data from 1955 until 2017, yielding a dataframe with nearly 10 million rows. Join with Lahman database Next I use the Lahman Master database to get first and last names of players using the left_join function on Retro_ID number. I then paste the last names and first names together to get a single field for player name. Finally, I filter out joined Master columns I do not need. 1 Change in Proportion of RHP PA by Year I begin by looking at the number of Plate Appearances (PA) against left-handed pitcher (LHP) versus right-handed pitchers (RHP) over the 63 years in this dataframe. I first create a column with the year. Then, I filter out any events that do not pertain to the batter (e.g., stolen base, passed ball or wild pitch while batting) to create a PA column. Switch-hitters were removed from the analysis. 0.750 0.725 0.700 PA against RHP/Total PA RHP/Total against PA 0.675 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year MLB-wide differences in BA against LHP vs. RHP I now examine the magnitude of batter splits against LHP and RHP for each year of the study. I consider LHP first and then RHP. Here are the mean batter splits by year against LHP. 2 0.040 0.030 0.020 0.010 0.000 BA of RHB − BA of LHB (versus LHP) of LHB (versus of RHB − BA BA 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year Here are the mean batter splits by year against RHP. 3 0.030 0.020 0.010 BA of LHB − BA of RHB (versus RHP) of RHB (versus of LHB − BA BA 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 Year Equilibration of Batting Average To examine individual variability in splits, I need to consider only players with a minimum number of ABs against LHP and RHP in a season to avoid biased Batting Averages because of low sample size. A survey of some randomly chosen plays indicates that BA, BA against LHP and BA against RHP begin to stabilize after 100 AB. Therefore, I require that all batters must have at least 100 ABs against LHP and 100 ABs against RHP to be included in an analysis. Individual variation in splits The next three histograms show the distribution of LH batter splits against LHP and RHP and RH batter splits against LHP and RHP. I randomly chose three seasons to present. A clear pattern is that the proportion of LH batters with opposite splits (higher BA against LHP) is lower than the proportion of RH batters with opposite splits. ## Joining, by = "batter_RetroID" 4 Data from 1971 MLB Season L R 20 15 10 Frequency 5 0 −0.1 0.0 0.1 −0.1 0.0 0.1 (Batting Average versus LHP) − (Batting Average versus RHP) ## Joining, by = "batter_RetroID" 5 Data from 2004 MLB Season L R 25 20 15 10 Frequency 5 0 −0.1 0.0 0.1 0.2 −0.1 0.0 0.1 0.2 (Batting Average versus LHP) − (Batting Average versus RHP) ## Joining, by = "batter_RetroID" 6 Data from 1970 MLB Season L R 20 Frequency 10 0 −0.1 0.0 0.1 −0.1 0.0 0.1 (Batting Average versus LHP) − (Batting Average versus RHP) Logistic regression using Batting Average splits. ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 1.561 28.355 ## ## Degrees of Freedom: 152 Total (i.e. Null); 151 Residual ## Null Deviance: 162.1 ## Residual Deviance: 126 AIC: 130 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6604 0.1244 0.4248 0.6306 1.9995 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.5613 0.2564 6.090 1.13e-09 *** ## lhp_rhp 28.3555 5.7860 4.901 9.55e-07 *** 7 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 162.09 on 152 degrees of freedom ## Residual deviance: 126.03 on 151 degrees of freedom ## AIC: 130.03 ## ## Number of Fisher Scoring iterations: 5 Data from 1970 MLB Season 1.00 0.75 0.50 0.25 0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.10 −0.05 0.00 0.05 0.10 0.15 Batting Average vs LH Pitching − Batting Average vs RH Pitching ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 0.928 17.282 ## ## Degrees of Freedom: 155 Total (i.e. Null); 154 Residual ## Null Deviance: 192.6 ## Residual Deviance: 167.8 AIC: 171.8 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## 8 ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.2596 -0.9671 0.5951 0.7915 1.5595 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.9280 0.1958 4.739 2.14e-06 *** ## lhp_rhp 17.2824 3.8643 4.472 7.74e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 192.58 on 155 degrees of freedom ## Residual deviance: 167.77 on 154 degrees of freedom ## AIC: 171.77 ## ## Number of Fisher Scoring iterations: 4 Data from 1999 MLB Season 1.00 0.75 0.50 0.25 0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.1 0.0 0.1 Batting Average vs LH Pitching − Batting Average vs RH Pitching ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 1.596 40.488 ## 9 ## Degrees of Freedom: 113 Total (i.e. Null); 112 Residual ## Null Deviance: 119.9 ## Residual Deviance: 80.07 AIC: 84.07 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.00875 0.07691 0.30980 0.54284 1.63588 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.5958 0.3236 4.931 8.18e-07 *** ## lhp_rhp 40.4879 8.5643 4.728 2.27e-06 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 119.932 on 113 degrees of freedom ## Residual deviance: 80.072 on 112 degrees of freedom ## AIC: 84.072 ## ## Number of Fisher Scoring iterations: 6 Data from 1961 MLB Season 1.00 0.75 0.50 0.25 0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.10 −0.05 0.00 0.05 0.10 Batting Average vs LH Pitching − Batting Average vs RH Pitching 10 Logistic regressions using weighted On-base Average (wOBA) Batting Average is a rather coarse measure of batting performance. I calculated weighted on-base averages for all players. wOBA gives a more nuanced assessment of batting performance by incorporating walks and power. I use logistic regression on the same years analyzed above (1970, 1999, 1961) with batting average splits to compare the strength of the models. Logistic Regression for wOBA splits for the 1970 MLB season ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 1.561 28.355 ## ## Degrees of Freedom: 152 Total (i.e. Null); 151 Residual ## Null Deviance: 162.1 ## Residual Deviance: 126 AIC: 130 ## ## Call: ## glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.6604 0.1244 0.4248 0.6306 1.9995 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 1.5613 0.2564 6.090 1.13e-09 *** ## lhp_rhp 28.3555 5.7860 4.901 9.55e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 162.09 on 152 degrees of freedom ## Residual deviance: 126.03 on 151 degrees of freedom ## AIC: 130.03 ## ## Number of Fisher Scoring iterations: 5 11 Data from 1970 MLB Season 1.00 0.75 0.50 0.25 0.00 Log(Odds of RH Hitter/Odds of LH Hitter) of RH Hitter/Odds Log(Odds −0.10 −0.05 0.00 0.05 0.10 0.15 wOBA vs LH Pitching − wOBA vs RH Pitching 1970 MLB Season L R 20 Frequency 10 0 −0.1 0.0 0.1 −0.1 0.0 0.1 wOBA versus LHP − wOBA versus RHP Logistic Regression for wOBA splits for the 1999 MLB season 12 ## Joining, by = "batter_RetroID" ## ## Call: glm(formula = bat_code ~ lhp_rhp, family = binomial, data = full_splits) ## ## Coefficients: ## (Intercept) lhp_rhp ## 0.928 17.282 ## ## Degrees of Freedom: 155 Total (i.e.